New Postgres Backup and Restore Book

A couple of months ago the folks a PACKT had asked me if I could tech review one of their new books; PostgreSQL Backup and Restore How-to. What caught my eye about this was the idea behind the book; pick a single topic that is important to people using the software, and then cover the topic quickly and efficiently. Postgres is a really large piece of software, with a heck of a lot of moving parts, so it’s difficult to cover the entire thing in one book. This approach is one that I have been suggesting to publishers for awhile, so I was happy to help PACKT with their attempt. The book itself covers a number of different options when it comes to Postgres backups; from pg_dump to how to make filesystem backups using PITR and the WAL system. If you’re working with Postgres and you have questions about the different options available for doing backups and/or restores, I encourage you to check it out. Cover-of-Packt-PostgreSQL-Backup-and-Restore-Howto-book

phpPgAdmin 5.1 Released!

The phpPgAdmin Team is proud to announce the new major release of phpPgAdmin. Version 5.1 add many new features, bug fixes and updated translations over the previous version. The version has been long overdue, and brings with it stable support for all current versions of PostgreSQL, including 9.1 and 9.2. In addition, there are also a fair number of bugs that have been fixed, including a bug that could lead to corrupted bytea data, so all users are strongly encouraged to upgrade. We appreciate the large number of people that use phpPgAdmin on a regular basis, and hope this new version will help make things even better!

Download

To download phpPgAdmin 5.1 right now, visit: http://phppgadmin.sourceforge.net/doku.php?id=download

Features

  • Full support for PostgreSQL 9.1 and 9.2
  • New plugin architecture, including addition of several new hooks (asleonardo, ioguix)
  • Support nested groups of servers (Julien Rouhaud & ioguix)
  • Expanded test coverage in Selenium test suite
  • Highlight referencing fields on hovering Foriegn Key values when browsing tables (asleonardo)
  • Simplified translation system implementation (ioguix)
  • Don’t show cancel/kill options in process page to non-superusers
  • Add download ability from the History window (ioguix)
  • User queries now paginate by default

Translations

  • Lithuanian (artvras)

Bug Fixes

  • Fix several bugs with bytea support, including possible data corruption bugs when updating rows that have bytea fields
  • Numerous fixes for running under PHP Strict Standards
  • Fix an issue with autocompletion of text based Foreign Keys
  • Fix a bug when browsing tables with no unique key

Incompatibilities

  • phpPgAdmin core is now UTF-8 only
  • We have stopped testing against Postgres versions < 8.4, which are EOL
Regards, The phpPgAdmin Team

Join the 5%

In the next 48 hours, American all across the country (well, half of them anyway) will head to the polls to cast their votes for President. But what does it mean to have a vote that counts? In the year 2000, sitting in Florida, watching that election unfold, I think I have never been closer to having a vote that counted. For a true cynic, sure, my one vote would not have changed the election. However, with a margin of ~500 people, it wasn’t lost on me that I actually knew enough people that had we all voted together, it could have changed the entire election. You can’t get much closer to a vote that counts than that.

In 2004, the election was not nearly as close. With a margin in Florida of almost 400,000, I certainly didn’t know enough people to swing that one. After that I moved to Maryland, and any illusion of a vote that would change the outcome of an election completely disappeared; Maryland is a state that has voted Democrat by double digit margins for years, with no signs of a change. Regardless of if you are voting Republican or Democrat, the outcome here is fairly certain. Of course, Maryland is not alone.

The above graph lists the “likelihood your state will determine the presidency” (source). If you aren’t in one of those states, the truth is that your vote means very little to the outcome of who becomes president. This isn’t to say you shouldn’t vote; it never hurts to take part in the political process, and to be sure there are always a number of state level initiatives that are worth voting on. Some would look at that and say that for most people, voting for president doesn’t really matter. Normally I’d agree, but this year there is a chance that things could be different.

While I’ve no illusion that they will win the election, this year the Libertarian party has the chance to do something significant: obtain 5% of the popular vote. If that happens, they will be eligible to receive matching funds for 2016. While this isn’t significant to the two majority parties (who have opted out of the program so as to not *limit* their fundraising), for a third party this would be a major milestone. If you’ve been dissatisfied with your party, or you live in a state where the outcome is solid, I’d urge you to join me in voting for Gary Johnson. Even if you don’t agree with all of their policies, you probably agree with some; but whether you do or not, the real issue here is getting the Libertarian party to 5% also means getting a whole slew of issues up for discussion which are sorely lacking from the current two-party system we’re working under. That’s something that would count, and definitly something worth voting for.

Shoot the Automated Failure in the Head

This past week Github experienced their most significant service disruption of the year, and much of it came at the hands of an automated failover system they had designed to try to avoid disruptions. There are a number of different factors that made the situation as bad as it was, but the basic summary of what lead to the problem looks like this:
  1. On Monday, they attempted a schema migration which lead to a load spike.
  2. The high load triggered an automated failover to one of their MySQL slaves.
  3. Once failed too, the new master also experienced high load, and so the automated failover attempted to revert back
  4. At this point, the ops team put the automated failure system into “maintenance mode”, to prevent further failover
There’s actually more that goes wrong for them after this point, I encourage you to read the full post on the Github blog, but I wanted to focus on the initial problems for a moment. Our database team at OmniTI is often asked about what type of process we normally recommend for dealing with failover situations, and we stand by our assessment that for most people, manual failover is typically the best option. It’s not that the idea of automated failover isn’t appealing, but the decisions involved can be very complex, and it’s hard to get that right in a scripted fashion. In their post, the Github team mentions that had a person been involved in the decision, neither of the failovers would have been done. To be clear, manual failover should not mean a bunch of manual steps. I think many people get confused on this idea. When you do need to failover, you need that to happen as quickly, and as correctly, as possible. When we say “manual” failover, we mean that your goal should be to have the decision to failover be manual, but the process to be as scripted and automated as possible. Another key factor in setting up scripted failover systems, and one that we see forgotten time and time again, is ye old STONITH concept. While it’s not 100% clear, from the description in the Github post, it seems that not only did their system allow automated failover, but it was also allowed to do automated fail-back. Just like any decision to failover needs to be manual, I always like to have at least one manual step involved after failover that is needed to reset the system as “back to normal”. This is extremely useful because it can act as a clear sign for your ops team that everyone agrees things are back to normal. Before that happens, your scripted failover solution should be unable to perform; why allow failover back to a machine that you’ve not agreed is ready to go back into service? I know none of this sounds particularly sexy, but it’s battle tested and it works. And if you really don’t think you can wait for a human to intervene, build your systems for fault tolerance, not failover; just be warned that it is more expensive, complicated, and time consuming to implement (and the current open source options leave a lot to be desired in the options available to you). Wondering about ways to help ensure availability in your environment? I’ll be speaking at Velocity Europe the first week of October, talking about ”Managing Databases in a DevOps Environment”; if you’re going to be attending I’d love to swap war stories. And yes, that’s the week after Surge, which is war story nirvana; if you haven’t gotten tickets for one of these events, there’s still time left; I hope to see you there.

Contents of an Office

Today is moving day at OmniTI. We’re moving to our new offices in Maple Lawn. They are pretty kick ass. Of course, moving means packing up all of your things and taking it to the new place, or perhaps throwing a bunch of it out. When I first came to OmniTI, I sat in a desk next to Wez Furlong for about 2 weeks. I’ve had I think 6 different desks since then, and now reside in an office. During all that moving, I tried to consolidate; I’m not sure I succeeded. While cleaning and packing, I decided to write down all of the stuff I had collected in my office; I’ve thrown almost all of it away, so from now on it can live on the internets for posterity. empties:
  • root beer bottles
    • hanks (philadelphia, pa) (2 cases, plus a spare)
    • appalachain brewing company (harrisburgh, pa) (1 case, plus a few spares)
    • jack black’s dead red root beer
    • mccutchensons (frederick, MD)
    • old soaker (atlantic brewing company, bar harbor, maine)
    • aj stephens (boston)
    • 1 abita brewing root beer cap
  • scotch bottles
    • auchentoshan three wood
    • copper fox rye whisky (bottled 2011-05-05)
    • balvenie 12
    • glenlivet 12
    • grangestone 12 (two bottles)
    • glenkinchie 12
    • bunnahabain 12
    • chivas regal 12
    • willett reserved whisky kentucky bourbon
  • 1 empty can of frescolita non-empties:
  • 1 bottle of scaldis noel
  • 1 2 liter bottle of pennslyvania dutch birch beer
  • 1 16oz bottle of pennslyvania dutch birch beer
  • books
    • scalable internet architectures (2 copies)
    • begining php and postgresql 8
    • version control by example
    • mysql tutorial
    • mysql database design and tuning
    • unix power tools
    • perl best practices
    • beautiful data
    • head first php & mysql
    • sterlings gold
  • hats
    • omniti
    • surge 2011
    • opensolaris
  • half-dozen or more conference badges
  • one bottle chipotle tabasco
  • 3 boxes of old busines cards (3 different designs)
  • 1 gift bag from client (and friend)
  • 3 MTA cards from NYC
  • 1 container of jellybeans from Truviso (thanks Greg!)
  • a blues clues sticker from my daughter
  • oscon data elephant sticker
  • busch gardens elephant fact sticker
  • codango php elephant squeeze toy
  • 1 printed photo of gier magnessun
  • surge postcards
  • 1 copy of Communications of the ACM
  • 1 menu from pudgies
  • 1 postgresql banner
  • real estate brochures of about 2 dozen area buildings
  • 2 sharpies
  • 1 highlighter
  • 1 omniti pen
  • 1 hilton garden inn pen
  • 1 pewter elephant bookend
  • 1 ceramic statue of Apsara
  • 1 plastic balancing jet fighter toy
  • 1 organic fruit sticker
  • several thank you cards from tech friends
  • an old contract proposal, full of highlighted issues
  • 1 worksheet on goal driven performance optimization from percona
  • 1 old sticky brain lying on the floor
  • a billing breakdown for one of our long time customers
  • 1 screw
  • 1 allan wrench
  • 2 whiteboard markers
  • 1 whiteboard eraser
  • 1 business card for “gas station tacos”
  • countless ERD’s for PE2
  • schema layout for podshow databases
  • several resumes, mostly from people we didn’t hire (sorry Jiraporn)
  • fax information for my daughters pre-school (she’s in 4th now)
  • 1 random screw
  • 1 paper clip

What Todd Akin Can Teach Us About DevOps

By now I’m sure most of you have heard the story of Todd Akin, and his comments on “legitimate” rape; they’ve been hard to avoid. Or at least, the backlash against those comments was hard to avoid. Most people (well, most in my circles) expressed some form of outrage, exasperation, or utter dismissal towards the comments and the man who made them. This is of course, the nature of political discourse in America; we tend to vilify those who say things we don’t understand or find offensive first, and then demonize them later. When I first heard the quote my reaction was not that this guy was some ass-clown who just hates women; I thought “What does he mean “legitimate” rape? And where is he getting his information?” Yes, I understand; my reaction probably disappoints a lot of people, and probably makes others heads explode. I find that most people try to do the right thing. Of course, what you think the right thing is depends a lot on the information you’ve come to believe. If I said that I was basing my beliefs off of what doctors say, I think most people would be ok with that. In this case, he said that doctors had told him these things. So to me, my problem isn’t with the conclusions he reached(1), it’s with the way he gets there. And this was what was more frustrating; no one was stopping to question the source material. Well, no one until I happened to see the Anderson Cooper show. Here’s a good write up on thier episode where they actually attempt to track the statement to the source, and they find a doctor who has written and lectured the information that Akin was referring to. They of course then brought in thier own doctor to counter those claims, and they made some inferences into the reason for the false information. (Yes, I know, actual journalism, hard to imagine). For anyone who thought there might be something to Akin’s comments, watching that episode should have put a lot of those thoughts to rest. So what the hell does this have to do with DevOps you might be asking? Well, one of tenants of DevOps culture that we try to employ, and that I have seen inside of really successful DevOps shops, is the idea of blameless post mortems. In practical terms this means that when something goes wrong, you work to find out the cause of the problem, but not to assign blame to any particular person, but instead to figure out how to make improvements. One of the reasons for this goes back to what I said early; people try to do the right thing; whether you are an SA or a Web Dev or whatever, your goal is not to crash the site, and if your actions caused that, we start with the idea that it wasn’t your intention, but some piece of information caused you to think that was ok. Why did you do the thing you did? Why did you believe it was safe and/or a good idea? As a technical manager or leader within an organization, answering these questions is critical to your success, because chances are that you have also played a part in the failure, because you did not adequately prepare the person for the mission they were about to embark. Yes, you can blame the person, call them the ass-clown, even git rid of them, but chances are if they thought they were acting on good information, someone else has probably heard similar information, and they are getting ready to make the same bad decisions. So the next time you see someone do something, or say something, that seems boneheadedly wrong, before you start castigating them, take a brief moment to find out why they did what they did, and what was the information they were relying on that caused them to act as they did. Then, rather than persecute the person, persecute the poor information; make sure everyone you think might be working under incorrect pretenses gets the opportunity to hear the real situation. If your lucky, your “bad actor” might even become a champion for your cause. OK, perhaps not in politics, but I have seen that happen in technical shops, and when it does, it’s awesome. ADDENDUM: This morning my son missed his bus. It was his first day of middle school. We went to the bus stop at the time we saw in the paper and posted at his school during orientation. He was understandably upset by this, and with new school nerves in a bundle, was feeling quite angry. At first he blamed himself and was worried that his teachers would be mad at him. After we explained that wouldn’t be the case; he then got angry at the bus driver for not showing up at the right time. We again told him that he shouldn’t be so upset, but he wasn’t having it. I then explained to him the concept of the blameless post mortem; that we didn’t really know what went wrong; we showed up when we thought we were supposed to, and it was possible the bus driver showed up when she thought she was supposed to, or maybe the bus didn’t show up at all (my older son’s bus broke down this morning, and he had to catch a ride). The point for us now was to figure out what the right time for the bus was, make sure it got communicated to all parties, and make sure we made the bus tommorrow. (1) OK, yes, I have a problem with the conclusion, but I don’t think it’s the problem people should be focusing on.

Root Cause of Success

Like most companies, we do root cause analysis when things go wrong. “Root cause” is a bit of a misnomer, we deal with complex systems, usually with different level of redundency, so having a single root cause is usually not really realistic; really they are more like post mortems. In any case, when we have an incident, it’s important to review what went wrong; gathering logs, graphs, and other data; to try to learn why the assumptions we made did not manifest as we thought, and to determine what changes we might need to make for the future. This cycle of review and learning is critical for continued success. This past weekend, the OmniTI operations folks went through a number of significant production excursions, most of which were pulled off with good success. After which, we didn’t do a post mortem. This probably isn’t too different from most shops; I think most people don’t do a post mortem when things work. We probably should. Even when things work, there are usually suprises along the way, and if you only decide when to do a in-depth look back on when things fail, you’re probably overlooking use cases and scenarios you are likely to encounter again. Additionally, it’s good information for people to be able to review, especially when bringing on new hires. You might think this would be boring, but I happen to love reading a well written post mortem. You probably do to, you just don’t think of something like Apollo 13 as a giant post mortem, but for the most part that’s what it is. So I’m curious, are there shops where people do regular detailed accounting when things go right? Not just having audit trail information around, but walking through those logs as a group and talking out loud abut the areas that were more hope than plan, but since it worked everyone feels confident in. I know a lot of different people running web operations, but this doesn’t seem like a common practice; if you’ve worked in such an environment, I’d love to hear about your experiences.

Slides for Big Bad Upgraded Postgres Talk

Howdy folks! I finally got the slides up for the “Big Bad `Upgraded` Postgres” talk which I gave at PGCon 2012 (and previously at PGDay DC). The talk walks through a multi-terrabyte database upgrade project, and discusses many of the problems and delays we encountered, both technical and non-technical. I think the slides stand up pretty well by themselves, but you can also find out some additional info on my co-worker Kieth’s blog, where he has also chronicaled some of the fun times we’ve had along the way. He also has some posts on benefits we’ve seen since upgrading. Anyway, slides are on my slideshare page, please have a look.

I Built a Node Site

Two weekends ago I was in need of website. The local Postgres Users Group is putting on a 1 day mini-conference (featuring some of the best speakers you can get I might add, you should probably go) and we wanted to put up a site with information on the conference. We didn’t need anything fancy, just some static pages with some basic info. We also don’t really have any money, so I wanted something simple that I could toss on-line and be hosted for free, with the caveat that I wanted something I could code (ie. not a wysiwg template thing) because I have some predefined Postgres related graphics and css type stuff I wanted to re-use. After browsing around a little I ran across an interesting service that I almost used called Static Cloud, which is designed to store html, css, and javascript files on-line. This seemed fine for such a simple site, but when I started tossing together the html, I realized I did have some repeatable content that I wanted to repeat (header, footer type stuff). There’s probably a way to do this, but it took me out of my comfortzone, so I decided I should use a scripting language to do my dirty work. I looked at the various PHP, Ruby, and Python offerings, but sadly nothing seemed to fit what I wanted, mainly on the account of them not being free. Then I stumbled upon nodester. Nodester is a node.js based hosting service, which allows you to host node based apps on their servers for free. How friendly! Now, I’ve looked at node before, probably 6+ months ago, and thought it was interesting, but didn’t really have too much use for it at the time. Since then OmniTI has used it for a couple of projects, including one recent project (still ongoing actually) where we built a hefty section of the back-end for a large, asynchronous, services system. And we did it in node.js. So, having seen some of that work, I thought why not give node.js another go around. So, I built a site. It’s not fancy. It’s a half a dozen pages that don’t need to do much. Some files get processed, some pages get displayed. I mostly mention it here because when I started putting it together, I couldn’t actually find anything like this: a complete site that was more than just the most trivial example of how to plumb things together. This doesn’t go much beyond that, but if you are getting your feet wet with node, I think being able to check this site out and just do a “node services.js” and have a real working site to look at, one where you could easily add or modify pages, well it might be handy. Also, it gives me a chance to write a bunch of links I found useful so I can refer back to it. For starters though, the code is on my github. (Yes, I should replumb the routes) I mentioned I used Nodester, so the first thing to check out is the Nodester page, which has a demo about having your app up and running in 1 minute. I hate those kind of demo’s, but it is really freakin’ easy. Here’s another link for wiring up your domain with Nodester. This was something I wanted, and fyi it also works fine for subdomains. Now, I have to give a warning about Nodester. They’ve been having service problems lately (obligatory monitoring graph here), and while they are responsive on twitter, they aren’t proactive. If I were just doing occasional demo’s of my app for people, I’d still use them, but I needed the site to stay up, and I work at a company with massive hosting capabilities, so I did move the site. Sorry Nodester. I did leave a copy of the app running there though. The site itself is written in node yes, but makes use of 2 npm modules, specifically Express and Jade. (Minor note, I hit the ”node env” error, in case you see it). These seem to be the defacto web framework / stack for node stuff, and it works well enough. Here’s the link on wiring up Express apps on Nodster. I also made use of this Express Tutorial from the guys at Nodetuts. I don’t think I actually watch the whole thing, but it was handy getting me over the hump on a couple things. For the Jade stuff, I mostly used the docs and some googling (which tended to end with questions on stack overflow). To be honest, I was tempted to scrap Jade and just use straight HTML, but in the end Jade did seem efficient enough that it was worth the bother.

Intrest Free (Technical) Debt Is Risky

Earlier today I read a post from Javier Salado that asked the question “If the interest rate is 0%, do you want to pay back your debt?”. In this case Javier was referring to technical debt, but I felt like the conclusion he reached was the same mis-understanding that people apply to regular debt. Let me back up a bit. In Javier’s post, he lay’s out the following scenario:

“Imagine you convince a bank (not likely) to grant you a loan with 0% interest rate until the end of time, would you pay back? I wouldn’t. It’s free money. Who doesn’t like free money?”

He then goes on to apply this thinking to technical debt.

“You have an application with, let’s say, $1,000,000 measured technical debt. It was developed 10 years ago when your organization didn’t have a fixed quality model nor coding standards for the particular technologies involved, hence the debt. Overtime, the application has been steadily provided useful functionality to users and what they have to say about it is mainly good. You have adapted to your organization’s new quality process, the maintenance cost is reasonable and any changes you have to make have an expected time-to-market that allows business growth. We could say the interest rate on your debt is close to 0%, why should I invest in reducing the debt?”

I think the answer to both questions is yes, and he makes the same mistake a lot of people do when it comes to taking on debt (technical or otherwise). Calculating the cost of debt cannot be based just on the interest rate alone, you must also factor in risk. In financial transactions, even a debt with 0% interest likely has some form of payment terms and collateral. (One might argue that Javier really meant a loan from a bank that was 0% interest, required no collateral, and had no terms for re-payment. I’d argue that’s a gift, not a loan.) It turns out, 0% interest loans aren’t actually just make believe. A simple example, which is actually a real world example, would be a 0% interest car loan. While this looks great from an interest point of view, it’s not so good from a risk assessment point of view; if you get into an accident, you now owe a bunch of money and no longer have the collateral to pay it off. It’s a double whammy if you figure you might have to deal with fallout from the accident itself.

So the question is, does risk assessment carry over to the technical debt metaphor? I believe it does. In most cases technical debt comes from legacy code, which means the number of people who can work on it are all folks who have been around a long time. In most cases, rather than teach new people how to develop on the legacy system, you just have the “old timers” deal with it when needed. But of course, this is risky, because as time goes by, you probably have fewer and fewer people who can serve in this role. This is a risk. You also have to be aware that, while you have the large amount of managed technical debt, it’s always possible that some new, unforeseen event could occur that changes the dynamic of things. Perhaps a large client / market opens up to you, or some similar opportunity. Perhaps a merger with a new company would be proposed. You now have to re-evaluate your technical situation, and in many cases that technical debt may come back to bite you.

In the end, I don’t think Javier was way off base with his recommendations, which was essentially to follow Elizabeth Naramore’s “D.E.B.T.” system (pdf/slides), to measure your debt and then decide how and what needs to be paid off. But I think it’s important to remember that once you have identified your debt, even if the “interest” on that debt is still low, it does represent risk within your organization (or your personal finances), and you would be best to eliminate as much of it as you can.