What Defines a Guru?

I recently wrote an entry for OmniTI’s Seeds Blog. The article talks about the idea of pattern matching vs. troubleshooting, and how those ideas relate to what people consider a “Tech Guru”. I hope you’ll check it out.

New Blog Rising

Almost 5 years ago (really?) I moved my site from the old “planet postgresql” blog hosting service over to my own server, pointing to my own domain. This server was hosted in OmniTI’s Ashburn data center, and it’s been a good home, but we are in the final stages of a data center evacuation project, which all of our sites, and our clients sites, are moving to either our new Ashburn facility, or our Chicago data center.

My old blog ran on an elderly version of S9Y, a PHP based blogging engine that also supported PostgreSQL. Like all new things, I liked it when I started with it, and even now I’d recommend it to folks looking for a blogging system, but rather than trying to port or upgrade the old system, I thought I would take this as an opportunity to try something new. The code I am most likely to touch these days is probably either PHP or node.js, with perl the outside third, so I figured I would look for something else. I did want to limit myself to a pre-built blog engine; while I think running software in different languages is valueable, I know I don’t have time to commit to maintaining my own system from scratch. You’d think that a blog wouldn’t be that complicated, but it’s harder than you’d think. Anyway, I originally was hoping to find something in Go, but blog options were lacking, and as I understand it, Go support on Illumos is still lacking, so I decided to look elsewhere. In the end, I ended up settling on Octopress/Jekyll. I have to admit, I was a bit resistent to go this route, because it almost seems cliche, but I figured that between the ruby and static page generation, it would be different enough to be interesting.

Some of the guys at OmniTI are doing a bunch of hacking on chef, but I can’t say that I’ve done any real ruby hacking in probably seven years. Still, Octopress is suppose to make everything easy, right?

Well, to be frank, I found the state of ruby’s development ecosystem to be a bit of a mess. I suppose that all the bits and pieces are meant to make things work better, but the need to juggle multiple ruby versions feels pretty painful. I’ll be honest, I tried to bypass it all by just using macports ruby, but it wasn’t up-to-date enoughr. Even when I started looking at the various Ruby environment managers, the version differences coupled with compiler issues on OSX made me question the whole thing, but eventually I ended up going with rbenv, and also making use of rbenv-gem-rehash to get everything to work correctly.

Once I was able to get a working ruby environment going, I proceeded to get a basic blog system up and going. That in and of itself required more wrangling than I expect, but I want to save that for another post, since this post is already getting long. The last thing I want to mention is that, although I originally had planned to move everything to one of our hosted OmniOS systems, in the end I decided to take a different route and host the system on Heroku. Once again, this was mainly in the spirit of learning something new. I know a bunch of the guys working on Heroku Postgres and I’ve always been interested in Heroku services in general, and this perticular site seemed simple enough that hosting it on one of their free hosts seemed pretty doable. I do miss not having tools like mod_rewrite or ATS to handle any black magic I might need, but for now things are running ok.

New Postgres Backup and Restore Book

A couple of months ago the folks a PACKT had asked me if I could tech review one of their new books; PostgreSQL Backup and Restore How-to. What caught my eye about this was the idea behind the book; pick a single topic that is important to people using the software, and then cover the topic quickly and efficiently. Postgres is a really large piece of software, with a heck of a lot of moving parts, so it’s difficult to cover the entire thing in one book. This approach is one that I have been suggesting to publishers for awhile, so I was happy to help PACKT with their attempt. The book itself covers a number of different options when it comes to Postgres backups; from pg_dump to how to make filesystem backups using PITR and the WAL system. If you’re working with Postgres and you have questions about the different options available for doing backups and/or restores, I encourage you to check it out. Cover-of-Packt-PostgreSQL-Backup-and-Restore-Howto-book

phpPgAdmin 5.1 Released!

The phpPgAdmin Team is proud to announce the new major release of phpPgAdmin. Version 5.1 add many new features, bug fixes and updated translations over the previous version. The version has been long overdue, and brings with it stable support for all current versions of PostgreSQL, including 9.1 and 9.2. In addition, there are also a fair number of bugs that have been fixed, including a bug that could lead to corrupted bytea data, so all users are strongly encouraged to upgrade. We appreciate the large number of people that use phpPgAdmin on a regular basis, and hope this new version will help make things even better!

Download

To download phpPgAdmin 5.1 right now, visit: http://phppgadmin.sourceforge.net/doku.php?id=download

Features

  • Full support for PostgreSQL 9.1 and 9.2
  • New plugin architecture, including addition of several new hooks (asleonardo, ioguix)
  • Support nested groups of servers (Julien Rouhaud & ioguix)
  • Expanded test coverage in Selenium test suite
  • Highlight referencing fields on hovering Foriegn Key values when browsing tables (asleonardo)
  • Simplified translation system implementation (ioguix)
  • Don’t show cancel/kill options in process page to non-superusers
  • Add download ability from the History window (ioguix)
  • User queries now paginate by default

Translations

  • Lithuanian (artvras)

Bug Fixes

  • Fix several bugs with bytea support, including possible data corruption bugs when updating rows that have bytea fields
  • Numerous fixes for running under PHP Strict Standards
  • Fix an issue with autocompletion of text based Foreign Keys
  • Fix a bug when browsing tables with no unique key

Incompatibilities

  • phpPgAdmin core is now UTF-8 only
  • We have stopped testing against Postgres versions < 8.4, which are EOL
Regards, The phpPgAdmin Team

Join the 5%

In the next 48 hours, American all across the country (well, half of them anyway) will head to the polls to cast their votes for President. But what does it mean to have a vote that counts? In the year 2000, sitting in Florida, watching that election unfold, I think I have never been closer to having a vote that counted. For a true cynic, sure, my one vote would not have changed the election. However, with a margin of ~500 people, it wasn’t lost on me that I actually knew enough people that had we all voted together, it could have changed the entire election. You can’t get much closer to a vote that counts than that.

In 2004, the election was not nearly as close. With a margin in Florida of almost 400,000, I certainly didn’t know enough people to swing that one. After that I moved to Maryland, and any illusion of a vote that would change the outcome of an election completely disappeared; Maryland is a state that has voted Democrat by double digit margins for years, with no signs of a change. Regardless of if you are voting Republican or Democrat, the outcome here is fairly certain. Of course, Maryland is not alone.

The above graph lists the “likelihood your state will determine the presidency” (source). If you aren’t in one of those states, the truth is that your vote means very little to the outcome of who becomes president. This isn’t to say you shouldn’t vote; it never hurts to take part in the political process, and to be sure there are always a number of state level initiatives that are worth voting on. Some would look at that and say that for most people, voting for president doesn’t really matter. Normally I’d agree, but this year there is a chance that things could be different.

While I’ve no illusion that they will win the election, this year the Libertarian party has the chance to do something significant: obtain 5% of the popular vote. If that happens, they will be eligible to receive matching funds for 2016. While this isn’t significant to the two majority parties (who have opted out of the program so as to not *limit* their fundraising), for a third party this would be a major milestone. If you’ve been dissatisfied with your party, or you live in a state where the outcome is solid, I’d urge you to join me in voting for Gary Johnson. Even if you don’t agree with all of their policies, you probably agree with some; but whether you do or not, the real issue here is getting the Libertarian party to 5% also means getting a whole slew of issues up for discussion which are sorely lacking from the current two-party system we’re working under. That’s something that would count, and definitly something worth voting for.

Shoot the Automated Failure in the Head

This past week Github experienced their most significant service disruption of the year, and much of it came at the hands of an automated failover system they had designed to try to avoid disruptions. There are a number of different factors that made the situation as bad as it was, but the basic summary of what lead to the problem looks like this:
  1. On Monday, they attempted a schema migration which lead to a load spike.
  2. The high load triggered an automated failover to one of their MySQL slaves.
  3. Once failed too, the new master also experienced high load, and so the automated failover attempted to revert back
  4. At this point, the ops team put the automated failure system into “maintenance mode”, to prevent further failover
There’s actually more that goes wrong for them after this point, I encourage you to read the full post on the Github blog, but I wanted to focus on the initial problems for a moment. Our database team at OmniTI is often asked about what type of process we normally recommend for dealing with failover situations, and we stand by our assessment that for most people, manual failover is typically the best option. It’s not that the idea of automated failover isn’t appealing, but the decisions involved can be very complex, and it’s hard to get that right in a scripted fashion. In their post, the Github team mentions that had a person been involved in the decision, neither of the failovers would have been done. To be clear, manual failover should not mean a bunch of manual steps. I think many people get confused on this idea. When you do need to failover, you need that to happen as quickly, and as correctly, as possible. When we say “manual” failover, we mean that your goal should be to have the decision to failover be manual, but the process to be as scripted and automated as possible. Another key factor in setting up scripted failover systems, and one that we see forgotten time and time again, is ye old STONITH concept. While it’s not 100% clear, from the description in the Github post, it seems that not only did their system allow automated failover, but it was also allowed to do automated fail-back. Just like any decision to failover needs to be manual, I always like to have at least one manual step involved after failover that is needed to reset the system as “back to normal”. This is extremely useful because it can act as a clear sign for your ops team that everyone agrees things are back to normal. Before that happens, your scripted failover solution should be unable to perform; why allow failover back to a machine that you’ve not agreed is ready to go back into service? I know none of this sounds particularly sexy, but it’s battle tested and it works. And if you really don’t think you can wait for a human to intervene, build your systems for fault tolerance, not failover; just be warned that it is more expensive, complicated, and time consuming to implement (and the current open source options leave a lot to be desired in the options available to you). Wondering about ways to help ensure availability in your environment? I’ll be speaking at Velocity Europe the first week of October, talking about ”Managing Databases in a DevOps Environment”; if you’re going to be attending I’d love to swap war stories. And yes, that’s the week after Surge, which is war story nirvana; if you haven’t gotten tickets for one of these events, there’s still time left; I hope to see you there.

Contents of an Office

Today is moving day at OmniTI. We’re moving to our new offices in Maple Lawn. They are pretty kick ass. Of course, moving means packing up all of your things and taking it to the new place, or perhaps throwing a bunch of it out. When I first came to OmniTI, I sat in a desk next to Wez Furlong for about 2 weeks. I’ve had I think 6 different desks since then, and now reside in an office. During all that moving, I tried to consolidate; I’m not sure I succeeded. While cleaning and packing, I decided to write down all of the stuff I had collected in my office; I’ve thrown almost all of it away, so from now on it can live on the internets for posterity. empties:
  • root beer bottles
    • hanks (philadelphia, pa) (2 cases, plus a spare)
    • appalachain brewing company (harrisburgh, pa) (1 case, plus a few spares)
    • jack black’s dead red root beer
    • mccutchensons (frederick, MD)
    • old soaker (atlantic brewing company, bar harbor, maine)
    • aj stephens (boston)
    • 1 abita brewing root beer cap
  • scotch bottles
    • auchentoshan three wood
    • copper fox rye whisky (bottled 2011-05-05)
    • balvenie 12
    • glenlivet 12
    • grangestone 12 (two bottles)
    • glenkinchie 12
    • bunnahabain 12
    • chivas regal 12
    • willett reserved whisky kentucky bourbon
  • 1 empty can of frescolita non-empties:
  • 1 bottle of scaldis noel
  • 1 2 liter bottle of pennslyvania dutch birch beer
  • 1 16oz bottle of pennslyvania dutch birch beer
  • books
    • scalable internet architectures (2 copies)
    • begining php and postgresql 8
    • version control by example
    • mysql tutorial
    • mysql database design and tuning
    • unix power tools
    • perl best practices
    • beautiful data
    • head first php & mysql
    • sterlings gold
  • hats
    • omniti
    • surge 2011
    • opensolaris
  • half-dozen or more conference badges
  • one bottle chipotle tabasco
  • 3 boxes of old busines cards (3 different designs)
  • 1 gift bag from client (and friend)
  • 3 MTA cards from NYC
  • 1 container of jellybeans from Truviso (thanks Greg!)
  • a blues clues sticker from my daughter
  • oscon data elephant sticker
  • busch gardens elephant fact sticker
  • codango php elephant squeeze toy
  • 1 printed photo of gier magnessun
  • surge postcards
  • 1 copy of Communications of the ACM
  • 1 menu from pudgies
  • 1 postgresql banner
  • real estate brochures of about 2 dozen area buildings
  • 2 sharpies
  • 1 highlighter
  • 1 omniti pen
  • 1 hilton garden inn pen
  • 1 pewter elephant bookend
  • 1 ceramic statue of Apsara
  • 1 plastic balancing jet fighter toy
  • 1 organic fruit sticker
  • several thank you cards from tech friends
  • an old contract proposal, full of highlighted issues
  • 1 worksheet on goal driven performance optimization from percona
  • 1 old sticky brain lying on the floor
  • a billing breakdown for one of our long time customers
  • 1 screw
  • 1 allan wrench
  • 2 whiteboard markers
  • 1 whiteboard eraser
  • 1 business card for “gas station tacos”
  • countless ERD’s for PE2
  • schema layout for podshow databases
  • several resumes, mostly from people we didn’t hire (sorry Jiraporn)
  • fax information for my daughters pre-school (she’s in 4th now)
  • 1 random screw
  • 1 paper clip

What Todd Akin Can Teach Us About DevOps

By now I’m sure most of you have heard the story of Todd Akin, and his comments on “legitimate” rape; they’ve been hard to avoid. Or at least, the backlash against those comments was hard to avoid. Most people (well, most in my circles) expressed some form of outrage, exasperation, or utter dismissal towards the comments and the man who made them. This is of course, the nature of political discourse in America; we tend to vilify those who say things we don’t understand or find offensive first, and then demonize them later. When I first heard the quote my reaction was not that this guy was some ass-clown who just hates women; I thought “What does he mean “legitimate” rape? And where is he getting his information?” Yes, I understand; my reaction probably disappoints a lot of people, and probably makes others heads explode. I find that most people try to do the right thing. Of course, what you think the right thing is depends a lot on the information you’ve come to believe. If I said that I was basing my beliefs off of what doctors say, I think most people would be ok with that. In this case, he said that doctors had told him these things. So to me, my problem isn’t with the conclusions he reached(1), it’s with the way he gets there. And this was what was more frustrating; no one was stopping to question the source material. Well, no one until I happened to see the Anderson Cooper show. Here’s a good write up on thier episode where they actually attempt to track the statement to the source, and they find a doctor who has written and lectured the information that Akin was referring to. They of course then brought in thier own doctor to counter those claims, and they made some inferences into the reason for the false information. (Yes, I know, actual journalism, hard to imagine). For anyone who thought there might be something to Akin’s comments, watching that episode should have put a lot of those thoughts to rest. So what the hell does this have to do with DevOps you might be asking? Well, one of tenants of DevOps culture that we try to employ, and that I have seen inside of really successful DevOps shops, is the idea of blameless post mortems. In practical terms this means that when something goes wrong, you work to find out the cause of the problem, but not to assign blame to any particular person, but instead to figure out how to make improvements. One of the reasons for this goes back to what I said early; people try to do the right thing; whether you are an SA or a Web Dev or whatever, your goal is not to crash the site, and if your actions caused that, we start with the idea that it wasn’t your intention, but some piece of information caused you to think that was ok. Why did you do the thing you did? Why did you believe it was safe and/or a good idea? As a technical manager or leader within an organization, answering these questions is critical to your success, because chances are that you have also played a part in the failure, because you did not adequately prepare the person for the mission they were about to embark. Yes, you can blame the person, call them the ass-clown, even git rid of them, but chances are if they thought they were acting on good information, someone else has probably heard similar information, and they are getting ready to make the same bad decisions. So the next time you see someone do something, or say something, that seems boneheadedly wrong, before you start castigating them, take a brief moment to find out why they did what they did, and what was the information they were relying on that caused them to act as they did. Then, rather than persecute the person, persecute the poor information; make sure everyone you think might be working under incorrect pretenses gets the opportunity to hear the real situation. If your lucky, your “bad actor” might even become a champion for your cause. OK, perhaps not in politics, but I have seen that happen in technical shops, and when it does, it’s awesome. ADDENDUM: This morning my son missed his bus. It was his first day of middle school. We went to the bus stop at the time we saw in the paper and posted at his school during orientation. He was understandably upset by this, and with new school nerves in a bundle, was feeling quite angry. At first he blamed himself and was worried that his teachers would be mad at him. After we explained that wouldn’t be the case; he then got angry at the bus driver for not showing up at the right time. We again told him that he shouldn’t be so upset, but he wasn’t having it. I then explained to him the concept of the blameless post mortem; that we didn’t really know what went wrong; we showed up when we thought we were supposed to, and it was possible the bus driver showed up when she thought she was supposed to, or maybe the bus didn’t show up at all (my older son’s bus broke down this morning, and he had to catch a ride). The point for us now was to figure out what the right time for the bus was, make sure it got communicated to all parties, and make sure we made the bus tommorrow. (1) OK, yes, I have a problem with the conclusion, but I don’t think it’s the problem people should be focusing on.

Root Cause of Success

Like most companies, we do root cause analysis when things go wrong. “Root cause” is a bit of a misnomer, we deal with complex systems, usually with different level of redundency, so having a single root cause is usually not really realistic; really they are more like post mortems. In any case, when we have an incident, it’s important to review what went wrong; gathering logs, graphs, and other data; to try to learn why the assumptions we made did not manifest as we thought, and to determine what changes we might need to make for the future. This cycle of review and learning is critical for continued success. This past weekend, the OmniTI operations folks went through a number of significant production excursions, most of which were pulled off with good success. After which, we didn’t do a post mortem. This probably isn’t too different from most shops; I think most people don’t do a post mortem when things work. We probably should. Even when things work, there are usually suprises along the way, and if you only decide when to do a in-depth look back on when things fail, you’re probably overlooking use cases and scenarios you are likely to encounter again. Additionally, it’s good information for people to be able to review, especially when bringing on new hires. You might think this would be boring, but I happen to love reading a well written post mortem. You probably do to, you just don’t think of something like Apollo 13 as a giant post mortem, but for the most part that’s what it is. So I’m curious, are there shops where people do regular detailed accounting when things go right? Not just having audit trail information around, but walking through those logs as a group and talking out loud abut the areas that were more hope than plan, but since it worked everyone feels confident in. I know a lot of different people running web operations, but this doesn’t seem like a common practice; if you’ve worked in such an environment, I’d love to hear about your experiences.

Slides for Big Bad Upgraded Postgres Talk

Howdy folks! I finally got the slides up for the “Big Bad `Upgraded` Postgres” talk which I gave at PGCon 2012 (and previously at PGDay DC). The talk walks through a multi-terrabyte database upgrade project, and discusses many of the problems and delays we encountered, both technical and non-technical. I think the slides stand up pretty well by themselves, but you can also find out some additional info on my co-worker Kieth’s blog, where he has also chronicaled some of the fun times we’ve had along the way. He also has some posts on benefits we’ve seen since upgrading. Anyway, slides are on my slideshare page, please have a look.