Entries by Robert Treat

Sunday, November 4. 2012

Join The 5%

In the next 48 hours, American all across the country (well, half of them anyway) will head to the polls to cast their votes for President. But what does it mean to have a vote that counts? In the year 2000, sitting in Florida, watching that election unfold, I think I have never been closer to having a vote that counted. For a true cynic, sure, my one vote would not have changed the election. However, with a margin of ~500 people, it wasn't lost on me that I actually knew enough people that had we all voted together, it could have changed the entire election. You can't get much closer to a vote that counts than that.

In 2004, the election was not nearly as close. With a margin in Florida of almost 400,000, I certainly didn't know enough people to swing that one. After that I moved to Maryland, and any illusion of a vote that would change the outcome of an election completely disappeared; Maryland is a state that has voted Democrat by double digit margins for years, with no signs of a change. Regardless of if you are voting Republican or Democrat, the outcome here is fairly certain. Of course, Maryland is not alone.

The above graph lists the "likelihood your state will determine the presidency" (source). If you aren't in one of those states, the truth is that your vote means very little to the outcome of who becomes president. This isn't to say you shouldn't vote; it never hurts to take part in the political process, and to be sure there are always a number of state level initiatives that are worth voting on. Some would look at that and say that for most people, voting for president doesn't really matter. Normally I'd agree, but this year there is a chance that things could be different.

While I've no illusion that they will win the election, this year the Libertarian party has the chance to do something significant: obtain 5% of the popular vote. If that happens, they will be eligible to receive matching funds for 2016. While this isn't significant to the two majority parties (who have opted out of the program so as to not limit their fundraising), for a third party this would be a major milestone. If you've been dissatisfied with your party, or you live in a state where the outcome is solid, I'd urge you to join me in voting for Gary Johnson. Even if you don't agree with all of their policies, you probably agree with some; but whether you do or not, the real issue here is getting the Libertarian party to 5% also means getting a whole slew of issues up for discussion which are sorely lacking from the current two-party system we're working under. That's something that would count, and definitly something worth voting for.

Posted by Robert Treat at 22:14 | Comments (0) | Trackbacks (0)

Monday, September 17. 2012

Shoot The Automated Failure In The Head

This past week Github experienced their most significant service disruption of the year, and much of it came at the hands of an automated failover system they had designed to try to avoid disruptions. There are a number of different factors that made the situation as bad as it was, but the basic summary of what lead to the problem looks like this:

On Monday, they attempted a schema migration which lead to a load spike.

The high load triggered an automated failover to one of their MySQL slaves.

Once failed too, the new master also experienced high load, and so the automated failover attempted to revert back

At this point, the ops team put the automated failure system into "maintenance mode", to prevent further failover

There's actually more that goes wrong for them after this point, I encourage you to read the full post on the Github blog, but I wanted to focus on the initial problems for a moment.

Our database team at OmniTI is often asked about what type of process we normally recommend for dealing with failover situations, and we stand by our assessment that for most people, manual failover is typically the best option. It's not that the idea of automated failover isn't appealing, but the decisions involved can be very complex, and it's hard to get that right in a scripted fashion. In their post, the Github team mentions that had a person been involved in the decision, neither of the failovers would have been done.

To be clear, manual failover should not mean a bunch of manual steps. I think many people get confused on this idea. When you do need to failover, you need that to happen as quickly, and as correctly, as possible. When we say "manual" failover, we mean that your goal should be to have the decision to failover be manual, but the process to be as scripted and automated as possible.

Another key factor in setting up scripted failover systems, and one that we see forgotten time and time again, is ye old STONITH concept. While it's not 100% clear, from the description in the Github post, it seems that not only did their system allow automated failover, but it was also allowed to do automated fail-back. Just like any decision to failover needs to be manual, I always like to have at least one manual step involved after failover that is needed to reset the system as "back to normal". This is extremely useful because it can act as a clear sign for your ops team that everyone agrees things are back to normal. Before that happens, your scripted failover solution should be unable to perform; why allow failover back to a machine that you've not agreed is ready to go back into service?

I know none of this sounds particularly sexy, but it's battle tested and it works. And if you really don't think you can wait for a human to intervene, build your systems for fault tolerance, not failover; just be warned that it is more expensive, complicated, and time consuming to implement (and the current open source options leave a lot to be desired in the options available to you).

Wondering about ways to help ensure availability in your environment? I'll be speaking at Velocity Europe the first week of October, talking about "Managing Databases in a DevOps Environment"; if you're going to be attending I'd love to swap war stories. And yes, that's the week after Surge, which is war story nirvana; if you haven't gotten tickets for one of these events, there's still time left; I hope to see you there.

Posted by Robert Treat in conf, devops, postgres at 17:38 | Comment (1) | Trackbacks (0)

Wednesday, August 29. 2012

Contents of an Office

Today is moving day at OmniTI. We're moving to our new offices in Maple Lawn. They are pretty kick ass. Of course, moving means packing up all of your things and taking it to the new place, or perhaps throwing a bunch of it out. When I first came to OmniTI, I sat in a desk next to Wez Furlong for about 2 weeks. I've had I think 6 different desks since then, and now reside in an office. During all that moving, I tried to consolidate; I'm not sure I succeeded. While cleaning and packing, I decided to write down all of the stuff I had collected in my office; I've thrown almost all of it away, so from now on it can live on the internets for posterity.

empties:

root beer bottles
- hanks (philadelphia, pa) (2 cases, plus a spare)
- appalachain brewing company (harrisburgh, pa) (1 case, plus a few spares)
- jack black's dead red root beer
- mccutchensons (frederick, MD)
- old soaker (atlantic brewing company, bar harbor, maine)
- aj stephens (boston)
- 1 abita brewing root beer cap

scotch bottles
- auchentoshan three wood
- copper fox rye whisky (bottled 2011-05-05)
- balvenie 12
- glenlivet 12
- grangestone 12 (two bottles)
- glenkinchie 12
- bunnahabain 12
- chivas regal 12
- willett reserved whisky kentucky bourbon

1 empty can of frescolita

non-empties:
1 bottle of scaldis noel
1 2 liter bottle of pennslyvania dutch birch beer
1 16oz bottle of pennslyvania dutch birch beer

books
- scalable internet architectures (2 copies)
- begining php and postgresql 8
- version control by example
- mysql tutorial
- mysql database design and tuning
- unix power tools
- perl best practices
- beautiful data
- head first php & mysql
- sterlings gold

hats
- omniti
- surge 2011
- opensolaris

half-dozen or more conference badges
one bottle chipotle tabasco
3 boxes of old busines cards (3 different designs)
1 gift bag from client (and friend)
3 MTA cards from NYC
1 container of jellybeans from Truviso (thanks Greg!)
a blues clues sticker from my daughter
oscon data elephant sticker
busch gardens elephant fact sticker
codango php elephant squeeze toy
1 printed photo of gier magnessun
surge postcards
1 copy of Communications of the ACM
1 menu from pudgies
1 postgresql banner
real estate brochures of about 2 dozen area buildings
2 sharpies
1 highlighter
1 omniti pen
1 hilton garden inn pen
1 pewter elephant bookend
1 ceramic statue of Apsara
1 plastic balancing jet fighter toy
1 organic fruit sticker
several thank you cards from tech friends
an old contract proposal, full of highlighted issues
1 worksheet on goal driven performance optimization from percona
1 old sticky brain lying on the floor
a billing breakdown for one of our long time customers
1 screw
1 allan wrench
2 whiteboard markers
1 whiteboard eraser
1 business card for "gas station tacos"
countless ERD's for PE2
schema layout for podshow databases
several resumes, mostly from people we didn't hire (sorry Jiraporn)
fax information for my daughters pre-school (she's in 4th now)
1 random screw
1 paper clip

Posted by Robert Treat at 15:56 | Comments (0) | Trackbacks (0)

Monday, August 27. 2012

What Todd Akin Can Teach Us About DevOps

By now I'm sure most of you have heard the story of Todd Akin, and his comments on "legitimate" rape; they've been hard to avoid. Or at least, the backlash against those comments was hard to avoid. Most people (well, most in my circles) expressed some form of outrage, exasperation, or utter dismissal towards the comments and the man who made them. This is of course, the nature of political discourse in America; we tend to vilify those who say things we don't understand or find offensive first, and then demonize them later. When I first heard the quote my reaction was not that this guy was some ass-clown who just hates women; I thought "What does he mean "legitimate" rape? And where is he getting his information?" Yes, I understand; my reaction probably disappoints a lot of people, and probably makes others heads explode.

I find that most people try to do the right thing. Of course, what you think the right thing is depends a lot on the information you've come to believe. If I said that I was basing my beliefs off of what doctors say, I think most people would be ok with that. In this case, he said that doctors had told him these things. So to me, my problem isn't with the conclusions he reached(1), it's with the way he gets there. And this was what was more frustrating; no one was stopping to question the source material. Well, no one until I happened to see the Anderson Cooper show. Here's a good write up on thier episode where they actually attempt to track the statement to the source, and they find a doctor who has written and lectured the information that Akin was referring to. They of course then brought in thier own doctor to counter those claims, and they made some inferences into the reason for the false information. (Yes, I know, actual journalism, hard to imagine). For anyone who thought there might be something to Akin's comments, watching that episode should have put a lot of those thoughts to rest.

So what the hell does this have to do with DevOps you might be asking? Well, one of tenants of DevOps culture that we try to employ, and that I have seen inside of really successful DevOps shops, is the idea of blameless post mortems. In practical terms this means that when something goes wrong, you work to find out the cause of the problem, but not to assign blame to any particular person, but instead to figure out how to make improvements. One of the reasons for this goes back to what I said early; people try to do the right thing; whether you are an SA or a Web Dev or whatever, your goal is not to crash the site, and if your actions caused that, we start with the idea that it wasn't your intention, but some piece of information caused you to think that was ok. Why did you do the thing you did? Why did you believe it was safe and/or a good idea? As a technical manager or leader within an organization, answering these questions is critical to your success, because chances are that you have also played a part in the failure, because you did not adequately prepare the person for the mission they were about to embark. Yes, you can blame the person, call them the ass-clown, even git rid of them, but chances are if they thought they were acting on good information, someone else has probably heard similar information, and they are getting ready to make the same bad decisions.

So the next time you see someone do something, or say something, that seems boneheadedly wrong, before you start castigating them, take a brief moment to find out why they did what they did, and what was the information they were relying on that caused them to act as they did. Then, rather than persecute the person, persecute the poor information; make sure everyone you think might be working under incorrect pretenses gets the opportunity to hear the real situation. If your lucky, your "bad actor" might even become a champion for your cause. OK, perhaps not in politics, but I have seen that happen in technical shops, and when it does, it's awesome.

ADDENDUM: This morning my son missed his bus. It was his first day of middle school. We went to the bus stop at the time we saw in the paper and posted at his school during orientation. He was understandably upset by this, and with new school nerves in a bundle, was feeling quite angry. At first he blamed himself and was worried that his teachers would be mad at him. After we explained that wouldn't be the case; he then got angry at the bus driver for not showing up at the right time. We again told him that he shouldn't be so upset, but he wasn't having it. I then explained to him the concept of the blameless post mortem; that we didn't really know what went wrong; we showed up when we thought we were supposed to, and it was possible the bus driver showed up when she thought she was supposed to, or maybe the bus didn't show up at all (my older son's bus broke down this morning, and he had to catch a ride). The point for us now was to figure out what the right time for the bus was, make sure it got communicated to all parties, and make sure we made the bus tommorrow.

(1) OK, yes, I have a problem with the conclusion, but I don't think it's the problem people should be focusing on.

Posted by Robert Treat in devops at 10:30 | Comments (0) | Trackbacks (0)

Thursday, June 21. 2012

Root Cause of Success

Like most companies, we do root cause analysis when things go wrong. "Root cause" is a bit of a misnomer, we deal with complex systems, usually with different level of redundency, so having a single root cause is usually not really realistic; really they are more like post mortems. In any case, when we have an incident, it's important to review what went wrong; gathering logs, graphs, and other data; to try to learn why the assumptions we made did not manifest as we thought, and to determine what changes we might need to make for the future. This cycle of review and learning is critical for continued success.

This past weekend, the OmniTI operations folks went through a number of significant production excursions, most of which were pulled off with good success. After which, we didn't do a post mortem. This probably isn't too different from most shops; I think most people don't do a post mortem when things work. We probably should. Even when things work, there are usually suprises along the way, and if you only decide when to do a in-depth look back on when things fail, you're probably overlooking use cases and scenarios you are likely to encounter again. Additionally, it's good information for people to be able to review, especially when bringing on new hires. You might think this would be boring, but I happen to love reading a well written post mortem. You probably do to, you just don't think of something like Apollo 13 as a giant post mortem, but for the most part that's what it is.

So I'm curious, are there shops where people do regular detailed accounting when things go right? Not just having audit trail information around, but walking through those logs as a group and talking out loud abut the areas that were more hope than plan, but since it worked everyone feels confident in. I know a lot of different people running web operations, but this doesn't seem like a common practice; if you've worked in such an environment, I'd love to hear about your experiences.

Posted by Robert Treat in devops at 20:41 | Comment (1) | Trackbacks (0)

Monday, May 28. 2012

Slides for Big Bad Upgraded Postgres Talk

Howdy folks! I finally got the slides up for the "Big Bad `Upgraded` Postgres" talk which I gave at PGCon 2012 (and previously at PGDay DC). The talk walks through a multi-terrabyte database upgrade project, and discusses many of the problems and delays we encountered, both technical and non-technical. I think the slides stand up pretty well by themselves, but you can also find out some additional info on my co-worker Kieth's blog, where he has also chronicaled some of the fun times we've had along the way. He also has some posts on benefits we've seen since upgrading. Anyway, slides are on my slideshare page, please have a look.

Posted by Robert Treat in bwpug, conf, postgres, solaris at 21:08 | Comments (0) | Trackbacks (0)

Tuesday, February 21. 2012

I built a node site

Two weekends ago I was in need of website. The local Postgres Users Group is putting on a 1 day mini-conference (featuring some of the best speakers you can get I might add, you should probably go) and we wanted to put up a site with information on the conference. We didn't need anything fancy, just some static pages with some basic info. We also don't really have any money, so I wanted something simple that I could toss on-line and be hosted for free, with the caveat that I wanted something I could code (ie. not a wysiwg template thing) because I have some predefined Postgres related graphics and css type stuff I wanted to re-use.

After browsing around a little I ran across an interesting service that I almost used called Static Cloud, which is designed to store html, css, and javascript files on-line. This seemed fine for such a simple site, but when I started tossing together the html, I realized I did have some repeatable content that I wanted to repeat (header, footer type stuff). There's probably a way to do this, but it took me out of my comfortzone, so I decided I should use a scripting language to do my dirty work. I looked at the various PHP, Ruby, and Python offerings, but sadly nothing seemed to fit what I wanted, mainly on the account of them not being free. Then I stumbled upon nodester.

Nodester is a node.js based hosting service, which allows you to host node based apps on their servers for free. How friendly! Now, I've looked at node before, probably 6+ months ago, and thought it was interesting, but didn't really have too much use for it at the time. Since then OmniTI has used it for a couple of projects, including one recent project (still ongoing actually) where we built a hefty section of the back-end for a large, asynchronous, services system. And we did it in node.js. So, having seen some of that work, I thought why not give node.js another go around.

So, I built a site. It's not fancy. It's a half a dozen pages that don't need to do much. Some files get processed, some pages get displayed. I mostly mention it here because when I started putting it together, I couldn't actually find anything like this: a complete site that was more than just the most trivial example of how to plumb things together. This doesn't go much beyond that, but if you are getting your feet wet with node, I think being able to check this site out and just do a "node services.js" and have a real working site to look at, one where you could easily add or modify pages, well it might be handy. Also, it gives me a chance to write a bunch of links I found useful so I can refer back to it. For starters though, the code is on my github. (Yes, I should replumb the routes)

I mentioned I used Nodester, so the first thing to check out is the Nodester page, which has a demo about having your app up and running in 1 minute. I hate those kind of demo's, but it is really freakin' easy. Here's another link for wiring up your domain with Nodester. This was something I wanted, and fyi it also works fine for subdomains. Now, I have to give a warning about Nodester. They've been having service problems lately (obligatory monitoring graph here), and while they are responsive on twitter, they aren't proactive. If I were just doing occasional demo's of my app for people, I'd still use them, but I needed the site to stay up, and I work at a company with massive hosting capabilities, so I did move the site. Sorry Nodester. I did leave a copy of the app running there though.

The site itself is written in node yes, but makes use of 2 npm modules, specifically Express and Jade. (Minor note, I hit the "node env" error, in case you see it). These seem to be the defacto web framework / stack for node stuff, and it works well enough. Here's the link on wiring up Express apps on Nodster. I also made use of this Express Tutorial from the guys at Nodetuts. I don't think I actually watch the whole thing, but it was handy getting me over the hump on a couple things. For the Jade stuff, I mostly used the docs and some googling (which tended to end with questions on stack overflow). To be honest, I was tempted to scrap Jade and just use straight HTML, but in the end Jade did seem efficient enough that it was worth the bother.

Posted by Robert Treat in bwpug, nodejs, postgres at 15:58 | Comments (0) | Trackbacks (0)

(Page 1 of 56, totaling 388 entries) » next page

Quicksearch

Hi! I'm Robert Treat, COO of OmniTI, perhaps the best internet technology consulting company on the planet.

A veteran open source developer and advocate, I have been recognized as a major contributor to the PostgreSQL project, and can often be found speaking on open source, databases, and large scale web operations.

Upcoming Events

PGDay NYC

March 20th

At New York City

PGCon 2013

May 21st - May 22nd

At Ottawa, Canada

Surge 2013

September 12th-13th

At Washington, D.C.

Postgres Open

September 16th - 18th

At Chicago, Illinois

You were saying?

Gabriele Bartolini about Shoot The Automated Failure In The Head
Tue, 18.09.2012 09:04
I agree with you Robert. My op inion is that decisions should be driven by obtaining an acc eptable Recovery Time Ob [...]

Laura Thomson about Root Cause of Success
Fri, 22.06.2012 10:18
We do postmortems when things go right - not always but we h ave, especially for big things that go right. It's im [...]

Javier Salado about Intrest free (technical) debt is risky
Tue, 07.02.2012 05:16
Hi Robert, Tanks for your t houghtful interest in my lates t post. You are absolutely right about the underlyi [...]