Why the F&#% Doesn’t Postgres Have Hints?!?!

Once again someone has brought up the idea of having a hints system in Postgres, which means once again we’re all subjected to watching people trod out the same tired, faulty, and even self-contradictory reasons to try to justify the idea that Postgres doesn’t have, need, or even want a hinting system. As frail as the arguments might be, people are so entrenched in their positions now that even having a discussion on the topic is difficult. And in fact, there are really two discussion going on here; one is whether Postgres should have *any* type of hinting system, and the other a more specific discussion on if we should have a *query* hinting system.

Does Postgres Need Hinting?

I think anyone who argues against hinting in an absolute sense is approaching things from the standpoint of intellectual dishonesty. Even posts that pretend to sustain the point of not having hints (like Josh’s recent post) eventually seem to acquiesce to the idea that some system for hinting would be ok; maybe not query hinting but “selectivity” hints or “statistics” hints or some such similar jargon. Now, it’s true, some people do take the overly high minded idea that even that level of effort is mis-guided and that time would be better spent fixing these so-called bugs in the planner. While that sounds great *in theory*, I’ve found that if you watch the mailing lists long enough, you’ll find people who have reported optimizer bugs and not been able to get a fix. Greg’s laundry list of work-arounds isn’t something new that he made up; it’s a collection of watching people year after year run into performance problems and then be forced to come up with solutions because a planner improvement just isn’t going to be forthcoming. Don’t get me wrong; I’d love nothing more than to be able to send poor queries to the list and have code improvements flow out of that, but given that after 10 years of me and 1000’s of others sending in those types of emails we’ve still not achieved a good enough planner to have killed the hinting conversation tells me that’s an unachievable goal, and for those who need to GSD, we would be better off with something more tangible.

Hinting Yes, But Certainly Not Query Hinting

Ok, so presuming you are onboard with the idea that hinting is something that is necessary, even if not ideal, then the question really should be framed as “what kind of hinting should we use?”. Most of the people involved in Postgres development who are vocal on this subject tend to be firmly against the idea of query hinting as implemented in all of the other major databases. I think the arguments for this break down into 1 of 3 different ideas, all of which are inter-related. The first idea is that solving these problem at the query level is far too narrow a solution, and that we need a more general solution that can solve things without having to modify individual queries. While I think that would be great, I’ve often pointed out that the reason so many implementations have fallen back to query hinting is that that is where the manifestation of performance problems actually exist. In other words, if I have a table that gets 20 different queries on it, I don’t necessarily want any special statistics changes or hinting mechanism at all, until the 1 query comes along that has performance issues. At this point I need something to improve that query, being careful not to have any negative impact on the already existing queries. Tricks like modifying join_collapse_limit, or random_page_cost, might solve your specific query problem, but if you change those on a global level you might wreck your existing system too. This pushes us back to query level hinting. Granted you could argue that those knobs don’t require direct query modification, but it’s not like those type of tricks aren’t already in the system as well. Example? The highly touted “offset 0” optimization that you’ll hear talked about is one example. I admit, I’ve told people to use this very same trick to fix queries after they have upgraded from older Postgres versions and had their query performance break. On the one hand, this makes me look like a “performance guru” because I know this super double-secret handshake optimizer trick; on the other hand, what kind of pompous jackass can stand there with a straight face and tell someone the “problem” with their query is that they aren’t using this syntax hack which was entirely unnecessary on the same queries using the same database product that was released 2 years ago. Yes, that’s right. This isn’t just that it makes some queries slow; it makes queries that used to run perfectly fine totally nosedive on performance. (You might think that’s a bug, but let’s try not to digress.) Any query hinting system we had would ostensibly be no worse than the majority of work-arounds that are in play today, and IMHO they’d likely be better understood than explaining to people the intricacies of the different enable_foo GUC’s (which aren’t a good enough solution to solve problems within complex queries anyway) The second idea against traditional query hinting is that the Postgres community can come up with a superior method at hinting, which would likely target data statistics rather than the query planner, and that efforts should be focused on that instead. This is again one of those ideas that sounds great *in theory*, and now-a-days there’s even some fancy hand-waving talking about overriding the optimizers notion of column distinctness, or perhaps storing statistics on the correlation of data between multiple columns in a table, that make it sound like there’s something to this line of thinking. The problem is that while these solution do address some of the (relatively) simpler cases, there are whole classes of problems yet to be dealt with. Have you ever tried to do joins across aggregated subquery columns in Postgres? Not only has Postgres struggled with these types of queries for years, the whole idea of improved statistics or selectivity hinting is pretty much irrelevant because no one can even come up with a plan for how these types of suggestions could be stored and fed into the optimizer. (Hint: It’s hard to store statistics on columns that don’t exist beyond the execution of a single statement). (Side note: I do think you might be able to feed these in on a per-query basis, but then we’d be back to argument number 1). This isn’t to say it can’t be done; I do occasionally see academic papers involving Postgres and improved query performance pop up, but I can’t remember the last time I saw one cited in a Postgres commit, and again we’ve been working on this problem for *years* and have yet to see a concrete proposal. The last major argument against query level hints is that because this solution ignores the underlying data, the fixes are at best temporary (since data can change over time), and will eventually bite you once that happens. Or, to put it a different way, query hints are nothing more than footguns which will eventually cause you pain. To that, all I can say is DBA’s hate being eaten by crocodiles. (OK, if that isn’t good enough for you, I’ll also point out that it’s a bit of a fallacy to think that the database can be a better predictor of data changes than a DBA; the database can only at best do statistical analysis after the fact; the DBA can do that as well, but can also know about potential data changes before they are implemented.)

Ok, Nice Rant, But You Know This Doesn’t Change Anything, Right?

Yes, I know. I know because I have had all of these discussions before. Hints are religious in Postgres, so you shouldn’t expect a rapid change here. On the other hand, Windows support used to be religious in Postgres. And you know what else? Replication used to be religious in Postgres. So, while you should never underestimate the stubbornness of the Postgres development group, it is possible for them to come around to ideas that were once consider verboten. I think it will probably be at least another 5 years before it happens though; there isn’t enough overlap with the Oracle and Postgres communities to realize how many people are not able to migrate to Postgres because of planner/optimizer deficiencies; but I do think that day will come.

A Present From My Filesystem

People ask me why I haven’t blogged much lately and I generally tell them I’ve been busy fixing broken stuff. Then they tell me to blog about the broken stuff I’m fixing. So, ok. Here’s a random encounter from before the holidays….

There’s nothing like trying to log into a database server to do some simple schema work, only to have you’re login hang trying to load your session. After some brief poking around and attempts to login with different accounts, I was able to see that we had a disk problem on the database. (Nail in the coffin, zpool status, the zfs command which shows the condition of your drives, was hanging trying to show the last pool). So, poke the SA’s and see what they can dig up… and unfortunately things were bad enough, we were going to need a reboot. So, we reboot.

Upon restart, well, things don’t restart; where “things” means postgres. SMF shows that we tried to start, but something has gone awry. Looking in the database logs, I see this

2010-12-23 17:29:16 EST:[email protected](32789):782: FATAL: the database system is starting up 2010-12-23 17:29:17 EST:[email protected](58978):783: FATAL: the database system is starting up 2010-12-23 17:29:18 EST:@:494: FATAL: could not create relation 16385/58326/58757497: File exists 2010-12-23 17:29:18 EST:@:494: CONTEXT: xlog redo file create: 16385/58326/58757497 2010-12-23 17:29:18 EST:@:492: LOG: startup process (PID 494) exited with exit code 1 2010-12-23 17:29:18 EST:@:492: LOG: aborting startup due to startup process failure

Uh, that’s not good. And looking at the file, it’s even weirder. It’s owned (and grouped) by root, something that shouldn’t actually ever happen. There are a couple ways to go with that, the way I thought would work would be to copy the file off to a side directory (“borked” seemed like a good directory name for this), and then change the permissions so that postgres owned the file again. After that, fire it back up

2010-12-23 17:46:07 EST:@:2790: FATAL: could not create relation 16385/58326/58757498: File exists 2010-12-23 17:46:07 EST:@:2790: CONTEXT: xlog redo file create: 16385/58326/58757498 2010-12-23 17:46:07 EST:@:2788: LOG: startup process (PID 2790) exited with exit code 1 2010-12-23 17:46:07 EST:@:2788: LOG: aborting startup due to startup process failure

Well duh. Ok, so look in the dir, and turns out there are only three files like this, so fix them up the same way, and see if it will work. A pg_ctl later, and now I see this in my logs:

2010-12-23 17:49:32 EST:@:3224: LOG: record with zero length at 3B02/E9565410 2010-12-23 17:49:32 EST:@:3224: LOG: redo done at 3B02/E95653E0 2010-12-23 17:49:32 EST:@:3224: LOG: last completed transaction was at log time 2010-12-23 17:17:01.558461-05 2010-12-23 17:49:32 EST:@:3224: LOG: checkpoint starting: shutdown immediate

zoinks… but the server doesn’t die and actually, I think it will work. And it does.

2010-12-23 17:53:10 EST:@:3224: LOG: checkpoint complete: wrote 131072 buffers (100.0%); 0 transaction log file(s) added, 0 removed, 23 recycled; write=215.734s, sync=2.262 s, total=218.043 s 2010-12-23 17:53:10 EST:@:3222: LOG: database system is ready to accept connections 2010-12-23 17:53:10 EST:@:3682: LOG: autovacuum launcher started

Well, at least enough to get things working. But that’s really not the whole story; when something looks like it has gone wrong, you need to dig just a little further. First I want to check out who those three files were:

pgods=# select \* from pg_class where relfilenode in (58757497, 58757498, 58757499); relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relfrozenxid | relacl | reloptions ------------------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+------------+-------------+----------------+--------------+--------+------------ ban_activity_log_snap1 | 17322 | 41561 | 10 | 0 | 58757497 | 16385 | 860 | 77570 | 41562 | 0 | f | f | r | 6 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 1281741311 | | pg_toast_41559 | 99 | 41563 | 10 | 0 | 58757498 | 16385 | 0 | 0 | 0 | 41564 | t | f | t | 3 | 0 | 0 | 0 | 0 | 0 | f | t | f | f | 1281741311 | | pg_toast_41559_index | 99 | 0 | 10 | 403 | 58757499 | 16385 | 1 | 0 | 0 | 0 | f | f | i | 2 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 0 | |

Well, the relation is totally harmless; it’s a replicated copy from a different database, so I can toss it out and rebuild it if needed, but I also need to see what the toast table and index are for.

pgods=# select \* from pg_class where oid = 41559; relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relfrozenxid | relacl | reloptions ------------------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+------------+-------------+----------------+--------------+--------+------------ ban_activity_log_snap1 | 17322 | 41561 | 10 | 0 | 58757497 | 16385 | 860 | 77570 | 41562 | 0 | f | f | r | 6 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 1281741311 | |

Again, I’ve dodged a bullet. It looks like all of those files came from the same relation. So chances are we’re ok. But before I start this under SMF, I do a couple of checkpoints and xlog switches. A little bit superstitious, but also I want to avoid having to go back through the previous batch of xlogs if I can. (I also did a vacuum verbose on the postgres database in the cluster).

Now I know that some people might be uncomfortable running there database after something like that; they might want to do some kind of dump/restore or reload or backup. Realistically, this is a multi-tb database, and we can’t afford the downtime needed for those luxuries, so we’re going to keep pushing forward. It’d be nice if Postgres had better tools to do filesystem level verification of tables/index files, but sometimes you have to make do. We’ve got some things working in our favor; zfs checksums on the files, and we’ll dig around a bit to see what else we can turn up. But for now, I consider it within the spirit of the season that things didn’t go worse.

BWPUG Meeting November 10th

Howdy folks, A reminder that this Wednesday (the 10th) is the BWPUG meeting for November. This month’s topic will be “Achieving PCI compliance with Postgres”. PCI-DSS standard was developed by credit card companies to help protect cardholders from identity theft and enhance cardholder security. While the standard has a history going back almost 10 years, there are still many questions and challenges faced for people trying to stay compliant, especially within the open source arena. When: November 10th, ~6:30PM. Where: 7070 Samuel Morse Dr, Columbia, MD, 21042. Host: OmniTI As always we will have time for networking and we can do some more open Q & A, and we’ll likely hit one of the local restaurants after the meet.

Scalability, I Has It

On Monday I spoke at the Highload++ conference in Moscow, Russia. This is an annual conference put on by several local Russian tech folk that focuses on large scale website architectures. This conference covers the full stack, which includes database, which includes Postgres, which is where I get included in all of this. I gave a talk with a rather generic title, “Scaling with Postgres”, which incorporates both some thoughts on “tech culture” within organizations trying to scale Postgres (gleaned from years of working with these types of clients at OmniTI, plus quite a bit of mind-melding and discussion a few weeks ago at Surge conference), mixed in with some hands on pointers for those trying to do it with Postgres. I’ve put slides up on slideshare for those that want to view it, and I believe the Highload folks will have video up at some point. Along side my talk were talks from fellow Postgres contributors Simon Riggs and Bruce Momjian, which focused on performance, upgrading, and replication. I thought all of the talks were good, although given some overlap in Simon and Bruce’s talks, I think it would have been cool to cut the talks time and maybe have a short panel for open Postgres Q&A for all three of us. (This is likely difficult from a practical perspective, given the need for translations, but it seems doable). Instead, since Simon went first, he ended up taking the brunt of people’s questions; which was amusing for Bruce and me, but I think a panel would have been good. Maybe next time. Anyway, I want to thank the folks running Highload for inviting me to speak, and working through the travel shenanigans. I found the conference very nicely run, and the growth they are getting in attendance is a good sign for the Russian tech community! PS, someone at the conference had asked me about getting reconnoiter support; since I couldn’t get a card, I’ll just toss a link out here for him, or anyone else interested: http://labs.omniti.com/labs/reconnoiter/wiki/Support

BWPUG September Meeting 2010-09-08: PostgreSQL Security and SE-Postgres

A reminder that tommorrow night, BWPUG will be gathering for its September meeting. This months we’re rounding up the gang to discuss “PostgreSQL Security via SE-Postgres”, including the current TODO’s, and what to focus on now that 9.0 is just around the corner and 9.1 development will be starting in earnest. We expect to have a good crowd on hand, including some potential out-of-towners, so we hope to see you there. When: Septmber 8th, ~6:30PM. Where: 7070 Samuel Morse Dr, Columbia, MD, 21042. Host: OmniTI As always we have time for networking and likely hit one of the local restaurants after the meeting, hope to see you there.

Now What? (Wrt OpenSolaris and Your Database)

Last week’s “announcement” of the death of OpenSolaris has steered a lot of questions my way about where people should go, and/or where OmniTI will go, now that OpenSolaris future looks non-existent. As one of the more open users of Solaris related technology, and running some beefy loads on top of it, it makes sense that people would be curious as to what we might be doing next. I would start with saying that as a company, we don’t have an official policy on this yet, and probably won’t. We evaluate each situation on a customer by customer basis, so what follows here is more my personal feelings on what people should do at this current point in time. The one thing I have noticed from the people I have already spoken with is that there seem to be two major camps, an over simplification to be sure, but I break this down into the free software camp (those motivated by a desire to remain on open source, and/or support, free software as a primary driver of technology decisions), and those more interested in the technology than the ideals behind it. Depending on where you fall into that spectrum, you have different options available to you, and will likely reach very different conclusions.

Too Soon?

The first thing I have said to everyone is that it is honestly too soon to make any moves. Oracle is notorious for being poor communicators, and at this point I don’t think we’ve seen enough official communication to really know what’s going to happen. This doesn’t mean you can’t start planning though! We’ve been looking at some of the available options since before the Oracle/Sun merger was closed, so it doesn’t hurt to start evaluating the options out there. However there’s no need to rush in to things; it is possible that the announcement of OpenSolaris’s death might be premature. I personally don’t believe Solaris can’t survive based on the model we’ve just seen laid out; there are too many people learning the gnu tool chain who won’t be willing to invest big money into a tool that is hard for them to use. They need a low cost / free option for people to familiarize themselves on (and all the better if it installs gnu tools by default). There’s an outside chance Oracle might come to this conclusion, which would give new life to OpenSolaris. A more likely alternative to that theory is that some other group might pick up OpenSolaris maintenance and start pushing it forward. Certainly not an easy task, but there are already several different distribution of OpenSolaris available, so the userland level management has the resources, we mostly would need to figure out how to handle the more core technologies that have been maintained by Sun. I think this might also be possible, as there are numerous companies already heavily invested in OpenSolaris technology, and there are Solaris internals hackers looking to move out of Oracle, it’s not an impossible leap to think we might see something worked out. And if Oracle continues to make technology available via the CDDL (which most of the current signs seem to indicate), this could work out. I would say that this might not resemble the OpenSolaris as it is now, but could definitely be an option for current users who’d like to remain on the OpenSolaris platform.

Other Options?

Of course, you might not want to put all your eggs in that basket. So what other options do we have? Well, that mostly depends on what you’re getting out of OpenSolaris now, and what you want out of your OS going forward. For many people, I suspect that Solaris 11 Express might be a suitable replacement, especially for those running mixed OpenSolaris / Solaris environments. Migrating up to full Solaris 11 will also cover most of your technology needs, so depending on pricing I suspect people may find that a cheaper alternative to migrating to a new platform. Of course, if you want to stick with a free software solution, this won’t really be an option. FreeBSD seems to be the most obvious alternative platform. If you’re currently taking advantage of dtrace, zfs, and zones, FreeBSD gives you options to cover all three. It won’t be the same; the dtrace and zfs implementations are pretty close aiui, but for zones you’ll probably have to use either Jails or OpenVS, neither of which am I a fan of. I think you’d also find a larger overlap in system utilities (tar, find, grep, etc..) between FreeBSD and Solaris, so for people (and scripts) making the transition, this might be an easier move. The big question here is probably hardware support; if you can’t get FreeBSD running on your hardware, that’s likely to be a show stopper, unless you can work out a new hardware purchase in the transition :-) So, if you don’t want to go closed Solaris, and FreeBSD isn’t an option, that probably leaves you on Linux. People sometimes think I don’t like Linux; I’m actually very comfortable on it. My first “unix” was Linux, and we run some extremely demanding systems on Linux and it has performed well in those cases. However if you’re trying to do deep introspection, systemtap is a poor man’s dtrace. And if you are relying on zfs, you’ll have a hard time finding a suitable replacement amongst the current Linux options. Personally I am most comfortable on ext3, but I tend to give up on file system snapshots, which is a painful submission if you have to make it. XFS is probably the next most common option, and generally I’ve no bones about using it if you want to avoid ext3. Of the three “advanced” replacements; ext4, btrfs, and zfs on linux; I think ext4 is probably your best bet, but only because zfs is too new for any serious database systems, and if you are moving off OpenSolaris to get away from Oracle, “butter” seems like an odd choice.

And so…

I think it’s wise to keep things in perspective. There are some cases where you want to be a technology leader (we’ve been running Postgres 9 for months on some systems), but generally speaking when it comes to picking the operating system and filesystem for your database, it’s best to tread lightly. Now is a fine time to start evaluating your options; at least figure out what features are critical to your enterprise that you’ll need to replace (and don’t just think about database, you might be relying on crossbow for something, or who knows what else). We’ll certainly be watching the current options available, and I suspect diversifying a little, over the next 6 months, as we wait for the picture to clear up where we can. We’re not in a hurry (after all, we do have the source code of what we’re running now), and I don’t see much reason for others to be either.

I Hardly Gnu, You?

This morning I was browsing some twitter, when I came across this tweet:
“Everyone who bitches about the GPL … I’d like them to remove all the GPL software from their computers, and see how they get on.”
Now, this is all apart of some brouhaha over Wordpress and it’s templates or something. I personally don’t care about any of that (I’ve long moved past Wordpress, as I find them to be bad open source citizens), but it did get me to thinking, as a person who does think the BSD license is a better open source license than the GPL, just how much of an inconveneince would this be? I certainly do rely on a bunch of gnu software that I’d just as soon not live without, like my playstation, or our dvr, and probably our router too. However I don’t really have a choice in these things, and to be fair I don’t think there is anything perticularly special about linux that makes it better for a dvr than you could do with BSD. But let’s consider things where I do have a choice, where most software developers have a choice, which is in the tools we work with and things we work on. This is pretty broad, so let’s just consider the classic LAMP stack that most people work on. Apache is of course, available under the Apache license, so it’s already in the clear. While I like Linux, I’ve long replaced it in my life with a mix of OSX, *Solaris, and *BSD, for computers where I get to choose the OS. Yes, our eeepc does run Linux, but I could probably switch that to some BSD system if needed. Likewise MySQL has always had the backseat next to Postgres or Oracle, and even the NoSQL fanboys have plenty of non-gpl options (Couch, Hadoop, Cassandra, Voldemort, etc…) to pick from. And finally, whether your P is Perl, PHP, or Python, all of those languages are available under non-GPL licenses. So, I guess the “LAMP” stack could go on. Surely there must be some things though right? I started to think about other tools I work with regularly, like X and vim, and while I’m sure there are some tools that might be gpl, certainly many are not, and I’d guess between Solaris and BSD, I could make a GPL-free stack that I’d be comfortable working on, with software that is already available. This isn’t to say I wouldn’t miss anything. I love my recursive grep, find the -P argument for xargs amazing, and find BSD tar just crippling to work with. None of those would be as bad as losing screen, which is a must have for any serious server work. Of course there are alternatives for all of these (not that I know anyone who uses tmux), and I’d bet some of these features could be easily re-implemented in a BSD version if needed. I think where this really get’s you is in the software that has been built upon these base tools, perticularly in the area of PHP software development. A lot of people over that last 10 years have produced GPL lciense software, like Drupal, PHP-Nuke, phpBB, phpMyAdmin, and more. I think most people didn’t really think about there license choices back then (perhaps not now either); “if it’s good enough for Linus”. It’s too bad, I know I’d much rather license phpPgAdmin under the BSD than GPL, but I think we’re pretty much stuck at this point. I have noticed some newer projects (Habari for instance) have chosen non-GPL licenses; I don’t know if that is a trend or anything, but it wouldn’t hurt if it was.

BWPUG July Meeting 2010-07-14: PostgreSQL Performance Farm

A reminder note from Stephan Frost:
Hey all! This month BWPUG returns to DC/Falls Church! When: July 14th, 6:30PM (this Wednesday!) Where: 3150 Fairview Park Dr, Falls Church, VA Host: Noblis, Inc. and the Noblis Innovation and Collaboration Center (NICC) It’s basically at 495 and 50 in Northern Virginia. We’ll be discussing the work and progress made so far on the PostgreSQL Performance Farm scripts done by Scott (an intern working for Noblis over the summer from Virginia Tech). If you’re planning to attend, please RSVP to [email protected]. The host/facility we’ll be meeting at does require a government issued photo ID (eg: driver’s license or passport). Hope to see you there! Thanks, Stephen

PGCon 2010 Slides Are Up

They’ve actually been up for a few days now, but since I haven’t mentioned it, I guess no one noticed :-) This includes slides for both the “Know More Waiting / Postgres 9” talk, and my “Check Please! / Postgres Monitoring”. I’ve uploaded the slides to the conference webstie for download, but I think it needs to get a site cache rebuild or something, in the mean time you can view the slides on my slideshare page. As always if you have any questions, please feel free to contact me.