A Present From My Filesystem

People ask me why I haven’t blogged much lately and I generally tell them I’ve been busy fixing broken stuff. Then they tell me to blog about the broken stuff I’m fixing. So, ok. Here’s a random encounter from before the holidays….

There’s nothing like trying to log into a database server to do some simple schema work, only to have you’re login hang trying to load your session. After some brief poking around and attempts to login with different accounts, I was able to see that we had a disk problem on the database. (Nail in the coffin, zpool status, the zfs command which shows the condition of your drives, was hanging trying to show the last pool). So, poke the SA’s and see what they can dig up… and unfortunately things were bad enough, we were going to need a reboot. So, we reboot.

Upon restart, well, things don’t restart; where “things” means postgres. SMF shows that we tried to start, but something has gone awry. Looking in the database logs, I see this

2010-12-23 17:29:16 EST:[email protected](32789):782: FATAL: the database system is starting up 2010-12-23 17:29:17 EST:[email protected](58978):783: FATAL: the database system is starting up 2010-12-23 17:29:18 EST:@:494: FATAL: could not create relation 16385/58326/58757497: File exists 2010-12-23 17:29:18 EST:@:494: CONTEXT: xlog redo file create: 16385/58326/58757497 2010-12-23 17:29:18 EST:@:492: LOG: startup process (PID 494) exited with exit code 1 2010-12-23 17:29:18 EST:@:492: LOG: aborting startup due to startup process failure

Uh, that’s not good. And looking at the file, it’s even weirder. It’s owned (and grouped) by root, something that shouldn’t actually ever happen. There are a couple ways to go with that, the way I thought would work would be to copy the file off to a side directory (“borked” seemed like a good directory name for this), and then change the permissions so that postgres owned the file again. After that, fire it back up

2010-12-23 17:46:07 EST:@:2790: FATAL: could not create relation 16385/58326/58757498: File exists 2010-12-23 17:46:07 EST:@:2790: CONTEXT: xlog redo file create: 16385/58326/58757498 2010-12-23 17:46:07 EST:@:2788: LOG: startup process (PID 2790) exited with exit code 1 2010-12-23 17:46:07 EST:@:2788: LOG: aborting startup due to startup process failure

Well duh. Ok, so look in the dir, and turns out there are only three files like this, so fix them up the same way, and see if it will work. A pg_ctl later, and now I see this in my logs:

2010-12-23 17:49:32 EST:@:3224: LOG: record with zero length at 3B02/E9565410 2010-12-23 17:49:32 EST:@:3224: LOG: redo done at 3B02/E95653E0 2010-12-23 17:49:32 EST:@:3224: LOG: last completed transaction was at log time 2010-12-23 17:17:01.558461-05 2010-12-23 17:49:32 EST:@:3224: LOG: checkpoint starting: shutdown immediate

zoinks… but the server doesn’t die and actually, I think it will work. And it does.

2010-12-23 17:53:10 EST:@:3224: LOG: checkpoint complete: wrote 131072 buffers (100.0%); 0 transaction log file(s) added, 0 removed, 23 recycled; write=215.734s, sync=2.262 s, total=218.043 s 2010-12-23 17:53:10 EST:@:3222: LOG: database system is ready to accept connections 2010-12-23 17:53:10 EST:@:3682: LOG: autovacuum launcher started

Well, at least enough to get things working. But that’s really not the whole story; when something looks like it has gone wrong, you need to dig just a little further. First I want to check out who those three files were:

pgods=# select \* from pg_class where relfilenode in (58757497, 58757498, 58757499); relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relfrozenxid | relacl | reloptions ------------------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+------------+-------------+----------------+--------------+--------+------------ ban_activity_log_snap1 | 17322 | 41561 | 10 | 0 | 58757497 | 16385 | 860 | 77570 | 41562 | 0 | f | f | r | 6 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 1281741311 | | pg_toast_41559 | 99 | 41563 | 10 | 0 | 58757498 | 16385 | 0 | 0 | 0 | 41564 | t | f | t | 3 | 0 | 0 | 0 | 0 | 0 | f | t | f | f | 1281741311 | | pg_toast_41559_index | 99 | 0 | 10 | 403 | 58757499 | 16385 | 1 | 0 | 0 | 0 | f | f | i | 2 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 0 | |

Well, the relation is totally harmless; it’s a replicated copy from a different database, so I can toss it out and rebuild it if needed, but I also need to see what the toast table and index are for.

pgods=# select \* from pg_class where oid = 41559; relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relfrozenxid | relacl | reloptions ------------------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+------------+-------------+----------------+--------------+--------+------------ ban_activity_log_snap1 | 17322 | 41561 | 10 | 0 | 58757497 | 16385 | 860 | 77570 | 41562 | 0 | f | f | r | 6 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 1281741311 | |

Again, I’ve dodged a bullet. It looks like all of those files came from the same relation. So chances are we’re ok. But before I start this under SMF, I do a couple of checkpoints and xlog switches. A little bit superstitious, but also I want to avoid having to go back through the previous batch of xlogs if I can. (I also did a vacuum verbose on the postgres database in the cluster).

Now I know that some people might be uncomfortable running there database after something like that; they might want to do some kind of dump/restore or reload or backup. Realistically, this is a multi-tb database, and we can’t afford the downtime needed for those luxuries, so we’re going to keep pushing forward. It’d be nice if Postgres had better tools to do filesystem level verification of tables/index files, but sometimes you have to make do. We’ve got some things working in our favor; zfs checksums on the files, and we’ll dig around a bit to see what else we can turn up. But for now, I consider it within the spirit of the season that things didn’t go worse.