People ask me why I haven’t blogged much lately and I generally tell them I’ve been busy fixing broken stuff. Then they tell me to blog about the broken stuff I’m fixing. So, ok. Here’s a random encounter from before the holidays….
There’s nothing like trying to log into a database server to do some simple schema work, only to have you’re login hang trying to load your session. After some brief poking around and attempts to login with different accounts, I was able to see that we had a disk problem on the database. (Nail in the coffin, zpool status, the zfs command which shows the condition of your drives, was hanging trying to show the last pool). So, poke the SA’s and see what they can dig up… and unfortunately things were bad enough, we were going to need a reboot. So, we reboot.
Upon restart, well, things don’t restart; where “things” means postgres. SMF shows that we tried to start, but something has gone awry. Looking in the database logs, I see this
Uh, that’s not good. And looking at the file, it’s even weirder. It’s owned (and grouped) by root, something that shouldn’t actually ever happen. There are a couple ways to go with that, the way I thought would work would be to copy the file off to a side directory (“borked” seemed like a good directory name for this), and then change the permissions so that postgres owned the file again. After that, fire it back up
Well duh. Ok, so look in the dir, and turns out there are only three files like this, so fix them up the same way, and see if it will work. A pg_ctl later, and now I see this in my logs:
zoinks… but the server doesn’t die and actually, I think it will work. And it does.
Well, at least enough to get things working. But that’s really not the whole story; when something looks like it has gone wrong, you need to dig just a little further. First I want to check out who those three files were:
pgods=# select \* from pg_class where relfilenode in (58757497, 58757498, 58757499);
relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relfrozenxid | relacl | reloptions
------------------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+------------+-------------+----------------+--------------+--------+------------
ban_activity_log_snap1 | 17322 | 41561 | 10 | 0 | 58757497 | 16385 | 860 | 77570 | 41562 | 0 | f | f | r | 6 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 1281741311 | |
pg_toast_41559 | 99 | 41563 | 10 | 0 | 58757498 | 16385 | 0 | 0 | 0 | 41564 | t | f | t | 3 | 0 | 0 | 0 | 0 | 0 | f | t | f | f | 1281741311 | |
pg_toast_41559_index | 99 | 0 | 10 | 403 | 58757499 | 16385 | 1 | 0 | 0 | 0 | f | f | i | 2 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 0 | |
Well, the relation is totally harmless; it’s a replicated copy from a different database, so I can toss it out and rebuild it if needed, but I also need to see what the toast table and index are for.
pgods=# select \* from pg_class where oid = 41559;
relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relfrozenxid | relacl | reloptions
------------------------+--------------+---------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+------------+-------------+----------------+--------------+--------+------------
ban_activity_log_snap1 | 17322 | 41561 | 10 | 0 | 58757497 | 16385 | 860 | 77570 | 41562 | 0 | f | f | r | 6 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | 1281741311 | |
Again, I’ve dodged a bullet. It looks like all of those files came from the same relation. So chances are we’re ok. But before I start this under SMF, I do a couple of checkpoints and xlog switches. A little bit superstitious, but also I want to avoid having to go back through the previous batch of xlogs if I can. (I also did a vacuum verbose on the postgres database in the cluster).
Now I know that some people might be uncomfortable running there database after something like that; they might want to do some kind of dump/restore or reload or backup. Realistically, this is a multi-tb database, and we can’t afford the downtime needed for those luxuries, so we’re going to keep pushing forward. It’d be nice if Postgres had better tools to do filesystem level verification of tables/index files, but sometimes you have to make do. We’ve got some things working in our favor; zfs checksums on the files, and we’ll dig around a bit to see what else we can turn up. But for now, I consider it within the spirit of the season that things didn’t go worse.