Killing Invincible Processes With Dbx

So yesterday I was doing some maintainece on my database, [http://www.postgresql.org/docs/8.2/interactive/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE reviewing index usage] and adjusting [http://www.postgresql.org/docs/8.2/interactive/manage-ag-tablespaces.html tablespaces], when I ran across a table that could not be altered because of some long running transactions that were holding locks open. As it turned out, the transactions involved some database jobs that connect over to an Oracle database and pulling information back over to the PostgreSQL instance. Looking in pg_stat_activity I saw
pgods=# select ~ from pg_stat_activity where procpid in (21726, 21975); datname | procpid | current_query | query_start --------+---------+-----------------------------------------------------------+------------------------------- pgods | 21975 | select ods.p_g_f('FL', 'TBLFL_UNSUB_USERS_PG','noDEBUG'); | 2007-01-22 21:05:04.053497-05 pgods | 21726 | select ods.p_g_tx('FL', 'USERS_PG','noDEBUG'); | 2007-01-22 21:03:03.061543-05 (2 rows)
Based on thier query times, I knew these jobs were bogus, because last week we had a server crash on the Oracle machine due to hardware failure. While it wasn’t impossible that these jobs would continue to run even with an outage on the Oracle instance, they shouldn’t take this long. First verifying that nothing was going on, I trussed the process, verified it was sleeping, and then went to cancel the query
pgods=# select pg_cancel_backend(21975); pg_cancel_backend ------------------- t (1 row) pgods=# select ~ from pg_stat_activity where procpid in (21975); datname | procpid | current_query | query_start --------+---------+-----------------------------------------------------------+------------------------------- pgods | 21975 | select ods.p_g_f('FL', 'TBLFL_UNSUB_USERS_PG','noDEBUG'); | 2007-01-22 21:05:04.053497-05 (1 row) -bash-3.00$ sudo pgtruss -p 21975 Password: read(12, 0x012F68C6, 2064) (sleeping...) Received signal #2, SIGINT, in read() [caught] read(12, 0x012F68C6, 2064) Err#91 ERESTART lwp_sigmask(SIG_SETMASK, 0x00000002, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] setcontext(0xFFFFFD7FFFDEFA30) read(12, 0x012F68C6, 2064) (sleeping...) Received signal #16, SIGUSR1, in read() [caught] read(12, 0x012F68C6, 2064) Err#91 ERESTART lwp_sigmask(SIG_SETMASK, 0x00008000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF] setcontext(0xFFFFFD7FFFDEFA30) read(12, 0x012F68C6, 2064) (sleeping...)
From the above, you can see my “successful” attempt at canceling the query, followed by the query still running when looking in pg_stat_activity. More concerning is the truss output, which shows my sleeping process but also shows the kill attempt being caught and the process going back to sleep. Given the inability to cancel the query I also tried to kill the process from the command line with little sucess.
bash-3.00$ kill 21975 bash-3.00$ bash-3.00$ ps -eaf | grep 21975 root 16136 19194 0 18:15:19 pts/3 0:00 /opt/perl/bin/perl pgtruss -p 21975 root 16160 16136 0 18:15:25 pts/3 0:00 truss -p 21975 postgres 21975 486 0 Jan 22 ? 0:00 /opt/pgsql814/bin/postmaster -D /pgods/slowdata1/pgdata postgres 17177 6856 0 18:21:21 pts/12 0:00 grep 21975
So there I was, qeuries that wouldn’t die and me needing to make them stop without causing trouble. I thought about trying to play with different flags for kill, though obviously kill -9 was not an option. Unfortunatly I have seen this problem before which is caused by a bug in the OCI drivers, where you just can’t kill the process, but generally when I have gotten into this state I wasable to kill the process from the Oracle side and continue on. In this case there was no Oracle side, and short of a shutdown I didn’t seem to have many options. Luckily one option I had was running this scenario by Theo, who said to try attaching with dbx and see if I could stop it with an exit call. I was a little skeptical at the idea, but figured it was worth a shot. For those not in the know, [http://en.wikipedia.org/wiki/Dbx_debugger dbx] is a debugger available on Solaris, similar to gdb. In fact attaching to a process works much the same way
bash-3.00# /opt/SUNWspro/bin/dbx /opt/pgsql814/bin/postgres 21975 For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc Reading postgres dbx: Cannot attach to process 21975 -- Device busy
Ok, it would work the same if I wasn’t running truss on the process [[image /xzilla/templates/default/img/emoticons/smile.png alt=”:-)” style=”display: inline; vertical-align: bottom;” class=”emoticon” /]]. I killed truss and reattached.
(dbx) attach 21975 Reading ld.so.1 Reading libz.so.1 Reading librt.so.1 Reading libresolv.so.2 Reading libgen.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading libdl.so.1 Reading libm.so.2 Reading libc.so.1 Reading libaio.so.1 Reading libmd5.so.1 Reading en_US.UTF-8.so.3 Reading methods_unicode.so.3 Reading plpgsql.so Reading plperl.so Reading libperl.so Reading libpthread.so.1 Reading libscf.so.1 Reading libdoor.so.1 Reading libuutil.so.1 Reading libmp.so.2 Reading DBI.so Reading Util.so Reading Pg.so Reading libpq.so.4.1 Reading libssl.so.0.9.7 Reading libcrypto.so.0.9.7 Reading Oracle.so Reading libocci.so.10.1 Reading libclntsh.so.10.1 Reading libthread.so.1 Reading libnnz10.so Reading libociicus.so Reading libkstat.so.1 Reading libsched.so.1 Attached to process 21975 t@1 (l@1) stopped in _read at 0xfffffd7fff0ddeda 0xfffffd7fff0ddeda: _read+0x000a: jae _read+0x16 [ 0xfffffd7fff0ddee6, .+0xc ] Current function is plperl_call_perl_func 905 count = perl_call_sv(desc->reference, G_SCALAR | G_EVAL); (dbx) call exit(0) execution completed, exit code is 0 (dbx)
After attaching I gave the exit call a swing and lo and behold it actually claimed that it worked. A quick check from the command line verified the success.
bash-3.00$ ps -eaf | grep 21975 root 17293 14876 0 18:21:58 pts/2 0:01 /opt/SUNWspro/bin/dbx /opt/pgsql814/bin/postgres 21975 postgres 17465 6856 0 18:22:56 pts/12 0:00 grep 21975
And a double check to make sure the server was alive.
pgods=# select count(*) from pg_stat_activity; count ------- 58 (1 row) Time: 10.500 ms
Yep, all was well. So toss another option into my bag of tricks, crazy as it might seem. Next time you end up with a connection that wont die, you might try using dbx, the DataBase connection eXterminator!