I had a PostgreSQL database crash overnight due to the $PGDATA file system running out of disk space. Looking at the disk growth, some process had steadily increasing disk usage for hrs and consumed 170GB before dying and taking out several other overnight batch processes.
The log file contained:
EST:ERROR: could not write block 22455034 of temporary file: No space left on device
EST:HINT: Perhaps out of disk space?
EST:STATEMENT: SELECT bunch of columns
FROM big_ass_table a, another_huge_table b
WHERE useless_criteria_that_filters_nothing
AND no_table_joins
GROUP BY lets_throw_a_group_by_in_here_too_just_to_fuck_it_up_some_more
The offending query was traced back to a single power user running an ad-hoc query from their desktop. Looking at the explain plan of that query revealed some rather excessive temp file usage due to cartesian joins on multi-GB tables.
So here is a way for a single user to crash the entire PostgreSQL database cluster. As a DBA this makes me shudder. What are my options to prevent this happening again:
1. Find some parameters that can avoid this. The only parameter remotely relevant was the temp_buffers which only governs space usage in memory.
2. Use a separate temporary tablespace. I could create a separate file system for a new tablespace, but that isn’t something I can do quickly on 40 different servers. I can’t point the new tablespace to an existing file system as there are no tablespace quotas in PostgreSQL and it would just fill up that file system if given the chance.
3. Write script to kill end user queries that run longer than reasonable before they have a chance to fill up the disk.
I can’t investigate all this without comparing to Oracle so here goes
All Oracle temp files are stored in one or more temporary tablespaces, each of which have a fixed size. They can be made to autoextend in size but the DBA always has control over maximum space used. If a query uses all the allocated space for that user/tablespace, their query errors and rolls back. Other users are not affected (unless they were unlucky enough to try and grab temp space from the same tablespace just when it filled up). Most importantly – the database stays UP.
Naturally, I ended up implementing choice #3 – automatically kill slow user queries. I have done similar with Oracle in the past to minimise performance problems and it is a very effective solution. It also helps to train the end users into writing more efficient queries
The basic steps are:
1. Connect to a database cluster and query for slow queries.
psql -U postgres postgres -Atc \
"select procpid, datname, usename, client_addr, current_query
from pg_stat_activity where current_query <> '<IDLE>'
and client_addr::varchar like '%.140.%'
and current_timestamp-query_start > interval '1 hr'" > /tmp/$$.out
2. For each query, send an alert email with details of who is running what then, if you’re brave enough, kill it.
while read LINE; do
KILL=`echo $LINE | awk -F\| '{print $1}'`
DATABASE=`echo $LINE | awk -F\| '{print $2}'`
USER=`echo $LINE | awk -F\| '{print $3}'`
IP=`echo $LINE | awk -F\| '{print $4}'`
SQL=`echo $LINE | awk -F\| '{print $5}'`
MSG="Killing $USER ($IP) in $DATABASE running $SQL"
/var/lib/dba/alert.sh "ALERT: Killing query" "$MSG"
`kill $KILL`
done < /tmp/$$.out
You can download the full script kill_old_queries.sh here.
you can set the statement_timeout(integer) for individual client connections http://www.postgresql.org/docs/8.4/static/runtime-config-client.html to automatically kill off those long queries.