Mar 7, 2011
Nathan Thom

PostgreSQL backup crashed – PANIC!


Over the weekend, one of my biggest databases crashed (400+ GB). I’m running 8.3.6 on Redhat. The log file included:

EST:PANIC:  could not write to file "pg_xlog/xlogtemp.9109": No space left on device
EST:CONTEXT:  writing block 103261 of relation 2597680/1222492/2600092
EST:DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
EST:LOG:  archiver process (PID 6578) exited with exit code 1

So it seems pg_xlog directory filled up (yet again) due to too much activity. I know from the timestamp that there was a backup running at the time, as well as a newly introduced forced vacuum process. Looks like the two of them combined proved too much.

$ ls -ltr
-rw------- 1 postgres postgres      252 Mar  4 00:15 000000010000169500000077.00000050.backup
-rw------- 1 postgres postgres 16777216 Mar  5 00:54 000000010000169A0000004A
-rw------- 1 postgres postgres 16777216 Mar  5 00:54 000000010000169A0000004B
-rw------- 1 postgres postgres 16777216 Mar  5 00:54 000000010000169A0000004C
-rw------- 1 postgres postgres 16777216 Mar  5 00:54 000000010000169A0000004D
...
-rw------- 1 postgres postgres 16777216 Mar  5 05:21 000000010000169F000000A2
-rw------- 1 postgres postgres 16777216 Mar  5 05:21 000000010000169F000000A3
-rw------- 1 postgres postgres 16777216 Mar  5 05:21 000000010000169F000000A4
drwx------ 2 postgres postgres    57344 Mar  5 05:21 archive_status

There are 1259 files in the pg_xlog directory (1259 x 16mb = 20GB).
Attempts to start it up again resulted in:

EST:LOG:  all server processes terminated; reinitializing
EST:LOG:  database system was interrupted; last known up at 2011-03-05 04:28:59 EST
EST:LOG:  could not open file "pg_xlog/000000010000169700000011" (log file 5783, segment 17): No such file or directory
EST:LOG:  invalid checkpoint record
EST:PANIC:  could not locate required checkpoint record
EST:HINT:  If you are not restoring from a backup, try removing the file "/var/lib/pgsql/data/backup_label".
EST:LOG:  startup process (PID 32565) was terminated by signal 6: Aborted
EST:LOG:  aborting startup due to startup process failure

I know it was in the middle of a backup when it failed. I looked for the backup_label file:

-rw------- 1 postgres postgres   152 Mar  4 18:00 backup_label
$ cat backup_label
START WAL LOCATION: 1697/110000C8 (file 000000010000169700000011)
CHECKPOINT LOCATION: 1697/110000C8
START TIME: 2011-03-04 18:00:14 EST
LABEL: tape_bu

This file gets created by pg_start_backup() and removed by pg_stop_backup(). It is used by a backup to store information required for recovering that backup. When the server starts and finds the backup_label file, it tries to recover using the checkpoint information in that file. It could not find the right file (000000010000169700000011) and so aborted. The log file was not there because it had already been archived well after the (very slow) backup had started and new checkpoints had been taken. Removing the file will cause it to use the pg_control file to find the latest checkpoint and logs required for recovery. Scenarios I can think of:
• Database shut down during backup (this case).
• Backup files have been restored and some scheduled script tries to start the database before you have created a recovery.conf file or locked down access.

I can understand the reason for not starting due to having ambiguous checkpointing data, but I don’t think it should even attempt the recovery – just abort immediately. However, why use such a system in the first place? If a backup fails, fuck it. Don’t let that prevent restarting a production database at 5am on a Saturday.

I’m sure you’re all thinking “how does Oracle handle this”? Well, the checkpoints are irrelevant in the recovery process and it merely needs to know where the logs are. Every data file has a record of the latest change number and can use the catalog information to apply the appropriate log files without fuss. So, basically there is no equivalent of the backup_label file to mess things up. If a database crashes, it will automatically perform instance recovery on startup using the online redo logs and undo tablespace. If a database is restored from hot backup, it requires additional commands to perform the recovery (similar to a recovery.conf file) to prevent accidental changes to an in progress restore.

OK rant over, let’s remove the backup_label file and try again:

$ mv backup_label /tmp
$ pg_ctl start

This time it started up OK:

EST:LOG:  database system was interrupted; last known up at 2011-03-05 04:28:59 EST
EST:LOG:  database system was not properly shut down; automatic recovery in progress
EST:LOG:  redo starts at 169B/2CF0488
EST:FATAL:  the database system is starting up
EST:FATAL:  the database system is starting up
EST:LOG:  could not open file "pg_xlog/000000010000169F000000A5" (log file 5791, segment 165): No such file or directory
EST:LOG:  redo done at 169F/A4FFCB78
EST:LOG:  last completed transaction was at log time 2011-03-05 05:21:56.048169+10
EST:FATAL:  the database system is starting up
EST:FATAL:  the database system is starting up
EST:FATAL:  the database system is starting up
EST:LOG:  autovacuum launcher started
EST:LOG:  database system is ready to accept connections

It took a little while to come up fully while it performed some recovery. The file system is still very full though:

$ ls -l | wc -l
1251

I’m impatient so I’ll force a checkpoint now:

postgres=# checkpoint;
CHECKPOINT
$ ls -l | wc -l
69

Good, now the old pg_xlog files have been removed and we have the disk space back.

Related posts:

  1. PostgreSQL pg_xlog filling up disk
  2. FATAL: the database system is starting up
  3. PostgreSQL: Moving pg_xlog to different drive on Windows
  4. Top 10 Missing PostgreSQL Features

2 Comments

  • Hi. We had the same trouble, with 8.1. Which is your version?
    Y.

    • This one was 8.3, but you will get the same error with any version I think.

Leave a comment

Article Series

Kick Ass PostgreSQL Books

Kick Ass Oracle Books

I've read lots of Oracle books, but these are by far the best I've encountered:

Categories