I have a big table that is quite bloated (has about 50% wasted space) and I need to recover the disk. The table is about 25GB in size with plenty of indexes, triggers, foreign keys, rules, etc. Recreating the table is a pain as all dependent objects need to be considered. I’m running PostgreSQL v8.3.
Plan A – Performing a vacuum full takes a ridiculously long time and locks the table even against selects (I let it run for 5hrs before having to cancel it – luckily this doesn’t seem to make things worse).
Plan B – Moving tablespaces didn’t work (as discovered here).
Plan C – Start looking at a pg_dump/drop index/truncate/pg_restore/create index and hope it is fast enough for the time I can get an outage.
Before I started that, I was introduced to a script written by Hubert Lubaczewski (aka Depesz) described here. This compact_table.sh script is pure awesome! It can compact a table without blocking active users. It will take a long time to run and can bloat the indexes as a side effect, but they are easy to fix afterwards.
The magic inside this script is achieved by a simple update statement:
UPDATE <TABLE> SET <COL> = <COL>
WHERE ctid = '(<PAGE>,<TUPLE>)'::tid
RETURNING ctid,'(<PAGE>,<TUPLE>)';
It loops over every page/tuple and runs the update which doesn’t actually change the data, but forces PostgreSQL to create a copy of the tuple somewhere else. If the table is fragmented, that somewhere else is usually in a gap somewhere lower in the table. Eventually by moving all the tuples in high pages to fill the gaps in the low pages, the table will shrink in size.
Sample output:
$ compact_table.sh -n myschema -t bigass -k 10 -d mydb -v
2011-03-30 12:04:57 : ./compact_table.sh Settings:
- CLEAN_PAGES : 10
- EXTENDED_DEBUG : 0
- FINAL_VACUUM : 1
- INITIAL_VACUUM :
- PGDATABASE : mydb
- PGHOST :
- PGPORT :
- PGUSER :
- PSQL : psql
- TABLE_NAME : bigass
- TABLE_SCHEMA : myschema
- VERBOSE : 1
2011-03-30 12:04:57 : At most, we can have 315 tuples per page.
2011-03-30 12:04:57 : Updates will be done on column: appearance_date
2011-03-30 12:04:57 : Entering main loop.
2011-03-30 12:04:57 : Current table size: 3185024 pages.
2011-03-30 12:04:57 : Working on page 3185023 (1 of 10)
2011-03-30 12:05:00 : Working on page 3185022 (2 of 10)
2011-03-30 12:05:03 : Working on page 3185021 (3 of 10)
2011-03-30 12:05:07 : Working on page 3185020 (4 of 10)
2011-03-30 12:05:10 : Working on page 3185019 (5 of 10)
2011-03-30 12:05:13 : Working on page 3185018 (6 of 10)
2011-03-30 12:05:16 : Working on page 3185017 (7 of 10)
2011-03-30 12:05:19 : Working on page 3185016 (8 of 10)
2011-03-30 12:05:23 : Working on page 3185015 (9 of 10)
2011-03-30 12:05:26 : Working on page 3185014 (10 of 10)
2011-03-30 12:05:29 : Final vacuuming
2011-03-30 12:09:50 : Final table size: 3185024 pages.
<snip create index sql>
2011-03-30 12:09:50 : All done.
Sometimes it doesn’t truncate the last pages and sometimes it does. It depends on what other activity is happening.
As awesome as this script is, I’m not satisfied. It’s incredibly slow – on my system it takes about 1hr to process 1000 pages. On my 3M page table where I know 1.5M of it is empty, it would take me about 2 months of continuous running to get through the top half of the table. So I started playing around with it to try and make it faster. My changes include:
1. Instead of getting the starting page from the catalog, I scan the table for the max ctid – yeah much slower, but it was worth it to avoid repeated effort due to the way I am running it (i.e. multiple runs over thousands of pages without necessarily vacuuming in between).
2. Don’t call psql once for every tuple. I avoid the connection overheads by batching the updates in a script and calling psql only once per page. This made it about 3x faster.
3. Wrap the page level updates inside a transaction to minimise the number of commits. This made a big different for pages that were already empty.
4. For my table, it always says I can have max 315 tuples per page. I checked and the actual max is 144. A 2x reduction is effort sounds pretty good. I calculate this at the same time as the max ctid from above, effectively getting it for free since I already invested in a full scan.
5. Because of the previous changes I had to change the way it detects pages moved higher than where it’s currently cleaning, and it now repeats entire pages when in doubt. This doesn’t really slow it down much as it doesn’t often have to re-clean pages. However, it does seem to move tuples higher on the same page quite often which results in lots of re-updating of the same row within each page. I couldn’t figure out a way to prevent this.
6. Since we are walking the file backwards, we are making the worst possible use of any disk read ahead in place. The initial table scan should cache the whole table, but there may not be enough space or it could get flushed out by other system activity. By first reading a page from lower in the table, we can try to pre-cache all required pages via the read ahead. This didn’t make much difference to me as my table fits in cache but I left the code in there. Use the -r parameter with the number of pages to read ahead.
7. I don’t care about safe transactions so set synchronous_commit = off. No measureable difference, but can’t hurt.
8. The overheads of checking if a page needs re-cleaning started to build up so I offloaded it to a background process so the main processing can continue without waiting.
9. By now it was fast enough that all the screen echos were slowing it down, so I only print every 50 rows.
10. I tried setting up full parallelism where several processes ran over different blocks of pages. The problem here is that one thread may move its tuples into the pages already done by another thread and not be able to detect that. From my testing, this seemed to happen quite often and meant the same pages had to be reprocessed each time it ran.
$ compact_table.sh -n myschema -t bigass -d mydb -k 20000 -r 4096 -v
2011-04-06 10:56:03 : ./compact_table.sh Settings:
- CLEAN_PAGES : 20000
- EXTENDED_DEBUG : 0
- FINAL_VACUUM : 1
- INITIAL_VACUUM :
- PGDATABASE : mydb
- PGHOST :
- PGPORT :
- PGUSER :
- PSQL : psql
- TABLE_NAME : bigass
- TABLE_SCHEMA : myschema
- VERBOSE : 1
- READAHEAD : 4096
2011-04-06 10:56:03 : Scanning table for highest tuple
2011-04-06 10:58:03 : At most, we can have 144 tuples per page.
2011-04-06 10:58:03 : Updates will be done on column: appearance_date
2011-04-06 10:58:03 : Entering main loop.
2011-04-06 10:58:03 : Current table size: 3010189 pages.
2011-04-06 10:58:03 : Clean at 3006092 to fill read ahead buffer
(3010164,10 -> 3010164,1) re-cleaning 3010164
2011-04-06 10:58:08 : Working on page 3010139 (50 of 20000)
2011-04-06 10:58:14 : Working on page 3010089 (100 of 20000)
2011-04-06 10:58:19 : Working on page 3010039 (150 of 20000)
<snip>
(2990257,6 -> 2990257,5) re-cleaning 2990257
2011-04-06 11:41:17 : Working on page 2990239 (19950 of 20000)
2011-04-06 11:41:26 : Working on page 2990189 (20000 of 20000)
2011-04-06 11:41:26 : Final vacuuming
INFO: vacuuming "myschema.bigass"
<snip index vacuums>
INFO: "bigass": found 464078 removable, 46019311 nonremovable row versions in 3023753 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 37291948 unused item pointers.
1175600 pages contain useful free space.
0 pages are entirely empty.
CPU 55.49s/76.47u sec elapsed 558.82 sec.
INFO: "bigass": truncated 3023753 to 2990189 pages
<snip other index vacuums>
That was 20,000 pages in 45 mins on my relatively busy system and when there are lots of blocks to move (much faster when most of them are already empty). So it’s an average of 25x faster than the original script. In this example, it was able to remove the completely empty pages at the end of the file, however VACUUM will not always do this if there are active transactions that may need to see that data. If that’s the case, a VACUUM at a later time will eventually reclaim the space.
It should be noted that as of PostgreSQL 9.0, the VACUUM FULL command is MUCH faster and actually re-creates the entire table instead of performing row by row processing. If you have enough time for the exclusive lock involved, this is a much easier solution and far better than the old 8.x implementation.
Download my updated script.
something is going wrong if you are 50% bloated. try running auto-vacuum in 8.3. if you already are see why it’s not working or aggressive enough.
auto-vacuum is not always an option for me. Some of my clusters have databases that are rebuilt every night and the auto-vacuum interferes with that process. Within the same cluster, there are other completely different databases which then suffer from this bloat without scheduled vacuums. The joy of inheriting old systems like this
Nathan, are you aware that you can turn off autovacuum per table? So you can easily have database that is recreated daily, but has autovacuum turned off.
I have seen it but not tried it. Can I add thousands of tables to the exclusion list?
Never tried with thousands, but I don’t see why not.