Oct 25, 2012
Nathan Thom

How To Implement PostgreSQL Full Text Search


I recently spent a LOT of time building a new search system for my awesome LEGO site. There was a lot of manually coded logic involved and some very large and ugly dynamically generated queries against the database. It worked but felt dirty to me. Then I did what I should have done at the start and read about PostgreSQL’s built in Full Text Search capability.

My conclusion: Full Text Search is awesome and I’m an idiot :)

After spending 20 mins reading the documentation, and an hour or so of coding afterwards I had a solution that was simpler, faster, easier to maintain, and gave far better results too!

There are many variations on how you could set it up, but the following is an outline of what I have implemented.

Step 1 – Add an extra column on the table which will store the list of keywords used by the search engine.

Each “document” in my search results is a set – a single row from the sets table. The key to fast searches is to pre-compute everything, which is what this new column will be for. It will store all the data that we want to search through in a pre-processed format (the tsvector data type) for easy searching.

ALTER TABLE sets ADD COLUMN fts tsvector;

Step 2 – Populate this column with the concatenated tsvectors from all tables of interest.

I need to search across data from many tables related to sets. The tsvector data type natively supports concatenation so it’s a simple matter to update the field with data from each table of interest.

-- First set fields on the main sets table
UPDATE sets SET fts = to_tsvector('english', set_id || ' ' || year || ' ' || coalesce(descr,''));

-- Concatenate fields from tags table
UPDATE sets SET fts = sets.fts || to_tsvector('english', x.fts)
FROM (
   select set_id, array_to_string(array_agg(tag),' ') as fts
   from tags group by set_id
) as x
WHERE sets.set_id = x.set_id;

-- Repeat for all other tables with data to be searched
etc...

When I query the table, I can see how the data is represented internally:

# select set_id, fts from sets where set_id = '9395-1';
 set_id |                                          fts
--------+---------------------------------------------------------------------------------------------
 9395-1 | '-1':2 '2012':3 '9395':1 'model':10 'pick':5 'pick-up':4 'technic':9 'tow':7 'truck':8

It automatically converts some words such as removing plurals, lower cases everything, etc.

Now that's fine for the initial data load, but I need to keep this data up to date every time any of the related tables change. I decided to create a trigger on all relevant tables as the best solution.

Step 3 – Feed the users’ query into the parser, calculate the rank and sort the results.

Say the user queries for "red racing car". I can now just use the @@ operator to compare against the fts column and it will search through the data from all other sets related tables.

select set_id, descr, ts_rank(fts, plainto_tsquery('english', 'red racing car'), 1 ) AS rank, fts
from sets
where fts @@ plainto_tsquery('english', 'red racing car')
order by rank desc;

I also use the ts_rank function to get a rank/score of how well the row fits the search query. This particular query gives:

  set_id  |                          descr                          |   rank    |                                                                    fts
----------+---------------------------------------------------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------
 1477-1   | {Red Race Car Number 3}                                 |  0.104282 | '-1':2 '1477':1 '1991':3 '3':8 'car':6 'classic':9 'number':7 'race':5,11 'red':4 'town':10,12
 4948-1   | Black and Red Racer                                     | 0.0738371 | '-1':2 '2006':3 '4948':1 'black':4 'car':8 'race':10 'racer':7,9 'red':6 'tini':11 'turbo':12
 4582-1   | Red Bullet                                              | 0.0734034 | '-1':2 '2002':3 '4582':1 'bullet':5 'car':6 'drome':7 'race':10 'racer':8,9 'red':4
 4592-1   | Red Monster                                             | 0.0734034 | '-1':2 '2002':3 '4592':1 'car':6 'drome':7 'monster':5 'race':10 'racer':8,9 'red':4
 7801-1   | Red Racer Polybag                                       | 0.0727254 | '-1':2 '2009':3 '7801':1 'car':7 'polybag':6 'race':9 'racer':5,8 'red':4 'tini':10 'turbo':11
 4593-1   | Zero Hurricane & Red Blizzard                           | 0.0686227 | '-1':2 '2002':3 '4593':1 'blizzard':7 'car':8 'drome':9 'hurrican':5 'race':12 'racer':10,11 'red':6 'zero':4
 1290-1   | Kabaya Promotional Set: Red (Volcano Climber) RoboRider |  0.066416 | '-1':2 '1290':1 '2000':3 'car':11 'climber':9 'kabaya':4 'promot':5 'race':12 'red':7 'roborid':10,13 'set':6 'technic':14 'volcano':8
 MOC-0011 | Small RC Red Racecar                                    | 0.0634835 | '-0011':2 '2010':3 'car':8 'function':9 'jurgen':13,14 'krooshoop':15 'moc':1,10 'power':11 'race':12 'racecar':7 'rc':5 'red':6 'small':4
(8 rows)

Note how the #1 result has no mention of the word "racing" but has "race" which is close enough for the search to match it.

I chose to use the ts_rank normalization parameter of 1 which helps to reduce the bias of sets with more information to search through from appearing higher in the list. It does this by dividing the scores by 1+log(doc length). There are other fancier options, but this seems to work quite well.

I actually use to_tsquery instead of plainto_tsquery to be able to use wildcard searches, but you have to validate the format of the tsquery data type first (which I do in PHP).

I also do some pre-processing of the queries for some things e.g. 1x1 -> 1 x 1 to more closely match the data. I could have probably used a PostgreSQL thesaurus to achieve this, but a few basic regexes seemed simpler.

Step 4 – Create a GIN index on this column to make it go even faster.

Pretty simple really:

CREATE INDEX sets_fts ON sets USING gin(fts);

Now the performance of any query the user can throw at it will be awesome.

To emphasize how much better this performs than my original query, check out the old plan and the new plan. The new query actually allows me to search through even more data than I was originally since it's effectively free now.

You can see the search in action here!
Pages:1234567...62»

Article Series

Kick Ass PostgreSQL Books

Kick Ass Oracle Books

I've read lots of Oracle books, but these are by far the best I've encountered:

Categories