PostgreSQL Full Text Search Testing

[http://www.omniti.com/home We] have a couple of upcoming projects that are likely to revolve heavily around using free text searching, so naturally one of the solutions we want to consider is using tsearch2. Our expectation is that the projects will start initially with only around one million items, but will need to be able to handle a good 10 million entries before all is said and done. After looking for a couple of different data sources, we settled on using a data dump from the [http://citizendium.com/ citizendium] project, which is comprable to using wikipedia. The best table for working with in this database is the page_content table, which contains 3,639,937 entries in it, a fairly good sample size for what we will be needing. (While many of these entries are longer that what we need, that could only hurt things as I see it). I had a few issues with the import of the data itself, and ended up recreating the tsearch column on the table, which looked something like this: cz=# UPDATE pc_plain SET fti_idx = to_tsvector(old_text); NOTICE: A word you are indexing is too long. It will be ignored. NOTICE: A word you are indexing is too long. It will be ignored. <snip> NOTICE: A word you are indexing is too long. It will be ignored. NOTICE: A word you are indexing is too long. It will be ignored. UPDATE 3639937 Time: 4247118.034 ms This is notable only because it looks scary but isn’t really. The idea is that words that are too long don’t get indexed. Chances are this is due to some test / bogus words in the sample data, and in reality it isn’t an issue (note other words in a given article will be indexed) In any case, once you do that, you have to vacuum full, which will provide you with some good information about your table. cz=# vacuum full verbose pc_plain; INFO: vacuuming "public.pc_plain" INFO: "pc_plain": found 3639937 removable, 3639937 nonremovable row versions in 441653 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 60 to 2032 bytes long. There were 0 unused item pointers. Total free space (including removable row versions) is 1566160228 bytes. 47880 pages are or will become empty, including 0 at the end of the table. 408374 pages containing 1565224764 free bytes are potential move destinations. CPU 17.41s/3.23u sec elapsed 197.42 sec. INFO: "pc_plain": moved 2345887 row versions, truncated 441653 to 253504 pages DETAIL: CPU 68.12s/1245.23u sec elapsed 2500.38 sec. INFO: vacuuming "pg_toast.pg_toast_5167411" INFO: "pg_toast_5167411": found 0 removable, 3457146 nonremovable row versions in 770157 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 45 to 2030 bytes long. There were 0 unused item pointers. Total free space (including removable row versions) is 602646972 bytes. 0 pages are or will become empty, including 0 at the end of the table. 630968 pages containing 598825444 free bytes are potential move destinations. CPU 38.71s/4.12u sec elapsed 605.83 sec. INFO: index "pg_toast_5167411_index" now contains 3457146 row versions in 9481 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.45s/0.04u sec elapsed 10.34 sec. INFO: "pg_toast_5167411": moved 2 row versions, truncated 770157 to 770157 pages DETAIL: CPU 0.00s/0.01u sec elapsed 0.10 sec. INFO: index "pg_toast_5167411_index" now contains 3457146 row versions in 9481 pages DETAIL: 2 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.08u sec elapsed 0.10 sec. VACUUM So now on to some basic testing. Just to get a feel for things, I followed [http://people.planetpostgresql.org/mha/index.php?/archives/112-GIN-performance-postgresql.org-websearch-part-2.html Magnus’ idea] of wrapping a simple select into a count and taking the query times from that. We needed to search something pop culture like, such as “britney spears”, though I ended up chosing “super mario” instead as it gives a [http://en.wikipedia.org/wiki/Super_Mario#In_popular_culture more diverse] search term. Anyway, a straight select (with no indexes) on the table looks a lot like this: cz=# explain analyze select count(*) from pc_plain where fti_idx @@ plainto_tsquery('super mario'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=299012.32..299012.33 rows=1 width=0) (actual time=257871.208..257871.208 rows=1 loops=1) -> Seq Scan on pc_plain (cost=0.00..299003.21 rows=3640 width=0) (actual time=615.218..257848.716 rows=2176 loops=1) Filter: (fti_idx @@ '''super'' & ''mario'''::tsquery) Total runtime: 257871.255 ms (4 rows) Time: 257924.829 ms with average query times running around 242427.0616. Really that’s pretty crappy, so let’s get on with looking at indexed queries; I created a copies of the table with both a Gin and a Gist index. The index creation seemed to take about twice as long for Gin as for Gist, though I didn’t really do it scientifically (started one up, later on fired the other off before first finished). Given the relative sizes of the indexes involved, this was probably pretty accurate. cz=# select relname, reltuples, pg_size_pretty(pg_relation_size(oid)) from pg_class where relname ~ 'fti' or relname = 'pc_plain'; relname | reltuples | pg_size_pretty ----------+-------------+---------------- pc_plain | 3.63994e+06 | 1981 MB fti_gist | 3.63994e+06 | 717 MB fti_gin | 2.60522e+08 | 2842 MB (3 rows) Yes, turns out the Gin index is actually larger than the entire table! That’s a point of concern for the app we have in mind, because we will likely have a fairly high amount of inserts or updates going on, so index creation and/or update time will be important. Anyways, for now I need to make sure query times are ok, so since I expected gin to outperform gist, I started with doing the above searches against the gist table. cz=# explain analyze select count(*) from pc_gist where fti_idx @@ plainto_tsquery('super mario'); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- ------- Aggregate (cost=13618.51..13618.52 rows=1 width=0) (actual time=284777.725..284777.726 rows=1 loops=1) -> Bitmap Heap Scan on pc_gist (cost=398.97..13609.41 rows=3640 width=0) (actual time=84487.263..284772.891 rows=2176 loops= 1) Filter: (fti_idx @@ '''super'' & ''mario'''::tsquery) -> Bitmap Index Scan on fti_gist (cost=0.00..398.97 rows=3640 width=0) (actual time=84058.794..84058.794 rows=13677 lo ops=1) Index Cond: (fti_idx @@ '''super'' & ''mario'''::tsquery) Total runtime: 284795.707 ms (6 rows) that plan looekd right, but the times sure looked bad… with average times at 271864.5408 it actually good slower using the index. It’s been said that GiST has scalability issues once you get in very large set sizes, I guess there is some truth to that. Would GIN be any better? cz=# explain analyze select count(*) from pc_gin where fti_idx @@ plainto_tsquery('super mario'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=14717.39..14717.40 rows=1 width=0) (actual time=21685.663..21685.664 rows=1 loops=1) -> Bitmap Heap Scan on pc_gin (cost=1497.85..14708.29 rows=3640 width=0) (actual time=489.247..21684.045 rows=2176 loops=1) Recheck Cond: (fti_idx @@ '''super'' & ''mario'''::tsquery) -> Bitmap Index Scan on fti_gin (cost=0.00..1497.85 rows=3640 width=0) (actual time=458.731..458.731 rows=2176 loops=1) Index Cond: (fti_idx @@ '''super'' & ''mario'''::tsquery) Total runtime: 21809.538 ms (6 rows) Better yes, but even that seemed way too slow. Of course that was the run coming from disk… once it was in memory, things speeded up dramatically cz=# explain analyze select count(*) from pc_gin where fti_idx @@ plainto_tsquery('super mario'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=14717.39..14717.40 rows=1 width=0) (actual time=12.119..12.119 rows=1 loops=1) -> Bitmap Heap Scan on pc_gin (cost=1497.85..14708.29 rows=3640 width=0) (actual time=8.123..11.506 rows=2176 loops=1) Recheck Cond: (fti_idx @@ '''super'' & ''mario'''::tsquery) -> Bitmap Index Scan on fti_gin (cost=0.00..1497.85 rows=3640 width=0) (actual time=7.661..7.661 rows=2176 loops=1) Index Cond: (fti_idx @@ '''super'' & ''mario'''::tsquery) Total runtime: 12.179 ms (6 rows) Yeah… that’s what you want to see… and the times stayed that low. But it is somewhat concerning that things were so bad on the initial run. More testing will certainly be required… for now I take away some thoughts:
    * If things are in ram, it’s good. (duh) * This was on good hardware, but SATA drives. See #1. * If 3 million articles = 3gb ram… scaling to 10 million seems doable * There’s still room for optimization: the data samples were too large, and I didn’t configure the dictionaries for optimal performance either.
In any case, it’s at least a probable solution, which is what we wanted to know. We’ve got more testing ahead of us as well, so we’ll see how that goes.