PostgreSQL full text search testing PART II

This past week I spent some time banging against the [http://www.citizendium.org/ citizendium] database that [http://people.planetpostgresql.org/xzilla/index.php?/archives/278-PostgreSQL-full-text-search-testing.html I had set up] before with GIN indexes. To make the test more applicable, I did some quasi-random trimming of the articles to an average size of around 146 charectors per entry (keeping the sample size at 3.5 million entries). I then split my schema up with one table of items and another table consisting of just the fti information. I then set up a number of different search queries with the main one being that of keyword searching. To get the keywords, I took a random sample of the [http://www.techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data/ aol search data] to generate various search query combinations. Because it was random, I did get duplicate queries in my batches; around 10% in one batch I looked at. Since we’re likely to see a fair number of duplicate queries in the real world, this seemed ok. I started with a very simple query run; 10k queries piped over a single connection from a remote pc, which took about 13 minutes to run, giving about 12.6 qeuries/second. Honestly I had no idea if this was a good number of not, but since using a single thread is always a bad idea, I went on to split the 10K queries across 10 connections. On average this ran in about 8 1/2 minutes, giving me 19.3 qps. That was a good improvement, so I went ahead and did 20 connections, which stayed around 7 1/6 minutes gving 23.2 qps. Again improvement, but as PostgreSQL is designed for high concurrency, istm going further was a good idea. I boosted my test up to 100 concurrent connections and this trimmed things down to about 6 3/4 minutes and around 24.5 qps. Not bad eh? Honestly I didn’t think this was nearly good enough, and had grown suspicious of my test setup. A couple of clues were that running an individual query ran in milliseconds, so I should have been able to push through more queries. I also noted that I seldom saw more than 50 active connections in pg_stat_activity which didn’t seem right. It all became clear on one run when I observed the following:



cz=# select * from pg_stat_activity where current_query <> '';

  datid  | datname | procpid | usesysid | usename  |                             current_query                             | waiting |          query_start          |  backend_start         |  client_addr  | client_port

---------+---------+---------+----------+----------+-----------------------------------------------------------------------+---------+-------------------------------+-------------------------------+---------------+-------------

 2041328 | cz      |    2710 |       10 | postgres | select page_id from page_fti where fti @@ plainto_tsquery('names'); | f       | 2006-12-21 14:04:03.43054-05  | 2006-12-21 13:57:52.619198-05 | 10.80.116.166 |       46348

 2041328 | cz      |    2478 |       10 | postgres | select * from pg_stat_activity where current_query <> '';       | f       | 2006-12-21 14:07:07.981442-05 | 2006-12-21 13:10:46.111449-05 | 10.80.116.166 |       45594

There was no way this query should be taking minutes, especially when I could do the following at the same time:



cz=# explain analyze select page_id from page_fti where fti @@ plainto_tsquery('names');

                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on page_fti  (cost=269.75..9832.77 rows=3647 width=4) (actual time=145.049..446.920 rows=281027 loops=1)

   Recheck Cond: (fti @@ '''name'''::tsquery)

   ->  Bitmap Index Scan on page_fti_gin_idx  (cost=0.00..269.75 rows=3647 width=0) (actual time=84.818..84.818 rows=281027 loops=1)

         Index Cond: (fti @@ '''name'''::tsquery)

 Total runtime: 503.501 ms

(5 rows)

Very fast… so apparently my network throughput had gone into the toilet. I motified the search queries to return a count rather than the full resultset (since I knew count overhead to be quite minimal). The results of the count(*) version across 100 connections? 14.444 seconds Sweet baby jesus! That breaks down to about 714 qps, and pretty much knocked out the earlier runs. Obviously network latency was an unacceptable differentiator, and soon after I moved all of my testing to local (tcp/ip) based queries. Generally doing keyword matching every batch ran somewhere between 15 and 25 seconds, depending on the exact query combination, heres a sample of some of the results: Time Search —- —— 14.3 basic keyword search 15.2 keyword with rank and limit 80.8 keyword with rank, limit, joined to table for create within 30 days 25.9 keyword with rank, limit, created within 30 days (same table) 25.9 keyword created last 30 days, order by created date(same table) In the above I combined the fti table with the description table (which held the created_on date information) to see how much of an effect the joins had, and it was pretty significant. So we estimate a baseline of at least 350 qps with basic tuning (postgresql.conf, fresh analyze, that type of stuff). Certainly there can be a lot of other factors, but hopefully this will give folks a window into what you can do with tsearch and postgresql 8.2.