PGroonga versus textsearch and pg_trgm

PostgreSQL provides two full text search modules. They are textsearch and pg_trgm.

The textsearch module is a built-in full text search feature. It supports GiST and GIN indexes. If you use GiST or GIN index, you can improve full text search performance.

The pg_trgm module is a contrib module. pg_trgm provides full text search feature. pg_trgm also supports GiST and GIN indexes. If you use GiST or GIN index, you can improve full text search performance.

This documents describes about difference between PGroonga, textsearch and pg_trgm.

Characteristics

Here are characteristics of each modules.

PGroonga

PGroonga implements both language specific full text search and not language specific full text search. PGroonga uses not language specific full text search by default. Normally, PGroonga provides satisfactory full text search result by default.

PGroonga supports all languages by default because PGroonga uses not language specific full text search by default. You don't need to change source code.

PGroonga is fast for searching because it doesn't need "recheck".

PGroonga is also fast for searching while updating because it doesn't block searching while updating. PGroonga doesn't decrease search performance while updating.

PGroonga is also fast for updating.

PGroonga index is large because it keeps index target text that is already stored in PostgreSQL.

The textsearch module

The textsearch module implements language specific full text search. For example, English terms are stemmed. All of "work", "works" and "worked" are converted to "work". It'll improve recall.

Language specific full text search may improve full text search feature but "language specific" means someone implements "language specific" process. PostgreSQL 9.6.1 supports 15 languages such as English, French and Russian but other many languages aren't supported such as Asian languages including Japanese, Chinese and Korean.

The textsearch module is fast for searching because it doesn't need "recheck". (If you use weight, you need "recheck".)

The textsearch module can't process very large text. 1MiB - 1B is the max bytes of text. If you put 1MiB or more text, you will get the following error:

string is too long for tsvector (XXX bytes, max 1048575 bytes)

The pg_trgm module

The pg_trgm module implements not language specific full text search. But pg_trgm disables non ASCII characters support. It means that pg_trgm doesn't support many Asian languages such as Japanese and Chinese by default. You need to comment out the following line in contrib/pg_trgm/trgm.h to enable non ASCII characters support:

#define KEEPONLYALNUM

If you're using Debian based system, you can enable non ASCII characters support without changing pg_trgm source code by using C.UTF-8 locale.

The pg_trgm module is slow when many documents are matched and each document is long. Because pg_trgm need "recheck" after index search.

Summary

Module Supported languages Search Update Size
PGroonga All Faster Faster Larger
textsearch 15 (Asian languages aren't included) Faster More slower Smaller
pg_trgm ASCII only languages Slower Slower Smaller

Benchmark

This section shows benchmark result against English Wikipedia. You can find benchmark script at postgresql.sh.

Summary

Here is the summary of full text search index creation benchmark result:

Index creation time

Here is the summary of full text search index size benchmark result:

Index size

Here is the summary of full text search performance benchmark result:

Full text search performance

Here is the full text search performance graph without pg_trgm because pg_trgm is too slow:

Full text search performance without pg_trgm

Environment

Here is the benchmark environment:

CPU Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (24cores)
Memory 32GiB
Swap 2GiB
Storage SSD (500GB)
OS CentOS 7.2

Version

Here are the software versions:

PostgreSQL PGroonga
9.6.1 1.1.9

Data

Here are statistics of the target data:

Size About 33GiB
The number of records About 5.3millions
The average of title length in byte About 19.6B
The max title length in byte 211B
The average of body length in byte About 6.4KiB
The max body length in byte About 1MiB (1047190B)

It's important for textsearch that the max body length in byte is less than 1MiB. If it's equals to 1MiB - 1B or more, you can't create textsearch index with the following error message:

string is too long for tsvector (1618908 bytes, max 1048575 bytes)

Data load

Here is the benchmark result of data load. It's not related with full text search modules. Because any indexes aren't created yet.

Elapsed time Database size
About 26minutes About 21GB

Here is the SQL to load data:

COPY wikipedia FROM 'en-all-pages.csv' WITH CSV ENCODING 'utf8';

You can download the CSV data at en-all-pages.csv.xz.

Here is the SQL to define the wikipedia table:

CREATE TABLE wikipedia (
  id integer PRIMARY KEY,
  title text,
  text text
);

Index creation

Here are benchmark results of creating full text search indexes:

Module Elapsed time Index size Note
PGroonga About 1hour 24minutes About 39GB PGroonga copies data and creates index from them. Data are compressed by zlib. Index only size is about 21GB.
textsearch About 2hours 53minutes About 12GB maintenance_work_mem is 2GB. There are 3923 words that can't be indexed. (*)
pg_trgm About 1hour 50minutes About 7.6GB maintenance_work_mem is 2GB.

(*) Here is the error message of the case:

NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.

Here is the index definition of PGroonga:

CREATE INDEX wikipedia_index_pgroonga ON wikipedia
 USING pgroonga (title, text);

Here is the index definition of textsearch:

CREATE INDEX wikipedia_index_textsearch ON wikipedia
 USING GIN (to_tsvector('english', title),
            to_tsvector('english', text));

Here is the index definition of pg_trgm:

CREATE INDEX wikipedia_index_pg_trgm ON wikipedia
 USING GIN (title gin_trgm_ops, text gin_trgm_ops);

Here are benchmark results of full text searches.

Query: "animation"

Module Elapsed time N hits Relative elapsed time Note
PGroonga About 173ms About 40thousands About 29  
Groonga About 6ms About 40thousands 1  
textsearch About 1s About 420thousands About 167 N hits is about 10 times than other cases. It's caused by stemming. "animation" is stemmed as "anim" and "anim" is searched.
pg_trgm About 44s About 30thousands About 7333  

Query: "database"

Module Elapsed time N hits Relative elapsed time
PGroonga About 698ms About 210thousands About 37
Groonga About 19ms About 210thousands 1
textsearch About 602ms About 190thousands About 32
pg_trgm About 33s About 130thousands About 1736

Query: "PostgreSQL OR MySQL"

Module Elapsed time N hits Relative elapsed time
PGroonga About 6ms 1636 About 2
Groonga About 3ms 1636 1
textsearch About 3ms 1506 1
pg_trgm About 241ms 1484 About 80

Query: "America"

Module Elapsed time N hits Relative elapsed time
PGroonga About 1.3s About 470thousands About 29
Groonga About 45ms About 470thousands 1
textsearch About 1.2s About 480thousands About 26
pg_trgm About 1m32s About 1.4millions About 2044