PGroonga versus pg_bigm

PostgreSQL doesn't support full text search for non alphabet based languages such as Asian languages including Japanese, Chinese and Korean by default. There are some third party modules to support Asian languages. PGroonga is one of them. pg_bigm is another module. There are some textsearch parsers such as zhparser for Chinese and textsearch_ja (not maintained) for Japanese.

The pg_bigm module is similar to the pg_trgm module. The pg_trgm module doesn't support Asian languages by default but the pg_bigm module supports them.

This documents describes about difference between PGroonga and pg_bigm.

Characteristics

Here are characteristics of each modules.

PGroonga

PGroonga is fast for searching because it doesn't need "recheck".

PGroonga is also fast for searching while updating because it doesn't block searching while updating. PGroonga doesn't decrease search performance while updating.

PGroonga is also fast for alphabet based languages and Asian languages mixed case. PGroonga uses N-gram based full text search by default. It's similar to pg_bigm because pg_bigm uses 2-gram (N is 2 case of N-gram). But PGroonga uses variable size N-gram. N-gram is slower for alphabet based languages because the number of letter types are few in alphabet. It's about 26 ("a" to "z" and more). Posting list in inverted index will be longer when the number of letter types are few. (If you're not familiar with inverted index, you may not understand this description.) See also PGroonga versus textsearch and pg_trgm. You can find pg_trgm that uses N-gram based full text search is slower. PGroonga uses word based full text search instead of N-gram based full text search for alphabet based languages even if the target text mixes alphabet based languages and Asian languages such as "私はPostgreSQLユーザーです。".

PGroonga is also fast for updating.

PGroonga index is large because it keeps index target text that is already stored in PostgreSQL.

PGroonga index isn't crash safe. You need to run REINDEX when your PGroonga index is broken by crash.

The pg_bigm module

The pg_bigm module is crash safe because it uses GIN. GIN is crash safe.

The pg_bigm module is slow when many documents are matched and each document is long. Because pg_bigm need "recheck" after index search.

Summary

Module Supported languages Search Update Size
PGroonga All Faster Faster Larger
pg_bigm All Slower Slower Smaller

Benchmark

This section shows benchmark result against Japanese Wikipedia. You can find benchmark script at postgresql.sh.

Summary

Here is the summary of full text search index creation benchmark result:

Index creation time

Here is the summary of full text search index size benchmark result:

Index size

Here is the summary of full text search performance benchmark result:

Full text search performance

Here is the full text search performance graph without pg_bigm because pg_bigm is about 50 times slower than other modules:

Full text search performance without pg_bigm

Environment

Here is the benchmark environment:

CPU Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (24cores)
Memory 32GiB
Swap 2GiB
Storage SSD (500GB)
OS CentOS 7.2

Version

Here are the software versions:

PostgreSQL PGroonga pg_bigm
9.6.1 1.1.9 1.2-20161011

Data

Here are statistics of the target data:

Size About 5.9GiB
The number of records About 900thousands
The average of title length in byte About 21.6B
The max title length in byte 250B
The average of body length in byte About 6.7KiB
The max body length in byte About 677KiB

Data load

Here is the benchmark result of data load. It's not related with full text search modules. Because any indexes aren't created yet.

Elapsed time Database size
About 5minutes About 5GB

Here is the SQL to load data:

COPY wikipedia FROM 'ja-all-pages.csv' WITH CSV ENCODING 'utf8';

You can download the CSV data at ja-all-pages.csv.xz.

Here is the SQL to define the wikipedia table:

CREATE TABLE wikipedia (
  id integer PRIMARY KEY,
  title text,
  text text
);

Index creation

Here are benchmark results of creating full text search indexes:

Module Elapsed time Index size Note
PGroonga About 19minutes About 9.8GB PGroonga copies data and creates index from them. Data are compressed by zlib. Index only size is about 6.4GB.
pg_bigm About 33minutes About 4.2GB maintenance_work_mem is 2GB.

Here is the index definition of PGroonga:

CREATE INDEX wikipedia_index_pgroonga ON wikipedia
 USING pgroonga (title, text);

Here is the index definition of pg_bigm:

CREATE INDEX wikipedia_index_pg_bigm ON wikipedia
 USING GIN (title gin_bigm_ops, text gin_bigm_ops);

Here are benchmark results of full text searches.

Query: "テレビアニメ"

Module Elapsed time N hits Relative elapsed time
PGroonga About 65ms About 20thousands About 1.1
Groonga About 38ms About 20thousands 1
pg_bigm About 2.8s About 20thousands About 48

Query: "データベース"

Module Elapsed time N hits Relative elapsed time
PGroonga About 49ms About 15thousands About 1.6
Groonga About 31ms About 15thousands 1
pg_bigm About 1.3s About 15thousands About 41

Query: "PostgreSQL OR MySQL"

Module Elapsed time N hits Relative elapsed time
PGroonga About 2ms 316 About 2
Groonga About 1ms 316 1
pg_bigm About 49ms 311 About 49

Query: "日本"

Module Elapsed time N hits Relative elapsed time
PGroonga About 563ms About 530thousands About 10
Groonga About 59ms About 530thousands 1
pg_bigm About 479ms About 530thousands About 8