This is a document for PGroonga 2.X and 3.X. See PGroonga 1.x document when you're using old PGroonga.
PostgreSQL doesn't support full text search for non alphabet based languages such as Asian languages including Japanese, Chinese and Korean by default. There are some third party modules to support Asian languages. PGroonga is one of them. pg_bigm is another module. There are some textsearch parsers such as zhparser for Chinese and textsearch_ja (not maintained) for Japanese.
The pg_bigm module is similar to the pg_trgm module. The pg_trgm module doesn't support Asian languages by default but the pg_bigm module supports them.
This documents describes about difference between PGroonga and pg_bigm.
Here are characteristics of each modules.
PGroonga is fast for searching because it doesn't need "recheck".
PGroonga is also fast for searching while updating because it doesn't block searching while updating. PGroonga doesn't decrease search performance while updating.
PGroonga is also fast for alphabet based languages and Asian languages mixed case. PGroonga uses N-gram based full text search by default. It's similar to pg_bigm because pg_bigm uses 2-gram (N is 2 case of N-gram). But PGroonga uses variable size N-gram. N-gram is slower for alphabet based languages because the number of letter types are few in alphabet. It's about 26 ("a" to "z" and more). Posting list in inverted index will be longer when the number of letter types are few. (If you're not familiar with inverted index, you may not understand this description.) See also PGroonga versus textsearch and pg_trgm. You can find pg_trgm that uses N-gram based full text search is slower. PGroonga uses word based full text search instead of N-gram based full text search for alphabet based languages even if the target text mixes alphabet based languages and Asian languages such as "私はPostgreSQLユーザーです。".
PGroonga is also fast for updating.
PGroonga index is large because it keeps index target text that is already stored in PostgreSQL.
PGroonga index isn't crash safe. You need to run REINDEX
when your PGroonga index is broken by crash.
The pg_bigm module is crash safe because it uses GIN. GIN is crash safe.
The pg_bigm module is slow when many documents are matched and each document is long. Because pg_bigm need "recheck" after index search.
Module | Supported languages | Search | Update | Size |
---|---|---|---|---|
PGroonga | All | Faster | Faster | Larger |
pg_bigm | All | Slower | Slower | Smaller |
This section shows benchmark result against Japanese Wikipedia. You can find benchmark script at postgresql.sh.
Here is the summary of full text search index creation benchmark result:
PGroonga is the fastest module.
pg_bigm is about 73% slower than PGroonga.
Here is the summary of full text search index size benchmark result:
pg_bigm is the smallest module.
PGroonga is about 2.3 times larger than pg_bigm.
Here is the summary of full text search performance benchmark result:
Full text search performance by pg_bigm is about 50 times slower than other modules expect "日本" (two characters case).
Full text search performances of PGroonga and Groonga are similar expect "日本" (many hit records case).
Here is the full text search performance graph without pg_bigm because pg_bigm is about 50 times slower than other modules:
Here is the benchmark environment:
CPU | Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (24cores) |
Memory | 32GiB |
Swap | 2GiB |
Storage | SSD (500GB) |
OS | CentOS 7.2 |
Here are the software versions:
PostgreSQL | PGroonga | pg_bigm |
---|---|---|
9.6.1 | 1.1.9 | 1.2-20161011 |
Here are statistics of the target data:
Size | About 5.9GiB |
The number of records | About 900thousands |
The average of title length in byte | About 21.6B |
The max title length in byte | 250B |
The average of body length in byte | About 6.7KiB |
The max body length in byte | About 677KiB |
Here is the benchmark result of data load. It's not related with full text search modules. Because any indexes aren't created yet.
Elapsed time | Database size |
---|---|
About 5minutes | About 5GB |
Here is the SQL to load data:
COPY wikipedia FROM 'ja-all-pages.csv' WITH CSV ENCODING 'utf8';
You can download the CSV data at ja-all-pages.csv.xz.
Here is the SQL to define the wikipedia
table:
CREATE TABLE wikipedia (
id integer PRIMARY KEY,
title text,
text text
);
Here are benchmark results of creating full text search indexes:
Module | Elapsed time | Index size | Note |
---|---|---|---|
PGroonga | About 19minutes | About 9.8GB | PGroonga copies data and creates index from them. Data are compressed by zlib. Index only size is about 6.4GB. |
pg_bigm | About 33minutes | About 4.2GB | maintenance_work_mem is 2GB . |
Here is the index definition of PGroonga:
CREATE INDEX wikipedia_index_pgroonga ON wikipedia
USING pgroonga (title, text);
Here is the index definition of pg_bigm:
CREATE INDEX wikipedia_index_pg_bigm ON wikipedia
USING GIN (title gin_bigm_ops, text gin_bigm_ops);
Here are benchmark results of full text searches.
"Groonga" means pgroonga_command('select ...')
result. See also pgroonga_command
.
"Relative elapsed time" is the ratio between the target elapsed time and the fastest case elapsed time. Larger means slower.
Query: "テレビアニメ"
Module | Elapsed time | N hits | Relative elapsed time |
---|---|---|---|
PGroonga | About 65ms | About 20thousands | About 1.1 |
Groonga | About 38ms | About 20thousands | 1 |
pg_bigm | About 2.8s | About 20thousands | About 48 |
Query: "データベース"
Module | Elapsed time | N hits | Relative elapsed time |
---|---|---|---|
PGroonga | About 49ms | About 15thousands | About 1.6 |
Groonga | About 31ms | About 15thousands | 1 |
pg_bigm | About 1.3s | About 15thousands | About 41 |
Query: "PostgreSQL OR MySQL"
Module | Elapsed time | N hits | Relative elapsed time |
---|---|---|---|
PGroonga | About 2ms | 316 | About 2 |
Groonga | About 1ms | 316 | 1 |
pg_bigm | About 49ms | 311 | About 49 |
Query: "日本"
Module | Elapsed time | N hits | Relative elapsed time |
---|---|---|---|
PGroonga | About 563ms | About 530thousands | About 10 |
Groonga | About 59ms | About 530thousands | 1 |
pg_bigm | About 479ms | About 530thousands | About 8 |