SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
https://www.2ndQuadrant.com
Event / Conference name
Location, Date
The State of (Full) Text
Search in PostgreSQL 12
FOSDEM 2020
Jimmy Angelakos
Senior PostgreSQL Architect
Twitter: @vyruss 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🇪🇺 🇬🇷
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Contents
●
(Full) Text Search
●
Operators
●
Functions
●
Dictionaries
●
Examples
●
Indexing
●
Non-natural text
●
Collation
●
Other “text” types
●
Maintenance
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Your attention please
● This presentation contains linguistics, NLP,
Markov chains, Levenshtein distances, and
various other confounding terms.
● These have been known to induce drowsiness
and inappropriate sleep onset in lecture theatres.
Allergy advice
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
What is Text?
(Baby don’t hurt me)
●
PostgreSQL character types
– CHAR(n)
– VARCHAR(n)
– VARCHAR, TEXT
●
Trailing spaces: significant (e.g. for LIKE / regex)
●
Storage
– Character Set (e.g. UTF-8)
– 1+126 bytes 4+→ n bytes
– Compression, TOAST
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
What is Text Search?
●
Information retrieval Text retrieval→
●
Search on metadata
– Descriptive, bibliographic, tags, etc.
– Discovery & identification
●
Search on parts of the text
– Matching
– Substring search
– Data extraction, cleaning, mining
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Text search operators in PostgreSQL
●
LIKE, ILIKE (~~, ~~*)
●
~, ~* (POSIX regex)
●
regexp_match(string text, pattern text)
●
But are SQL/regular expressions enough?
– No ranking of results
– No concept of language
– Cannot be indexed
●
Okay okay, can be somewhat indexed*
●
SIMILAR TO best forget about this one→
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
What is Full Text Search (FTS)?
●
Information retrieval Text retrieval Document retrieval→ →
●
Search on words (on tokens) in a database (all documents)
●
No index Serial search (e.g.→ grep)
●
Indexing Avoid scanning whole documents→
●
Techniques for criteria-based matching
– Natural Language Processing (NLP)
●
Precision vs Recall
– Stop words
– Stemming
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Documents? Tokens?
●
Document: a chunk of text (a field in a row)
●
Parsing of documents into classes of tokens
– PostgreSQL parser (or write your own… in C)
●
Conversion of tokens into lexemes
– Normalisation of strings
●
Lexeme: an abstract lexical unit representing related
words (i.e. word root)
– SEARCH searched, searcher→
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Stop words
●
Very common and have no value for our search
●
Filtering them out increases precision of search
●
Removal based on dictionaries
– Some check stoplist first
●
But: phrase search?
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Stemming
●
Reducing words to their roots (lexemes)
●
Increases number of results (recall)
●
Algorithms
– Normalisation using dictionaries
– Prefix/suffix stripping
– Automatic production rules
– Lemmatisation rules
– n-gram models
●
Multilingual stemming?
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS representation in PostgreSQL
●
tsvector
– A document!
– Preprocessed
●
tsquery
– Our search query!
– Normalized into lexemes
●
Utility functions
– to_tsvector(), plainto_tsquery(),
ts_debug(), etc.
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS operators in PostgreSQL
@@ tsvector matches tsquery
|| tsvector concatenation
&&, ||, !! tsquery AND, OR, NOT
<-> tsquery followed by tsquery
@> tsquery contains
<@ tsquery is contained in
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Dictionaries in PostgreSQL
●
Programs!
●
Accept tokens as input
●
Improve search quality
– Eliminate stop words
– Normalise words into lexemes
●
Reduce size of tsvector
●
CREATE TEXT SEARCH DICTIONARY name
(TEMPLATE = simple, STOPWORDS = english);
●
Can be chained: most specific more general→
ALTER TEXT SEARCH CONFIGURATION name
ADD MAPPING FOR word WITH english_ispell, simple;
●
ispell, myspell, hunspell, etc.
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Text matching example (1)
fts=# SELECT to_tsvector('A nice day for a car ride')
fts-# @@ plainto_tsquery('I am riding');
?column?
----------
t
(1 row)
fts=# SELECT to_tsvector('A nice day for a car ride');
to_tsvector
-----------------------------------
'car':6 'day':3 'nice':2 'ride':7
(1 row)
fts=# SELECT plainto_tsquery('I am riding');
plainto_tsquery
-----------------
'ride'
(1 row)
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Text matching example (2)
fts=# SELECT to_tsvector('A nice day for a car ride')
fts-# @@ plainto_tsquery('I am riding a bike');
?column?
----------
f
(1 row)
fts=# SELECT to_tsvector('A nice day for a car ride');
to_tsvector
-----------------------------------
'car':6 'day':3 'nice':2 'ride':7
(1 row)
fts=# SELECT plainto_tsquery('I am riding a bike');
plainto_tsquery
-----------------
'ride' & 'bike'
(1 row)
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Text matching example (3)
fts=# SELECT 'Starman' @@ 'star';
?column?
----------
f
(1 row)
fts=# SELECT 'Starman' @@ to_tsquery('star:*');
?column?
----------
t
(1 row)
fts=# SELECT websearch_to_tsquery('"The Stray Cats" -"cat shelter"');
websearch_to_tsquery
----------------------------------------------
'stray' <-> 'cat' & !( 'cat' <-> 'shelter' )
(1 row)
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
An example table
●
pgsql-hackers mailing list archive subset
fts=# d mail_messages
Table "public.mail_messages"
Column | Type | Collation | Nullable |
------------+-----------------------------+-----------+----------+-------------
id | integer | | not null | nextval('mai
parent_id | integer | | |
sent | timestamp without time zone | | |
subject | text | | |
author | text | | |
body_plain | text | | |
fts=# dt+ mail_messages
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------------+-------+----------+--------+-------------
public | mail_messages | table | postgres | 478 MB |
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Ranking results
ts_rank (and Cover Density variant ts_rank_cd)
fts=# SELECT subject, ts_rank(to_tsvector(coalesce(body_plain,'')),
fts(# to_tsquery('aggregate'), 32) AS rank
fts-# FROM mail_messages ORDER BY rank DESC LIMIT 5;
subject | rank
--------------------------------------------------------------+-------------
Re: Window functions patch v04 for the September commit fest | 0.08969686
Re: Window functions patch v04 for the September commit fest | 0.08940695
Re: [HACKERS] PoC: Grouped base relation | 0.08936066
Re: [HACKERS] PoC: Grouped base relation | 0.08931142
Re: [PERFORM] not using index for select min(...) | 0.08925897
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS Stats
ts_stat for verifying your TS configuration, identifying stop words
fts=# SELECT * FROM ts_stat(
fts(# 'SELECT to_tsvector(body_plain)
fts'# FROM mail_messages')
fts-# ORDER BY nentry DESC, ndoc DESC, word
fts-# LIMIT 5;
word | ndoc | nentry
-------+--------+--------
use | 173833 | 380951
wrote | 231174 | 350905
would | 157169 | 316416
think | 149858 | 256661
patch | 100991 | 226099
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Text indexing
Normal default:
●
B-Tree
– with B-Tree text_pattern_ops for left, right anchored text
– CREATE INDEX name ON table (column varchar_pattern_ops);
For FTS we have:
●
GIN
– Inverted index: one entry per lexeme
– Larger, slower to update Better on less dynamic data→
– On tsvector columns
●
GiST
– Lossy index, smaller but slower (to eliminate false positives)
– Better on fewer unique items
– On tsvector or tsquery columns
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS, unindexed
fts=# EXPLAIN ANALYZE SELECT count(*) FROM mail_messages
fts-# WHERE to_tsvector('english',body_plain) @@ to_tsquery('aggregate');
QUERY PLAN
-------------------------------------------------------------------------------
Finalize Aggregate (cost=122708.56..122708.57 rows=1 width=8) (actual time=26
-> Gather (cost=122708.34..122708.55 rows=2 width=8) (actual time=26981.64
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=121708.34..121708.35 rows=1 width=8) (act
-> Parallel Seq Scan on mail_messages (cost=0.00..121706.49 ro
Filter: (to_tsvector('english'::regconfig, body_plain) @@
Rows Removed by Filter: 116770
Planning Time: 0.258 ms
JIT:
Functions: 14
Options: Inlining false, Optimization false, Expressions true, Deforming tru
Timing: Generation 3.243 ms, Inlining 0.000 ms, Optimization 1.534 ms, Emiss
Execution Time: 26991.805 ms
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS indexing
CREATE INDEX ON mail_messages USING GIN
(to_tsvector('english',
subject ||' '|| body_plain));
●
New in PG12: Generated columns (stored):
ALTER TABLE mail_messages
ADD COLUMN fts_col tsvector
GENERATED ALWAYS AS (to_tsvector('english',
coalesce(subject, '') ||' '||
coalesce(body_plain, ''))) STORED;
CREATE INDEX ON mail_messages USING GIN (fts_col);
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS, GiST indexed
fts=# EXPLAIN ANALYZE SELECT count(*) FROM mail_messages
fts-# WHERE to_tsvector('english',body_plain) @@ to_tsquery('aggregate');
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=7210.61..7210.62 rows=1 width=8) (actual time=5630.167..5630.
-> Bitmap Heap Scan on mail_messages (cost=330.46..7206.16 rows=1781 width
Recheck Cond: (to_tsvector('english'::regconfig, body_plain) @@ to_tsq
Rows Removed by Index Recheck: 4267
Heap Blocks: exact=7883
-> Bitmap Index Scan on mail_messages_to_tsvector_idx (cost=0.00..33
Index Cond: (to_tsvector('english'::regconfig, body_plain) @@ to
Planning Time: 0.620 ms
Execution Time: 5630.249 ms
●
26.99 seconds 5.63 seconds! ~4.8x faster→ →
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
FTS, GIN indexed
fts=# EXPLAIN ANALYZE SELECT count(*) FROM mail_messages
fts-# WHERE to_tsvector('english',body_plain) @@ to_tsquery('aggregate');
QUERY PLAN
-------------------------------------------------------------------------------
Aggregate (cost=6873.60..6873.61 rows=1 width=8) (actual time=6.133..6.134 ro
-> Bitmap Heap Scan on mail_messages (cost=33.96..6869.18 rows=1769 width=
Recheck Cond: (to_tsvector('english'::regconfig, body_plain) @@ to_tsq
Heap Blocks: exact=4630
-> Bitmap Index Scan on mail_messages_to_tsvector_idx (cost=0.00..33
Index Cond: (to_tsvector('english'::regconfig, body_plain) @@ to
Planning Time: 0.433 ms
Execution Time: 5.684 ms
●
26.99 seconds 5.684→ milliseconds! → ~4700x faster
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
GIN, GiST indexed operations
●
GIN
– tsvector: @@
– jsonb: ? ?& ?| @> @? @@
●
GIST
– tsvector: @@
– tsquery: <@ @>
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Super useful modules
●
pg_trgm
– Trigram indexing operations
●
unaccent
– Dictionary: removes accents / diacritics
●
fuzzystrmatch
– String similarity: Levenshtein distances
(also Soundex, Metaphone, Double Metaphone)
– SELECT name FROM users WHERE
levenshtein('Stephen', name) <= 2;
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Other index types
●
VODKA =)
●
RUM
– https://github.com/postgrespro/rum
– Lexeme positional information stored
– Faster ranking
– Faster phrase search
– <=> Distance between timestamps, floats, money
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Free text but not natural?
●
One use case: identifying arbitrary strings
– e.g. keywords in device logs
●
Dictionaries not very helpful here
●
Arbitrary example: 10M * ~100 char “IoT device” log entries
– Some contain strings that are significant to user
(but we don’t know these keywords)
– Populate table with random hex codes but 1% of log entries
contains a keyword from /etc/dictionaries-common/words:
c4f2cede5da57f0ace6e669b51186cbaexcruciating9635d8a26a
efb2b4ee8b9845e89718577b3266f68dffa5ae12ebfebf1a508b21
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Free text but not natural?
fts=# SELECT message FROM logentries LIMIT 5 OFFSET 495;
message
--------------------------------------------------------------------------------------------------
da40c1006cd75105c1eb8ea70705828d195b264565f047c6d449e51cf99d01e901cf532f03018e793a394fdac9bb5d2a
aa88a5c43ec8b2a8578d44f924053e842584c0e6b8295b72230f7d19aa3ba2f2b9e1a4bffcf0f82e4d29344645b714ca
fe9731c39108a74714cad9fc8570b115howlingb9904fa4ad86544fb778ef5edfe362e02a94c66851c3c8d7fe47b26e5
b68430decf30085cc2e7810585c5d681source2b638d61c5972f25aa3fa5c35aa2be282f04843cfca007689cc6ecdbe3
5b7ba17108e416d04788dc9ac15121fad7625fa7c216666bf54c1b0ca21ab618829262dfd67a5cd40aefd66235cf9c7f
(5 rows)
fts=# dt+ logentries
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------+-------+----------+---------+-------------
public | logentries | table | postgres | 1421 MB |
(1 row)
fts=# SELECT * FROM logentries WHERE message LIKE '%source%';
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
How long?
fts=# EXPLAIN ANALYZE SELECT * FROM logentries WHERE message LIKE '%source%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..235029.95 rows=1000 width=109) (actual time=143.010..9654.769 rows=16 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on logentries (cost=0.00..233929.95 rows=417 width=109) (actual time=1017.442..
Filter: (message ~~ '%source%'::text)
Rows Removed by Filter: 3333594
Planning Time: 0.220 ms
JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 18.918 ms, Inlining 0.000 ms, Optimization 41.736 ms, Emission 121.955 ms, Total 18
Execution Time: 9673.582 ms
(12 rows)
●
9.6 seconds!
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Trigrams
●
n-gram model: probabilistic language model (Markov Chains)
●
3 characters trigrams→
●
Similarity of alphanumeric text number of shared trigrams→
●
CREATE EXTENSION pg_trgm;
●
fts=# SELECT show_trgm('source');
show_trgm
-------------------------------------
{" s"," so","ce ",our,rce,sou,urc}
●
fts=# CREATE INDEX ON logentries
fts-# USING GIN (message gin_trgm_ops);
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Did trigrams help?
fts=# EXPLAIN ANALYZE SELECT * FROM logentries WHERE message LIKE '%source%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on logentries (cost=87.75..3870.45 rows=1000 width=109) (actual time=0.152..0.206 rows
Recheck Cond: (message ~~ '%source%'::text)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=18
-> Bitmap Index Scan on logentries_message_idx (cost=0.00..87.50 rows=1000 width=0) (actual time=0.1
Index Cond: (message ~~ '%source%'::text)
Planning Time: 0.222 ms
Execution Time: 0.258 ms
(8 rows)
●
0.258 milliseconds! → ~37000x faster
●
Also work with regex
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
This comes at a cost
fts=# di+ logentries_message_idx
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+------------------------+-------+----------+------------+---------+-------------
public | logentries_message_idx | index | postgres | logentries | 1601 MB |
(1 row)
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Other neat trigram tricks
●
similarity(text, text) real→
●
text <-> text → Distance (1-similarity)
●
text % text true→ if over similarity_threshold
●
Supported by indexes:
– GIN
– GiST is efficient: k-nearest neighbour (k-NN)
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Character set support
●
pg_client_encoding()
●
convert(string bytea, src_encoding name,
dest_encoding name)
●
convert_from, convert_to
●
Automatic character set conversion
SET CLIENT_ENCODING TO 'value';
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Collation in PostgreSQL
●
Sort order and character classification
– Per-column: CREATE TABLE test1 (a text
COLLATE "de_DE" …
– Per-operation: SELECT a < b COLLATE "de_DE"
FROM test1;
– Not restricted by DB LC_COLLATE, LC_CTYPE
●
New in PG12: Nondeterministic collations (case-
insensitive, ignore accents)
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Other types of documents JSON→
●
Also a real world use case
●
JSONB supports indexing
(article ->> 'title' ||''||
article ->> 'author')::tsvector
●
jsonb_to_tsvector()
SELECT jsonb_to_tsvector('english', column,
'["numeric","key","string","boolean"]') FROM table;
●
New in PG12: SQL/JSON (SQL:2016) jsonpath expressions→
●
JsQuery: JSONB query language with GIN support
– Equivalent to tsquery, JSON query as a single value
– https://github.com/postgrespro/jsquery
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Finally, maintenance
●
VACUUM ANALYZE
– Keep your table statistics up-to-date
– Pending GIN entries
●
ALTER TABLE SET STATISTICS
– Keep your table statistics accurate
●
Number of distinct values
●
Correlated columns
●
EXPLAIN ANALYZE from time to time
– Your query works now – but a year from now?
●
maintenance_work_mem
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
The curious case of TEXT NAME 🤪
CREATE TABLE user (id serial, text name)
Type NAME
●
Sleepy developer 😴
●
Internal type for object names, 64 bytes
https://www.2ndQuadrant.com
FOSDEM
Brussels, 2020-02-02
Thanks! More info:
●
Dictionaries:
https://www.postgresql.org/docs/current/textsearch-dictionaries.html
●
Parsers:
https://www.postgresql.org/docs/current/textsearch-parsers.html
●
Ranking/Weights:
https://www.postgresql.org/docs/current/textsearch-controls.html
●
FTS functions:
https://www.postgresql.org/docs/current/functions-textsearch.html
●
Trigrams: https://www.postgresql.org/docs/current/pgtrgm.html
●
Collations: https://www.postgresql.org/docs/current/collation.html

Más contenido relacionado

Similar a The State of (Full) Text Search in PostgreSQL 12

Eff Plsql
Eff PlsqlEff Plsql
Eff Plsqlafa reg
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorMasahiko Sawada
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Ontico
 
Domain Specific Languages In Scala Duse3
Domain Specific Languages In Scala Duse3Domain Specific Languages In Scala Duse3
Domain Specific Languages In Scala Duse3Peter Maas
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cqlzznate
 
Of Haystacks And Needles
Of Haystacks And NeedlesOf Haystacks And Needles
Of Haystacks And NeedlesZendCon
 
Aileen heal postgis osmm cou
Aileen heal postgis osmm couAileen heal postgis osmm cou
Aileen heal postgis osmm couMatt Travis
 
Non-Relational Postgres
Non-Relational PostgresNon-Relational Postgres
Non-Relational PostgresEDB
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11gfcamachob
 
11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01Karam Abuataya
 
External Language Stored Procedures for MySQL
External Language Stored Procedures for MySQLExternal Language Stored Procedures for MySQL
External Language Stored Procedures for MySQLAntony T Curtis
 

Similar a The State of (Full) Text Search in PostgreSQL 12 (20)

Eff Plsql
Eff PlsqlEff Plsql
Eff Plsql
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributor
 
MYSQL -1.pptx
MYSQL -1.pptxMYSQL -1.pptx
MYSQL -1.pptx
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
 
Writing MySQL UDFs
Writing MySQL UDFsWriting MySQL UDFs
Writing MySQL UDFs
 
SQLQueries
SQLQueriesSQLQueries
SQLQueries
 
Sql analytic queries tips
Sql analytic queries tipsSql analytic queries tips
Sql analytic queries tips
 
Domain Specific Languages In Scala Duse3
Domain Specific Languages In Scala Duse3Domain Specific Languages In Scala Duse3
Domain Specific Languages In Scala Duse3
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
Of Haystacks And Needles
Of Haystacks And NeedlesOf Haystacks And Needles
Of Haystacks And Needles
 
Aileen heal postgis osmm cou
Aileen heal postgis osmm couAileen heal postgis osmm cou
Aileen heal postgis osmm cou
 
Non-Relational Postgres
Non-Relational PostgresNon-Relational Postgres
Non-Relational Postgres
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11g
 
11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
 
External Language Stored Procedures for MySQL
External Language Stored Procedures for MySQLExternal Language Stored Procedures for MySQL
External Language Stored Procedures for MySQL
 

Más de Jimmy Angelakos

Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]Jimmy Angelakos
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Jimmy Angelakos
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresJimmy Angelakos
 
Changing your huge table's data types in production
Changing your huge table's data types in productionChanging your huge table's data types in production
Changing your huge table's data types in productionJimmy Angelakos
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesJimmy Angelakos
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataJimmy Angelakos
 
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλονEισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλονJimmy Angelakos
 
PostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data ReplicationPostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data ReplicationJimmy Angelakos
 

Más de Jimmy Angelakos (9)

Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]Don't Do This [FOSDEM 2023]
Don't Do This [FOSDEM 2023]
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with Postgres
 
Changing your huge table's data types in production
Changing your huge table's data types in productionChanging your huge table's data types in production
Changing your huge table's data types in production
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλονEισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
Eισαγωγή στην PostgreSQL - Χρήση σε επιχειρησιακό περιβάλλον
 
PostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data ReplicationPostgreSQL: Mέθοδοι για Data Replication
PostgreSQL: Mέθοδοι για Data Replication
 

Último

Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 

Último (20)

Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 

The State of (Full) Text Search in PostgreSQL 12

  • 1. https://www.2ndQuadrant.com Event / Conference name Location, Date The State of (Full) Text Search in PostgreSQL 12 FOSDEM 2020 Jimmy Angelakos Senior PostgreSQL Architect Twitter: @vyruss 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🇪🇺 🇬🇷
  • 2. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Contents ● (Full) Text Search ● Operators ● Functions ● Dictionaries ● Examples ● Indexing ● Non-natural text ● Collation ● Other “text” types ● Maintenance
  • 3. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Your attention please ● This presentation contains linguistics, NLP, Markov chains, Levenshtein distances, and various other confounding terms. ● These have been known to induce drowsiness and inappropriate sleep onset in lecture theatres. Allergy advice
  • 4. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 What is Text? (Baby don’t hurt me) ● PostgreSQL character types – CHAR(n) – VARCHAR(n) – VARCHAR, TEXT ● Trailing spaces: significant (e.g. for LIKE / regex) ● Storage – Character Set (e.g. UTF-8) – 1+126 bytes 4+→ n bytes – Compression, TOAST
  • 5. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 What is Text Search? ● Information retrieval Text retrieval→ ● Search on metadata – Descriptive, bibliographic, tags, etc. – Discovery & identification ● Search on parts of the text – Matching – Substring search – Data extraction, cleaning, mining
  • 6. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Text search operators in PostgreSQL ● LIKE, ILIKE (~~, ~~*) ● ~, ~* (POSIX regex) ● regexp_match(string text, pattern text) ● But are SQL/regular expressions enough? – No ranking of results – No concept of language – Cannot be indexed ● Okay okay, can be somewhat indexed* ● SIMILAR TO best forget about this one→
  • 7. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 What is Full Text Search (FTS)? ● Information retrieval Text retrieval Document retrieval→ → ● Search on words (on tokens) in a database (all documents) ● No index Serial search (e.g.→ grep) ● Indexing Avoid scanning whole documents→ ● Techniques for criteria-based matching – Natural Language Processing (NLP) ● Precision vs Recall – Stop words – Stemming
  • 8. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Documents? Tokens? ● Document: a chunk of text (a field in a row) ● Parsing of documents into classes of tokens – PostgreSQL parser (or write your own… in C) ● Conversion of tokens into lexemes – Normalisation of strings ● Lexeme: an abstract lexical unit representing related words (i.e. word root) – SEARCH searched, searcher→
  • 9. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Stop words ● Very common and have no value for our search ● Filtering them out increases precision of search ● Removal based on dictionaries – Some check stoplist first ● But: phrase search?
  • 10. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Stemming ● Reducing words to their roots (lexemes) ● Increases number of results (recall) ● Algorithms – Normalisation using dictionaries – Prefix/suffix stripping – Automatic production rules – Lemmatisation rules – n-gram models ● Multilingual stemming?
  • 11. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS representation in PostgreSQL ● tsvector – A document! – Preprocessed ● tsquery – Our search query! – Normalized into lexemes ● Utility functions – to_tsvector(), plainto_tsquery(), ts_debug(), etc.
  • 12. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS operators in PostgreSQL @@ tsvector matches tsquery || tsvector concatenation &&, ||, !! tsquery AND, OR, NOT <-> tsquery followed by tsquery @> tsquery contains <@ tsquery is contained in
  • 13. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Dictionaries in PostgreSQL ● Programs! ● Accept tokens as input ● Improve search quality – Eliminate stop words – Normalise words into lexemes ● Reduce size of tsvector ● CREATE TEXT SEARCH DICTIONARY name (TEMPLATE = simple, STOPWORDS = english); ● Can be chained: most specific more general→ ALTER TEXT SEARCH CONFIGURATION name ADD MAPPING FOR word WITH english_ispell, simple; ● ispell, myspell, hunspell, etc.
  • 14. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Text matching example (1) fts=# SELECT to_tsvector('A nice day for a car ride') fts-# @@ plainto_tsquery('I am riding'); ?column? ---------- t (1 row) fts=# SELECT to_tsvector('A nice day for a car ride'); to_tsvector ----------------------------------- 'car':6 'day':3 'nice':2 'ride':7 (1 row) fts=# SELECT plainto_tsquery('I am riding'); plainto_tsquery ----------------- 'ride' (1 row)
  • 15. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Text matching example (2) fts=# SELECT to_tsvector('A nice day for a car ride') fts-# @@ plainto_tsquery('I am riding a bike'); ?column? ---------- f (1 row) fts=# SELECT to_tsvector('A nice day for a car ride'); to_tsvector ----------------------------------- 'car':6 'day':3 'nice':2 'ride':7 (1 row) fts=# SELECT plainto_tsquery('I am riding a bike'); plainto_tsquery ----------------- 'ride' & 'bike' (1 row)
  • 16. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Text matching example (3) fts=# SELECT 'Starman' @@ 'star'; ?column? ---------- f (1 row) fts=# SELECT 'Starman' @@ to_tsquery('star:*'); ?column? ---------- t (1 row) fts=# SELECT websearch_to_tsquery('"The Stray Cats" -"cat shelter"'); websearch_to_tsquery ---------------------------------------------- 'stray' <-> 'cat' & !( 'cat' <-> 'shelter' ) (1 row)
  • 17. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 An example table ● pgsql-hackers mailing list archive subset fts=# d mail_messages Table "public.mail_messages" Column | Type | Collation | Nullable | ------------+-----------------------------+-----------+----------+------------- id | integer | | not null | nextval('mai parent_id | integer | | | sent | timestamp without time zone | | | subject | text | | | author | text | | | body_plain | text | | | fts=# dt+ mail_messages List of relations Schema | Name | Type | Owner | Size | Description --------+---------------+-------+----------+--------+------------- public | mail_messages | table | postgres | 478 MB |
  • 18. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Ranking results ts_rank (and Cover Density variant ts_rank_cd) fts=# SELECT subject, ts_rank(to_tsvector(coalesce(body_plain,'')), fts(# to_tsquery('aggregate'), 32) AS rank fts-# FROM mail_messages ORDER BY rank DESC LIMIT 5; subject | rank --------------------------------------------------------------+------------- Re: Window functions patch v04 for the September commit fest | 0.08969686 Re: Window functions patch v04 for the September commit fest | 0.08940695 Re: [HACKERS] PoC: Grouped base relation | 0.08936066 Re: [HACKERS] PoC: Grouped base relation | 0.08931142 Re: [PERFORM] not using index for select min(...) | 0.08925897
  • 19. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS Stats ts_stat for verifying your TS configuration, identifying stop words fts=# SELECT * FROM ts_stat( fts(# 'SELECT to_tsvector(body_plain) fts'# FROM mail_messages') fts-# ORDER BY nentry DESC, ndoc DESC, word fts-# LIMIT 5; word | ndoc | nentry -------+--------+-------- use | 173833 | 380951 wrote | 231174 | 350905 would | 157169 | 316416 think | 149858 | 256661 patch | 100991 | 226099
  • 20. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Text indexing Normal default: ● B-Tree – with B-Tree text_pattern_ops for left, right anchored text – CREATE INDEX name ON table (column varchar_pattern_ops); For FTS we have: ● GIN – Inverted index: one entry per lexeme – Larger, slower to update Better on less dynamic data→ – On tsvector columns ● GiST – Lossy index, smaller but slower (to eliminate false positives) – Better on fewer unique items – On tsvector or tsquery columns
  • 21. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS, unindexed fts=# EXPLAIN ANALYZE SELECT count(*) FROM mail_messages fts-# WHERE to_tsvector('english',body_plain) @@ to_tsquery('aggregate'); QUERY PLAN ------------------------------------------------------------------------------- Finalize Aggregate (cost=122708.56..122708.57 rows=1 width=8) (actual time=26 -> Gather (cost=122708.34..122708.55 rows=2 width=8) (actual time=26981.64 Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=121708.34..121708.35 rows=1 width=8) (act -> Parallel Seq Scan on mail_messages (cost=0.00..121706.49 ro Filter: (to_tsvector('english'::regconfig, body_plain) @@ Rows Removed by Filter: 116770 Planning Time: 0.258 ms JIT: Functions: 14 Options: Inlining false, Optimization false, Expressions true, Deforming tru Timing: Generation 3.243 ms, Inlining 0.000 ms, Optimization 1.534 ms, Emiss Execution Time: 26991.805 ms
  • 22. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS indexing CREATE INDEX ON mail_messages USING GIN (to_tsvector('english', subject ||' '|| body_plain)); ● New in PG12: Generated columns (stored): ALTER TABLE mail_messages ADD COLUMN fts_col tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(subject, '') ||' '|| coalesce(body_plain, ''))) STORED; CREATE INDEX ON mail_messages USING GIN (fts_col);
  • 23. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS, GiST indexed fts=# EXPLAIN ANALYZE SELECT count(*) FROM mail_messages fts-# WHERE to_tsvector('english',body_plain) @@ to_tsquery('aggregate'); QUERY PLAN ------------------------------------------------------------------------------- Aggregate (cost=7210.61..7210.62 rows=1 width=8) (actual time=5630.167..5630. -> Bitmap Heap Scan on mail_messages (cost=330.46..7206.16 rows=1781 width Recheck Cond: (to_tsvector('english'::regconfig, body_plain) @@ to_tsq Rows Removed by Index Recheck: 4267 Heap Blocks: exact=7883 -> Bitmap Index Scan on mail_messages_to_tsvector_idx (cost=0.00..33 Index Cond: (to_tsvector('english'::regconfig, body_plain) @@ to Planning Time: 0.620 ms Execution Time: 5630.249 ms ● 26.99 seconds 5.63 seconds! ~4.8x faster→ →
  • 24. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 FTS, GIN indexed fts=# EXPLAIN ANALYZE SELECT count(*) FROM mail_messages fts-# WHERE to_tsvector('english',body_plain) @@ to_tsquery('aggregate'); QUERY PLAN ------------------------------------------------------------------------------- Aggregate (cost=6873.60..6873.61 rows=1 width=8) (actual time=6.133..6.134 ro -> Bitmap Heap Scan on mail_messages (cost=33.96..6869.18 rows=1769 width= Recheck Cond: (to_tsvector('english'::regconfig, body_plain) @@ to_tsq Heap Blocks: exact=4630 -> Bitmap Index Scan on mail_messages_to_tsvector_idx (cost=0.00..33 Index Cond: (to_tsvector('english'::regconfig, body_plain) @@ to Planning Time: 0.433 ms Execution Time: 5.684 ms ● 26.99 seconds 5.684→ milliseconds! → ~4700x faster
  • 25. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 GIN, GiST indexed operations ● GIN – tsvector: @@ – jsonb: ? ?& ?| @> @? @@ ● GIST – tsvector: @@ – tsquery: <@ @>
  • 26. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Super useful modules ● pg_trgm – Trigram indexing operations ● unaccent – Dictionary: removes accents / diacritics ● fuzzystrmatch – String similarity: Levenshtein distances (also Soundex, Metaphone, Double Metaphone) – SELECT name FROM users WHERE levenshtein('Stephen', name) <= 2;
  • 27. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Other index types ● VODKA =) ● RUM – https://github.com/postgrespro/rum – Lexeme positional information stored – Faster ranking – Faster phrase search – <=> Distance between timestamps, floats, money
  • 28. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Free text but not natural? ● One use case: identifying arbitrary strings – e.g. keywords in device logs ● Dictionaries not very helpful here ● Arbitrary example: 10M * ~100 char “IoT device” log entries – Some contain strings that are significant to user (but we don’t know these keywords) – Populate table with random hex codes but 1% of log entries contains a keyword from /etc/dictionaries-common/words: c4f2cede5da57f0ace6e669b51186cbaexcruciating9635d8a26a efb2b4ee8b9845e89718577b3266f68dffa5ae12ebfebf1a508b21
  • 29. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Free text but not natural? fts=# SELECT message FROM logentries LIMIT 5 OFFSET 495; message -------------------------------------------------------------------------------------------------- da40c1006cd75105c1eb8ea70705828d195b264565f047c6d449e51cf99d01e901cf532f03018e793a394fdac9bb5d2a aa88a5c43ec8b2a8578d44f924053e842584c0e6b8295b72230f7d19aa3ba2f2b9e1a4bffcf0f82e4d29344645b714ca fe9731c39108a74714cad9fc8570b115howlingb9904fa4ad86544fb778ef5edfe362e02a94c66851c3c8d7fe47b26e5 b68430decf30085cc2e7810585c5d681source2b638d61c5972f25aa3fa5c35aa2be282f04843cfca007689cc6ecdbe3 5b7ba17108e416d04788dc9ac15121fad7625fa7c216666bf54c1b0ca21ab618829262dfd67a5cd40aefd66235cf9c7f (5 rows) fts=# dt+ logentries List of relations Schema | Name | Type | Owner | Size | Description --------+------------+-------+----------+---------+------------- public | logentries | table | postgres | 1421 MB | (1 row) fts=# SELECT * FROM logentries WHERE message LIKE '%source%';
  • 30. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 How long? fts=# EXPLAIN ANALYZE SELECT * FROM logentries WHERE message LIKE '%source%'; QUERY PLAN --------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..235029.95 rows=1000 width=109) (actual time=143.010..9654.769 rows=16 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on logentries (cost=0.00..233929.95 rows=417 width=109) (actual time=1017.442.. Filter: (message ~~ '%source%'::text) Rows Removed by Filter: 3333594 Planning Time: 0.220 ms JIT: Functions: 6 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 18.918 ms, Inlining 0.000 ms, Optimization 41.736 ms, Emission 121.955 ms, Total 18 Execution Time: 9673.582 ms (12 rows) ● 9.6 seconds!
  • 31. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Trigrams ● n-gram model: probabilistic language model (Markov Chains) ● 3 characters trigrams→ ● Similarity of alphanumeric text number of shared trigrams→ ● CREATE EXTENSION pg_trgm; ● fts=# SELECT show_trgm('source'); show_trgm ------------------------------------- {" s"," so","ce ",our,rce,sou,urc} ● fts=# CREATE INDEX ON logentries fts-# USING GIN (message gin_trgm_ops);
  • 32. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Did trigrams help? fts=# EXPLAIN ANALYZE SELECT * FROM logentries WHERE message LIKE '%source%'; QUERY PLAN --------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on logentries (cost=87.75..3870.45 rows=1000 width=109) (actual time=0.152..0.206 rows Recheck Cond: (message ~~ '%source%'::text) Rows Removed by Index Recheck: 2 Heap Blocks: exact=18 -> Bitmap Index Scan on logentries_message_idx (cost=0.00..87.50 rows=1000 width=0) (actual time=0.1 Index Cond: (message ~~ '%source%'::text) Planning Time: 0.222 ms Execution Time: 0.258 ms (8 rows) ● 0.258 milliseconds! → ~37000x faster ● Also work with regex
  • 33. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 This comes at a cost fts=# di+ logentries_message_idx List of relations Schema | Name | Type | Owner | Table | Size | Description --------+------------------------+-------+----------+------------+---------+------------- public | logentries_message_idx | index | postgres | logentries | 1601 MB | (1 row)
  • 34. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Other neat trigram tricks ● similarity(text, text) real→ ● text <-> text → Distance (1-similarity) ● text % text true→ if over similarity_threshold ● Supported by indexes: – GIN – GiST is efficient: k-nearest neighbour (k-NN)
  • 35. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Character set support ● pg_client_encoding() ● convert(string bytea, src_encoding name, dest_encoding name) ● convert_from, convert_to ● Automatic character set conversion SET CLIENT_ENCODING TO 'value';
  • 36. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Collation in PostgreSQL ● Sort order and character classification – Per-column: CREATE TABLE test1 (a text COLLATE "de_DE" … – Per-operation: SELECT a < b COLLATE "de_DE" FROM test1; – Not restricted by DB LC_COLLATE, LC_CTYPE ● New in PG12: Nondeterministic collations (case- insensitive, ignore accents)
  • 37. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Other types of documents JSON→ ● Also a real world use case ● JSONB supports indexing (article ->> 'title' ||''|| article ->> 'author')::tsvector ● jsonb_to_tsvector() SELECT jsonb_to_tsvector('english', column, '["numeric","key","string","boolean"]') FROM table; ● New in PG12: SQL/JSON (SQL:2016) jsonpath expressions→ ● JsQuery: JSONB query language with GIN support – Equivalent to tsquery, JSON query as a single value – https://github.com/postgrespro/jsquery
  • 38. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Finally, maintenance ● VACUUM ANALYZE – Keep your table statistics up-to-date – Pending GIN entries ● ALTER TABLE SET STATISTICS – Keep your table statistics accurate ● Number of distinct values ● Correlated columns ● EXPLAIN ANALYZE from time to time – Your query works now – but a year from now? ● maintenance_work_mem
  • 39. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 The curious case of TEXT NAME 🤪 CREATE TABLE user (id serial, text name) Type NAME ● Sleepy developer 😴 ● Internal type for object names, 64 bytes
  • 40. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Thanks! More info: ● Dictionaries: https://www.postgresql.org/docs/current/textsearch-dictionaries.html ● Parsers: https://www.postgresql.org/docs/current/textsearch-parsers.html ● Ranking/Weights: https://www.postgresql.org/docs/current/textsearch-controls.html ● FTS functions: https://www.postgresql.org/docs/current/functions-textsearch.html ● Trigrams: https://www.postgresql.org/docs/current/pgtrgm.html ● Collations: https://www.postgresql.org/docs/current/collation.html