Full Text Search In PostgreSQL

Practical
full-text search
in PostgreSQL
Bill Karwin
PostgreSQL Conference West 09 • 2009/10/17

Me
• 20+ years experience
• Application/SDK developer
• Support, Training, Proj Mgmt
• C, Java, Perl, PHP

• SQL maven
• MySQL, PostgreSQL, InterBase
• Zend Framework
• Oracle, SQL Server, IBM DB2, SQLite

• Community contributor

Text search

• Web applications demand speed
• Let’s compare 5 solutions for text search

Sample data

• StackOverﬂow.com Posts
• Data dump exported September 2009

• 1.2 million tuples

• ~850 MB

Naive Searching
Some people, when confronted with a problem,
think “I know, I’ll use regular expressions.”
Now they have two problems.
— Jamie Zawinsky

Performance issue

• LIKE with wildcards: time: 91 sec
SELECT * FROM Posts
WHERE body LIKE ‘%postgresql%’
• POSIX regular expressions:
SELECT * FROM Posts
WHERE body ~ ‘postgresql’ time: 105 sec

Why so slow?

CREATE TABLE telephone_book (

full_name

VARCHAR(50)
);
CREATE INDEX name_idx ON telephone_book

(full_name);
INSERT INTO telephone_book VALUES

(‘Riddle, Thomas’),

(‘Thomas, Dean’);

Why so slow?

• Search for all with last name “Thomas”
uses
SELECT * FROM telephone_book index
WHERE full_name LIKE ‘Thomas%’

• Search for all with ﬁrst name “Thomas”
SELECT * FROM telephone_book
WHERE full_name LIKE ‘%Thomas’
doesn’t
use index

Indexes don’t help
searching for substrings

Accuracy issue

• Irrelevant or false matching words
‘one’, ‘money’, ‘prone’, etc.:
body LIKE ‘%one%’
• Regular expressions in PostgreSQL
support escapes for word boundaries:
body ~ ‘yoney’

Solutions

• Full-Text Indexing in the RDBMS
• Sphinx Search
• Apache Lucene
• Inverted Index
• Search Engine Service

PostgreSQL Text-Search

• Since PostgreSQL 8.3
• TSVECTOR to represent text data
• TSQUERY to represent search predicates
• Special indexes

PostgreSQL Text-Search:

Basic Querying

SELECT * FROM Posts
WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags)

@@ to_tsquery(‘postgresql & performance’);

text-search
matching
operator


Basic Querying

SELECT * FROM Posts
WHERE title || ‘ ’ || body || ‘ ’ || tags

@@ ‘postgresql & performance’;

time with no index:
8 min 2 sec


Add TSVECTOR column

ALTER TABLE Posts ADD COLUMN

PostText TSVECTOR;
UPDATE Posts SET PostText =

to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);

Special index types

• GIN (generalized inverted index)
• GiST (generalized search tree)


Indexing

CREATE INDEX PostText_GIN ON Posts

USING GIN(PostText);

time: 39 min 36 sec


Querying

SELECT * FROM Posts
WHERE PostText @@ ‘postgresql & performance’;

time with index:
20 milliseconds


Keep TSVECTOR in sync

CREATE TRIGGER TS_PostText

BEFORE INSERT OR UPDATE ON Posts
FOR EACH ROW
EXECUTE PROCEDURE

tsvector_update_trigger(
ostText,
P

‘english’, title, body, tags);

Lucene

• Full-text indexing and search engine
• Apache Project since 2001
• Apache License
• Java implementation
• Ports exist for C, Perl, Ruby, Python, PHP,
etc.

Lucene:

How to use

1. Add documents to index
2. Parse query
3. Execute query

Lucene:

Creating an index

• Programmatic solution in Java...
time: 8 minutes 55 seconds

Lucene:

Indexing
String url = "jdbc:postgresql:stackoverﬂow";
Properties props = new Properties();
props.setProperty("user", "postgres");
run any SQL query
Class.forName("org.postgresql.Driver");
Connection con = DriverManager.getConnection(url, props);

Statement stmt = con.createStatement();
String sql = "SELECT PostId, Title, Body, Tags FROM Posts";
ResultSet rs = stmt.executeQuery(sql);
open Lucene
Date start = new Date(); index writer
IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR),

new StandardAnalyzer(Version.LUCENE_CURRENT),

true, IndexWriter.MaxFieldLength.LIMITED);

Lucene:

Indexing
loop over SQL result

while (rs.next()) {
Document doc = new Document();

doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO));
doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(doc); each row is
}
a Document
writer.optimize();
writer.close();
with four Fields

ﬁnish and
close index

Lucene:

Querying

• Parse a Lucene query define fields
String[] fields = new String[3];
fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”;

Query q = new MultiFieldQueryParser(fields,

new StandardAnalyzer()).parse(‘performance’);

• Execute the query parse search
query
Searcher s = new IndexSearcher(indexName);

Hits h = s.search(q);
time: 80 milliseconds

Sphinx Search

• Embedded full-text search engine
• Started in 2001
• GPLv2 license
• Good database integration

Sphinx Search:

How to use

1. Edit conﬁguration ﬁle
2. Index the data
3. Query the index
4. Issues

Sphinx Search:

sphinx.conf

source stackoverﬂowsrc
{

type = pgsql

sql_host = localhost

sql_user = postgres

sql_pass = xxxx

sql_db = stackoverﬂow

sql_query = SELECT PostId, Title, Body, Tags FROM Posts

sql_query_info = SELECT * FROM Posts WHERE PostId=$id
}

Sphinx Search:

sphinx.conf

index stackoverflow
{

source = stackoverflowsrc

path = /opt/local/var/db/sphinx/stackoverflow
}

Sphinx Search:

Building index

indexer -c sphinx.conf stackoverﬂow
collected 1242365 docs, 720.5 MB
sorted 88.3 Mhits, 100.0% done
total 1242365 docs, 720452944 bytes
total 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec

time: 5 min 57 sec

Sphinx Search:

Querying index

search -c sphinx.conf -i stackoverﬂow

-b “sql & performance”

time: 8 milliseconds

Sphinx Search:

Issues

• Index updates are as expensive as
rebuilding the index from scratch
• Maintain “main” index plus “delta” index for
recent changes

• Merge indexes periodically

• Not all data ﬁts into this model

Inverted index

searchable words

Posts Tags TagTypes

intersection of
words / Posts

Inverted index:

Updated ER Diagram

Inverted index:

Data deﬁnition
CREATE TABLE TagTypes (

TagId

SERIAL PRIMARY KEY,

Tag

VARCHAR(50) NOT NULL
);

CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag);

CREATE TABLE Tags (

PostId

INT NOT NULL,

TagId

INT NOT NULL,

PRIMARY KEY (PostId, TagId),

FOREIGN KEY (PostId) REFERENCES Posts (PostId),

FOREIGN KEY (TagId) REFERENCES TagTypes (TagId)
);

CREATE INDEX Tags_PostId_index ON Tags(PostId);
CREATE INDEX Tags_TagId_index ON Tags(TagId);

Inverted index:

Indexing

INSERT INTO Tags (PostId, TagId)

SELECT p.PostId, t.TagId

FROM Posts p JOIN TagTypes t

ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’);

90 seconds
per tag!!

Inverted index:

Querying

SELECT p.* FROM Posts p
JOIN Tags t USING (PostId)
JOIN TagTypes tt USING (TagId)
WHERE tt.Tag = ‘performance’;

40 milliseconds

Search engine services:

Google Custom Search Engine

• http://www.google.com/cse/

• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html

even big web sites
use this solution

Search engine services:

Is it right for you?

• Your site is public and allows external index
• Search is a non-critical feature for you
• Search results are satisfactory
• You need to ofﬂoad search processing

Comparison: Time to Build Index
LIKE predicate none

PostgreSQL / GIN 40 min

Sphinx Search 6 min

Apache Lucene 9 min

Inverted index high

Google / Yahoo! ofﬂine

Comparison: Index Storage
LIKE predicate none

PostgreSQL / GIN 532 MB

Sphinx Search 533 MB

Apache Lucene 1071 MB

Inverted index 101 MB

Google / Yahoo! ofﬂine

Comparison: Query Speed
LIKE predicate 90+ sec

PostgreSQL / GIN 20 ms

Sphinx Search 8 ms

Apache Lucene 80 ms

Inverted index 40 ms

Google / Yahoo! *

Comparison: Bottom-Line
indexing storage query solution

LIKE predicate none none 11,250x SQL

PostgreSQL / GIN 7x 5.3x 2.5x RDBMS

Sphinx Search 1x * 5.3x 1x 3rd party

Apache Lucene 1.5x 10x 10x 3rd party

Inverted index high 1x 5x SQL

Google / Yahoo! ofﬂine ofﬂine * Service

Copyright 2009 Bill Karwin
www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License:
http://creativecommons.org/licenses/by-nc-nd/3.0/

You are free to share - to copy, distribute and
transmit this work, under the following conditions:

Attribution. Noncommercial. No Derivative Works.
You must attribute this You may not use this work You may not alter,
work to Bill Karwin. for commercial purposes. transform, or build
upon this work.

Full Text Search In PostgreSQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Full Text Search In PostgreSQL

Similar a Full Text Search In PostgreSQL (20)

Más de Karwin Software Solutions LLC

Más de Karwin Software Solutions LLC (17)

Último

Último (20)

Full Text Search In PostgreSQL