Practical
full-text search
in PostgreSQL
Bill Karwin
PostgreSQL Conference West 09 • 2009/10/17
Me
• 20+ years experience
  •   Application/SDK developer
  •   Support, Training, Proj Mgmt
  •   C, Java, Perl, PHP

• S...
Full Text Search
Text search



• Web applications demand speed
• Let’s compare 5 solutions for text search
Sample data


• StackOverflow.com Posts
  •   Data dump exported September 2009

  •   1.2 million tuples

  •   ~850 MB
StackOverflow ER diagram
Naive Searching
Some people, when confronted with a problem,
  think “I know, I’ll use regular expressions.”
        Now t...
Performance issue

• LIKE with wildcards:             time: 91 sec
  SELECT * FROM Posts
  WHERE body LIKE ‘%postgresql%’
...
Why so slow?

CREATE TABLE telephone_book (

 full_name
 
 VARCHAR(50)
);
CREATE INDEX name_idx ON telephone_book

 (full_...
Why so slow?


• Search for all with last name “Thomas”
                                  uses
  SELECT * FROM telephone_b...
Indexes don’t help
searching for substrings
Accuracy issue

• Irrelevant or false matching words
  ‘one’, ‘money’, ‘prone’, etc.:
  body LIKE ‘%one%’
• Regular expres...
Solutions

• Full-Text Indexing in the RDBMS
• Sphinx Search
• Apache Lucene
• Inverted Index
• Search Engine Service
PostgreSQL
Text-Search
PostgreSQL Text-Search


• Since PostgreSQL 8.3
• TSVECTOR to represent text data
• TSQUERY to represent search predicates...
PostgreSQL Text-Search:

            Basic Querying



SELECT * FROM Posts
WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ |...
PostgreSQL Text-Search:

            Basic Querying



SELECT * FROM Posts
WHERE title || ‘ ’ || body || ‘ ’ || tags

 @@ ...
PostgreSQL Text-Search:

   Add TSVECTOR column


ALTER TABLE Posts ADD COLUMN

 PostText TSVECTOR;
UPDATE Posts SET PostT...
Special index types



• GIN (generalized inverted index)
• GiST (generalized search tree)
PostgreSQL Text-Search:

             Indexing



CREATE INDEX PostText_GIN ON Posts

 USING GIN(PostText);


        time...
PostgreSQL Text-Search:

               Querying



SELECT * FROM Posts
WHERE PostText @@ ‘postgresql & performance’;


  ...
PostgreSQL Text-Search:

  Keep TSVECTOR in sync


CREATE TRIGGER TS_PostText

 BEFORE INSERT OR UPDATE ON Posts
FOR EACH ...
Lucene
Lucene

• Full-text indexing and search engine
• Apache Project since 2001
• Apache License
• Java implementation
• Ports ...
Lucene:

            How to use


1. Add documents to index
2. Parse query
3. Execute query
Lucene:

         Creating an index



• Programmatic solution in Java...
            time: 8 minutes 55 seconds
Lucene:

                               Indexing
String url = "jdbc:postgresql:stackoverflow";
Properties props = new Prope...
Lucene:

                                    Indexing
       loop over SQL result

while (rs.next()) {
 Document doc = new...
Lucene:

                            Querying

• Parse a Lucene query                                         define fields
...
Sphinx Search
Sphinx Search


• Embedded full-text search engine
• Started in 2001
• GPLv2 license
• Good database integration
Sphinx Search:

            How to use


1. Edit configuration file
2. Index the data
3. Query the index
4. Issues
Sphinx Search:

                sphinx.conf

source stackoverflowsrc
{

 type = pgsql

 sql_host = localhost

 sql_user = p...
Sphinx Search:

                 sphinx.conf


index stackoverflow
{

 source = stackoverflowsrc

 path = /opt/local/var/db/...
Sphinx Search:

               Building index


indexer -c sphinx.conf stackoverflow
collected 1242365 docs, 720.5 MB
sorte...
Sphinx Search:

         Querying index



search -c sphinx.conf -i stackoverflow

 -b “sql & performance”


           tim...
Sphinx Search:

                        Issues

• Index updates are as expensive as
  rebuilding the index from scratch
  ...
Inverted Index
Inverted index

                             searchable words




Posts           Tags                 TagTypes



       ...
Inverted index:

Updated ER Diagram
Inverted index:

               Data definition
CREATE TABLE TagTypes (

  TagId
 
     SERIAL PRIMARY KEY,

  Tag
 
  
   ...
Inverted index:

               Indexing


INSERT INTO Tags (PostId, TagId)

 SELECT p.PostId, t.TagId

 FROM Posts p JOIN...
Inverted index:

             Querying


SELECT p.* FROM Posts p
JOIN Tags t USING (PostId)
JOIN TagTypes tt USING (TagId)...
Search Engine Services
Search engine services:

Google Custom Search Engine

• http://www.google.com/cse/



• DEMO ➪    http://www.karwin.com/de...
Search engine services:

         Is it right for you?


• Your site is public and allows external index
• Search is a non...
Comparison: Time to Build Index
LIKE predicate      none

PostgreSQL / GIN   40 min

Sphinx Search       6 min

Apache Luc...
Comparison: Index Storage
LIKE predicate        none

PostgreSQL / GIN     532 MB

Sphinx Search        533 MB

Apache Luc...
Comparison: Query Speed
LIKE predicate      90+ sec

PostgreSQL / GIN    20 ms

Sphinx Search        8 ms

Apache Lucene  ...
Comparison: Bottom-Line
                   indexing   storage    query     solution

LIKE predicate     none       none   ...
Copyright 2009 Bill Karwin
        www.slideshare.net/billkarwin
              Released under a Creative Commons 3.0 Licen...
Próxima SlideShare
Cargando en...5
×

Full Text Search In PostgreSQL

63,827

Published on

A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.

Published in: Tecnología
8 comentarios
64 Me gusta
Estadísticas
Notas
  • To replicate what your doing on slide 17:

    CREATE table posts (
    id int primary key,
    body text,
    title varchar (250),
    tags varchar (150)
    );
    INSERT INTO posts (id, body, title, tags) VALUES (1, 'postgresql & performance rank me high', 'postgresql', 'performance');
    INSERT INTO posts (id, body, title, tags) VALUES (2, 'Some post ranks 2nd best?', 'postgresql performance', 'performance');
    INSERT INTO posts (id, body, title, tags) VALUES (3, 'still match lowest rank ?', 'postgresql','performance');
    INSERT INTO posts (id, body, title, tags) VALUES (4, 'do not match', 'mysql','123456789');

    SELECT * FROM posts ;

    SELECT * FROM posts
    WHERE to_tsvector (title||''||body||''||tags)
    @@ to_tsquery('postgresql & performance');

    Only returns ID#2 why does #1 and #3 above not match.
    Thanks
       Responder 
    ¿Está seguro?    No
    Tu mensaje aparecerá aquí
  • Thank you for your presentation!!!!!! Very thank you
       Responder 
    ¿Está seguro?    No
    Tu mensaje aparecerá aquí
  • Many thanks for sharing it :)
       Responder 
    ¿Está seguro?    No
    Tu mensaje aparecerá aquí
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back amanda_n14144@yahoo.com so that i can explain all about myself .
    Best regards AMANDA.
    amanda_n14144@yahoo.com
       Responder 
    ¿Está seguro?    No
    Tu mensaje aparecerá aquí
  • thank you for that very compact tutorial. I would have wasted much time without it. greetings from vienna! :) (bzw: andrès is right: I also HAD TO use to_tsquery() - without that it didn't work with postgres 9.1)
       Responder 
    ¿Está seguro?    No
    Tu mensaje aparecerá aquí
Sin descargas
reproducciones
reproducciones totales
63,827
En SlideShare
0
De insertados
0
Número de insertados
15
Acciones
Compartido
0
Descargas
977
Comentarios
8
Me gusta
64
Insertados 0
No embeds

No notes for slide

Full Text Search In PostgreSQL

  1. 1. Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17
  2. 2. Me • 20+ years experience • Application/SDK developer • Support, Training, Proj Mgmt • C, Java, Perl, PHP • SQL maven • MySQL, PostgreSQL, InterBase • Zend Framework • Oracle, SQL Server, IBM DB2, SQLite • Community contributor
  3. 3. Full Text Search
  4. 4. Text search • Web applications demand speed • Let’s compare 5 solutions for text search
  5. 5. Sample data • StackOverflow.com Posts • Data dump exported September 2009 • 1.2 million tuples • ~850 MB
  6. 6. StackOverflow ER diagram
  7. 7. Naive Searching Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky
  8. 8. Performance issue • LIKE with wildcards: time: 91 sec SELECT * FROM Posts WHERE body LIKE ‘%postgresql%’ • POSIX regular expressions: SELECT * FROM Posts WHERE body ~ ‘postgresql’ time: 105 sec
  9. 9. Why so slow? CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_name); INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);
  10. 10. Why so slow? • Search for all with last name “Thomas” uses SELECT * FROM telephone_book index WHERE full_name LIKE ‘Thomas%’ • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE ‘%Thomas’ doesn’t use index
  11. 11. Indexes don’t help searching for substrings
  12. 12. Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: body LIKE ‘%one%’ • Regular expressions in PostgreSQL support escapes for word boundaries: body ~ ‘yoney’
  13. 13. Solutions • Full-Text Indexing in the RDBMS • Sphinx Search • Apache Lucene • Inverted Index • Search Engine Service
  14. 14. PostgreSQL Text-Search
  15. 15. PostgreSQL Text-Search • Since PostgreSQL 8.3 • TSVECTOR to represent text data • TSQUERY to represent search predicates • Special indexes
  16. 16. PostgreSQL Text-Search: Basic Querying SELECT * FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’); text-search matching operator
  17. 17. PostgreSQL Text-Search: Basic Querying SELECT * FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’; time with no index: 8 min 2 sec
  18. 18. PostgreSQL Text-Search: Add TSVECTOR column ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);
  19. 19. Special index types • GIN (generalized inverted index) • GiST (generalized search tree)
  20. 20. PostgreSQL Text-Search: Indexing CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); time: 39 min 36 sec
  21. 21. PostgreSQL Text-Search: Querying SELECT * FROM Posts WHERE PostText @@ ‘postgresql & performance’; time with index: 20 milliseconds
  22. 22. PostgreSQL Text-Search: Keep TSVECTOR in sync CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON Posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( ostText, P ‘english’, title, body, tags);
  23. 23. Lucene
  24. 24. Lucene • Full-text indexing and search engine • Apache Project since 2001 • Apache License • Java implementation • Ports exist for C, Perl, Ruby, Python, PHP, etc.
  25. 25. Lucene: How to use 1. Add documents to index 2. Parse query 3. Execute query
  26. 26. Lucene: Creating an index • Programmatic solution in Java... time: 8 minutes 55 seconds
  27. 27. Lucene: Indexing String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); run any SQL query Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props); Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql); open Lucene Date start = new Date(); index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
  28. 28. Lucene: Indexing loop over SQL result while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); each row is } a Document writer.optimize(); writer.close(); with four Fields finish and close index
  29. 29. Lucene: Querying • Parse a Lucene query define fields String[] fields = new String[3]; fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”; Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’); • Execute the query parse search query Searcher s = new IndexSearcher(indexName); Hits h = s.search(q); time: 80 milliseconds
  30. 30. Sphinx Search
  31. 31. Sphinx Search • Embedded full-text search engine • Started in 2001 • GPLv2 license • Good database integration
  32. 32. Sphinx Search: How to use 1. Edit configuration file 2. Index the data 3. Query the index 4. Issues
  33. 33. Sphinx Search: sphinx.conf source stackoverflowsrc { type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id }
  34. 34. Sphinx Search: sphinx.conf index stackoverflow { source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow }
  35. 35. Sphinx Search: Building index indexer -c sphinx.conf stackoverflow collected 1242365 docs, 720.5 MB sorted 88.3 Mhits, 100.0% done total 1242365 docs, 720452944 bytes total 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec time: 5 min 57 sec
  36. 36. Sphinx Search: Querying index search -c sphinx.conf -i stackoverflow -b “sql & performance” time: 8 milliseconds
  37. 37. Sphinx Search: Issues • Index updates are as expensive as rebuilding the index from scratch • Maintain “main” index plus “delta” index for recent changes • Merge indexes periodically • Not all data fits into this model
  38. 38. Inverted Index
  39. 39. Inverted index searchable words Posts Tags TagTypes intersection of words / Posts
  40. 40. Inverted index: Updated ER Diagram
  41. 41. Inverted index: Data definition CREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL ); CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag); CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId) ); CREATE INDEX Tags_PostId_index ON Tags(PostId); CREATE INDEX Tags_TagId_index ON Tags(TagId);
  42. 42. Inverted index: Indexing INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’); 90 seconds per tag!!
  43. 43. Inverted index: Querying SELECT p.* FROM Posts p JOIN Tags t USING (PostId) JOIN TagTypes tt USING (TagId) WHERE tt.Tag = ‘performance’; 40 milliseconds
  44. 44. Search Engine Services
  45. 45. Search engine services: Google Custom Search Engine • http://www.google.com/cse/ • DEMO ➪ http://www.karwin.com/demo/gcse-demo.html even big web sites use this solution
  46. 46. Search engine services: Is it right for you? • Your site is public and allows external index • Search is a non-critical feature for you • Search results are satisfactory • You need to offload search processing
  47. 47. Comparison: Time to Build Index LIKE predicate none PostgreSQL / GIN 40 min Sphinx Search 6 min Apache Lucene 9 min Inverted index high Google / Yahoo! offline
  48. 48. Comparison: Index Storage LIKE predicate none PostgreSQL / GIN 532 MB Sphinx Search 533 MB Apache Lucene 1071 MB Inverted index 101 MB Google / Yahoo! offline
  49. 49. Comparison: Query Speed LIKE predicate 90+ sec PostgreSQL / GIN 20 ms Sphinx Search 8 ms Apache Lucene 80 ms Inverted index 40 ms Google / Yahoo! *
  50. 50. Comparison: Bottom-Line indexing storage query solution LIKE predicate none none 11,250x SQL PostgreSQL / GIN 7x 5.3x 2.5x RDBMS Sphinx Search 1x * 5.3x 1x 3rd party Apache Lucene 1.5x 10x 10x 3rd party Inverted index high 1x 5x SQL Google / Yahoo! offline offline * Service
  51. 51. Copyright 2009 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×