Full Text Search In PostgreSQL
Upcoming SlideShare
Loading in...5
×
 

Full Text Search In PostgreSQL

on

  • 58,572 reproducciones

A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.

A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.

Statistics

reproducciones

Total Views
58,572
Views on SlideShare
56,989
Embed Views
1,583

Actions

Likes
52
Downloads
854
Comments
8

22 insertados 1,583

http://rofaidaawad.wordpress.com 596
http://www.scoop.it 457
http://www.chebyte.com 312
http://www.slideshare.net 132
http://localhost:1970 25
https://twitter.com 9
http://192.168.4.223 8
http://blog.hashdog.com 8
http://www.techgig.com 6
http://localhost 5
http://www.twylah.com 5
url_unknown 4
http://kowidi.com 3
http://www.365dailyjournal.com 2
http://www.hashdog.com 2
http://itnlu.info 2
http://a0.twimg.com 2
http://115.112.206.131 1
http://www.m.techgig.com 1
http://s.deeeki.com 1
http://www.google.com 1
https://www.linkedin.com 1
Más...

Accesibilidad

Categorias

Detalles de carga

Uploaded via as Adobe PDF

Derechos de uso

© Todos los derechos reservados

Report content

Marcada como inapropiada Marcar como inapropiada
Marcar como inapropiada

Seleccione la razón para marcar esta presentación como inapropiada.

Cancelar
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Tu mensaje aparecerá aquí
    Processing...
  • To replicate what your doing on slide 17:

    CREATE table posts (
    id int primary key,
    body text,
    title varchar (250),
    tags varchar (150)
    );
    INSERT INTO posts (id, body, title, tags) VALUES (1, 'postgresql & performance rank me high', 'postgresql', 'performance');
    INSERT INTO posts (id, body, title, tags) VALUES (2, 'Some post ranks 2nd best?', 'postgresql performance', 'performance');
    INSERT INTO posts (id, body, title, tags) VALUES (3, 'still match lowest rank ?', 'postgresql','performance');
    INSERT INTO posts (id, body, title, tags) VALUES (4, 'do not match', 'mysql','123456789');

    SELECT * FROM posts ;

    SELECT * FROM posts
    WHERE to_tsvector (title||''||body||''||tags)
    @@ to_tsquery('postgresql & performance');

    Only returns ID#2 why does #1 and #3 above not match.
    Thanks
    Are you sure you want to
    Tu mensaje aparecerá aquí
    Processing...
  • Thank you for your presentation!!!!!! Very thank you
    Are you sure you want to
    Tu mensaje aparecerá aquí
    Processing...
  • Many thanks for sharing it :)
    Are you sure you want to
    Tu mensaje aparecerá aquí
    Processing...
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back amanda_n14144@yahoo.com so that i can explain all about myself .
    Best regards AMANDA.
    amanda_n14144@yahoo.com
    Are you sure you want to
    Tu mensaje aparecerá aquí
    Processing...
  • thank you for that very compact tutorial. I would have wasted much time without it. greetings from vienna! :) (bzw: andrès is right: I also HAD TO use to_tsquery() - without that it didn't work with postgres 9.1)
    Are you sure you want to
    Tu mensaje aparecerá aquí
    Processing...
Publicar comentario
Edite su comentario

Full Text Search In PostgreSQL Full Text Search In PostgreSQL Presentation Transcript

  • Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17
  • Me • 20+ years experience • Application/SDK developer • Support, Training, Proj Mgmt • C, Java, Perl, PHP • SQL maven • MySQL, PostgreSQL, InterBase • Zend Framework • Oracle, SQL Server, IBM DB2, SQLite • Community contributor
  • Full Text Search
  • Text search • Web applications demand speed • Let’s compare 5 solutions for text search
  • Sample data • StackOverflow.com Posts • Data dump exported September 2009 • 1.2 million tuples • ~850 MB
  • StackOverflow ER diagram
  • Naive Searching Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky
  • Performance issue • LIKE with wildcards: time: 91 sec SELECT * FROM Posts WHERE body LIKE ‘%postgresql%’ • POSIX regular expressions: SELECT * FROM Posts WHERE body ~ ‘postgresql’ time: 105 sec
  • Why so slow? CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_name); INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);
  • Why so slow? • Search for all with last name “Thomas” uses SELECT * FROM telephone_book index WHERE full_name LIKE ‘Thomas%’ • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE ‘%Thomas’ doesn’t use index
  • Indexes don’t help searching for substrings
  • Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: body LIKE ‘%one%’ • Regular expressions in PostgreSQL support escapes for word boundaries: body ~ ‘yoney’
  • Solutions • Full-Text Indexing in the RDBMS • Sphinx Search • Apache Lucene • Inverted Index • Search Engine Service
  • PostgreSQL Text-Search
  • PostgreSQL Text-Search • Since PostgreSQL 8.3 • TSVECTOR to represent text data • TSQUERY to represent search predicates • Special indexes
  • PostgreSQL Text-Search: Basic Querying SELECT * FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’); text-search matching operator
  • PostgreSQL Text-Search: Basic Querying SELECT * FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’; time with no index: 8 min 2 sec
  • PostgreSQL Text-Search: Add TSVECTOR column ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);
  • Special index types • GIN (generalized inverted index) • GiST (generalized search tree)
  • PostgreSQL Text-Search: Indexing CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); time: 39 min 36 sec
  • PostgreSQL Text-Search: Querying SELECT * FROM Posts WHERE PostText @@ ‘postgresql & performance’; time with index: 20 milliseconds
  • PostgreSQL Text-Search: Keep TSVECTOR in sync CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON Posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( ostText, P ‘english’, title, body, tags);
  • Lucene
  • Lucene • Full-text indexing and search engine • Apache Project since 2001 • Apache License • Java implementation • Ports exist for C, Perl, Ruby, Python, PHP, etc.
  • Lucene: How to use 1. Add documents to index 2. Parse query 3. Execute query
  • Lucene: Creating an index • Programmatic solution in Java... time: 8 minutes 55 seconds
  • Lucene: Indexing String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); run any SQL query Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props); Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql); open Lucene Date start = new Date(); index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
  • Lucene: Indexing loop over SQL result while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); each row is } a Document writer.optimize(); writer.close(); with four Fields finish and close index
  • Lucene: Querying • Parse a Lucene query define fields String[] fields = new String[3]; fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”; Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’); • Execute the query parse search query Searcher s = new IndexSearcher(indexName); Hits h = s.search(q); time: 80 milliseconds
  • Sphinx Search
  • Sphinx Search • Embedded full-text search engine • Started in 2001 • GPLv2 license • Good database integration
  • Sphinx Search: How to use 1. Edit configuration file 2. Index the data 3. Query the index 4. Issues
  • Sphinx Search: sphinx.conf source stackoverflowsrc { type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id }
  • Sphinx Search: sphinx.conf index stackoverflow { source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow }
  • Sphinx Search: Building index indexer -c sphinx.conf stackoverflow collected 1242365 docs, 720.5 MB sorted 88.3 Mhits, 100.0% done total 1242365 docs, 720452944 bytes total 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec time: 5 min 57 sec
  • Sphinx Search: Querying index search -c sphinx.conf -i stackoverflow -b “sql & performance” time: 8 milliseconds
  • Sphinx Search: Issues • Index updates are as expensive as rebuilding the index from scratch • Maintain “main” index plus “delta” index for recent changes • Merge indexes periodically • Not all data fits into this model
  • Inverted Index
  • Inverted index searchable words Posts Tags TagTypes intersection of words / Posts
  • Inverted index: Updated ER Diagram
  • Inverted index: Data definition CREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL ); CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag); CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId) ); CREATE INDEX Tags_PostId_index ON Tags(PostId); CREATE INDEX Tags_TagId_index ON Tags(TagId);
  • Inverted index: Indexing INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’); 90 seconds per tag!!
  • Inverted index: Querying SELECT p.* FROM Posts p JOIN Tags t USING (PostId) JOIN TagTypes tt USING (TagId) WHERE tt.Tag = ‘performance’; 40 milliseconds
  • Search Engine Services
  • Search engine services: Google Custom Search Engine • http://www.google.com/cse/ • DEMO ➪ http://www.karwin.com/demo/gcse-demo.html even big web sites use this solution
  • Search engine services: Is it right for you? • Your site is public and allows external index • Search is a non-critical feature for you • Search results are satisfactory • You need to offload search processing
  • Comparison: Time to Build Index LIKE predicate none PostgreSQL / GIN 40 min Sphinx Search 6 min Apache Lucene 9 min Inverted index high Google / Yahoo! offline
  • Comparison: Index Storage LIKE predicate none PostgreSQL / GIN 532 MB Sphinx Search 533 MB Apache Lucene 1071 MB Inverted index 101 MB Google / Yahoo! offline
  • Comparison: Query Speed LIKE predicate 90+ sec PostgreSQL / GIN 20 ms Sphinx Search 8 ms Apache Lucene 80 ms Inverted index 40 ms Google / Yahoo! *
  • Comparison: Bottom-Line indexing storage query solution LIKE predicate none none 11,250x SQL PostgreSQL / GIN 7x 5.3x 2.5x RDBMS Sphinx Search 1x * 5.3x 1x 3rd party Apache Lucene 1.5x 10x 10x 3rd party Inverted index high 1x 5x SQL Google / Yahoo! offline offline * Service
  • Copyright 2009 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work.