SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
PostgreSQL FTS
   Solutions
FOSDEM PGDAY 2013
  by Emanuel Calvo
About me:
●   Operational DBA at PalominoDB.
     ○ MySQL, Maria and PostgreSQL databases.
●   Spanish Press Contact.
●   Check out my LinkedIn Profile at: http://es.linkedin.com/in/ecbcbcb/
Credits
●   Thanks to:
     ○ Andrew Atanasoff
     ○ Vlad Fedorkov
     ○ All the PalominoDB people that help out !
Palomino - Service Offerings
●   Monthly Support:
     ○ Being renamed to Palomino DBA as a service.
     ○ Eliminating 10 hour monthly clients.
     ○ Discounts are based on spend per month (0-80, 81-160, 161+
     ○ We will be penalizing excessive paging financially.
     ○ Quarterly onsite day from Palomino executive, DBA and PM for clients
        using 80 hours or more per month.
     ○ Clients using 80-160 hours get 2 new relic licenses. 160 hours plus
        get 4.
●   Adding annual support contracts:
     ○ Consultation as needed.
     ○ Emergency pages allowed.
     ○ Small bucket of DBA hours (8, 16 or 24)

For more information, please go to: Spreadsheet
Agenda
●   What we are looking for?
●   Concepts
●   Native Postgres Support
     ○ http://www.postgresql.org/docs/9.2/static/textsearch.html
●   External solutions
     ○ Sphinx
         ■ http://sphinxsearch.com/
     ○ Solr
         ■ http://lucene.apache.org/solr/
Goals of FTS

●   Add complex searches using synonyms, specific operators or spellings.
     ○ Improving performance sacrificing accuracy.
●   Reduce IO and CPU utilization.
     ○ Text consumes a lot of IO for read and CPU for operations.
●   FTS can be handled:
     ○ Externally
         ■ using tools like Sphinx or Solr
     ○ Internally
         ■ native FTS support.
●   Order words by relevance
●   Language sensitive
●   Faster than regular expressions or LIKE operands
Concepts
●   Parsers
     ○ 23 token types (url, email, file, etc)
●   Token
●   Stop word
●   Lexeme
     ○ array of lexemes + position + weight = tsvector
●   Dictionaries
     ○ Simple Dictionary
          ■ The simple dictionary template operates by converting the input
             token to lower case and checking it against a file of stop words.
     ○ Synonym Dictionary
     ○ Thesaurus Dictionary
     ○ Ispell Dictionary
     ○ Snowball Dictionary
Limitations
 ●   The length of each lexeme must be less than 2K bytes
 ●   The length of a tsvector (lexemes + positions) must be less than 1
     megabyte
 ●   The number of lexemes must be less than 264
 ●   Position values in tsvector must be greater than 0 and no more than
     16,383
 ●   No more than 256 positions per lexeme
 ●   The number of nodes (lexemes + operators) in a tsquery must be less than
     32,768
 ●   Those limits are hard to be reached!

For comparison, the PostgreSQL 8.1 documentation contained 10,441 unique words, a total of
335,420 words, and the most frequent word “postgresql” was mentioned 6,127 times in 655
documents.
Another example — the PostgreSQL mailing list archives contained 910,989 unique words with
57,491,343 lexemes in 461,020 messages.
psql commands
●   dF[+] [PATTERN]     list text search configurations
●    dFd[+] [PATTERN]   list text search dictionaries
●    dFp[+] [PATTERN]   list text search parsers
●    dFt[+] [PATTERN]   list text search templates
Elements
full_text_search=# dFd+ *
                                                     List of text search dictionaries
  Schema |           Name |               Template             |                    Init options                    |                     Description
------------+-----------------+---------------------+---------------------------------------------------+-----------------------------------------------------------
pg_catalog | danish_stem                  | pg_catalog.snowball | language = 'danish', stopwords = 'danish'                               | snowball stemmer for danish language
pg_catalog | dutch_stem                   | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch'                                 | snowball stemmer for dutch language
pg_catalog | english_stem                 | pg_catalog.snowball | language = 'english', stopwords = 'english'                             | snowball stemmer for english language
pg_catalog | finnish_stem                 | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish'                             | snowball stemmer for finnish language
pg_catalog | french_stem                  | pg_catalog.snowball | language = 'french', stopwords = 'french'                               | snowball stemmer for french language
pg_catalog | german_stem                  | pg_catalog.snowball | language = 'german', stopwords = 'german'                               | snowball stemmer for german language
pg_catalog | hungarian_stem | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian' | snowball stemmer for hungarian language
pg_catalog | italian_stem                 | pg_catalog.snowball | language = 'italian', stopwords = 'italian' | snowball stemmer for italian language
pg_catalog | norwegian_stem | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian' | snowball stemmer for norwegian language
pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language
pg_catalog | romanian_stem | pg_catalog.snowball | language = 'romanian'                                                       | snowball stemmer for romanian language
pg_catalog | russian_stem                 | pg_catalog.snowball | language = 'russian', stopwords = 'russian'                             | snowball stemmer for russian language
pg_catalog | simple            | pg_catalog.simple |                                                      | simple dictionary: just lower case and check for stopword
pg_catalog | spanish_stem                 | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish'                             | snowball stemmer for spanish language
pg_catalog | swedish_stem                 | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish'                             | snowball stemmer for swedish language
pg_catalog | turkish_stem                 | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish'                             | snowball stemmer for turkish language
(16 rows)
Elements
postgres=# dF
           List of text search configurations
  Schema |           Name |               Description
------------+------------+---------------------------------------
 pg_catalog | danish           | configuration for danish language
 pg_catalog | dutch            | configuration for dutch language
 pg_catalog | english          | configuration for english language
 pg_catalog | finnish          | configuration for finnish language
 pg_catalog | french           | configuration for french language
 pg_catalog | german | configuration for german language
 pg_catalog | hungarian | configuration for hungarian language
 pg_catalog | italian          | configuration for italian language
 pg_catalog | norwegian | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian | configuration for romanian language
 pg_catalog | russian          | configuration for russian language
 pg_catalog | simple           | simple configuration
 pg_catalog | spanish | configuration for spanish language
 pg_catalog | swedish | configuration for swedish language
 pg_catalog | turkish          | configuration for turkish language
(16 rows)
Elements
                             List of data types
  Schema | Name |                                    Description
------------+-----------+---------------------------------------------------------
 pg_catalog | gtsvector | GiST index internal text representation for text search
 pg_catalog | tsquery | query representation for text search
 pg_catalog | tsvector | text representation for text search
(3 rows)

Some operators:
 ●    @@ (tsvector against tsquery)
 ●    || concatenate tsvectors (it reorganises lexemes and ranking)
Small Example
full_text_search=# create table basic_example (i serial PRIMARY KEY, whole text, fulled tsvector, dictionary
regconfig);
postgres=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON basic_example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(fulled, "pg_catalog.
english", whole);
CREATE TRIGGER
postgres=# insert into basic_example(whole,dictionary) values ('This is an example','english'::regconfig);
INSERT 0 1
full_text_search=# create index on basic_example(to_tsvector(dictionary,whole));
CREATE INDEX
full_text_search=# create index on basic_example using GIST(to_tsvector(dictionary,whole));
CREATE INDEX
postgres=# select * from basic_example;
 i|    whole          | fulled | dictionary
---+--------------------+------------+------------
 5 | This is an example | 'exampl':4 | english
(1 row)
Pre processing
●   Documents into tokens
          ■ Find and clean
●   Tokens into lexemes
     ○ Token normalised to a language or dictionary
     ○ Eliminate stop words ( high frequently words)
●   Storing
     ○ Array of lexemes (tsvector)
          ■ the position of the word respect the presence of stop words, although
            they are not stored
          ■ Stores positional information for proximity info
Highlighting

 ●   ts_headline
      ○ it doesn't use tsvector and needs to use the entire document, so could be
         expensive.
 ●   Only for certain type of queries or titles

postgres=# SELECT ts_headline('english','Just a simple example of a highlighted query and similarity.',
to_tsquery('query & similarity'),'StartSel = <, StopSel = >');
                       ts_headline
------------------------------------------------------------------
 Just a simple example of a highlighted <query> and <similarity>.
(1 row)

Default:
StartSel=<b>, StopSel=</b>,
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
MaxFragments=0, FragmentDelimiter=" ... "
Ranking
●   Weights: (A B C D)
●   Ranking functions:
     ○ ts_rank
     ○ ts_rank_cd
●   Ranking is expensive cause re process and check each tsvector.

SELECT to_tsquery(’english’, ’Fat | Rats:AB’);
to_tsquery
------------------
’fat’ | ’rat’:AB

Also, * can be attached to a lexeme to specify prefix matching:
SELECT to_tsquery(’supern:*A & star:A*B’);
to_tsquery
--------------------------
’supern’:*A & ’star’:*AB
Maniputaling tsvectors and tsquery
●   Manipulating tsvectors
    ○ setweight(vector tsvector, weight "char") returns tsvector
    ○ lenght (tsvector) : number of lexemes
    ○ strip (tsvector): returns tsvector without additional position as weight or
       position

●   Manipulating Queries
●   If you need a dynamic input for a query, parse it with numnode(tsquery), it will
    avoid unnecessary searches if contains a lot of stop words
     ○ numnode(plainto_tsquery(’a the is’))
     ○ clean the queries using querytree also, is useful
Example
postgres=# select * from ts_debug('english','The doctor saids I''m sick.');
  alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+----------------+--------------+----------
 asciiword | Word, all ASCII | The | {english_stem} | english_stem | {}
 blank | Space symbols |                   | {}             |                |
 asciiword | Word, all ASCII | doctor | {english_stem} | english_stem | {doctor}
 blank | Space symbols |                   | {}             |                |
 asciiword | Word, all ASCII | saids | {english_stem} | english_stem | {said}
 blank | Space symbols |                   | {}             |                |
 asciiword | Word, all ASCII | I           | {english_stem} | english_stem | {}
 blank | Space symbols | '                 | {}             |                |
 asciiword | Word, all ASCII | m | {english_stem} | english_stem | {m}
 blank | Space symbols |                   | {}             |                |
 asciiword | Word, all ASCII | sick | {english_stem} | english_stem | {sick}
 blank | Space symbols | .                 | {}             |                |
(12 rows)

postgres=# select numnode(plainto_tsquery('The doctor saids I''m sick.')), plainto_tsquery('The doctor saids I''m sick.'),
to_tsvector('english','The doctor saids I''m sick.'), ts_lexize('english_stem','The doctor saids I''m sick.');
 numnode |                plainto_tsquery            |        to_tsvector               |        ts_lexize
---------+----------------------------------+------------------------------------+--------------------------------
         7 | 'doctor' & 'said' & 'm' & 'sick' | 'doctor':2 'm':5 'said':3 'sick':6 | {"the doctor saids i'm sick."}
(1 row)
Maniputaling tsquery
postgres=# SELECT querytree(to_tsquery('!defined'));
 querytree
-----------
 T
(1 row)

postgres=# SELECT querytree(to_tsquery('cat & food | (dog & run & food)'));
               querytree
-----------------------------------------
 'cat' & 'food' | 'dog' & 'run' & 'food'
(1 row)

postgres=# SELECT querytree(to_tsquery('the '));
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
 querytree
-----------

(1 row)
Automating updates on tsvector
●   Postgresql provide standard functions for this:
     ○ tsvector_update_trigger(tsvector_column_name, config_name,
        text_column_name [, ... ])
     ○ tsvector_update_trigger_column(tsvector_column_name,
        config_column_name, text_column_name [, ...

CREATE TABLE messages (
title text,
body text,
tsv tsvector
);
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, ’pg_catalog.english’, title, body);
Automating updates on tsvector (2)

If you want to keep a custom weight:

CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
    begin
    new.tsv :=
    setweight(to_tsvector(’pg_catalog.english’, coalesce(new.title,”)), ’A’) ||
    setweight(to_tsvector(’pg_catalog.english’, coalesce(new.body,”)), ’D’);
    return new;
end
$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
Tips and considerations
●   Store the text externally, index on the database
     ○ requires superuser
●   Store the whole document on the database, index on Sphinx/Solr
●   Don't index everything
     ○ Solr /Sphinx are not databases, just index only what you want to search.
         Smaller indexes are faster and easy to maintain.
●   ts_stats
     ○ can help you out to check your FTS configuration
●   You can parse URLS, mails and whatever using ts_debug function for nun
    intensive operations
Tips and considerations
●   You can index by language

CREATE INDEX pgweb_idx_en ON pgweb USING gin(to_tsvector(’english’, body))
WHERE config_language = 'english';
CREATE INDEX pgweb_idx_fr ON pgweb USING gin(to_tsvector(’french’, body))
WHERE config_language = 'french';
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_language,
body));
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(’english’, title || ’ ’ ||
body));
Features on 9.2
●   Move tsvector most-common-element statistics to new pg_stats columns
    (Alexander Korotkov)
●   Consult most_common_elems and most_common_elem_freqs for the data
    formerly available in most_common_vals and most_common_freqs for a tsvector
    column.

most_common_elems     | {exampl}
most_common_elem_freqs | {1,1,1}
Links
●   http://www.postgresql.org/docs/9.2/static/textsearch.htm
●   http://www.postgresql.org/docs/9.2/static/textsearch-migration.html >
    migration from version pre-8.3
Sphinx
Sphinx
●   Standalone daemon written on C++
●   Highly scalable
     ○ Known installation consists 50+ Boxes, 20+ Billions of documents
●   Extended search for text and non-full-text data
     ○ Optimized for faceted search
     ○ Snippets generation based on language settings
●   Very fast
     ○ Keeps attributes in memory
         ■ See Percona benchmarks for details
●   Receiving data from PostgreSQL
     ○ Dedicated PostgreSQL datasource type.



http://sphinxsearch.com
Key features- Sphinx
●   Scalability & failover
●   Extended FT language
      ●  Faceted search support
      ●  GEO-search support
●   Integration and pluggable architecture
      ●  Dedicated PostgreSQL source, UDF support
●   Morphology & stemming
●   Both batch & real-time indexing is available
●   Parallel snippets generation
What's new on Sphinx
●   1. added AOT (new morphology library, lemmatizer) support
     ○ Russian only for now; English coming soon; small 10-20% indexing
         impact; it's all about search quality (much much better "stemming")
●   2. added JSON support
     ○ limited support (limited subset of JSON) for now; JSON sits in a
         column; you're able to do thing like WHERE jsoncol.key=123 or
         ORDER BY or GROUP BY
●   3. added subselect syntax that reorders result sets, SELECT * FROM
    (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y
●   4. added bigram indexing, and quicker phrase searching with bigrams
    (bigram_index, bigram_freq_words directives)
     ○ improves the worst cases for social mining
●   5. added HA support, ha_strategy, agent_mirror directives
●   6. added a few new geofunctions (POLY2D, GEOPOLY2D, CONTAINS)
●   7. added GROUP_CONCAT()
●   8. added OPTIMIZE INDEX rtindex, rt_merge_iops, rt_merge_maxiosize
    directives
●   9. added TRUNCATE RTINDEX statement
Sphinx - Postgres compilation
[root@ip-10-55-83-238 ~]# yum install gcc-c++.noarch
[root@ip-10-55-83-238 sphinx-2.0.6-release]# ./configure --prefix=/opt/sphinx --
without-mysql --with-pgsql-includes=$PGSQL_INCLUDE --with-pgsql-
libs=$PGSQL_LIBS --with-pgsql
[root@ip-10-55-83-238 sphinx]# /opt/pg/bin/psql -Upostgres -hmaster test <
etc/example-pg.sql

* Package is compiled with mysql libraries dependencies
Sphinx - Daemon
●   For speed
     ●   to offload main database
     ●   to make particular queries faster
           ●  Actually most of search-related
●   For failover
     ●   It happens to best of us!
●   For extended functionality
     ●   Morphology & stemming
     ●   Autocomplete, “do you mean” and “Similar items”
SOLR/Lucene
Solr Features
●   Advanced Full-Text Search Capabilities
●   Optimized for High Volume Web Traffic
●   Standards Based Open Interfaces - XML, JSON and HTTP
●   Comprehensive HTML Administration Interfaces
●   Server statistics exposed over JMX for monitoring
●   Linearly scalable, auto index replication, auto failover and recovery
●   Near Real-time indexing
●   Flexible and Adaptable with XML configuration
●   Extensible Plugin Architecture
Solr
●   http://lucene.apache.org/solr/features.html
●   Solr uses Lucene Library
Thanks!
      Contact us!
     We are hiring!
emanuel@palominodb.com

Más contenido relacionado

La actualidad más candente

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesWebscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesJonathan Katz
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12Andrew Dunstan
 
Big Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraBig Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraRobbie Strickland
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016Duyhai Doan
 
Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019Dave Stokes
 
Real time indexes in Sphinx, Yaroslav Vorozhko
Real time indexes in Sphinx, Yaroslav VorozhkoReal time indexes in Sphinx, Yaroslav Vorozhko
Real time indexes in Sphinx, Yaroslav VorozhkoFuenteovejuna
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinxAdrian Nuta
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWJonathan Katz
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertablehypertable
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Hypertable
HypertableHypertable
Hypertablebetaisao
 
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...DataStax
 

La actualidad más candente (20)

Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
What Reika Taught us
What Reika Taught usWhat Reika Taught us
What Reika Taught us
 
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesWebscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12
 
Big Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraBig Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to Cassandra
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019Hybrid Databases - PHP UK Conference 22 February 2019
Hybrid Databases - PHP UK Conference 22 February 2019
 
Real time indexes in Sphinx, Yaroslav Vorozhko
Real time indexes in Sphinx, Yaroslav VorozhkoReal time indexes in Sphinx, Yaroslav Vorozhko
Real time indexes in Sphinx, Yaroslav Vorozhko
 
Real time fulltext search with sphinx
Real time fulltext search with sphinxReal time fulltext search with sphinx
Real time fulltext search with sphinx
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertable
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Hypertable
HypertableHypertable
Hypertable
 
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
 

Similar a PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY

Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013Emanuel Calvo
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course PROIDEA
 
Constraint Grammar and Apertium
Constraint Grammar and ApertiumConstraint Grammar and Apertium
Constraint Grammar and Apertiumunhammer
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17Daniel Eriksson
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Productive bash
Productive bashProductive bash
Productive bashAyla Khan
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data EngineersDuy Do
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterEDB
 
Making.postgres.central.2015
Making.postgres.central.2015Making.postgres.central.2015
Making.postgres.central.2015EDB
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBArangoDB Database
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 

Similar a PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY (20)

Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Language Search
Language SearchLanguage Search
Language Search
 
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
 
Constraint Grammar and Apertium
Constraint Grammar and ApertiumConstraint Grammar and Apertium
Constraint Grammar and Apertium
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Productive bash
Productive bashProductive bash
Productive bash
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data Engineers
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data Center
 
Making.postgres.central.2015
Making.postgres.central.2015Making.postgres.central.2015
Making.postgres.central.2015
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 

Más de Emanuel Calvo

Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scEmanuel Calvo
 
Pgbr 2013 postgres on aws
Pgbr 2013   postgres on awsPgbr 2013   postgres on aws
Pgbr 2013 postgres on awsEmanuel Calvo
 
LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)Emanuel Calvo
 
Monitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQLMonitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQLEmanuel Calvo
 

Más de Emanuel Calvo (7)

Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live sc
 
Pgbr 2013 postgres on aws
Pgbr 2013   postgres on awsPgbr 2013   postgres on aws
Pgbr 2013 postgres on aws
 
LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)
 
Admon PG 1
Admon PG 1Admon PG 1
Admon PG 1
 
Monitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQLMonitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQL
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 

Último

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY

  • 1. PostgreSQL FTS Solutions FOSDEM PGDAY 2013 by Emanuel Calvo
  • 2. About me: ● Operational DBA at PalominoDB. ○ MySQL, Maria and PostgreSQL databases. ● Spanish Press Contact. ● Check out my LinkedIn Profile at: http://es.linkedin.com/in/ecbcbcb/
  • 3. Credits ● Thanks to: ○ Andrew Atanasoff ○ Vlad Fedorkov ○ All the PalominoDB people that help out !
  • 4. Palomino - Service Offerings ● Monthly Support: ○ Being renamed to Palomino DBA as a service. ○ Eliminating 10 hour monthly clients. ○ Discounts are based on spend per month (0-80, 81-160, 161+ ○ We will be penalizing excessive paging financially. ○ Quarterly onsite day from Palomino executive, DBA and PM for clients using 80 hours or more per month. ○ Clients using 80-160 hours get 2 new relic licenses. 160 hours plus get 4. ● Adding annual support contracts: ○ Consultation as needed. ○ Emergency pages allowed. ○ Small bucket of DBA hours (8, 16 or 24) For more information, please go to: Spreadsheet
  • 5. Agenda ● What we are looking for? ● Concepts ● Native Postgres Support ○ http://www.postgresql.org/docs/9.2/static/textsearch.html ● External solutions ○ Sphinx ■ http://sphinxsearch.com/ ○ Solr ■ http://lucene.apache.org/solr/
  • 6. Goals of FTS ● Add complex searches using synonyms, specific operators or spellings. ○ Improving performance sacrificing accuracy. ● Reduce IO and CPU utilization. ○ Text consumes a lot of IO for read and CPU for operations. ● FTS can be handled: ○ Externally ■ using tools like Sphinx or Solr ○ Internally ■ native FTS support. ● Order words by relevance ● Language sensitive ● Faster than regular expressions or LIKE operands
  • 7. Concepts ● Parsers ○ 23 token types (url, email, file, etc) ● Token ● Stop word ● Lexeme ○ array of lexemes + position + weight = tsvector ● Dictionaries ○ Simple Dictionary ■ The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. ○ Synonym Dictionary ○ Thesaurus Dictionary ○ Ispell Dictionary ○ Snowball Dictionary
  • 8. Limitations ● The length of each lexeme must be less than 2K bytes ● The length of a tsvector (lexemes + positions) must be less than 1 megabyte ● The number of lexemes must be less than 264 ● Position values in tsvector must be greater than 0 and no more than 16,383 ● No more than 256 positions per lexeme ● The number of nodes (lexemes + operators) in a tsquery must be less than 32,768 ● Those limits are hard to be reached! For comparison, the PostgreSQL 8.1 documentation contained 10,441 unique words, a total of 335,420 words, and the most frequent word “postgresql” was mentioned 6,127 times in 655 documents. Another example — the PostgreSQL mailing list archives contained 910,989 unique words with 57,491,343 lexemes in 461,020 messages.
  • 9. psql commands ● dF[+] [PATTERN] list text search configurations ● dFd[+] [PATTERN] list text search dictionaries ● dFp[+] [PATTERN] list text search parsers ● dFt[+] [PATTERN] list text search templates
  • 10. Elements full_text_search=# dFd+ * List of text search dictionaries Schema | Name | Template | Init options | Description ------------+-----------------+---------------------+---------------------------------------------------+----------------------------------------------------------- pg_catalog | danish_stem | pg_catalog.snowball | language = 'danish', stopwords = 'danish' | snowball stemmer for danish language pg_catalog | dutch_stem | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch' | snowball stemmer for dutch language pg_catalog | english_stem | pg_catalog.snowball | language = 'english', stopwords = 'english' | snowball stemmer for english language pg_catalog | finnish_stem | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish' | snowball stemmer for finnish language pg_catalog | french_stem | pg_catalog.snowball | language = 'french', stopwords = 'french' | snowball stemmer for french language pg_catalog | german_stem | pg_catalog.snowball | language = 'german', stopwords = 'german' | snowball stemmer for german language pg_catalog | hungarian_stem | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian' | snowball stemmer for hungarian language pg_catalog | italian_stem | pg_catalog.snowball | language = 'italian', stopwords = 'italian' | snowball stemmer for italian language pg_catalog | norwegian_stem | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian' | snowball stemmer for norwegian language pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language pg_catalog | romanian_stem | pg_catalog.snowball | language = 'romanian' | snowball stemmer for romanian language pg_catalog | russian_stem | pg_catalog.snowball | language = 'russian', stopwords = 'russian' | snowball stemmer for russian language pg_catalog | simple | pg_catalog.simple | | simple dictionary: just lower case and check for stopword pg_catalog | spanish_stem | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish' | snowball stemmer for spanish language pg_catalog | swedish_stem | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish' | snowball stemmer for swedish language pg_catalog | turkish_stem | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish' | snowball stemmer for turkish language (16 rows)
  • 11. Elements postgres=# dF List of text search configurations Schema | Name | Description ------------+------------+--------------------------------------- pg_catalog | danish | configuration for danish language pg_catalog | dutch | configuration for dutch language pg_catalog | english | configuration for english language pg_catalog | finnish | configuration for finnish language pg_catalog | french | configuration for french language pg_catalog | german | configuration for german language pg_catalog | hungarian | configuration for hungarian language pg_catalog | italian | configuration for italian language pg_catalog | norwegian | configuration for norwegian language pg_catalog | portuguese | configuration for portuguese language pg_catalog | romanian | configuration for romanian language pg_catalog | russian | configuration for russian language pg_catalog | simple | simple configuration pg_catalog | spanish | configuration for spanish language pg_catalog | swedish | configuration for swedish language pg_catalog | turkish | configuration for turkish language (16 rows)
  • 12. Elements List of data types Schema | Name | Description ------------+-----------+--------------------------------------------------------- pg_catalog | gtsvector | GiST index internal text representation for text search pg_catalog | tsquery | query representation for text search pg_catalog | tsvector | text representation for text search (3 rows) Some operators: ● @@ (tsvector against tsquery) ● || concatenate tsvectors (it reorganises lexemes and ranking)
  • 13. Small Example full_text_search=# create table basic_example (i serial PRIMARY KEY, whole text, fulled tsvector, dictionary regconfig); postgres=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON basic_example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(fulled, "pg_catalog. english", whole); CREATE TRIGGER postgres=# insert into basic_example(whole,dictionary) values ('This is an example','english'::regconfig); INSERT 0 1 full_text_search=# create index on basic_example(to_tsvector(dictionary,whole)); CREATE INDEX full_text_search=# create index on basic_example using GIST(to_tsvector(dictionary,whole)); CREATE INDEX postgres=# select * from basic_example; i| whole | fulled | dictionary ---+--------------------+------------+------------ 5 | This is an example | 'exampl':4 | english (1 row)
  • 14. Pre processing ● Documents into tokens ■ Find and clean ● Tokens into lexemes ○ Token normalised to a language or dictionary ○ Eliminate stop words ( high frequently words) ● Storing ○ Array of lexemes (tsvector) ■ the position of the word respect the presence of stop words, although they are not stored ■ Stores positional information for proximity info
  • 15. Highlighting ● ts_headline ○ it doesn't use tsvector and needs to use the entire document, so could be expensive. ● Only for certain type of queries or titles postgres=# SELECT ts_headline('english','Just a simple example of a highlighted query and similarity.', to_tsquery('query & similarity'),'StartSel = <, StopSel = >'); ts_headline ------------------------------------------------------------------ Just a simple example of a highlighted <query> and <similarity>. (1 row) Default: StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE, MaxFragments=0, FragmentDelimiter=" ... "
  • 16. Ranking ● Weights: (A B C D) ● Ranking functions: ○ ts_rank ○ ts_rank_cd ● Ranking is expensive cause re process and check each tsvector. SELECT to_tsquery(’english’, ’Fat | Rats:AB’); to_tsquery ------------------ ’fat’ | ’rat’:AB Also, * can be attached to a lexeme to specify prefix matching: SELECT to_tsquery(’supern:*A & star:A*B’); to_tsquery -------------------------- ’supern’:*A & ’star’:*AB
  • 17. Maniputaling tsvectors and tsquery ● Manipulating tsvectors ○ setweight(vector tsvector, weight "char") returns tsvector ○ lenght (tsvector) : number of lexemes ○ strip (tsvector): returns tsvector without additional position as weight or position ● Manipulating Queries ● If you need a dynamic input for a query, parse it with numnode(tsquery), it will avoid unnecessary searches if contains a lot of stop words ○ numnode(plainto_tsquery(’a the is’)) ○ clean the queries using querytree also, is useful
  • 18. Example postgres=# select * from ts_debug('english','The doctor saids I''m sick.'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+--------+----------------+--------------+---------- asciiword | Word, all ASCII | The | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | doctor | {english_stem} | english_stem | {doctor} blank | Space symbols | | {} | | asciiword | Word, all ASCII | saids | {english_stem} | english_stem | {said} blank | Space symbols | | {} | | asciiword | Word, all ASCII | I | {english_stem} | english_stem | {} blank | Space symbols | ' | {} | | asciiword | Word, all ASCII | m | {english_stem} | english_stem | {m} blank | Space symbols | | {} | | asciiword | Word, all ASCII | sick | {english_stem} | english_stem | {sick} blank | Space symbols | . | {} | | (12 rows) postgres=# select numnode(plainto_tsquery('The doctor saids I''m sick.')), plainto_tsquery('The doctor saids I''m sick.'), to_tsvector('english','The doctor saids I''m sick.'), ts_lexize('english_stem','The doctor saids I''m sick.'); numnode | plainto_tsquery | to_tsvector | ts_lexize ---------+----------------------------------+------------------------------------+-------------------------------- 7 | 'doctor' & 'said' & 'm' & 'sick' | 'doctor':2 'm':5 'said':3 'sick':6 | {"the doctor saids i'm sick."} (1 row)
  • 19. Maniputaling tsquery postgres=# SELECT querytree(to_tsquery('!defined')); querytree ----------- T (1 row) postgres=# SELECT querytree(to_tsquery('cat & food | (dog & run & food)')); querytree ----------------------------------------- 'cat' & 'food' | 'dog' & 'run' & 'food' (1 row) postgres=# SELECT querytree(to_tsquery('the ')); NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored querytree ----------- (1 row)
  • 20. Automating updates on tsvector ● Postgresql provide standard functions for this: ○ tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ]) ○ tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... CREATE TABLE messages ( title text, body text, tsv tsvector ); CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv, ’pg_catalog.english’, title, body);
  • 21. Automating updates on tsvector (2) If you want to keep a custom weight: CREATE FUNCTION messages_trigger() RETURNS trigger AS $$ begin new.tsv := setweight(to_tsvector(’pg_catalog.english’, coalesce(new.title,”)), ’A’) || setweight(to_tsvector(’pg_catalog.english’, coalesce(new.body,”)), ’D’); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
  • 22. Tips and considerations ● Store the text externally, index on the database ○ requires superuser ● Store the whole document on the database, index on Sphinx/Solr ● Don't index everything ○ Solr /Sphinx are not databases, just index only what you want to search. Smaller indexes are faster and easy to maintain. ● ts_stats ○ can help you out to check your FTS configuration ● You can parse URLS, mails and whatever using ts_debug function for nun intensive operations
  • 23. Tips and considerations ● You can index by language CREATE INDEX pgweb_idx_en ON pgweb USING gin(to_tsvector(’english’, body)) WHERE config_language = 'english'; CREATE INDEX pgweb_idx_fr ON pgweb USING gin(to_tsvector(’french’, body)) WHERE config_language = 'french'; CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_language, body)); CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(’english’, title || ’ ’ || body));
  • 24. Features on 9.2 ● Move tsvector most-common-element statistics to new pg_stats columns (Alexander Korotkov) ● Consult most_common_elems and most_common_elem_freqs for the data formerly available in most_common_vals and most_common_freqs for a tsvector column. most_common_elems | {exampl} most_common_elem_freqs | {1,1,1}
  • 25. Links ● http://www.postgresql.org/docs/9.2/static/textsearch.htm ● http://www.postgresql.org/docs/9.2/static/textsearch-migration.html > migration from version pre-8.3
  • 27. Sphinx ● Standalone daemon written on C++ ● Highly scalable ○ Known installation consists 50+ Boxes, 20+ Billions of documents ● Extended search for text and non-full-text data ○ Optimized for faceted search ○ Snippets generation based on language settings ● Very fast ○ Keeps attributes in memory ■ See Percona benchmarks for details ● Receiving data from PostgreSQL ○ Dedicated PostgreSQL datasource type. http://sphinxsearch.com
  • 28. Key features- Sphinx ● Scalability & failover ● Extended FT language ● Faceted search support ● GEO-search support ● Integration and pluggable architecture ● Dedicated PostgreSQL source, UDF support ● Morphology & stemming ● Both batch & real-time indexing is available ● Parallel snippets generation
  • 29. What's new on Sphinx ● 1. added AOT (new morphology library, lemmatizer) support ○ Russian only for now; English coming soon; small 10-20% indexing impact; it's all about search quality (much much better "stemming") ● 2. added JSON support ○ limited support (limited subset of JSON) for now; JSON sits in a column; you're able to do thing like WHERE jsoncol.key=123 or ORDER BY or GROUP BY ● 3. added subselect syntax that reorders result sets, SELECT * FROM (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y ● 4. added bigram indexing, and quicker phrase searching with bigrams (bigram_index, bigram_freq_words directives) ○ improves the worst cases for social mining ● 5. added HA support, ha_strategy, agent_mirror directives ● 6. added a few new geofunctions (POLY2D, GEOPOLY2D, CONTAINS) ● 7. added GROUP_CONCAT() ● 8. added OPTIMIZE INDEX rtindex, rt_merge_iops, rt_merge_maxiosize directives ● 9. added TRUNCATE RTINDEX statement
  • 30. Sphinx - Postgres compilation [root@ip-10-55-83-238 ~]# yum install gcc-c++.noarch [root@ip-10-55-83-238 sphinx-2.0.6-release]# ./configure --prefix=/opt/sphinx -- without-mysql --with-pgsql-includes=$PGSQL_INCLUDE --with-pgsql- libs=$PGSQL_LIBS --with-pgsql [root@ip-10-55-83-238 sphinx]# /opt/pg/bin/psql -Upostgres -hmaster test < etc/example-pg.sql * Package is compiled with mysql libraries dependencies
  • 31. Sphinx - Daemon ● For speed ● to offload main database ● to make particular queries faster ● Actually most of search-related ● For failover ● It happens to best of us! ● For extended functionality ● Morphology & stemming ● Autocomplete, “do you mean” and “Similar items”
  • 33. Solr Features ● Advanced Full-Text Search Capabilities ● Optimized for High Volume Web Traffic ● Standards Based Open Interfaces - XML, JSON and HTTP ● Comprehensive HTML Administration Interfaces ● Server statistics exposed over JMX for monitoring ● Linearly scalable, auto index replication, auto failover and recovery ● Near Real-time indexing ● Flexible and Adaptable with XML configuration ● Extensible Plugin Architecture
  • 34. Solr ● http://lucene.apache.org/solr/features.html ● Solr uses Lucene Library
  • 35. Thanks! Contact us! We are hiring! emanuel@palominodb.com