SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
TEXT TAGGING WITH FINITE STATE
TRANSDUCERS
David Smiley
Software Systems Engineer, Lead
Text Tagging with
Finite State Transducers
David Smiley
Lucene/Solr Revolution 2013
© 2012 The MITRE Corporation. All rights reserved.
About David Smiley
 Working at MITRE, for 13 years
 web development, Java, search
 Published 1st book on Solr; then 2nd edition (2009, 2011)
 Apache Lucene / Solr committer/PMC member (2012)
 Presented at Lucene Revolution (2010) & Basis O.S. Search
Conference (2011, 2012)
 Taught Solr classes at MITRE (2010, 2011, 2012)
 Solr search consultant within MITRE and its sponsors, and
privately
3
What is “Text Tagging” and “FSTs”?
 First, I need to establish the context:
 JIEDDO’s OpenSextant project
 Though this presentation is not about OpenSextant or
geotagging
 Ultimately, I want to convey how cool Lucene’s FSTs are
 And you may have a need for a text tagger
 Or a geotagger (like OpenSextant)
OpenSextant
A DoD Funded Project: JIEDDO/COIC & NGA
Open Source approval recently obtained
OpenSextant Project
 A geotagging solution for unstructured text
 Finds place name references in natural language
 “… I live near Boston … ”
 Finds “Boston” with input character offset #s
 Often resolves to multiple gazetteer entries: “Boston” has 73
 What’s a Gazetteer?
 A dictionary of place names with metadata like latitude &
longitude
How does it work?
The “Naïve” Tagger
 AKA “Text Tagger”
 Simply consults a dictionary/gazetteer; no fancy NLP
 There’s nothing geospatial about it
 Subsequent NLP processing eliminates low-confidence tags
 Actually, not so simple
 Names vary in word length
 Must find overlapping names
 but not names within names
The Gazetteer
 13 million place name records
 8.1M distinct place names
 Why not 13M?
 Ambiguous names (e.g. San Diego)
 Text analysis normalization (e.g. diacritic removal, etc.)
 2.8M are single-word names (1/3rd)
 2.3 avg. words / name
 14 avg. chars / name
3 Naïve Tagger Implementations
 GATE’s Tagger
 In-memory Aho-Corasick string-matching algorithm
 Requires an estimated 80 GB RAM !! (for our data)
 FAST
 A JIEDDO developed MySQL based Tagger
 “Reasonable” RAM requirements ~4GB
 SLOW (~15x, 20x? not certain). ~1 doc/second
 A JIEDDO developed Solr/FST based Tagger …
Finite State Transducers
Applied to text tagging
Finite State Automata (FSA)
 SortedSet<char[]>:
 mop, moth, pop, slop, sloth, stop, top
Note: a “Trie” data structure is similar but only shares prefixes
Finite State Transducer (FST)
 Adds optional output to each arc
 SortedMap<char[],int>
 mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
Lucene’s FST Implementation
 FST encoded as a byte[]
 Memory efficient! And fast to load from disk.
 Write-once API (immutable)
 Build minimal, acyclic FST from pre-sorted inputs
 Fast (linear time with input size), low memory
 Optional two-pass packing can shrink by ~25%
 SortedMap<int[],T>: arcs are sorted by label
 getByOutput also possible if outputs are sorted
 http://s.apache.org/LuceneFSTs
Based on a
Mihov & Maurel
paper, 2001
FSTs and Text Tagging
 My approach involves two layers of FSTs:
 A word dictionary FST to hold each unique word
 Enables using integers as substitutes for char[]
 Via getByOutput(12345) -> “New”
 Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345
 A word phrase FST comprised of word id string keys
 Ex: “New York City” -> [12345, 5522111, 345]
 Value are arrays of gazetteer primary keys
Memory Use
 Word Dict FST:
 3.3M words with ordinal ids in 26MB of RAM
 Name Phrase FST:
 8.1M word id phrases in 90 MB of RAM
 Plus 82MB of arrays of gazetteer primary key ids
 Total: 198 MB (compare to 80GB GATE Aho-Corasick)
 Building it consumes ~1.5GB Java heap, for 2 minutes
Experimental measurements
 Single FST Experiment
 1 FST of analyzed character word phrase -> int id
 “new york city” -> 6344207
 Theory: more than 2x the memory
 Result: 69 MB! (compare to 26+90) 41% reduction
 Retrospective: What I would have done differently
 Index a field of concatenated terms (custom TokenFilter).
 More disk needed but reduces build time & memory
requirements. Unclear effect on tagging performance.
 Potential to use MemoryPostingsFormat, a Lucene Codec that
uses an FST internally + vInt doc ids, instead of custom FST code.
Tagging Algorithm
It’s complicated! Single-pass (streaming) algorithm
 For each input word, lookup its ordinal id, then:
1. Create an FST arc iterator for name phrase
2. Append the iterator onto a queue of active ones
3. Try to advance all iterators
 Remove those that don’t advance
Iterator linked-list queue:
Head: New, York, City ✔
Head+1: York, City
Head+2: City …
Speed Benchmarks
Docs/Sec RAM (GB)
OpenSextant: GATE Tagger ? 80
OpenSextant: MySQL based Tagger 1.1 4
OpenSextant: Solr/FST Tagger 15.9 2*
Measures single-threaded performance of geotagging 428
documents in the “ACE” collection. OpenSextant tests all had
the same gazetteer.
Integrated with Solr
 As a custom Solr Request Handler
 Builds the FSTs from the index (the gazetteer)
 Configurable
 Text analysis (e.g. phonetic)
 Exclude gazetteer docs by configured query
 Optional partial word phrase matching
 Optional sub-tags tagging
 Solr integration benefits
 Solr as a taxonomy manager! Web-service, searchable,
scalable, easy to update, …
~$ curl -XPOST 'http://localhost:8983/solr/tag
?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston"
{
"responseHeader":{
"status":0,
"QTime":1898},
"tagsCount":1,
"tags":[[
"startOffset",12,
"endOffset",18,
"ids",[1190927,
1099063,
2562742,
2667203,
2684629,
2695904,
2653982,
2657690,
2585165,
2597292,
…
… 11890986,
11891415]]],
"matchingDocs":{"numFound":73,"start":0,"docs":[
{
"id":12719030,
"place_id":"USGS1893700",
"name":"Boston",
"lat":65.01667,
"lon":-163.28333,
"feat_class":"L",
"feat_code":"AREA",
"FIPS_cc":"US",
"ISO_cc":["US"],
"cc":"US",
"ISO3_cc":"USA",
"adm1":"US02",
"adm2":"US02.0180",
"name_bias":0.0,
"id_bias":0.04,
"geo":"65.01667,-163.28333"},
…
Where can you get this?
 https://github.com/openSextant/SolrTextTagger
 An independent module of OpenSextant
 Might seek incubator status at http://www.osgeo.org
 Includes documentation, tests
Concluding Remarks
 Lucene FSTs are awesome!
 Great for storing large amounts of strings in-memory
 Or other string-like data: e.g. IP addresses, geohashes
 The API is hard to use, however
 The text tagger should be useful independent of
OpenSextant
 Tag people/org names or special keywords
 Might be ported to Lucene as an alternative to its synonym
token filter
 I’ve got an idea on applying these concepts to Lucene
“Shingling” as a codec to make it more scalable
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
David Smiley
dsmiley@mitre.org

Más contenido relacionado

La actualidad más candente

Consensus and Raft algorithm (Vietnamese version)
Consensus and Raft algorithm (Vietnamese version)Consensus and Raft algorithm (Vietnamese version)
Consensus and Raft algorithm (Vietnamese version)Thao Huynh Quang
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
Deep dive into flink interval join
Deep dive into flink interval joinDeep dive into flink interval join
Deep dive into flink interval joinyeomii
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSematext Group, Inc.
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
 
Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Jonathan Katz
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksEDB
 
Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링
Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링
Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링Jiho Lee
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 

La actualidad más candente (20)

Consensus and Raft algorithm (Vietnamese version)
Consensus and Raft algorithm (Vietnamese version)Consensus and Raft algorithm (Vietnamese version)
Consensus and Raft algorithm (Vietnamese version)
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Deep dive into flink interval join
Deep dive into flink interval joinDeep dive into flink interval join
Deep dive into flink interval join
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 
Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15Looking ahead at PostgreSQL 15
Looking ahead at PostgreSQL 15
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링
Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링
Pg 클러스터 기반의 구성 및 이전, 그리고 인덱스 클러스터링
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 

Destacado

Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoriststjcarter
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_versiontjcarter
 
Adult development theory
Adult development theoryAdult development theory
Adult development theorycccscoetc
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライトShinichiro Abe
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VECharles Palus
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesCharles Palus
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 ScotlandCharles Palus
 

Destacado (10)

Adult Development
Adult DevelopmentAdult Development
Adult Development
 
Class 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theoristsClass 5, adlt 671 developmental theorists
Class 5, adlt 671 developmental theorists
 
Class 5 adult development theories___longer_version
Class 5 adult development theories___longer_versionClass 5 adult development theories___longer_version
Class 5 adult development theories___longer_version
 
Adult Development
Adult Development Adult Development
Adult Development
 
Adult development theory
Adult development theoryAdult development theory
Adult development theory
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
類義語検索と類義語ハイライト
類義語検索と類義語ハイライト類義語検索と類義語ハイライト
類義語検索と類義語ハイライト
 
Current state and future state using VE
Current state and future state using VECurrent state and future state using VE
Current state and future state using VE
 
VE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future statesVE plus graphic facilitation for currrent / future states
VE plus graphic facilitation for currrent / future states
 
HMC Conference 2011 Scotland
HMC Conference 2011 ScotlandHMC Conference 2011 Scotland
HMC Conference 2011 Scotland
 

Similar a Text tagging with finite state transducers

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Christopher Biow
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"Jihyun Ahn
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)Serhii Kartashov
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddlerholiman
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUGAlex Ott
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF UsersAndy Schwartz
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 

Similar a Text tagging with finite state transducers (20)

Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010Mark Logic StrangeLoop 2010
Mark Logic StrangeLoop 2010
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Basics of XML
Basics of XMLBasics of XML
Basics of XML
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' MeetupMongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
MongoDB at the Silicon Valley iPhone and iPad Developers' Meetup
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
Hatkit Project - Datafiddler
Hatkit Project - DatafiddlerHatkit Project - Datafiddler
Hatkit Project - Datafiddler
 
Clojure talk at Münster JUG
Clojure talk at Münster JUGClojure talk at Münster JUG
Clojure talk at Münster JUG
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo Seattle
 
MongoDB @ fliptop
MongoDB @ fliptopMongoDB @ fliptop
MongoDB @ fliptop
 
Component Framework Primer for JSF Users
Component Framework Primer for JSF UsersComponent Framework Primer for JSF Users
Component Framework Primer for JSF Users
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 

Último (20)

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 

Text tagging with finite state transducers

  • 1. TEXT TAGGING WITH FINITE STATE TRANSDUCERS David Smiley Software Systems Engineer, Lead
  • 2. Text Tagging with Finite State Transducers David Smiley Lucene/Solr Revolution 2013 © 2012 The MITRE Corporation. All rights reserved.
  • 3. About David Smiley  Working at MITRE, for 13 years  web development, Java, search  Published 1st book on Solr; then 2nd edition (2009, 2011)  Apache Lucene / Solr committer/PMC member (2012)  Presented at Lucene Revolution (2010) & Basis O.S. Search Conference (2011, 2012)  Taught Solr classes at MITRE (2010, 2011, 2012)  Solr search consultant within MITRE and its sponsors, and privately 3
  • 4. What is “Text Tagging” and “FSTs”?  First, I need to establish the context:  JIEDDO’s OpenSextant project  Though this presentation is not about OpenSextant or geotagging  Ultimately, I want to convey how cool Lucene’s FSTs are  And you may have a need for a text tagger  Or a geotagger (like OpenSextant)
  • 5. OpenSextant A DoD Funded Project: JIEDDO/COIC & NGA Open Source approval recently obtained
  • 6. OpenSextant Project  A geotagging solution for unstructured text  Finds place name references in natural language  “… I live near Boston … ”  Finds “Boston” with input character offset #s  Often resolves to multiple gazetteer entries: “Boston” has 73  What’s a Gazetteer?  A dictionary of place names with metadata like latitude & longitude
  • 7.
  • 8. How does it work?
  • 9. The “Naïve” Tagger  AKA “Text Tagger”  Simply consults a dictionary/gazetteer; no fancy NLP  There’s nothing geospatial about it  Subsequent NLP processing eliminates low-confidence tags  Actually, not so simple  Names vary in word length  Must find overlapping names  but not names within names
  • 10. The Gazetteer  13 million place name records  8.1M distinct place names  Why not 13M?  Ambiguous names (e.g. San Diego)  Text analysis normalization (e.g. diacritic removal, etc.)  2.8M are single-word names (1/3rd)  2.3 avg. words / name  14 avg. chars / name
  • 11. 3 Naïve Tagger Implementations  GATE’s Tagger  In-memory Aho-Corasick string-matching algorithm  Requires an estimated 80 GB RAM !! (for our data)  FAST  A JIEDDO developed MySQL based Tagger  “Reasonable” RAM requirements ~4GB  SLOW (~15x, 20x? not certain). ~1 doc/second  A JIEDDO developed Solr/FST based Tagger …
  • 13. Finite State Automata (FSA)  SortedSet<char[]>:  mop, moth, pop, slop, sloth, stop, top Note: a “Trie” data structure is similar but only shares prefixes
  • 14. Finite State Transducer (FST)  Adds optional output to each arc  SortedMap<char[],int>  mop: 0, moth: 1, pop: 2, slop: 3, sloth: 4, stop: 5, top: 6
  • 15. Lucene’s FST Implementation  FST encoded as a byte[]  Memory efficient! And fast to load from disk.  Write-once API (immutable)  Build minimal, acyclic FST from pre-sorted inputs  Fast (linear time with input size), low memory  Optional two-pass packing can shrink by ~25%  SortedMap<int[],T>: arcs are sorted by label  getByOutput also possible if outputs are sorted  http://s.apache.org/LuceneFSTs Based on a Mihov & Maurel paper, 2001
  • 16. FSTs and Text Tagging  My approach involves two layers of FSTs:  A word dictionary FST to hold each unique word  Enables using integers as substitutes for char[]  Via getByOutput(12345) -> “New”  Ex: “New” -> 12345, “York” -> 5522111, “City” -> 345  A word phrase FST comprised of word id string keys  Ex: “New York City” -> [12345, 5522111, 345]  Value are arrays of gazetteer primary keys
  • 17. Memory Use  Word Dict FST:  3.3M words with ordinal ids in 26MB of RAM  Name Phrase FST:  8.1M word id phrases in 90 MB of RAM  Plus 82MB of arrays of gazetteer primary key ids  Total: 198 MB (compare to 80GB GATE Aho-Corasick)  Building it consumes ~1.5GB Java heap, for 2 minutes
  • 18. Experimental measurements  Single FST Experiment  1 FST of analyzed character word phrase -> int id  “new york city” -> 6344207  Theory: more than 2x the memory  Result: 69 MB! (compare to 26+90) 41% reduction  Retrospective: What I would have done differently  Index a field of concatenated terms (custom TokenFilter).  More disk needed but reduces build time & memory requirements. Unclear effect on tagging performance.  Potential to use MemoryPostingsFormat, a Lucene Codec that uses an FST internally + vInt doc ids, instead of custom FST code.
  • 19. Tagging Algorithm It’s complicated! Single-pass (streaming) algorithm  For each input word, lookup its ordinal id, then: 1. Create an FST arc iterator for name phrase 2. Append the iterator onto a queue of active ones 3. Try to advance all iterators  Remove those that don’t advance Iterator linked-list queue: Head: New, York, City ✔ Head+1: York, City Head+2: City …
  • 20. Speed Benchmarks Docs/Sec RAM (GB) OpenSextant: GATE Tagger ? 80 OpenSextant: MySQL based Tagger 1.1 4 OpenSextant: Solr/FST Tagger 15.9 2* Measures single-threaded performance of geotagging 428 documents in the “ACE” collection. OpenSextant tests all had the same gazetteer.
  • 21. Integrated with Solr  As a custom Solr Request Handler  Builds the FSTs from the index (the gazetteer)  Configurable  Text analysis (e.g. phonetic)  Exclude gazetteer docs by configured query  Optional partial word phrase matching  Optional sub-tags tagging  Solr integration benefits  Solr as a taxonomy manager! Web-service, searchable, scalable, easy to update, …
  • 22. ~$ curl -XPOST 'http://localhost:8983/solr/tag ?fl=*&wt=json&indent=2' -H 'Content-Type:text/plain' -d "I live near Boston" { "responseHeader":{ "status":0, "QTime":1898}, "tagsCount":1, "tags":[[ "startOffset",12, "endOffset",18, "ids",[1190927, 1099063, 2562742, 2667203, 2684629, 2695904, 2653982, 2657690, 2585165, 2597292, … … 11890986, 11891415]]], "matchingDocs":{"numFound":73,"start":0,"docs":[ { "id":12719030, "place_id":"USGS1893700", "name":"Boston", "lat":65.01667, "lon":-163.28333, "feat_class":"L", "feat_code":"AREA", "FIPS_cc":"US", "ISO_cc":["US"], "cc":"US", "ISO3_cc":"USA", "adm1":"US02", "adm2":"US02.0180", "name_bias":0.0, "id_bias":0.04, "geo":"65.01667,-163.28333"}, …
  • 23. Where can you get this?  https://github.com/openSextant/SolrTextTagger  An independent module of OpenSextant  Might seek incubator status at http://www.osgeo.org  Includes documentation, tests
  • 24. Concluding Remarks  Lucene FSTs are awesome!  Great for storing large amounts of strings in-memory  Or other string-like data: e.g. IP addresses, geohashes  The API is hard to use, however  The text tagger should be useful independent of OpenSextant  Tag people/org names or special keywords  Might be ported to Lucene as an alternative to its synonym token filter  I’ve got an idea on applying these concepts to Lucene “Shingling” as a codec to make it more scalable
  • 25. CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT David Smiley dsmiley@mitre.org