SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Querying  Rich  Text  
with  Lucene  XQuery	

{	

Michael  Sokolov	
Senior  Architect	
Safari  Books  Online
!   Overview  of  Lux	

!   Why  we  need  want  a  rich(er)  query  language	

!   Implementation  Stories	

!   Indexing  tagged  text	
!   Storing  documents  in  Lucene	
!   Lazy  searching	
	

!   Demo	

The  plan  for  this  talk
!  XQuery  in  Solr	

!   Query  optimizer	
!   Efficient  XML  document  format	
!   XQuery  function  library	

!   as  a  Java  library  (Lucene  only)	
!   as  Solr  plugins	
!   as  a  standalone  App  Server	

What  is  Lux?
Search	

to  find  something
Query	

to  get  an  answer
!
!
!
!
!

  maybe  it  was  once  –  10  year  s  ago?	
  Legacy  stuff:  DTDs,  namespaces,  etc	
  arcane  Java  programming  interfaces	
  Don’t  we  use  JSON  now?	
  so  why  do  we  care  about  it?	

XML  is  not  cool
!   There’s  a  huge  amount  of  it  out  there	
!   HTML  is  XML,  or  can  be	
!   Lux  is  about  making  it  easy  (and  free)  to  deal  
with  XML	
	

But  it  still  maZers
!   We  make  content-­‐‑rich  sites:	

!   our  own  site:  safaribooksonline.com	
!   our  clients  sites:  oed.com,  degruyter.com,  
oxfordreference.com,  …	

!   Publishers  provide  us  with  content	

!   we  debug  content  problems	
!   we  add  new  features  nimbly	
!   Piles  of  random  data  (XML,  mostly)	

Why  did  we  make  it?
!   Complex  queries  over  semi-­‐‑structured  data,  typically  
documents	
!   You  don’t  need  it  for  edismax-­‐‑style  “quick”  search	
!   or  highly-­‐‑structured  data	
!   XQuery  comes  with  a  rich  function  library;	
!   rich  string,  numeric  and  date  functions	
!   extensions  for  HTTP,  filesystem,  zip	

How  can  XQuery  help?
DispatchFilter	
UpdateProcessor	
XML  Indexer	
XML  text  
fields	

Tagged  
TokenStream	

XPath  fields	
Tinybin  
storage	

External  
Field  Codec	

QueryComponent	
QParserPlugin	
Evaluator	
Saxon  XQuery  
XSLT  Processor	
XQuery  
Function  
Library	
Lazy  
Searcher	

ResponseWriter	

Compiler	
Optimizer	
Tagged	
Highlighter	

How  does  Lux  work?
!   “hamlet”  	
!   “hamlet”  in  //title	
!   “hamlet”  in  //scene/title,  //speaker,  etc…	
!   XQuery,  but  we  need  an  index	
!   DIH  XPathEntityProcessor	
!   But  are  XPath  indexes  enough?	

XML  is  text  with  context
!   In  which  speeches  does  Hamlet  talk  about  poison?	
!   +speaker:Hamlet  +line:poison	
!   Works  great  if  we  indexed  speaker  and  line  for  each  
speech	

!   What  if  we  only  indexed  at  the  scene  level?  	
!   What  if  we  just  indexed  speech  text  as  a  field?	
!   XPath  indexes  are  precise  and  fine-­‐‑grained	
!   Great  when  you  know  exactly  what  you  need	
	

How  do  we  index  context?
<play>	
<title>Hamlet</title>	
<act act=”1”>	
<scene act=”1” scene=”1”>	
<title>SCENE I. Elsinore ... </title>	
	
Index	

Values	

Tags	

title, act, @act	
  

Tag  Paths	

/play, /play/title, /play/act, /play/act/@act	
  

Text	

hamlet,	
  scene,	
  elsinore	
  

Tagged  Text	

play:hamlet,	
  title:hamlet,	
  @act:1	
  

XPath	

user-­‐defined	
  Xpath	
  2.0	
  expression;	
  eg:	
  	
  
count(//line),	
  	
  
replace(//title,	
  'SCENE|ACT	
  S+','')	
  

Contextual  Indexes
!   Tagged  Text,  Path  index	
!   Imprecise,  generic  indexes,  but  more  context  
than  just  full  text	
!   XQuery  post-­‐‑processing  to  patch  over  the  gaps	
!   Query  optimizer  applies  indexes	
!   For  when  you  don’t  want  to  sweat  the  details:  
ad  hoc  queries,  content  analysis  and  debugging	

General  purpose  indexes
<scene><speech>
<speaker>Hamlet</speaker>
<line>To be or not to be, … </line>

…	
scene	
speech	
speaker	

…	
scene	
speech	
line	

…	
scene	
speech	
line	

Hamlet	

To	

be	

!
!
!
!

Zext:scene:hamlet                pos=1	
Zext:speech:hamlet            pos=1	
Zext:speaker:hamlet        pos=1	
Zext:scene:to                                  pos=2	
Zext:speech:to                              pos=2	
…	

Tokens  emiZed	

  Wraps  an  existing  Analyzer  (for  the  text)	
  Responds  to  XML  events  (start  element,  etc)	
  Maintains  a  tag  name  stack	
  Emits  each  token  prefixed  by  enclosing  tags	

TaggedTokenStream
!   XPath:	
      //speech[speaker=“Hamlet”][contains(.,”poison”)]	
!   “optimized”  XQuery:	
      lux:search(“+<speaker:Hamlet  +<speech:poison”)          	
              //speech  [speaker=“Hamlet”]  [contains(.,”poison”)]	
!   Lucene  Query:	
      tagged_text:(+speaker:Hamlet  +speech:poison)	

TagQueryParser
!   Generic  JSON  index	
!   Overlapping  tags  (part-­‐‑of-­‐‑speech,  phrase-­‐‑labeling,  NLP)	
!   citation  classification  w/probabilistic  labeling	

!   One  stored  field  for  all  the  text  makes  highlighting  easier	
!   One  Lucene  field  means  you  can  use  PhraseQuery,  eg:  	
        PhraseQuery(<speaker:hamlet  <speech:to)  finds  all              	
                    speeches  by  hamlet  starting  with  “to”.	

Tagged  token  examples
!
!
!
!
!
!

  stored  document    =  100%	
  qnames  =  +1.3%	
  paths  =  +2.4%	
  text  tokens  =  18%	
  tagged  text  (opaque)  =  18%	
  tagged  text  (all  transparent)  =  71%	

What’s  the  cost?
subsequence(	
  
	
  	
  for	
  $doc	
  in	
  collection()[.//SPEAKER=“Hamlet”]	
  
	
  order	
  by	
  $doc/lux:key(“title”)	
  
	
  return	
  $doc,	
  1000,	
  20)	
  
	
  
subsequence	
  (	
  
	
  lux:search(“<SPEAKER:Hamlet”,	
  “title”,	
  
1000)	
  [.//SPEAKER=“Hamlet”]	
  
,	
  1,	
  20)	
  

Query  optimization
!   Lux  uses  Lucene  as  its  primary  document  store	
!   Lux  tinybin  (based  on  Saxon  TinyTree)  storage  
format  avoids  XML  parsing  overhead	
!   Experimental  new  codec  stores  fields  as  files	
	

Document  storage
!   Problem:  “big”  stored  fields	
!   Text  documents  get  stored  for  highlighting	

!   Take  time  to  copy  when  merging	
!   Can  we  do  beZer  by  storing  as  files,  but  
managing  w/Lucene?	

“Big”  binary  stored  fields
large  stored  fields	
small  stored  fields	

ExternalFieldCodec
!   Real-­‐‑time  deletes	
!   Track  deletions  when  merging	
!   Keep  commits  with  IndexDeletionPolicy	
!   Delete  unmerged  (empty)  segments	

!   Off-­‐‑line  deletes	
!   Cleanup  tool  traverses  entire  index	

Deleting  is  complicated
!
!
!
!

  2-­‐‑3x  write  speedup  for  unindexed  stored  fields	
  a  bit  slower  in  the  worst  case	
  But,  text  analysis  can  take  most  of  the  time	
  Net:  useful  if  you  are  storing  large  binaries	

Codec  Performance  
(preliminary)
!   custom  DispatchFilter  provides:	
!   HTTP  request/response  handling  in  XQuery	
!   file  uploads,  redirects	
!   Ability  to  roll  your  own:  cookies,  authentication	

!   Rapid  prototyping,  testing  query  performance,  
relevance,  in  an  application  seZing	

App  Server
!   Yes,  but  did  you  remember  to  index  all  the  
fields  you  need  in  advance?	
!   Yes,  but  did  you  want  to  format  the  result  into  a  
nice  report  *using  your  query  language*?	
!   Yes,  but  did  you  want  access  to  a  complete  
XPath  2.0  implementation  in  your  indexer?	

Isn’t  Solr  enough?
!   Find  some  sample  content  with  a  new  tag  we  need  
to  support	
!   Perform  complex  updates  to  patch  broken  content	
!   Troubleshoot  content	
!   Explore  unfamiliar  content	
!   Write  prototypes  and  admin  tools  entirely  in  HTML,  
JS  and  XQuery	
!   Demo:  hZp://localhost:8080	

Example  uses  
!   Downloads  and  Documentation  at  
hZp://luxdb.org  	
!   Source  code  at  hZp://github.com/msokolov/lux	
!   Freely  available  under  OSS  license  (MPL  2)	
!   Contributions  welcome	
!   Thank  you,  Safari  Books!	
	

Thank  You!

Más contenido relacionado

Destacado (10)

SQL Server - Querying and Managing XML Data
SQL Server - Querying and Managing XML DataSQL Server - Querying and Managing XML Data
SQL Server - Querying and Managing XML Data
 
Xml dtd
Xml dtdXml dtd
Xml dtd
 
Xml
XmlXml
Xml
 
Xml
XmlXml
Xml
 
Introduction to xml
Introduction to xmlIntroduction to xml
Introduction to xml
 
Intro xml
Intro xmlIntro xml
Intro xml
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
XML - What is XML?
XML - What is XML?XML - What is XML?
XML - What is XML?
 
Xml ppt
Xml pptXml ppt
Xml ppt
 

Similar a Querying rich text with XQuery

Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
Vikrant Chauhan
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
Marco Gralike
 

Similar a Querying rich text with XQuery (20)

Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...Text based search engine on a fixed corpus and utilizing indexation and ranki...
Text based search engine on a fixed corpus and utilizing indexation and ranki...
 
Xtext beyond the defaults - how to tackle performance problems
Xtext beyond the defaults -  how to tackle performance problemsXtext beyond the defaults -  how to tackle performance problems
Xtext beyond the defaults - how to tackle performance problems
 
You Want to Go XML-First: Now What? Building an In-House XML-First Workflow -...
You Want to Go XML-First: Now What? Building an In-House XML-First Workflow -...You Want to Go XML-First: Now What? Building an In-House XML-First Workflow -...
You Want to Go XML-First: Now What? Building an In-House XML-First Workflow -...
 
Multi Lingual Websites In Umbraco
Multi Lingual Websites In UmbracoMulti Lingual Websites In Umbraco
Multi Lingual Websites In Umbraco
 
Catmandu / LibreCat Project
Catmandu / LibreCat ProjectCatmandu / LibreCat Project
Catmandu / LibreCat Project
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
Joys & frustrations of putting 34,000 lines of Haskell into production (at Va...
Joys & frustrations of putting 34,000 lines of Haskell into production (at Va...Joys & frustrations of putting 34,000 lines of Haskell into production (at Va...
Joys & frustrations of putting 34,000 lines of Haskell into production (at Va...
 
xml2tex at TUG 2014
xml2tex at TUG 2014xml2tex at TUG 2014
xml2tex at TUG 2014
 
plone.app.multilingual
plone.app.multilingual plone.app.multilingual
plone.app.multilingual
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
XML
XMLXML
XML
 
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
 

Más de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Querying rich text with XQuery

  • 1. Querying  Rich  Text   with  Lucene  XQuery { Michael  Sokolov Senior  Architect Safari  Books  Online
  • 2. !   Overview  of  Lux !   Why  we  need  want  a  rich(er)  query  language !   Implementation  Stories !   Indexing  tagged  text !   Storing  documents  in  Lucene !   Lazy  searching !   Demo The  plan  for  this  talk
  • 3. !  XQuery  in  Solr !   Query  optimizer !   Efficient  XML  document  format !   XQuery  function  library !   as  a  Java  library  (Lucene  only) !   as  Solr  plugins !   as  a  standalone  App  Server What  is  Lux?
  • 6. ! ! ! ! !   maybe  it  was  once  –  10  year  s  ago?   Legacy  stuff:  DTDs,  namespaces,  etc   arcane  Java  programming  interfaces   Don’t  we  use  JSON  now?   so  why  do  we  care  about  it? XML  is  not  cool
  • 7. !   There’s  a  huge  amount  of  it  out  there !   HTML  is  XML,  or  can  be !   Lux  is  about  making  it  easy  (and  free)  to  deal   with  XML But  it  still  maZers
  • 8. !   We  make  content-­‐‑rich  sites: !   our  own  site:  safaribooksonline.com !   our  clients  sites:  oed.com,  degruyter.com,   oxfordreference.com,  … !   Publishers  provide  us  with  content !   we  debug  content  problems !   we  add  new  features  nimbly !   Piles  of  random  data  (XML,  mostly) Why  did  we  make  it?
  • 9. !   Complex  queries  over  semi-­‐‑structured  data,  typically   documents !   You  don’t  need  it  for  edismax-­‐‑style  “quick”  search !   or  highly-­‐‑structured  data !   XQuery  comes  with  a  rich  function  library; !   rich  string,  numeric  and  date  functions !   extensions  for  HTTP,  filesystem,  zip How  can  XQuery  help?
  • 10. DispatchFilter UpdateProcessor XML  Indexer XML  text   fields Tagged   TokenStream XPath  fields Tinybin   storage External   Field  Codec QueryComponent QParserPlugin Evaluator Saxon  XQuery   XSLT  Processor XQuery   Function   Library Lazy   Searcher ResponseWriter Compiler Optimizer Tagged Highlighter How  does  Lux  work?
  • 11. !   “hamlet”   !   “hamlet”  in  //title !   “hamlet”  in  //scene/title,  //speaker,  etc… !   XQuery,  but  we  need  an  index !   DIH  XPathEntityProcessor !   But  are  XPath  indexes  enough? XML  is  text  with  context
  • 12. !   In  which  speeches  does  Hamlet  talk  about  poison? !   +speaker:Hamlet  +line:poison !   Works  great  if  we  indexed  speaker  and  line  for  each   speech !   What  if  we  only  indexed  at  the  scene  level?   !   What  if  we  just  indexed  speech  text  as  a  field? !   XPath  indexes  are  precise  and  fine-­‐‑grained !   Great  when  you  know  exactly  what  you  need How  do  we  index  context?
  • 13. <play> <title>Hamlet</title> <act act=”1”> <scene act=”1” scene=”1”> <title>SCENE I. Elsinore ... </title> Index Values Tags title, act, @act   Tag  Paths /play, /play/title, /play/act, /play/act/@act   Text hamlet,  scene,  elsinore   Tagged  Text play:hamlet,  title:hamlet,  @act:1   XPath user-­‐defined  Xpath  2.0  expression;  eg:     count(//line),     replace(//title,  'SCENE|ACT  S+','')   Contextual  Indexes
  • 14. !   Tagged  Text,  Path  index !   Imprecise,  generic  indexes,  but  more  context   than  just  full  text !   XQuery  post-­‐‑processing  to  patch  over  the  gaps !   Query  optimizer  applies  indexes !   For  when  you  don’t  want  to  sweat  the  details:   ad  hoc  queries,  content  analysis  and  debugging General  purpose  indexes
  • 15. <scene><speech> <speaker>Hamlet</speaker> <line>To be or not to be, … </line> … scene speech speaker … scene speech line … scene speech line Hamlet To be ! ! ! ! Zext:scene:hamlet                pos=1 Zext:speech:hamlet            pos=1 Zext:speaker:hamlet        pos=1 Zext:scene:to                                  pos=2 Zext:speech:to                              pos=2 … Tokens  emiZed   Wraps  an  existing  Analyzer  (for  the  text)   Responds  to  XML  events  (start  element,  etc)   Maintains  a  tag  name  stack   Emits  each  token  prefixed  by  enclosing  tags TaggedTokenStream
  • 16. !   XPath:      //speech[speaker=“Hamlet”][contains(.,”poison”)] !   “optimized”  XQuery:      lux:search(“+<speaker:Hamlet  +<speech:poison”)                        //speech  [speaker=“Hamlet”]  [contains(.,”poison”)] !   Lucene  Query:      tagged_text:(+speaker:Hamlet  +speech:poison) TagQueryParser
  • 17. !   Generic  JSON  index !   Overlapping  tags  (part-­‐‑of-­‐‑speech,  phrase-­‐‑labeling,  NLP) !   citation  classification  w/probabilistic  labeling !   One  stored  field  for  all  the  text  makes  highlighting  easier !   One  Lucene  field  means  you  can  use  PhraseQuery,  eg:          PhraseQuery(<speaker:hamlet  <speech:to)  finds  all                                  speeches  by  hamlet  starting  with  “to”. Tagged  token  examples
  • 18. ! ! ! ! ! !   stored  document    =  100%   qnames  =  +1.3%   paths  =  +2.4%   text  tokens  =  18%   tagged  text  (opaque)  =  18%   tagged  text  (all  transparent)  =  71% What’s  the  cost?
  • 19. subsequence(      for  $doc  in  collection()[.//SPEAKER=“Hamlet”]    order  by  $doc/lux:key(“title”)    return  $doc,  1000,  20)     subsequence  (    lux:search(“<SPEAKER:Hamlet”,  “title”,   1000)  [.//SPEAKER=“Hamlet”]   ,  1,  20)   Query  optimization
  • 20. !   Lux  uses  Lucene  as  its  primary  document  store !   Lux  tinybin  (based  on  Saxon  TinyTree)  storage   format  avoids  XML  parsing  overhead !   Experimental  new  codec  stores  fields  as  files Document  storage
  • 21. !   Problem:  “big”  stored  fields !   Text  documents  get  stored  for  highlighting !   Take  time  to  copy  when  merging !   Can  we  do  beZer  by  storing  as  files,  but   managing  w/Lucene? “Big”  binary  stored  fields
  • 22. large  stored  fields small  stored  fields ExternalFieldCodec
  • 23. !   Real-­‐‑time  deletes !   Track  deletions  when  merging !   Keep  commits  with  IndexDeletionPolicy !   Delete  unmerged  (empty)  segments !   Off-­‐‑line  deletes !   Cleanup  tool  traverses  entire  index Deleting  is  complicated
  • 24. ! ! ! !   2-­‐‑3x  write  speedup  for  unindexed  stored  fields   a  bit  slower  in  the  worst  case   But,  text  analysis  can  take  most  of  the  time   Net:  useful  if  you  are  storing  large  binaries Codec  Performance   (preliminary)
  • 25. !   custom  DispatchFilter  provides: !   HTTP  request/response  handling  in  XQuery !   file  uploads,  redirects !   Ability  to  roll  your  own:  cookies,  authentication !   Rapid  prototyping,  testing  query  performance,   relevance,  in  an  application  seZing App  Server
  • 26. !   Yes,  but  did  you  remember  to  index  all  the   fields  you  need  in  advance? !   Yes,  but  did  you  want  to  format  the  result  into  a   nice  report  *using  your  query  language*? !   Yes,  but  did  you  want  access  to  a  complete   XPath  2.0  implementation  in  your  indexer? Isn’t  Solr  enough?
  • 27. !   Find  some  sample  content  with  a  new  tag  we  need   to  support !   Perform  complex  updates  to  patch  broken  content !   Troubleshoot  content !   Explore  unfamiliar  content !   Write  prototypes  and  admin  tools  entirely  in  HTML,   JS  and  XQuery !   Demo:  hZp://localhost:8080 Example  uses  
  • 28. !   Downloads  and  Documentation  at   hZp://luxdb.org   !   Source  code  at  hZp://github.com/msokolov/lux !   Freely  available  under  OSS  license  (MPL  2) !   Contributions  welcome !   Thank  you,  Safari  Books! Thank  You!