SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Multilingual Search and Text
     Analytics with Solr
     Steve Kearns, Basis Technology
         skearns@basistech.com
            October 19, 2011
Agenda
 Agenda

   About me
   Language is important
   Approaches for language-aware search
   Configuring Solr




                      3
About Me
 Steve Kearns
 Product Manager at Basis Technology
 MS Information Technology

 BBN Technologies
  • Speech to text, Machine translation
  • Distributed architecture
 EMD Serono (Merck KGaA)
  • Scientific application development
  • Molecular modeling Linux cluster commodity hw

                          4
Language
    is
Important
    5
Language is Important
 Content is produced and consumed in the
  language of the user

 Document collections often contain more than one
  language

 Each language is unique, and presents different
  challenges to the search engine



                        6
Language is Complex
 Tokenization
   • Some languages do not use spaces
   • Compound words combine two or more words


 Inflection
   • In grammar, inflection or inflexion is the modification
     of a word to express different grammatical
     categories such as tense, grammatical mood,
     grammatical voice, aspect, person, number, gender
     and case.

                                             http://en.wikipedia.org/wiki/Inflection


                            7
Language is Complex




                                                    8

http://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png
Language is Complex
 The Spanish word “pasaportar” has more than
  50 inflected forms:

pasaportando                       pasaportareis                       pasaportarán
pasaportes                         pasaportaron                        pasaporte
pasaportada                        pasaportase                         pasaportan
pasaportaba                        pasaportemos                        pasaporta
pasaportarían                      pasaportaría                        pasaportaste
pasaportarais                      pasaportara                         pasaportad
pasaportasen                       pasaportasteis                      pasaportéis
pasaportaren                       pasaportáramos                      pasaportadas
pasaportado                        pasaportaban                        pasaporté
pasaportaremos                     pasaportásemos                      pasaportados
pasaportábamos                     pasaportamos                        pasaportaré
pasaportases                       pasaporten                          pasaportare
pasaportaríais                     pasaportaréis                       pasaportará
pasaportaran                       pasaportabas                        pasaportó
pasaportarías                      pasaportaríamos                     pasaportabais
pasaportaras                       pasaportáremos                      pasaportaseis
pasaportarás                       pasaporto                           …

                                                     9
  http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
Language is Complex
 English:
  • Spoke: Noun, part of a wheel
  • Spoke: Verb, Past tense of “speak”
 French:
  • Été: Noun, Summer
  • Été: Verb, Past tense of être (to be)
 German:
  • Samstagmorgen: Compound Noun
      Compound of Samstag (Saturday) and Morgen (Morning)
 Japanese:
  • 首脳会談後、オバマ大統領は記者団の質問に答える予定
      Where are the words??
                               10
Language-Aware Search
 Language Technology
  • Language Identification
  • Tokenization
      N-Gram
      Morphological
  • Token processing
      Stemming
      Lemmatization
  • Higher level analytics
      Entity Extraction
      Relationship Extraction



                                 11
Language Identification
 Find a single dominant language in a document
 Find multiple languages in a single document




                           12
Tokenization
 Morphological Analysis vs. N-gram
 Search Term: 東京 ルパン上映時間
 N-gram:




 Morphological Analysis:




                        13
Token Processing
 Stemming vs. Lemmatization
 English: “I have spoken at several conferences”
 Stemming:




 Lemmatization:




                        14
Token Processing
German: “Am Samstagmorgen fliege ich zurueck
  nach Boston.”
 Stemming:



 Lemmatization (and decompounding!):




                      15
Additional Linguistic Analysis
 Decompounding
 Part of Speech
 Noun Phrase Extraction




                       16
Higher Level Analysis
 Entity Extraction
 Relationship Extraction




                        17
How to Configure Solr
 Goals:
  • Language Identification
  • Language-aware:
      Tokenization
      Token Processing
      Entity Extraction


 Challenges
  • Multiple languages in the data set




                           18
How to Configure Solr
 What Solr tools do we have to work with?
  • UpdateRequestProcessor
  • Analyzer/Tokenizer/TokenFilter
  • Solr Cores


 Pre-processor to Solr?




                           19
UpdateRequestProcessor
 Runs Before Analyzers
 Full Access to Document

 Two options:
  • Run the analysis directly in Solr
      Good for Lightweight Analysis
  • Call out to external analysis services
      Web Services/UIMA. Increases Complexity


 Limitations:
  • Think through your indexing strategy

                               20
Analyzer/Tokenizer
 Good for:
  • Segmentation of Asian Language
  • Linguistics
 Limitations:
  • No access to document object


 Schema.xml
   • FieldType
      Analyzer
        – CharFilter
        – Tokenize
        – TokenFilter

                        21
How to Configure Language ID
 UpdateRequestProcessor
  • Runs before field-level analysis takes place
  • Basic Language Identifier URP to be included in Solr


 Outside Solr



What do you do with the language information??



                         22
Multiple Languages: Method 1
        One field for each language
              • Pro:
                     Simple approach and implementation
                     Guarantees that queries are processed the same way as index
              • Con:
                     Increased query-time complexity (Dismax, etc)
                     Decreased speed as additional fields are queried
                     May require storing multiple copies of text




Informed by Trey Grainger @ Careerbuilder: http://www.lucidimagination.com/sites/default/files/Grainger%20Trey%20-
%20Extending%20Solr,%20Building%20a%20Cloud-Like%20Knowledge%20Discovery%20Platform%20-%20rev.pdf

                                                               23
Multiple Languages: Method 2
 One Solr core per language
  Each Core has the same field, with a language-specific
    Analyzer/Tokenizer
  • Pros:
      No query-time performance overhead
      Guarantees that queries are processed the same way as index
  • Cons:
      Significant complexity in managing multiple cores
      Must implement custom sharding
      Does not support multilingual documents




                                24
Multiple Languages: Method 3
 All Languages in one field
  • Pros:
      Single field makes queries and indexing easy
      Same schema/core as more languages added
  • Cons:
      Requires complex custom Tokenizer/Analyzer
      Must pass in language information for queries and indexing
      Does not guarantee queries are processed the same as the
       index
      Potential TF/IDF confusion




                               25
Language is Important
 Use language information at index and query time
 Increase recall, maintain precision



 Better search results for your users




                         26
Contact
 Steve Kearns
  • skearns@basistech.com
  • http://www.basistech.com




                        27

Más contenido relacionado

Destacado

Sonova.com building multilingual and multidomain drupal website
Sonova.com building multilingual and multidomain drupal websiteSonova.com building multilingual and multidomain drupal website
Sonova.com building multilingual and multidomain drupal websiteAmazee Labs
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Exove
 
Using solr to find the right person for the right job - By Kang Laura
Using solr to find the right person for the right job - By Kang Laura   Using solr to find the right person for the right job - By Kang Laura
Using solr to find the right person for the right job - By Kang Laura lucenerevolution
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 
Managing Translation Workflows in Drupal 7
Managing Translation Workflows in Drupal 7Managing Translation Workflows in Drupal 7
Managing Translation Workflows in Drupal 7Suzanne Dergacheva
 
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...Lucidworks (Archived)
 

Destacado (8)

Sonova.com building multilingual and multidomain drupal website
Sonova.com building multilingual and multidomain drupal websiteSonova.com building multilingual and multidomain drupal website
Sonova.com building multilingual and multidomain drupal website
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013Language support in searching Drupal with SOLR - Drupalcamp London 2013
Language support in searching Drupal with SOLR - Drupalcamp London 2013
 
Using solr to find the right person for the right job - By Kang Laura
Using solr to find the right person for the right job - By Kang Laura   Using solr to find the right person for the right job - By Kang Laura
Using solr to find the right person for the right job - By Kang Laura
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
Managing Translation Workflows in Drupal 7
Managing Translation Workflows in Drupal 7Managing Translation Workflows in Drupal 7
Managing Translation Workflows in Drupal 7
 
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Multilingual Search and Text Analytics with Solr - Steve Kearns

  • 1. Multilingual Search and Text Analytics with Solr Steve Kearns, Basis Technology skearns@basistech.com October 19, 2011
  • 2. Agenda  Agenda  About me  Language is important  Approaches for language-aware search  Configuring Solr 3
  • 3. About Me  Steve Kearns  Product Manager at Basis Technology  MS Information Technology  BBN Technologies • Speech to text, Machine translation • Distributed architecture  EMD Serono (Merck KGaA) • Scientific application development • Molecular modeling Linux cluster commodity hw 4
  • 4. Language is Important 5
  • 5. Language is Important  Content is produced and consumed in the language of the user  Document collections often contain more than one language  Each language is unique, and presents different challenges to the search engine 6
  • 6. Language is Complex  Tokenization • Some languages do not use spaces • Compound words combine two or more words  Inflection • In grammar, inflection or inflexion is the modification of a word to express different grammatical categories such as tense, grammatical mood, grammatical voice, aspect, person, number, gender and case. http://en.wikipedia.org/wiki/Inflection 7
  • 7. Language is Complex 8 http://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png
  • 8. Language is Complex  The Spanish word “pasaportar” has more than 50 inflected forms: pasaportando pasaportareis pasaportarán pasaportes pasaportaron pasaporte pasaportada pasaportase pasaportan pasaportaba pasaportemos pasaporta pasaportarían pasaportaría pasaportaste pasaportarais pasaportara pasaportad pasaportasen pasaportasteis pasaportéis pasaportaren pasaportáramos pasaportadas pasaportado pasaportaban pasaporté pasaportaremos pasaportásemos pasaportados pasaportábamos pasaportamos pasaportaré pasaportases pasaporten pasaportare pasaportaríais pasaportaréis pasaportará pasaportaran pasaportabas pasaportó pasaportarías pasaportaríamos pasaportabais pasaportaras pasaportáremos pasaportaseis pasaportarás pasaporto … 9 http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
  • 9. Language is Complex  English: • Spoke: Noun, part of a wheel • Spoke: Verb, Past tense of “speak”  French: • Été: Noun, Summer • Été: Verb, Past tense of être (to be)  German: • Samstagmorgen: Compound Noun  Compound of Samstag (Saturday) and Morgen (Morning)  Japanese: • 首脳会談後、オバマ大統領は記者団の質問に答える予定  Where are the words?? 10
  • 10. Language-Aware Search  Language Technology • Language Identification • Tokenization  N-Gram  Morphological • Token processing  Stemming  Lemmatization • Higher level analytics  Entity Extraction  Relationship Extraction 11
  • 11. Language Identification  Find a single dominant language in a document  Find multiple languages in a single document 12
  • 12. Tokenization  Morphological Analysis vs. N-gram  Search Term: 東京 ルパン上映時間  N-gram:  Morphological Analysis: 13
  • 13. Token Processing  Stemming vs. Lemmatization  English: “I have spoken at several conferences”  Stemming:  Lemmatization: 14
  • 14. Token Processing German: “Am Samstagmorgen fliege ich zurueck nach Boston.”  Stemming:  Lemmatization (and decompounding!): 15
  • 15. Additional Linguistic Analysis  Decompounding  Part of Speech  Noun Phrase Extraction 16
  • 16. Higher Level Analysis  Entity Extraction  Relationship Extraction 17
  • 17. How to Configure Solr  Goals: • Language Identification • Language-aware:  Tokenization  Token Processing  Entity Extraction  Challenges • Multiple languages in the data set 18
  • 18. How to Configure Solr  What Solr tools do we have to work with? • UpdateRequestProcessor • Analyzer/Tokenizer/TokenFilter • Solr Cores  Pre-processor to Solr? 19
  • 19. UpdateRequestProcessor  Runs Before Analyzers  Full Access to Document  Two options: • Run the analysis directly in Solr  Good for Lightweight Analysis • Call out to external analysis services  Web Services/UIMA. Increases Complexity  Limitations: • Think through your indexing strategy 20
  • 20. Analyzer/Tokenizer  Good for: • Segmentation of Asian Language • Linguistics  Limitations: • No access to document object  Schema.xml • FieldType  Analyzer – CharFilter – Tokenize – TokenFilter 21
  • 21. How to Configure Language ID  UpdateRequestProcessor • Runs before field-level analysis takes place • Basic Language Identifier URP to be included in Solr  Outside Solr What do you do with the language information?? 22
  • 22. Multiple Languages: Method 1  One field for each language • Pro:  Simple approach and implementation  Guarantees that queries are processed the same way as index • Con:  Increased query-time complexity (Dismax, etc)  Decreased speed as additional fields are queried  May require storing multiple copies of text Informed by Trey Grainger @ Careerbuilder: http://www.lucidimagination.com/sites/default/files/Grainger%20Trey%20- %20Extending%20Solr,%20Building%20a%20Cloud-Like%20Knowledge%20Discovery%20Platform%20-%20rev.pdf 23
  • 23. Multiple Languages: Method 2  One Solr core per language Each Core has the same field, with a language-specific Analyzer/Tokenizer • Pros:  No query-time performance overhead  Guarantees that queries are processed the same way as index • Cons:  Significant complexity in managing multiple cores  Must implement custom sharding  Does not support multilingual documents 24
  • 24. Multiple Languages: Method 3  All Languages in one field • Pros:  Single field makes queries and indexing easy  Same schema/core as more languages added • Cons:  Requires complex custom Tokenizer/Analyzer  Must pass in language information for queries and indexing  Does not guarantee queries are processed the same as the index  Potential TF/IDF confusion 25
  • 25. Language is Important  Use language information at index and query time  Increase recall, maintain precision  Better search results for your users 26
  • 26. Contact  Steve Kearns • skearns@basistech.com • http://www.basistech.com 27