SlideShare a Scribd company logo
1 of 22
Download to read offline
Integrating Advanced Text
      Analytics into Solr

               Lucene Revolution



Steve Kearns
Product Manager
www.basistech.com
Agenda

• About Basis Technology

• Why Text Analytics and Solr?

• Overview and Uses of Text Analytics

• Integration Strategies
About Basis Technology

• HQ in Cambridge, MA, Offices in:
  Tokyo, San Francisco, Washington DC


• Specialists in multilingual text analytics for
  Web/enterprise search
  Document/OSINT/media exploitation


• Rosette Linguistics Platform is widely used by
  commercial enterprises and government
  agencies
Why Text Analytics and Solr?


• More than Keyword Search and Result Lists

• More Metadata
    New ways to visualize, navigate and explore
    New knobs to tune relevance
    New info to connect disparate data sources

• Solr can be the consumer, host, or broker
Overview of Text Analytics

• Document-Level
    Language Identification, Categorization

• Sub-Document Level
    Entity Extraction, Fact Extraction, Sentiment, Linguistics

• Cross-Document
    Cross-Document Entity Resolution, Near Duplicate Detection,
     Unsupervised Clustering
Document Level Analysis: Language Identification


        • Sub-document Lang ID is possible
                                                                            La Grande-Bretagne
                                           Американская                     a de son côté jugé
                                                                                Après avoir rencontré
La Grande-Bretagne a                                                                                         「端末側で行単位に(あるい
                                           софтверная компания              queles présidents de nigérian
                                                                                 l'accord de
de son côté jugé que                       становится
                                                                                     Le président            は一画面分)編集しておいて、
l'accord deВ данный момент
             Luxembourg
                                                                            Luxembourg cinq pays
                                                                                quatre des
                                                                                     Olusegun Obasanjo a       「端末側で行単位に(あるい
                                                                                                             送信キーによりまとめて送信
                                           пользующимся
constituaitправительство США, 私ごとになりますが、ちょうどこの
             un véritable                  спросом у спецслужб
                                                                            constituait uncette du
                                                                                africains (Afrique
                                                                                     salué                     は一画面分)編集しておいて、
                                                                                                             する」という方式と、
changement dans la               ころ大学院生でしたが、ACOS-6                          véritable
                                                                                Sud, l'engagement du G8,
                                                                                      Algérie,                    FNPがコンピュータと端末の
                                                                                                               送信キーによりまとめて送信
            обвиняющее                     США экспертом в                                                   「端末には知能はなく、一字
stratégie agricole de            用のある言語処理系の開発を請
                                           области лингвистики              changement Nigeria) "la
                                                                                Sénégal, dans la
                                                                                     déclarant que                間にあって、実際の端末との
                                                                                                               する」という方式と、
            радикальную                                                                                      一字すべてがその都度送ら
l'Europe, tandis que
            мусульманскую        け負って作っていました。ACOS-6
                                           (в частности, изучения           stratégie
                                                                                membres du comité
                                                                                     condition majeure au         やりとりを制御するのです。そ
                                                                                                               「端末には知能はなく、一字
                                                                                                             れ処理される」
l'Irlande y a vu un gage "Аль
            группировку          はMulticsの概念に非常に近いも
                                           и обработки                          de pilotage du                    して、コンピュータとFNPの間
                                                                                                               一字すべてがその都度送ら
                                                                                     développement est
de stabilité et et de терактах 2 のを持っていました、あるいは持
            Каида" в                       информации на                                                          の通信は、
                                                                                                               れ処理される」
sécurité pour les
            года назад,          とうとしていました。 языке) после
                                           арабском                                                               少量の転送には不向きで、大
            активизирует свое また、ハードウェアも大変似てい
agriculteurs.                              терактов 11 сентября                        French                     量の一括転送に向いていまし
            внимание к арабскому ました。シールをはがすと、
                                           2001 Le président nigérian
                                                 г.
            языку и программам その下から別のアメリカの会社の
            его обработки.       名前が出てくるマシンでテスト
                                                Olusegun Obasanjo a
                                                salué cette                                                                Japanese
            Грамматика языков したこともありました。1年間ほとdu G8,
      「端末側で行単位に(あるいは一                           l'engagement
      画面分)編集しておいて、
            данной группы        んど休みなしにマシンルーム  déclarant que "la             Программное
      送信キーによりまとめて送信す             にこもっていて、ここでの議論とcondition majeure au          обеспечение Basis
                                                                                  Американская
      る」という方式と、      Программное 疑問を自分のテーマとしても  développement est             Technology позволяет
                                                                                  софтверная
      「端末には知能はなく、一字一字обеспечение扱ったことがあるのです。それで、
                                  Basis         l'absence de conflit". La     осуществлять поиск
                                                                                  компания момент
                                                                                     В данный
                                 よーくわかるのです。     porte-parole de la                                          Bild vergrößern                 German
      すべてがその都度送られ処理さ Technology позволяет                                     слов с правительство США,
                                                                                     близкими
                                                                                  становится                Berlin (AP) Der Kanzler
      れる」            осуществлять поиск слов présidence française,            значениями, а также
                                                                                     обвиняющее
                                                                                  пользующимся              strahlte: «Ich gestehe, dass
                                                                                                                                             29%
      という方式は、究極的に前者は с близкими значениями, а Catherine Colonna, a            транслитерировать
                                                                                     радикальную
                                                                                  спросом у                 ich 90 Prozent Zustimmung
      半二重通信、後者は全二重通信 также транслитерировать pour sa part qualifié la                                                                        French
                                        FNPがコンピュータと端末の間に
      とフィットします。арабские и фарси-буквы в réunion                                      мусульманскую
                                                                                  спецслужб США             EVIAN (AP) - Les membres du
      後者では、入力のエコーもコン                    あって、実際の端末とのやりとり
                     латинские. Продукт был     d'"exceptionnelle".                  группировку "Аль
                                                                                  экспертом в области       G8 se sont engagés dimanche       33%
      ピュータ側で制御されます。 по
                     разработан         を制御するのです。そして、コン                              Каида" в терактах 2    soir à soutenir la
      つまり、入力した字の表示はキーспециальному заказуピュータとFNPの間の通信は、
                                                                                                            これはファンドマネージャー
                                                                                                                                            Japanese
      入力がコンピュータに送られ、 США少量の転送には不向きで、大量
      それが送り返されて表示されま
                     правительства
                     целью оптимизации
                                         с
                                        の一括転送に向いていました。                                   Russian            さんが嘘をついているという                     21%
      す。             процесса анализа FNPによるコンピュータへの割り
                                                                                                            わけではありません。計算
                     арабских текстов. 込み要求は高価なものだったか                                                       ilHaaqa-n bikitaabinaa s-        Arabic
                                        らです。Multicsでのプロセスの                                                  sirriyyi r-raqiimi fii yurjae
                                        wake upも高価だということもあ                                                  ittikhaadha maa yulzamu
                                                                                                                                              17%
                                        りました。
Document Level Analysis: Categorization

      • Group Documents into Pre-defined categories




http://news.google.com/
http://www.bbc.co.uk/
Sub-Document Analysis: Linguistics

 • Segmentation of Asian language

 • Lemmatization


Stemming
N-Gram




Morphological
Lemmatization
Segmentation
Sub-Document Analysis: Sentiment

     • Sentence, paragraph, entity, aspect, emotion




http://twittersentiment.appspot.com/search?query=Lucene
http://maps.google.com/maps/place?cid=7410753351872099397
Sub-Document Analysis: Entity Extraction

     • Identify Named Concepts in Unstructured Text
            Statistical, rules, lists




http://www.twitscoop.com/
Sub-Document: Fact / Rel. / Event Extraction

      • Identify Facts, Link Entities, Events and Times




http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
Cross-Document: Entity Co-reference Resolution

• Map extracted entities to real-world Concepts
Cross-Document Analysis: Clustering

• Near Duplicate Detection

• Unsupervised Clustering
Integration Strategies

• Analyzer/Tokenizer/TokenFilter

• UpdateRequestProcessor
    Run Analysis in Solr
    Call External Analysis Service

• Pre-Processor to Solr
Integration Point: Analyzer/Tokenizer

• Good for:
    Linguistics
    Segmentation of Asian Language

• Limitations:
    No access to document object
Analyzer/Tokenizer Configuration

• Schema.xml

   FieldType
     • Analyzer
        – CharFilter
        – Tokenize
        – TokenFilter
Integration Point: UpdateRequestProcessor

• Runs Before Analyzers

• Full Access to Document




• Two options:
    Run the analysis directly in Solr
    Call out to external analysis services




• Limitations:
    Think through your indexing strategy
Integration Point: UpdateRequestProcessor

• Run the analysis directly in Solr
     Good for light weight analytics
     Not good for cross-document analytics




• Call out to external analysis services
     Web Services, UIMA, OpenPipeline, GATE, custom code
     Note that these external calls are synchronous
     Additional complexity / points of failure
UpdateRequestProcessor Configuration

• SolrConfig.xml
    RequestHandler
       • update.processor = UpdateRequestProcessorChain.name
    UpdateRequestProcessorChain
       • Processors
Integration Point: Pre-Processor

• Index in Solr as Last Step of Analysis




• Good For:
     Finer-grained control
     Managing dependencies between components
     Scalability

• Limitations:
     Complexity / New points of failure
     Cannot use Solr’s content acquisition features
Integration Summary

• There are Many Options!




• Document-Level Analysis:
    Generally, safe to run in UpdateRequestProcessor

• Sub-Document Analysis:
    Sometimes run in UpdateRequestProcessor, sometimes external

• Cross-Document Analysis:
    Run external

• Multiple-Analysis Components:
    Run external document processing pipeline
Questions?

More Related Content

Viewers also liked

Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012彰 村地
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your TechnologyMarty Kaszubowski
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentationocrock
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemLucidworks (Archived)
 
Shining new light on lucene solr performance and monitoring
Shining new light on lucene solr performance and monitoringShining new light on lucene solr performance and monitoring
Shining new light on lucene solr performance and monitoringLucidworks (Archived)
 
Tennis
TennisTennis
Tennisaritz
 
Practical Search with Solr: Beyond just Looking it Up
Practical Search with Solr: Beyond just Looking it UpPractical Search with Solr: Beyond just Looking it Up
Practical Search with Solr: Beyond just Looking it UpLucidworks (Archived)
 
Jonh Lennon
Jonh LennonJonh Lennon
Jonh Lennontanica
 
Creating Custom Finishes
Creating Custom FinishesCreating Custom Finishes
Creating Custom Finishesguest0a3c64a
 

Viewers also liked (19)

All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 
Search Analytics What? Why? How?
Search Analytics What? Why? How?Search Analytics What? Why? How?
Search Analytics What? Why? How?
 
Mains aux fleurs
Mains aux fleursMains aux fleurs
Mains aux fleurs
 
ICT Tool Sharing
ICT Tool SharingICT Tool Sharing
ICT Tool Sharing
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
 
Updated: Sources of Funding
Updated:  Sources of FundingUpdated:  Sources of Funding
Updated: Sources of Funding
 
Sample2
Sample2Sample2
Sample2
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your Technology
 
IAMAS 2010 First presentation
IAMAS 2010 First presentationIAMAS 2010 First presentation
IAMAS 2010 First presentation
 
Customized Navigation Using SOLR
Customized Navigation Using SOLRCustomized Navigation Using SOLR
Customized Navigation Using SOLR
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
 
What’s new in apache solr 1.4
What’s new in apache solr 1.4What’s new in apache solr 1.4
What’s new in apache solr 1.4
 
Shining new light on lucene solr performance and monitoring
Shining new light on lucene solr performance and monitoringShining new light on lucene solr performance and monitoring
Shining new light on lucene solr performance and monitoring
 
Tennis
TennisTennis
Tennis
 
Practical Search with Solr: Beyond just Looking it Up
Practical Search with Solr: Beyond just Looking it UpPractical Search with Solr: Beyond just Looking it Up
Practical Search with Solr: Beyond just Looking it Up
 
Juan gris
Juan grisJuan gris
Juan gris
 
Jonh Lennon
Jonh LennonJonh Lennon
Jonh Lennon
 
Creating Custom Finishes
Creating Custom FinishesCreating Custom Finishes
Creating Custom Finishes
 
Van gogh
Van goghVan gogh
Van gogh
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見
持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見
持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見Shumpei Kishi
 
TaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdf
TaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdfTaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdf
TaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdfMatsushita Laboratory
 
IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)
IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)
IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)ssuser539845
 
情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法
情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法
情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法ssuser370dd7
 
20240326_IoTLT_vol109_kitazaki_v1___.pdf
20240326_IoTLT_vol109_kitazaki_v1___.pdf20240326_IoTLT_vol109_kitazaki_v1___.pdf
20240326_IoTLT_vol109_kitazaki_v1___.pdfAyachika Kitazaki
 
「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ
「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ
「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-LoopへTetsuya Nihonmatsu
 
2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~
2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~
2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~arts yokohama
 
2024 01 Virtual_Counselor
2024 01 Virtual_Counselor 2024 01 Virtual_Counselor
2024 01 Virtual_Counselor arts yokohama
 
ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦
ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦
ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦Sadao Tokuyama
 

Recently uploaded (12)

持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見
持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見
持続可能なDrupal Meetupのコツ - Drupal Meetup Tokyoの知見
 
TaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdf
TaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdfTaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdf
TaketoFujikawa_台本中の動作表現に基づくアニメーション原画システムの提案_SIGEC71.pdf
 
2024 03 CTEA
2024 03 CTEA2024 03 CTEA
2024 03 CTEA
 
2024 04 minnanoito
2024 04 minnanoito2024 04 minnanoito
2024 04 minnanoito
 
What is the world where you can make your own semiconductors?
What is the world where you can make your own semiconductors?What is the world where you can make your own semiconductors?
What is the world where you can make your own semiconductors?
 
IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)
IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)
IFIP IP3での資格制度を対象とする国際認定(IPSJ86全国大会シンポジウム)
 
情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法
情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法
情報処理学会86回全国大会_Generic OAMをDeep Learning技術によって実現するための課題と解決方法
 
20240326_IoTLT_vol109_kitazaki_v1___.pdf
20240326_IoTLT_vol109_kitazaki_v1___.pdf20240326_IoTLT_vol109_kitazaki_v1___.pdf
20240326_IoTLT_vol109_kitazaki_v1___.pdf
 
「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ
「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ
「今からでも間に合う」GPTsによる 活用LT会 - 人とAIが協調するHumani-in-the-Loopへ
 
2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~
2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~
2024 02 Nihon-Tanken ~Towards a More Inclusive Japan~
 
2024 01 Virtual_Counselor
2024 01 Virtual_Counselor 2024 01 Virtual_Counselor
2024 01 Virtual_Counselor
 
ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦
ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦
ARスタートアップOnePlanetの Apple Vision Proへの情熱と挑戦
 

Integrating Advanced Text Analytics into Solr

  • 1. Integrating Advanced Text Analytics into Solr Lucene Revolution Steve Kearns Product Manager www.basistech.com
  • 2. Agenda • About Basis Technology • Why Text Analytics and Solr? • Overview and Uses of Text Analytics • Integration Strategies
  • 3. About Basis Technology • HQ in Cambridge, MA, Offices in: Tokyo, San Francisco, Washington DC • Specialists in multilingual text analytics for Web/enterprise search Document/OSINT/media exploitation • Rosette Linguistics Platform is widely used by commercial enterprises and government agencies
  • 4. Why Text Analytics and Solr? • More than Keyword Search and Result Lists • More Metadata  New ways to visualize, navigate and explore  New knobs to tune relevance  New info to connect disparate data sources • Solr can be the consumer, host, or broker
  • 5. Overview of Text Analytics • Document-Level  Language Identification, Categorization • Sub-Document Level  Entity Extraction, Fact Extraction, Sentiment, Linguistics • Cross-Document  Cross-Document Entity Resolution, Near Duplicate Detection, Unsupervised Clustering
  • 6. Document Level Analysis: Language Identification • Sub-document Lang ID is possible La Grande-Bretagne Американская a de son côté jugé Après avoir rencontré La Grande-Bretagne a 「端末側で行単位に(あるい софтверная компания queles présidents de nigérian l'accord de de son côté jugé que становится Le président は一画面分)編集しておいて、 l'accord deВ данный момент Luxembourg Luxembourg cinq pays quatre des Olusegun Obasanjo a 「端末側で行単位に(あるい 送信キーによりまとめて送信 пользующимся constituaitправительство США, 私ごとになりますが、ちょうどこの un véritable спросом у спецслужб constituait uncette du africains (Afrique salué は一画面分)編集しておいて、 する」という方式と、 changement dans la ころ大学院生でしたが、ACOS-6 véritable Sud, l'engagement du G8, Algérie, FNPがコンピュータと端末の 送信キーによりまとめて送信 обвиняющее США экспертом в 「端末には知能はなく、一字 stratégie agricole de 用のある言語処理系の開発を請 области лингвистики changement Nigeria) "la Sénégal, dans la déclarant que 間にあって、実際の端末との する」という方式と、 радикальную 一字すべてがその都度送ら l'Europe, tandis que мусульманскую け負って作っていました。ACOS-6 (в частности, изучения stratégie membres du comité condition majeure au やりとりを制御するのです。そ 「端末には知能はなく、一字 れ処理される」 l'Irlande y a vu un gage "Аль группировку はMulticsの概念に非常に近いも и обработки de pilotage du して、コンピュータとFNPの間 一字すべてがその都度送ら développement est de stabilité et et de терактах 2 のを持っていました、あるいは持 Каида" в информации на の通信は、 れ処理される」 sécurité pour les года назад, とうとしていました。 языке) после арабском 少量の転送には不向きで、大 активизирует свое また、ハードウェアも大変似てい agriculteurs. терактов 11 сентября French 量の一括転送に向いていまし внимание к арабскому ました。シールをはがすと、 2001 Le président nigérian г. языку и программам その下から別のアメリカの会社の его обработки. 名前が出てくるマシンでテスト Olusegun Obasanjo a salué cette Japanese Грамматика языков したこともありました。1年間ほとdu G8, 「端末側で行単位に(あるいは一 l'engagement 画面分)編集しておいて、 данной группы んど休みなしにマシンルーム déclarant que "la Программное 送信キーによりまとめて送信す にこもっていて、ここでの議論とcondition majeure au обеспечение Basis Американская る」という方式と、 Программное 疑問を自分のテーマとしても développement est Technology позволяет софтверная 「端末には知能はなく、一字一字обеспечение扱ったことがあるのです。それで、 Basis l'absence de conflit". La осуществлять поиск компания момент В данный よーくわかるのです。 porte-parole de la Bild vergrößern German すべてがその都度送られ処理さ Technology позволяет слов с правительство США, близкими становится Berlin (AP) Der Kanzler れる」 осуществлять поиск слов présidence française, значениями, а также обвиняющее пользующимся strahlte: «Ich gestehe, dass 29% という方式は、究極的に前者は с близкими значениями, а Catherine Colonna, a транслитерировать радикальную спросом у ich 90 Prozent Zustimmung 半二重通信、後者は全二重通信 также транслитерировать pour sa part qualifié la French FNPがコンピュータと端末の間に とフィットします。арабские и фарси-буквы в réunion мусульманскую спецслужб США EVIAN (AP) - Les membres du 後者では、入力のエコーもコン あって、実際の端末とのやりとり латинские. Продукт был d'"exceptionnelle". группировку "Аль экспертом в области G8 se sont engagés dimanche 33% ピュータ側で制御されます。 по разработан を制御するのです。そして、コン Каида" в терактах 2 soir à soutenir la つまり、入力した字の表示はキーспециальному заказуピュータとFNPの間の通信は、 これはファンドマネージャー Japanese 入力がコンピュータに送られ、 США少量の転送には不向きで、大量 それが送り返されて表示されま правительства целью оптимизации с の一括転送に向いていました。 Russian さんが嘘をついているという 21% す。 процесса анализа FNPによるコンピュータへの割り わけではありません。計算 арабских текстов. 込み要求は高価なものだったか ilHaaqa-n bikitaabinaa s- Arabic らです。Multicsでのプロセスの sirriyyi r-raqiimi fii yurjae wake upも高価だということもあ ittikhaadha maa yulzamu 17% りました。
  • 7. Document Level Analysis: Categorization • Group Documents into Pre-defined categories http://news.google.com/ http://www.bbc.co.uk/
  • 8. Sub-Document Analysis: Linguistics • Segmentation of Asian language • Lemmatization Stemming N-Gram Morphological Lemmatization Segmentation
  • 9. Sub-Document Analysis: Sentiment • Sentence, paragraph, entity, aspect, emotion http://twittersentiment.appspot.com/search?query=Lucene http://maps.google.com/maps/place?cid=7410753351872099397
  • 10. Sub-Document Analysis: Entity Extraction • Identify Named Concepts in Unstructured Text  Statistical, rules, lists http://www.twitscoop.com/
  • 11. Sub-Document: Fact / Rel. / Event Extraction • Identify Facts, Link Entities, Events and Times http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360
  • 12. Cross-Document: Entity Co-reference Resolution • Map extracted entities to real-world Concepts
  • 13. Cross-Document Analysis: Clustering • Near Duplicate Detection • Unsupervised Clustering
  • 14. Integration Strategies • Analyzer/Tokenizer/TokenFilter • UpdateRequestProcessor  Run Analysis in Solr  Call External Analysis Service • Pre-Processor to Solr
  • 15. Integration Point: Analyzer/Tokenizer • Good for:  Linguistics  Segmentation of Asian Language • Limitations:  No access to document object
  • 16. Analyzer/Tokenizer Configuration • Schema.xml FieldType • Analyzer – CharFilter – Tokenize – TokenFilter
  • 17. Integration Point: UpdateRequestProcessor • Runs Before Analyzers • Full Access to Document • Two options:  Run the analysis directly in Solr  Call out to external analysis services • Limitations:  Think through your indexing strategy
  • 18. Integration Point: UpdateRequestProcessor • Run the analysis directly in Solr  Good for light weight analytics  Not good for cross-document analytics • Call out to external analysis services  Web Services, UIMA, OpenPipeline, GATE, custom code  Note that these external calls are synchronous  Additional complexity / points of failure
  • 19. UpdateRequestProcessor Configuration • SolrConfig.xml  RequestHandler • update.processor = UpdateRequestProcessorChain.name  UpdateRequestProcessorChain • Processors
  • 20. Integration Point: Pre-Processor • Index in Solr as Last Step of Analysis • Good For:  Finer-grained control  Managing dependencies between components  Scalability • Limitations:  Complexity / New points of failure  Cannot use Solr’s content acquisition features
  • 21. Integration Summary • There are Many Options! • Document-Level Analysis:  Generally, safe to run in UpdateRequestProcessor • Sub-Document Analysis:  Sometimes run in UpdateRequestProcessor, sometimes external • Cross-Document Analysis:  Run external • Multiple-Analysis Components:  Run external document processing pipeline