SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
CONTENT PROCESSING
 ARCHITECTURE AND
   APPLICATIONS	
 Introduction to text mining – Warsaw University of Technology
Plan	

  Findwise – who we are, what we do.
  What is content?
  Why content processing is important
  Content processing and information retrieval
  Technology for content processing
  Methods for content processing
  Examples of usage
Findwise – Search Driven Solutions	
 •  Founded	
  in	
  2005	
  


 •  Offices	
  in	
  Sweden,	
  Denmark,	
  	
  
 	
  	
  	
  	
  	
  	
  	
  Norway,	
  Poland	
  and	
  Australia	
  


 •  90	
  employees	
  


 Our	
  objecBve	
  is	
  to	
  be	
  a	
  leading	
  provider	
  of	
  Findability	
  soluBons	
  uBlising	
  
 the	
  full	
  potenBal	
  of	
  search	
  technology	
  to	
  create	
  customer	
  business	
  value.	
  
 	
  
 •       Paweł	
  Wróblewski	
  &	
  Marcin	
  Goss	
  
WHAT IS CONTENT?
Content ≥ Information	
 From the business point of view INFORMATION is the key to
 success.	
 	
 ”Informa)on	
  can	
  only	
  be	
  an	
  asset	
  when	
  it	
  enables	
  a	
  
 task	
  to	
  be	
  completed.”	
  
 “The	
  value	
  is	
  in	
  the	
  outcome	
  of	
  the	
  task,	
  not	
  in	
  
 the	
  informa)on	
  itself.”	
  
 MarBn	
  White	
  
 	
 Employee productivity (The hidden cost… IDC April 2006):	
 ” “the cost for wasted time on the part of professional searching, but not
 !nding relevant information, amounts to $5.3 million annually for an enterprise
 with 1000 knowledge workers.””
Information is hidden	
 Big Data is commonly described with 3V:	
 	
 1.  Variety	
        Human	
  generated	
  vs.	
  Machine	
  generated	
  
        Text	
  &	
  MulBmedia	
  
 2.  Volume	
        Up	
  to	
  Petabytes	
  
 3.  Velocity	
        Stream	
  of	
  data	
  
        GBs	
  per	
  day,	
  hour,	
  minute,	
  second	
  
Information lives in the
  context	
 The right Information is hidden in text.	
 	
 Text forms a context:	
 word -> sentence -> paragraph -> chapter -> document	
 	
 Content processing is about extracting required
 information from the context.
WHY CONTENT PROCESSING IS
       IMPORTANT?
Why content processing is important	
 To get right information in seconds	
 •  Usage	
  of	
  faceted	
  search	
  
 	
 To tag consistently large document set	
 •  Usage	
  of	
  automaBc	
  extactor	
  
 	
 To biuld semantic database	
 •  ExtracBon	
  of	
  concepts	
  with	
  linkage	
  to	
  taxonomy/ontology	
  

 To perform document classi#cation	
 •  ExtracBon	
  of	
  enBBes	
  with	
  grouping	
  /	
  clustering	
  

 Examples	
  from	
  publicly	
  available	
  websites	
  [live	
  show]	
  
Conclusion	
 Content processing is a set of techniques enabling text analytics.	
 	
 Content processing leverages the value of data stored in companies
 improving data consumption.	
 	
 Content processing used with search engines helps #nd information
 in any context.	
 •  Enteprise	
  Findability	
  
 •  E-­‐commerce	
  
TECHNOLOGY FOR CONTENT
      PROCESSING
General architecture of search engines
Content Processing – the idea	

                     Format	
           Language	
                                 Spell	
                Lemmas	
  
                                                         Synonyms	
  
                     Conversion	
       Detec?on	
                                 Checking	
             (tenses,	
  forms)	
  




Document	
  
                                                                                                               Geography	
  
                     Taxonomy	
                            Custom	
                                            Companies	
  
                                        Vectorizer	
                                En??es	
  
                     Classifica?on	
                        PLUG-­‐IN	
                                         People	
  




                       Scopifier	
       	
  index	
               PARIS	
  (Reuters)	
  -­‐	
  Venus	
  Williams	
  raced	
  into	
  the	
  second	
  round	
  of	
  
                                                                       the	
  $11.25	
  million	
  French	
  Open	
  Monday,	
  brushing	
  aside	
  
                                                                       Bianka	
  Lamade,	
  6-­‐3,	
  6-­‐3,	
  in	
  65	
  minutes.	
  	
  

                                                                   The	
  Wimbledon	
  and	
  U.S.	
  Open	
  champion,	
  seeded	
  second,	
  breezed	
  
                                                                          past	
  the	
  German	
  on	
  a	
  blustery	
  center	
  court	
  to	
  become	
  the	
  
                                                                          first	
  seed	
  to	
  advance	
  at	
  Roland	
  Garros.	
  "I	
  love	
  being	
  here,	
  I	
  
                                                                          love	
  the	
  French	
  Open	
  and	
  more	
  than	
  anything	
  I'd	
  love	
  to	
  do	
  
                                                                          well	
  here,"	
  the	
  American	
  said.	
  	
  
Input:	
  	
  	
  	
  byte	
  stream	
  
Output:	
  structured	
  document	
  ready	
  to	
  be	
  indexed	
  
Content Processing – the implementation	
 Hydra is used in order to refine content before it hits the index. Every
 document fetched from a source runs through a targeted pipeline,
 which includes a number of stages. A stage can be considered as an
 “app” within Appstore or the Android market. Findwise have created
 a huge amount of such stages, where each stage has a small
 purpose to enhance the content of the item. It is possible to create
 additional stages to serve a specific customer functionality.
Hydra - example	

 Select	
  stages	
  to	
  use	
  in	
  the	
  pipeline,	
  the	
  leX	
  column	
  corresponds	
  to	
  the	
  
 “market”,	
  and	
  the	
  right	
  is	
  the	
  stages	
  used.	
  
Hydra - example	

 Modify	
  the	
  format	
  of	
  the	
  date	
  to	
  only	
  include	
  year.	
  
 	
  
 	
  
Hydra - example	

 The	
  new	
  year	
  meta-­‐data	
  can	
  be	
  used	
  as	
  a	
  facet	
  
Hydra - example	

 Map	
  every	
  author	
  field	
  to	
  a	
  metadata	
  field	
  called	
  author.	
  
 Pipeline	
  A	
  
 	
  
 	
  
 	
  
 Pipeline	
  B	
  
 	
  
 	
  
 	
  
Hydra - example	

 In	
  the	
  search	
  result…	
  
 	
  
 	
  
Hydra is Open Source	
 http://#ndwise.github.com/Hydra/
METHODS FOR CONTENT PROCESSING
Named entity recognition – statistical classi#ers	
	
     •  OpenNLP (http://opennlp.apache.org/)	
                 •  Markov chains	
     •  Mallet (http://mallet.cs.umass.edu/) 	
                 •  Conditional random #elds	

     Input: 	
                   Mark has been in London since Mary dumped him.	
                   	
     Output:	
                   <person>Mark</person> has been in <place>London</place>
                   since <person>Mary</person> dumped him.
Classi#ers - training	
	
     •  Training set - language corpora	
                       •  (http://nkjp.pl/) for Polish	


     Set of manually tagged texts in given language. Preferably from various
     sources, various topics.	
     	
     Tokens	
                 PoS	
  tags	
         Name	
  tags	
  
     	
     He	
                     Pronoun	
             O	
  
            went	
                   Verb	
                O	
  
            to	
                     Prep.	
               O	
  
            United	
                 AdjecBve	
            Place	
  
            States	
                 Noun	
                Place	
  
            .	
                      Interp	
              O	
  
Classi#ers – supervised training	
	
     •  Training input	
             •  Features extracted from each token	
                  token: text, PoS tag, token class	
                  prev token: text, PoS tag, token class	
                  next token: text, PoS tag, token class	
                  previous tags assigned	

             •  Token classes examples	
                  lowercase alphabetic, digits, contains number and letter, contains number and
                  a hyphen, all caps, all caps with dots inbetween ...	
                  	
     •  Training output	
             •  <place> <location> <person>	
             •  <B-place> <I-place> <L-place> <U-place>
Classi#ers – approaches	
	
     „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w
     Sheratonie”	
     	
     Location? Organisation name? Person name?	

     •  One classi!er for all name-types	
             •  faster	
             •  automatically resolves con#icts	


     •  One classi!er per name-type	
             •  slower, memory consuming	
             •  provides more information
EXAMPLES
Naive approach	

 Often people names intersect with location names:	
        	
- Kazimierz	
        	
- Washington	
 	
 Company names may come from common language:	
        	
- Oracle	
        	
- Dialog	
 	
 Conlcusion: dictionaries are not enough
      	
  without contextual analysis
Findwise implementation
QUESTIONS?
Paweł Wróblewski	
pawel.wroblewski@#ndwise.com	


       Marcin Goss	
   marcin.goss@#ndwise.com

Más contenido relacionado

Destacado

Nutch and lucene_framework
Nutch and lucene_frameworkNutch and lucene_framework
Nutch and lucene_frameworksamuelhard
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysisSigmoid
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing sparkSigmoid
 
Pig on spark
Pig on sparkPig on spark
Pig on sparkSigmoid
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain AnalyticsSigmoid
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseSoftServe
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise
 

Destacado (7)

Nutch and lucene_framework
Nutch and lucene_frameworkNutch and lucene_framework
Nutch and lucene_framework
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing spark
 
Pig on spark
Pig on sparkPig on spark
Pig on spark
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain Analytics
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
 

Similar a Content Processing Architecture and Applications - Introduction to Text Mining

Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Amazon Web Services
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceBasis Technology
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013CA API Management
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
 
Named Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptxNamed Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptxMOAZZAMALISATTI
 
Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16 Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16 Katie Bauer
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to SolrCominvent AS
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
re:Invent Recap keynote - An introduction to the latest AWS services
re:Invent Recap keynote  - An introduction to the latest AWS servicesre:Invent Recap keynote  - An introduction to the latest AWS services
re:Invent Recap keynote - An introduction to the latest AWS servicesAmazon Web Services
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Paige Morgan
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensKnowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensVinay Bhat
 

Similar a Content Processing Architecture and Applications - Introduction to Text Mining (20)

Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
 
seo tutorial
seo tutorialseo tutorial
seo tutorial
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
Named Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptxNamed Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptx
 
Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16 Bay Area NLP Reading Group - 7.12.16
Bay Area NLP Reading Group - 7.12.16
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
re:Invent Recap keynote - An introduction to the latest AWS services
re:Invent Recap keynote  - An introduction to the latest AWS servicesre:Invent Recap keynote  - An introduction to the latest AWS services
re:Invent Recap keynote - An introduction to the latest AWS services
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Bne impact iif
Bne impact iifBne impact iif
Bne impact iif
 
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensKnowledge_Based_Systems_Siemens
Knowledge_Based_Systems_Siemens
 

Más de Findwise

White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017Findwise
 
AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017Findwise
 
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017Findwise
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM WatsonFindwise
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindwise
 
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindwise
 
Findability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindwise
 
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindwise
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindwise
 
Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findwise
 
Findability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindwise
 
Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Findwise
 
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findwise
 
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindwise
 
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findwise
 
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...Findwise
 
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findwise
 
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...Findwise
 
Findability Day 2015 Joachim Dahl - Virtual Works - 360 degree view of the ...
Findability Day 2015   Joachim Dahl - Virtual Works - 360 degree view of the ...Findability Day 2015   Joachim Dahl - Virtual Works - 360 degree view of the ...
Findability Day 2015 Joachim Dahl - Virtual Works - 360 degree view of the ...Findwise
 

Más de Findwise (20)

White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017
 
AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017
 
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learning
 
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaboration
 
Findability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindability Day 2016 - SKF case study
Findability Day 2016 - SKF case study
 
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experience
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligence
 
Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?
 
Findability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPR
 
Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365
 
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
 
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
 
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
 
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
 
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!
 
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
 
Findability Day 2015 Joachim Dahl - Virtual Works - 360 degree view of the ...
Findability Day 2015   Joachim Dahl - Virtual Works - 360 degree view of the ...Findability Day 2015   Joachim Dahl - Virtual Works - 360 degree view of the ...
Findability Day 2015 Joachim Dahl - Virtual Works - 360 degree view of the ...
 

Content Processing Architecture and Applications - Introduction to Text Mining

  • 1. CONTENT PROCESSING ARCHITECTURE AND APPLICATIONS Introduction to text mining – Warsaw University of Technology
  • 2. Plan Findwise – who we are, what we do. What is content? Why content processing is important Content processing and information retrieval Technology for content processing Methods for content processing Examples of usage
  • 3. Findwise – Search Driven Solutions •  Founded  in  2005   •  Offices  in  Sweden,  Denmark,                  Norway,  Poland  and  Australia   •  90  employees   Our  objecBve  is  to  be  a  leading  provider  of  Findability  soluBons  uBlising   the  full  potenBal  of  search  technology  to  create  customer  business  value.     •  Paweł  Wróblewski  &  Marcin  Goss  
  • 5. Content ≥ Information From the business point of view INFORMATION is the key to success. ”Informa)on  can  only  be  an  asset  when  it  enables  a   task  to  be  completed.”   “The  value  is  in  the  outcome  of  the  task,  not  in   the  informa)on  itself.”   MarBn  White   Employee productivity (The hidden cost… IDC April 2006): ” “the cost for wasted time on the part of professional searching, but not !nding relevant information, amounts to $5.3 million annually for an enterprise with 1000 knowledge workers.””
  • 6. Information is hidden Big Data is commonly described with 3V: 1.  Variety Human  generated  vs.  Machine  generated   Text  &  MulBmedia   2.  Volume Up  to  Petabytes   3.  Velocity Stream  of  data   GBs  per  day,  hour,  minute,  second  
  • 7. Information lives in the context The right Information is hidden in text. Text forms a context: word -> sentence -> paragraph -> chapter -> document Content processing is about extracting required information from the context.
  • 8. WHY CONTENT PROCESSING IS IMPORTANT?
  • 9. Why content processing is important To get right information in seconds •  Usage  of  faceted  search   To tag consistently large document set •  Usage  of  automaBc  extactor   To biuld semantic database •  ExtracBon  of  concepts  with  linkage  to  taxonomy/ontology   To perform document classi#cation •  ExtracBon  of  enBBes  with  grouping  /  clustering   Examples  from  publicly  available  websites  [live  show]  
  • 10. Conclusion Content processing is a set of techniques enabling text analytics. Content processing leverages the value of data stored in companies improving data consumption. Content processing used with search engines helps #nd information in any context. •  Enteprise  Findability   •  E-­‐commerce  
  • 12. General architecture of search engines
  • 13. Content Processing – the idea Format   Language   Spell   Lemmas   Synonyms   Conversion   Detec?on   Checking   (tenses,  forms)   Document   Geography   Taxonomy   Custom   Companies   Vectorizer   En??es   Classifica?on   PLUG-­‐IN   People   Scopifier     index   PARIS  (Reuters)  -­‐  Venus  Williams  raced  into  the  second  round  of   the  $11.25  million  French  Open  Monday,  brushing  aside   Bianka  Lamade,  6-­‐3,  6-­‐3,  in  65  minutes.     The  Wimbledon  and  U.S.  Open  champion,  seeded  second,  breezed   past  the  German  on  a  blustery  center  court  to  become  the   first  seed  to  advance  at  Roland  Garros.  "I  love  being  here,  I   love  the  French  Open  and  more  than  anything  I'd  love  to  do   well  here,"  the  American  said.     Input:        byte  stream   Output:  structured  document  ready  to  be  indexed  
  • 14. Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
  • 15. Hydra - example Select  stages  to  use  in  the  pipeline,  the  leX  column  corresponds  to  the   “market”,  and  the  right  is  the  stages  used.  
  • 16. Hydra - example Modify  the  format  of  the  date  to  only  include  year.      
  • 17. Hydra - example The  new  year  meta-­‐data  can  be  used  as  a  facet  
  • 18. Hydra - example Map  every  author  field  to  a  metadata  field  called  author.   Pipeline  A         Pipeline  B        
  • 19. Hydra - example In  the  search  result…      
  • 20. Hydra is Open Source http://#ndwise.github.com/Hydra/
  • 21. METHODS FOR CONTENT PROCESSING
  • 22. Named entity recognition – statistical classi#ers •  OpenNLP (http://opennlp.apache.org/) •  Markov chains •  Mallet (http://mallet.cs.umass.edu/) •  Conditional random #elds Input: Mark has been in London since Mary dumped him. Output: <person>Mark</person> has been in <place>London</place> since <person>Mary</person> dumped him.
  • 23. Classi#ers - training •  Training set - language corpora •  (http://nkjp.pl/) for Polish Set of manually tagged texts in given language. Preferably from various sources, various topics. Tokens   PoS  tags   Name  tags     He   Pronoun   O   went   Verb   O   to   Prep.   O   United   AdjecBve   Place   States   Noun   Place   .   Interp   O  
  • 24. Classi#ers – supervised training •  Training input •  Features extracted from each token token: text, PoS tag, token class prev token: text, PoS tag, token class next token: text, PoS tag, token class previous tags assigned •  Token classes examples lowercase alphabetic, digits, contains number and letter, contains number and a hyphen, all caps, all caps with dots inbetween ... •  Training output •  <place> <location> <person> •  <B-place> <I-place> <L-place> <U-place>
  • 25. Classi#ers – approaches „Warszawskie Koło Brydżΐowe im. Jana Nowaka organizuje turniej w Sheratonie” Location? Organisation name? Person name? •  One classi!er for all name-types •  faster •  automatically resolves con#icts •  One classi!er per name-type •  slower, memory consuming •  provides more information
  • 27. Naive approach Often people names intersect with location names: - Kazimierz - Washington Company names may come from common language: - Oracle - Dialog Conlcusion: dictionaries are not enough without contextual analysis
  • 30. Paweł Wróblewski pawel.wroblewski@#ndwise.com Marcin Goss marcin.goss@#ndwise.com