SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Apache UIMA - hands
      on code
  Gestione delle Informazioni su Web - 2010/2011
                  Tommaso Teofili
          tommaso [at] apache [dot] org
Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information
extraction

UIMA & search engines

Tutorial

Assignment
UC1 : Source


An online announcement site for sellers and
buyers

Wide purpose (cars, RE, hi-fi, etc...)

Local scope (Rome and nearby)
UC1 - Goals

Track real estate market in order to:

  Take smart decisions

  Predict how things will go in the (near) future

Estate listings text is unstructered

Aggregate queries for statistical analysis need
structured information
UC1 - Source
UC1 - Blocks
UC1 - Crawler
A specialized crawler extract data from the source

Estate listings data are stored grouped by zones in files
on some directory on a managed machine

Define navigation of the site using one XML for each
city zone

The crawler downloads page fragments two times a
week

The estate listings extracted free text is saved on XML
grouped by zone
UC1 - Crawler

Issues :

  Enabled cookies

  Some HTTP headers needed

  Needed to put fixed sleeping intervals
  between requests
UC1 - Domain


Announcement

Zone

MagazineNumber

HouseStructure (with properties)
UC1 - Information
  Extraction Engine
Goal : extract price, zone and telephone
number

The first version used huge regular
expressions

Hard to maintain and unefficient

Poor extraction
UC1 - IE Engine

New requirements: extract the structure of
the house

  Number of rooms, box, garden(s), external
  spaces, number of bathrooms, kitchen,
  etc...

  Track more fine grained zones
Sample text


“ven 26 Dic APPIA via grottaferrata metro 2
¡ piano assolato ingresso salone americana
cucina camera cameretta bagno soppalco
posto auto e 295.000”
UC1 - ContentAnnotator

 From the XML produced by the crawler only
 estate listings must be extracted

 A simple parser to get each node containing
 an estate listing (that in turn will be
 unstructured)

 Create a ContentAnnotation over the
 document
ContentAnnotation
UC1 - Entities
UC1 - ZoneAnnotation
UC1 - Consuming
extracted information
the previous version of the IE engine
produced XML files that needed to be
reparsed to store structured data inside the
DB

with UIMA a CAS Consumer at the end of
the analysis pipeline can automatically put
extracted information on the DB
UIMA - CAS Consumer
Analysis Engine responsible for consuming
information contained inside the CAS

Can write extracted information to:

  DBMS

  Lucene index

  Filesystem

  ...
UC1 - Analysis Graphs
UC1 - Analysis Graphs
UC2 - Monitor of EU
  announcements

Monitor various sources which provide
announcement and tenders

Automate the long monitoring process of such
sources and automatically extract useful
common information from announcements’
texts
UC2 - Blocks
Different input texts
Different input texts
Different input texts
UC2 - Domain
            annotations
Language           Funding type

Abstract           Geographic region

Activity           Sector

Beneficiary         Subject

Budget             Title

Expiration date    Tags
UC2 - Domain entities


First and most important is an entity that
represents the entire tender or
announcement

Annotations inside the domain will finally fill
such entity properties
UC2 - Simple first

Each annotator first looks:

   if some metadata was extracted during navigation

   for the most common pattern for defining
   information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are common in different languages
UC2 - AbstractAnnotator
 The abstract is usually in the first part of the
 document

 We use Tokenizer and Tagger to get Tokens (with
 PoS tags) and Sentences

 We use dictionary of “good” words and linguistic
 patterns

 We look in the first sentences of the document
 looking for objectives of the announcement
UC2 - ExpirationDateAnnotator


  A DateAnnotator is executed before

  Iterate over DateAnnotations

  Get sentences wrapping such DateAnnotations

  Check if some terms or patterns like “the
  deadline is ...” appear near a DateAnnotation
UC2 - BandoEntity
UIMA & Search Engines

 Decorate documents with automatically
 extracted metadata to improve search
 experience

   relevance

   results

   clustering
Information Retrieval and
     Named Entities
UIMA & Search Engines
 “Push” scenario:

    documents are sent to UIMA which extracts metadata and
    writes on the index with a CAS Consumer

 “Pull” scenario:

    documents are sent to Lucene which asks UIMA to extract
    metadata for it and then Lucene itself writes them to the
    index

 “On demand” scenario:

    metadata are extracted only on demand each time a
    document is retrieved/showed...
UIMA - tutorial


create a Type System

create an Analysis Engine descriptor

create a simple Annotator
Assignment

Named Entities Recognition

  sport: person, player, coach, team,
  competition

  videogames: person, videogame character,
  videogame, software house, hardware
  requirement

Preciosion & Recall test

Más contenido relacionado

Similar a Apache UIMA - Hands on code

GurgaonPoliceSummerCyberSecurityInternship
GurgaonPoliceSummerCyberSecurityInternshipGurgaonPoliceSummerCyberSecurityInternship
GurgaonPoliceSummerCyberSecurityInternshipMandeep Singh Kapoor
 
PandoraFMS: Pasado, presente y futuro.
PandoraFMS: Pasado, presente y futuro.PandoraFMS: Pasado, presente y futuro.
PandoraFMS: Pasado, presente y futuro.Enrique Verdes
 
Mir - a Media Information Retrieval system
Mir - a Media Information Retrieval systemMir - a Media Information Retrieval system
Mir - a Media Information Retrieval systemMarco Masetti
 
The Semantic Data Factory Boston Text Analystics World 2013
The Semantic Data Factory Boston Text Analystics World  2013The Semantic Data Factory Boston Text Analystics World  2013
The Semantic Data Factory Boston Text Analystics World 2013George Roth
 
Linux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. SystemLinux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. SystemOlga Bautista
 
WP 3.3 API and Functional Architecture
WP 3.3 API and Functional ArchitectureWP 3.3 API and Functional Architecture
WP 3.3 API and Functional ArchitectureEuropeana
 
Digital Libraries, K. Stefanov
Digital Libraries, K. StefanovDigital Libraries, K. Stefanov
Digital Libraries, K. StefanovErik Axdorph
 
Digital Libraries, K. Stefanov
Digital Libraries, K. StefanovDigital Libraries, K. Stefanov
Digital Libraries, K. StefanovErik Axdorph
 
Symbian Os Introduction
Symbian Os IntroductionSymbian Os Introduction
Symbian Os IntroductionDeepak Rathi
 
S.2.e Specifications for Data Ingestion via Sunshine FTP
S.2.e Specifications for Data Ingestion via Sunshine FTPS.2.e Specifications for Data Ingestion via Sunshine FTP
S.2.e Specifications for Data Ingestion via Sunshine FTPSUNSHINEProject
 
Smart traffic managment system real time (stmsrt)
Smart traffic managment system real time (stmsrt)Smart traffic managment system real time (stmsrt)
Smart traffic managment system real time (stmsrt)Ayoub Rouzi
 
Orion context broker webminar 2013 06-19
Orion context broker webminar 2013 06-19Orion context broker webminar 2013 06-19
Orion context broker webminar 2013 06-19Fermin Galan
 
Orion context broker webminar 2013 05-30
Orion context broker webminar 2013 05-30Orion context broker webminar 2013 05-30
Orion context broker webminar 2013 05-30Fermin Galan
 
Recap of the previous training session
Recap of the previous training sessionRecap of the previous training session
Recap of the previous training sessionEuropeana_Sounds
 
Internet of Things
Internet of ThingsInternet of Things
Internet of ThingsDeZyre
 

Similar a Apache UIMA - Hands on code (20)

GurgaonPoliceSummerCyberSecurityInternship
GurgaonPoliceSummerCyberSecurityInternshipGurgaonPoliceSummerCyberSecurityInternship
GurgaonPoliceSummerCyberSecurityInternship
 
ES UNIT-I.pptx
ES UNIT-I.pptxES UNIT-I.pptx
ES UNIT-I.pptx
 
PandoraFMS: Pasado, presente y futuro.
PandoraFMS: Pasado, presente y futuro.PandoraFMS: Pasado, presente y futuro.
PandoraFMS: Pasado, presente y futuro.
 
Mir - a Media Information Retrieval system
Mir - a Media Information Retrieval systemMir - a Media Information Retrieval system
Mir - a Media Information Retrieval system
 
The Semantic Data Factory Boston Text Analystics World 2013
The Semantic Data Factory Boston Text Analystics World  2013The Semantic Data Factory Boston Text Analystics World  2013
The Semantic Data Factory Boston Text Analystics World 2013
 
Windows 2000
Windows 2000Windows 2000
Windows 2000
 
Linux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. SystemLinux Operating System Resembles Unix Operating. System
Linux Operating System Resembles Unix Operating. System
 
Accelerating Media Business Developments
Accelerating Media Business DevelopmentsAccelerating Media Business Developments
Accelerating Media Business Developments
 
WP 3.3 API and Functional Architecture
WP 3.3 API and Functional ArchitectureWP 3.3 API and Functional Architecture
WP 3.3 API and Functional Architecture
 
Digital Libraries, K. Stefanov
Digital Libraries, K. StefanovDigital Libraries, K. Stefanov
Digital Libraries, K. Stefanov
 
Digital Libraries, K. Stefanov
Digital Libraries, K. StefanovDigital Libraries, K. Stefanov
Digital Libraries, K. Stefanov
 
SoftNews-lowres
SoftNews-lowresSoftNews-lowres
SoftNews-lowres
 
Symbian Os Introduction
Symbian Os IntroductionSymbian Os Introduction
Symbian Os Introduction
 
S.2.e Specifications for Data Ingestion via Sunshine FTP
S.2.e Specifications for Data Ingestion via Sunshine FTPS.2.e Specifications for Data Ingestion via Sunshine FTP
S.2.e Specifications for Data Ingestion via Sunshine FTP
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
Smart traffic managment system real time (stmsrt)
Smart traffic managment system real time (stmsrt)Smart traffic managment system real time (stmsrt)
Smart traffic managment system real time (stmsrt)
 
Orion context broker webminar 2013 06-19
Orion context broker webminar 2013 06-19Orion context broker webminar 2013 06-19
Orion context broker webminar 2013 06-19
 
Orion context broker webminar 2013 05-30
Orion context broker webminar 2013 05-30Orion context broker webminar 2013 05-30
Orion context broker webminar 2013 05-30
 
Recap of the previous training session
Recap of the previous training sessionRecap of the previous training session
Recap of the previous training session
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 

Más de Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 

Más de Tommaso Teofili (15)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Apache UIMA - Hands on code

  • 1. Apache UIMA - hands on code Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  • 2. Use Cases - Agenda UC1 : Real Estatate market analysis UC2 : Tenders automatic information extraction UIMA & search engines Tutorial Assignment
  • 3. UC1 : Source An online announcement site for sellers and buyers Wide purpose (cars, RE, hi-fi, etc...) Local scope (Rome and nearby)
  • 4. UC1 - Goals Track real estate market in order to: Take smart decisions Predict how things will go in the (near) future Estate listings text is unstructered Aggregate queries for statistical analysis need structured information
  • 7. UC1 - Crawler A specialized crawler extract data from the source Estate listings data are stored grouped by zones in files on some directory on a managed machine Define navigation of the site using one XML for each city zone The crawler downloads page fragments two times a week The estate listings extracted free text is saved on XML grouped by zone
  • 8. UC1 - Crawler Issues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests
  • 10. UC1 - Information Extraction Engine Goal : extract price, zone and telephone number The first version used huge regular expressions Hard to maintain and unefficient Poor extraction
  • 11. UC1 - IE Engine New requirements: extract the structure of the house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Track more fine grained zones
  • 12. Sample text “ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”
  • 13. UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document
  • 17. UC1 - Consuming extracted information the previous version of the IE engine produced XML files that needed to be reparsed to store structured data inside the DB with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB
  • 18. UIMA - CAS Consumer Analysis Engine responsible for consuming information contained inside the CAS Can write extracted information to: DBMS Lucene index Filesystem ...
  • 19. UC1 - Analysis Graphs
  • 20. UC1 - Analysis Graphs
  • 21. UC2 - Monitor of EU announcements Monitor various sources which provide announcement and tenders Automate the long monitoring process of such sources and automatically extract useful common information from announcements’ texts
  • 26. UC2 - Domain annotations Language Funding type Abstract Geographic region Activity Sector Beneficiary Subject Budget Title Expiration date Tags
  • 27. UC2 - Domain entities First and most important is an entity that represents the entire tender or announcement Annotations inside the domain will finally fill such entity properties
  • 28. UC2 - Simple first Each annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcements i.e.: “Budget: 200000$” or “Financial information: ......” Such patterns are common in different languages
  • 29. UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use dictionary of “good” words and linguistic patterns We look in the first sentences of the document looking for objectives of the announcement
  • 30. UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation
  • 32. UIMA & Search Engines Decorate documents with automatically extracted metadata to improve search experience relevance results clustering
  • 33. Information Retrieval and Named Entities
  • 34. UIMA & Search Engines “Push” scenario: documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer “Pull” scenario: documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index “On demand” scenario: metadata are extracted only on demand each time a document is retrieved/showed...
  • 35. UIMA - tutorial create a Type System create an Analysis Engine descriptor create a simple Annotator
  • 36. Assignment Named Entities Recognition sport: person, player, coach, team, competition videogames: person, videogame character, videogame, software house, hardware requirement Preciosion & Recall test