1. Apache UIMA - hands
on code
Gestione delle Informazioni su Web - 2010/2011
Tommaso Teofili
tommaso [at] apache [dot] org
2. Use Cases - Agenda
UC1 : Real Estatate market analysis
UC2 : Tenders automatic information
extraction
UIMA & search engines
Tutorial
Assignment
3. UC1 : Source
An online announcement site for sellers and
buyers
Wide purpose (cars, RE, hi-fi, etc...)
Local scope (Rome and nearby)
4. UC1 - Goals
Track real estate market in order to:
Take smart decisions
Predict how things will go in the (near) future
Estate listings text is unstructered
Aggregate queries for statistical analysis need
structured information
7. UC1 - Crawler
A specialized crawler extract data from the source
Estate listings data are stored grouped by zones in files
on some directory on a managed machine
Define navigation of the site using one XML for each
city zone
The crawler downloads page fragments two times a
week
The estate listings extracted free text is saved on XML
grouped by zone
8. UC1 - Crawler
Issues :
Enabled cookies
Some HTTP headers needed
Needed to put fixed sleeping intervals
between requests
10. UC1 - Information
Extraction Engine
Goal : extract price, zone and telephone
number
The first version used huge regular
expressions
Hard to maintain and unefficient
Poor extraction
11. UC1 - IE Engine
New requirements: extract the structure of
the house
Number of rooms, box, garden(s), external
spaces, number of bathrooms, kitchen,
etc...
Track more fine grained zones
12. Sample text
“ven 26 Dic APPIA via grottaferrata metro 2
¡ piano assolato ingresso salone americana
cucina camera cameretta bagno soppalco
posto auto e 295.000”
13. UC1 - ContentAnnotator
From the XML produced by the crawler only
estate listings must be extracted
A simple parser to get each node containing
an estate listing (that in turn will be
unstructured)
Create a ContentAnnotation over the
document
17. UC1 - Consuming
extracted information
the previous version of the IE engine
produced XML files that needed to be
reparsed to store structured data inside the
DB
with UIMA a CAS Consumer at the end of
the analysis pipeline can automatically put
extracted information on the DB
18. UIMA - CAS Consumer
Analysis Engine responsible for consuming
information contained inside the CAS
Can write extracted information to:
DBMS
Lucene index
Filesystem
...
21. UC2 - Monitor of EU
announcements
Monitor various sources which provide
announcement and tenders
Automate the long monitoring process of such
sources and automatically extract useful
common information from announcements’
texts
26. UC2 - Domain
annotations
Language Funding type
Abstract Geographic region
Activity Sector
Beneficiary Subject
Budget Title
Expiration date Tags
27. UC2 - Domain entities
First and most important is an entity that
represents the entire tender or
announcement
Annotations inside the domain will finally fill
such entity properties
28. UC2 - Simple first
Each annotator first looks:
if some metadata was extracted during navigation
for the most common pattern for defining
information inside such announcements
i.e.: “Budget: 200000$” or “Financial information: ......”
Such patterns are common in different languages
29. UC2 - AbstractAnnotator
The abstract is usually in the first part of the
document
We use Tokenizer and Tagger to get Tokens (with
PoS tags) and Sentences
We use dictionary of “good” words and linguistic
patterns
We look in the first sentences of the document
looking for objectives of the announcement
30. UC2 - ExpirationDateAnnotator
A DateAnnotator is executed before
Iterate over DateAnnotations
Get sentences wrapping such DateAnnotations
Check if some terms or patterns like “the
deadline is ...” appear near a DateAnnotation
34. UIMA & Search Engines
“Push” scenario:
documents are sent to UIMA which extracts metadata and
writes on the index with a CAS Consumer
“Pull” scenario:
documents are sent to Lucene which asks UIMA to extract
metadata for it and then Lucene itself writes them to the
index
“On demand” scenario:
metadata are extracted only on demand each time a
document is retrieved/showed...
35. UIMA - tutorial
create a Type System
create an Analysis Engine descriptor
create a simple Annotator