Decarbonising Buildings: Making a net-zero built environment a reality
Text Mining and SEASR
1. Introduc)on to SEASR and Text Mining
UIUC/NCSA
Feb 4, 2009
LoreBa Auvil
Na)onal Center for Supercompu)ng Applica)ons
University of Illinois at Urbana Champaign
3. SEASR: Reach + Relevance + Reuse + Repeatability
SEASR emphasizes flexibility, scalability, modularity, provides
community hub and access to heterogeneous data and
computa)onal systems
– Seman)c driven environment for SOA interoperability
– Encourages sharing and par)cipa)on for building communi)es
– Modular construc)on allows flows to be modified and configured to
encourage reusability within and across domains
– Enables a mashup and integra)on of tools
– Data‐intensive flows can be executed on a simple desktop or a large
cluster(s) without modifica)on
– Computa)on can be created for distributed execu)on on servers where
the content lives
– User accessibility to control trust and compliance with required copyright
license of content
– Relies on standardized Resource Descrip)on Framework (RDF) to define
components and flow
5. Workbench
• Web‐based UI
• Components and flows
are retrieved from server
• Addi)onal loca)ons of
components and flows
can be added to server
• Create flow using a
graphical drag and drop
interface
• Change property values
• Execute the flow
7. SEASR @ Work – Zotero
• Plugin to Firefox
• Zotero manages the
collec)on
• Launch SEASR Analy)cs
– Cita)on Analysis uses the JUNG
network importance algorithms
to rank the authors in the cita)on
network that is exported as RDF
data from Zotero to SEASR
– Zotero Export to Fedora through
SEASR
– Saves results from SEASR
Analy)cs to a Collec)on
• Launch MONK Processing
– MONK DB Inges)on Workflow
10. SEASR @ Work – Audio Analysis
• NEMA: Executes a SEASR
flow for each run
– Loads audio data
– Extracts features for every
10 sec moving window of
audio
– Loads and applies the
models
– Sends results back to the
WebUI
• NESTER: Annota)on of
Audio via Spectral
Analysis
12. SEASR @ Work – DISCUS
On‐demand usage of
•
analy)cs while surfing
– While naviga)ng
request analy)cs to be
performed on page
– Text extrac)on and
cleaning
Summariza)on and key
•
work extrac)on
– List the important
terms on the page
being analyzed
– Provide relevant short
summaries
Visual maps
•
– Provide a visual
representa)on of the
key concepts
– Show the graph of
rela)ons between
concepts
20. Some Examples
• Authorship Analysis (JUNG network
importance algorithms to rank the authors
in the citation network)
• Author Centrality Analysis
– Uses Betweenness Centrality, which ranks each coauthor graph derived from the
number of shortest paths that pass through them
• Author Degree Analysis
– Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors
• Author HITS Analysis
– The *hubness* of a node is the degree to which a node links to other important
authorities. The *authoritativeness* of a node is the degree to which a node is pointed
to by important hubs.
• Readability
• Flesch-Kincaid readability test quot;
(http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)
22. Text Mining Defini)on
Many defini)ons in the literature
• The non trivial extrac)on of implicit, previously
unknown, and poten)ally useful informa)on
from (large amount of) textual data”
• An explora)on and analysis of textual (natural‐
language) data by automa)c and semi automa)c
means to discover new knowledge
• What is “previously unknown” informa)on?
– Strict defini)on
• Informa)on that not even the writer knows
– Lenient defini)on
• Rediscover the informa)on that the author encoded in the
text
23. Text Mining Process
Text Preprocessing
•
Syntac)c Text Analysis
–
Seman)c Text Analysis
–
Features Genera)on
•
Bag of Words
–
Ngrams
–
Feature Selec)on
•
Simple Coun)ng
–
Sta)s)cs
–
Selec)on based on POS
–
Text/Data Mining
•
Classifica)on ‐ Supervised
–
Learning
Clustering ‐ Unsupervised
–
Learning
Informa)on Extrac)on
–
Analyzing Results
•
Visual Explora)on, Discovery
–
and Knowledge Extrac)on
Query‐based – ques)on
–
answering
24. Text Characteris)cs (1)
• Large textual data base
– Enormous wealth of textual informa)on on the Web
– Publica)ons are electronic
• High dimensionality
– Consider each word/phrase as a dimension
• Noisy data
– Spelling mistakes
– Abbrevia)ons
– Acronyms
• Text messages are very dynamic
– Web pages are constantly being generated (removed)
– Web pages are generated from database queries
• Not well structured text
– Email/Chat rooms
• “r u available ?”
• “Hey whazzzzzz up”
– Speech
25. Text Characteris)cs (2)
• Dependency
– Relevant informa)on is a complex conjunc)on of words/phrases
– Order of words in the query
• hot dog stand in the amusement park
• hot amusement stand in the dog park
• Ambiguity
– Word ambiguity
• Pronouns (he, she …)
• Synonyms (buy, purchase)
• Words with mul)ple meanings (bat – it is related to baseball or mammal)
– Seman)c ambiguity
• The king saw the rabbit with his glasses. (mul)ple meanings)
• Authority of the source
– IBM is more likely to be an authorized source then my second far
cousin
27. Syntac)c Analysis
Tokeniza)on
•
Text document is represented by the words it contains (and their occurrences)
–
e.g., “Lord of the rings” → {“the”, “Lord”, “rings”, “of”}
–
Highly efficient
–
Makes learning far simpler and easier
–
Order of words is not that important for certain applica)ons
–
Lemmi)za)on/Stemming
•
Involves the reduc)on of corpus words to their respec)ve headwords (i.e. lemmas)
–
Reduce dimensionality
–
Iden)fies a word by its root
–
e.g., flying, flew → fly
–
Stop words
•
Iden)fies the most common words that are unlikely to help with text mining
–
e.g., “the”, “a”, “an”, “you”
–
Parsing / Part of Speech (POS) tagging
•
Generates a parse tree (graph) for each sentence
–
Each sentence is a stand alone graph
–
Find the corresponding POS for each word
–
e.g., John (noun) gave (verb) the (det) ball (noun)
–
Shallow Parsing
–
analysis of a sentence which iden)fies the cons)tuents (noun groups, verbs,...), but does not specify their internal
•
structure, nor their role in the main sentence
Deep Parsing
–
more sophis)cated syntac)c, seman)c and contextual processing must be performed to extract or construct the answer
•
28. Seman)c Analysis: Informa)on Extrac)on
• Defini)on: Informa)on extrac)on is the
iden)fica)on of specific seman)c elements
within a text (e.g., en))es, proper)es,
rela)ons)
• Extract the relevant informa)on and ignore
non‐relevant informa)on (important!)
• Link related informa)on and output in a
predetermined format
29. Informa)on Extrac)on
Informa(on Type State of the art (Accuracy)
En((es 90‐98%
an object of interest such as a
person or organiza)on.
A9ributes 80%
a property of an en)ty such as its
name, alias, descriptor, or type.
Facts 60‐70%
a rela1onship held between two or
more en))es such as Posi)on of a
Person in a Company.
Events 50‐60%
an ac1vity involving several en))es
such as a terrorist act, airline crash,
management change, new product
introduc)on.
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
30. Informa)on Extrac)on Approaches
• Terminology (name) lists
– This works very well if the list of names and name
expressions is stable and available
• Tokeniza)on and morphology
– This works well for things like formulas or dates, which
are readily recognized by their internal format (e.g.,
DD/MM/YY or chemical formulas)
• Use of characteris)c paBerns
– This works fairly well for novel en))es
– Rules can be created by hand or learned via machine
learning or sta)s)cal algorithms
– Rules capture local paBerns that characterize en))es
from instances of annotated training data
31. Informa)on Extrac)on
Rela)on (Event) Extrac)on
• Iden)fy (and tag) the rela)on among two en))es:
– A person is_located_at a loca)on (news)
– A gene codes_for a protein (biology)
• Rela)ons require more informa)on
– Iden)fica)on of two en))es & their rela)onship
– Predicted rela)on accuracy
• Pr(E1)*Pr(E2)*Pr(R) ~= (.93) * (.93) * (.93) = .80
• Informa)on in rela)ons is less local
– Contextual informa)on is a problem: right word may not
be explicitly present in the sentence
– Events involve more rela)ons and are even harder
32. Seman)c Analy)cs
Named En)ty (NE) Tagging
NE:Person NE:Time
Mayor Rex Luthor announced today the establishment of a
NE:Loca)on
new research facility in Alderwood. It will be known as
NE:Organiza)on
Boynton Laboratory.
35. Seman)c Analysis
Seman)c Role Analysis
ACTOR ACTION WHEN OBJECT
Mayor Rex Luthor announced today the establishment
WHERE OBJECT
of a new research facility in Alderwoon. It will be
ACTION COMPL
known as Boynton Laboratory
36. Seman)c Analysis
Concept‐Rela)on Extrac)on
today
e
tim n )
time
e
(w h
actor
Rex Luthor announce
(who)
person action
ob w h a
(
je t)
ct
establ.
loc(whe
event
ha t
(w jec
at re)
t)
b
io
o
n
Boynton
Alderwood
Lab
organiz. location
38. Template Extrac)on
<Facility>Finsbury Park Mosque</Facility>
(c) 2001, Chicago Tribune.
<Country>England</Country>
Visit the Chicago Tribune on the Internet at
<Country>France </Country>
http://www.chicago.tribune.com/
Distributed by Knight Ridder/Tribune
<Country>England</Country>
Information Services.
By Stephen J. Hedges and Cam Simpson
<Country>Belgium</Country>
…….
<Country>United States</Country>
The Finsbury Park Mosque is the center of
<PersonPositionOrganization>
radical Muslim activism in England. Through its <Person>Abu Hamza al-Masri</Person>
<OFFLEN OFFSET=quot;3576quot;
doors have passed at least three of the men
LENGTH=“33quot; />
now held on suspicion of terrorist activity in
France, England and Belgium, as well as one <Person>Abu Hamza al-Masri</Person>
Algerian man in prison in the United States.
<Position>chief cleric</Position>
``The mosque's chief cleric, Abu Hamza al- <Organization>Finsbury Park Mosque</
Masri lost two hands fighting the Soviet Union
Organization>
<PersonArrest>
in Afghanistan and he advocates the <City>London</City>
</PersonPositionOrganization>
<OFFLEN OFFSET=quot;3814quot;
elimination of Western influence from Muslim
countries. He was arrested in London in 1999
LENGTH=quot;61quot; />
for his alleged involvement in a Yemen bomb
<Person>Abu Hamza al-Masri</Person>
plot, but was set free after Yemen failed to
<Location>London</Location>
produce enough evidence to have him
extradited. .'‘ … <Date>1999</Date>
<Reason>his alleged involvement in a
Yemen bomb plot</Reason>
</PersonArrest>
39. Streaming Text: Knowledge Extrac)on
• Leveraging some earlier
work on informa)on
extrac)on from text
streams
Informa)on extrac)on
• process of using
advanced automated
machine learning
approaches
• to iden)fy en))es in
text documents
• extract this informa)on
along with the
rela)onships these
The visualiza)on above demonstrates informa)on
en))es may have in the
extrac)on of names, places and organiza)ons from real‐
text documents
)me news feeds. As news ar)cles arrive, the informa)on
is extracted and displayed. Rela)onships are defined
when en))es co‐occur within a specific window of
words.
40. Seman)c Analysis
Word Sense Disambigua)on
•
– Context based or proximity based
– Very accurate
41. Ontological Associa)on (WordNet)
• Wordnet: As of 2006, the database contains about 150,000 words
organized in over 115,000 synsets for a total of 207,000 word‐sense pairs
• Search for dog
– n dog, domes)c dog, Canis familiaris (a member of the genus Canis (probably
descended from the common wolf) that has been domes)cated by man since
prehistoric )mes; occurs in many breeds)
– n frump, dog (a dull unaBrac)ve unpleasant girl or woman)
– n dog (informal term for a man)
– n cad, bounder, blackguard, dog, hound, heel (someone who is morally
reprehensible)
– n frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie (a
smooth‐textured sausage of minced beef or pork usually smoked; o}en served
on a bread roll)
– n pawl, detent, click, dog (a hinged catch that fits into a notch of a ratchet to
move a wheel forward or prevent it from moving backward)
– n andiron, firedog, dog, dog‐iron (metal supports for logs in a fireplace)
– v chase, chase a}er, trail, tail, tag, give chase, dog, go a}er, track (go a}er with
the intent to catch)
42. Feature Selec)on
• Reduce Dimensionality
– Learners have difficulty addressing tasks with high
dimensionality
• Irrelevant Features
– Not all features help!
– Remove features that occur in only a few
documents
– Reduce features that occur in too many
documents
43. Text Mining: General Applica)on Areas
• Informa)on Retrieval
– Indexing and retrieval of textual documents
– Finding a set of (ranked) documents that are relevant to
the query
• Informa)on Extrac)on
– Extrac)on of par)al knowledge in the text
• Web Mining
– Indexing and retrieval of textual documents and extrac)on
of par)al knowledge using the web
• Classifica)on
– Predict a class for each text document
• Clustering
– Genera)ng collec)ons of similar text documents
44. Text Mining: Supervised vs. Unsupervised
• Supervised learning (Classifica)on)
– Data (observa)ons, measurements, etc.) are accompanied by
labels indica)ng the class of the observa)ons
– Split into training data and test data for model building process
– New data is classified based on the model built with the training
data
– Techniques
• Bayesian classifica)on, Decision trees, Neural networks,
Instance‐Based Methods, Support Vector Machines
• Unsupervised learning (Clustering)
– Class labels of training data is unknown
– Given a set of measurements, observa)ons, etc. with the aim of
establishing the existence of classes or clusters in the data
51. Text Mining: Applica)ons
• Email: Spam filtering
• News Feeds: Discover what is
interes)ng
• Medical: Iden)fy rela)onships and
link informa)on from different
medical fields
• Homeland Security
• Marke)ng: Discover dis)nct groups of
poten)al buyers and make
sugges)ons for other products
• Industry: Iden)fying groups of
compe)tors web pages
• Job Seeking: Iden)fy parameters in
searching for jobs
52. Text Mining: Classifica)on Defini)on
• Given: Collec)on of labeled records
– Each record contains a set of features (aBributes), and the true class
(label)
– Create a training set to build the model
– Create a tes)ng set to test the model
• Find: Model for the class as a func)on of the values of the features
• Goal: Assign a class (as accurately as possible) to previously unseen
records
• Evalua)on: What Is Good Classifica)on?
– Correct classifica)on
• Known label of test example is iden)cal to the predicted class from the model
– Accuracy ra)o
• Percent of test set examples that are correctly classified by the model
– Distance measure between classes can be used
• e.g., classifying “football” document as a “basketball” document is not as bad
as classifying it as “crime”
53. Text Mining: Clustering Defini)on
• Given: Set of documents and a similarity measure
among documents
• Find: Clusters such that
– Documents in one cluster are more similar to one
another
– Documents in separate clusters are less similar to one
another
• Goal:
– Finding a correct set of documents
• Similarity Measures:
– Euclidean distance if aBributes are con)nuous
– Other problem‐specific measures
• e.g., how many words are common in these documents
• Evalua)on: What Is Good Clustering?
– Produce high quality clusters with
• high intra‐class similarity
• low inter‐class similarity
– Quality of a clustering method is also measured by its
ability to discover some or all of the hidden paBerns