Text Mining and SEASR

Introduc)on to SEASR and Text Mining 
UIUC/NCSA 
Feb 4, 2009 

LoreBa Auvil 
Na)onal Center for Supercompu)ng Applica)ons 
University of Illinois at Urbana Champaign

SEASR: Reach + Relevance + Reuse + Repeatability  

 SEASR emphasizes flexibility, scalability, modularity, provides 
community hub and access to heterogeneous data and 
computa)onal systems 
–  Seman)c driven environment for SOA interoperability 
–  Encourages sharing and par)cipa)on for building communi)es 
–  Modular construc)on allows flows to be modified and configured to 
encourage reusability within and across domains 
–  Enables a mashup and integra)on of tools 
–  Data‐intensive flows can be executed on a simple desktop or a large 
cluster(s) without modifica)on 
–  Computa)on can be created for distributed execu)on on servers where 
the content lives 
–  User accessibility to control trust and compliance with required copyright 
license of content 
–  Relies on standardized Resource Descrip)on Framework (RDF) to define 
components and flow

Knowledge Discovery in Data

Workbench 
•  Web‐based UI 
•  Components and flows 
are retrieved from server 
•  Addi)onal loca)ons of 
components and flows 
can be added to server 
•  Create flow using a 
graphical drag and drop 
interface 
•  Change property values 
•  Execute the flow

SEASR @ Work – Zotero 
•  Plugin to Firefox  
•  Zotero manages the 
collec)on 
•  Launch SEASR Analy)cs  
–  Cita)on Analysis uses the JUNG 
network importance algorithms 
to rank the authors in the cita)on 
network that is exported as RDF 
data from Zotero to SEASR 
–  Zotero Export to Fedora through 
SEASR 
–  Saves results from SEASR 
Analy)cs to a Collec)on 
•  Launch MONK Processing 
–  MONK DB Inges)on Workﬂow

SEASR @ Work – Fedora 

Interac)ve Web  
Applica)on 

Web Service

SEASR @ Work –  En)ty Mash‐up 
•  En)ty 
Extrac)on with 
OpenNLP 
•  Loca)ons 
viewed on 
Google Map  
•  Dates viewed 
on Simile 
Timeline

SEASR @ Work – Audio Analysis 
•  NEMA: Executes a SEASR 
ﬂow for each run 
–  Loads audio data 
–  Extracts features for every 
10 sec moving window of 
audio 
–  Loads and applies the 
models 
–  Sends results back to the 
WebUI 
•  NESTER: Annota)on of 
Audio via Spectral 
Analysis

SEASR @ Work – MONK 
Executes ﬂows for 
each analysis 
requested 
–  Predic)ve 
modeling using 
Naïve Bayes 
–  Predic)ve 
modeling using 
Support Vector 
Machines (SVM)

SEASR @ Work – DISCUS 
On‐demand usage of 
• 
analy)cs while surﬁng 
–  While naviga)ng 
request analy)cs to be 
performed on page 
–  Text extrac)on and 
cleaning 
Summariza)on and key 
• 
work extrac)on 
–  List the important 
terms on the page 
being analyzed 
–  Provide relevant short 
summaries  
Visual maps 
• 
–  Provide a visual 
representa)on of the 
key concepts 
–  Show the graph of 
rela)ons between 
concepts

SEASR and UIMA : Emo)on Tracking 
 Goal is to have this type of Visualiza)on to track emo)ons across a text 
document (Leveraging ﬂare.prefuse.org)

SEASR Text Analy)cs Goals 
Address the Scholarly text analy)cs needs by: 

•  Efficiently managing distributed Literary and Historical textual assets 
•  Structuring extracted informa)on to facilitate knowledge discovery 
•  Extract informa)on from text at a level of seman)c/func)onal 
abstrac)on that is sufficiently rich to support ques)on‐answering 
•  Devise a representa)on for the extracted informa)on that can be 
efficiently reasoned over to recover data in the ques)on‐answer 
process 
•  Devise algorithms for ques)on answering and inference 
•  Develop UI for effec)ve visual knowledge discovery with separate 
query logic from applica)on logic  
•  Leveraging exis)ng approaches and devise algorithms for clustering, 
inference, and Q&A 
•  Developing an Interac)on UI for effec)ve visual data explora)on 
•  Enable the text analy)cs through SEASR components

The Zotero Picture 

The  
WEB 

Zotero 
Store

The Zotero + SEASR Picture 

The 
The 
WEB 
WEB 

Zotero 
Store

Some Examples 
•  Authorship Analysis (JUNG network
importance algorithms to rank the authors
in the citation network)
•  Author Centrality Analysis
–  Uses Betweenness Centrality, which ranks each coauthor graph derived from the
number of shortest paths that pass through them
•  Author Degree Analysis
–  Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors
•  Author HITS Analysis
–  The *hubness* of a node is the degree to which a node links to other important
authorities. The *authoritativeness* of a node is the degree to which a node is pointed
to by important hubs.

•  Readability
•  Flesch-Kincaid readability test quot;
(http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)

Text Mining Defini)on 
Many defini)ons in the literature 
•  The non trivial extrac)on of implicit, previously 
unknown, and poten)ally useful informa)on 
from (large amount of) textual data” 
•  An explora)on and analysis of textual (natural‐
language) data by automa)c and semi automa)c 
means to discover new knowledge 
•  What is “previously unknown” informa)on? 
–  Strict defini)on 
•  Informa)on that not even the writer knows 
–  Lenient defini)on 
•  Rediscover the informa)on that the author encoded in the 
text

Text Mining Process 
Text Preprocessing 
• 
Syntac)c Text Analysis 
– 
Seman)c Text Analysis  
– 
Features Genera)on  
• 
Bag of Words 
– 
Ngrams 
– 
Feature Selec)on 
• 
Simple Coun)ng 
– 
Sta)s)cs  
– 
Selec)on based on POS 
– 
Text/Data Mining 
• 
Classiﬁca)on ‐ Supervised 
– 
Learning 
Clustering ‐ Unsupervised 
– 
Learning 
Informa)on Extrac)on 
– 
Analyzing Results 
• 
Visual Explora)on, Discovery 
– 
and Knowledge Extrac)on 
Query‐based – ques)on 
– 
answering

Text Characteris)cs (1) 
•  Large textual data base  
–  Enormous wealth of textual informa)on on the Web 
–  Publica)ons are electronic 
•  High dimensionality  
–  Consider each word/phrase as a dimension 
•  Noisy data 
–  Spelling mistakes 
–  Abbrevia)ons 
–  Acronyms 
•  Text messages are very dynamic 
–  Web pages are constantly being generated (removed) 
–  Web pages are generated from database queries  
•  Not well structured text 
–  Email/Chat rooms 
•  “r u available ?” 
•  “Hey whazzzzzz up”  
–  Speech

Text Characteris)cs (2) 
•  Dependency  
–  Relevant informa)on is a complex conjunc)on of words/phrases 
–  Order of words in the query 
•  hot dog stand in the amusement park  
•  hot amusement stand in the dog park 
•  Ambiguity  
–  Word ambiguity  
•  Pronouns  (he, she …) 
•  Synonyms (buy, purchase) 
•  Words with mul)ple meanings (bat – it is related to baseball or mammal) 
–  Seman)c ambiguity 
•  The king saw the rabbit with his glasses. (mul)ple meanings)  
•  Authority of the source 
–  IBM is more likely to be an authorized source then my second far 
cousin

Text Preprocessing 
•  Syntac)c analysis 
Tokeniza)on 
– 
Lemmi)za)on 
– 
POS tagging 
– 
Shallow parsing 
– 
Custom literary tagging 
– 
•  Seman)c analysis 
–  Informa)on Extrac)on 
•  Named En)ty tagging 
Seman)c Category (unnamed en)ty) tagging 
– 
Co‐reference resolu)on 
– 
Ontological associa)on (WordNet, VerbNet) 
– 
Seman)c Role analysis 
– 
Concept‐Rela)on extrac)on 
–

Syntac)c Analysis 
Tokeniza)on 
• 
Text document is represented by the words it contains (and their occurrences) 
– 
e.g., “Lord of the rings”  →  {“the”, “Lord”, “rings”, “of”} 
– 
Highly efficient 
– 
Makes learning far simpler and easier 
– 
Order of words is not that important for certain applica)ons 
– 
Lemmi)za)on/Stemming 
• 
Involves the reduc)on of corpus words to their respec)ve headwords (i.e. lemmas) 
– 
Reduce dimensionality 
– 
Iden)fies a word by its root 
– 
e.g., flying, flew → fly 
– 
Stop words 
• 
Iden)fies the most common words that are unlikely to help with text mining 
– 
e.g., “the”, “a”, “an”, “you” 
– 
Parsing / Part of Speech (POS) tagging 
• 
Generates a parse tree (graph) for each sentence 
– 
Each sentence is a stand alone graph  
– 
Find the corresponding POS for each word 
– 
e.g., John (noun) gave (verb) the (det) ball (noun)  
– 
Shallow Parsing 
– 
analysis of a sentence which iden)fies the cons)tuents (noun groups, verbs,...), but does not specify their internal 
• 
structure, nor their role in the main sentence 
Deep Parsing 
– 
more sophis)cated syntac)c, seman)c and contextual processing must be performed to extract or construct the answer 
•

Seman)c Analysis: Informa)on Extrac)on 
•  Defini)on: Informa)on extrac)on is the 
iden)fica)on of specific seman)c elements 
within a text (e.g., en))es, proper)es, 
rela)ons) 
•  Extract the relevant informa)on and ignore 
non‐relevant informa)on (important!)  
•  Link related informa)on and output in a 
predetermined format


Informa(on Type  State of the art (Accuracy) 
En((es  90‐98% 
an object of interest such as a 
person or organiza)on. 
A9ributes  80% 
a property of an en)ty such as its 
name, alias, descriptor, or type. 
Facts  60‐70% 
a rela1onship held between two or 
more en))es such as Posi)on of a 
Person in a Company. 
Events  50‐60% 
an ac1vity involving several en))es 
such as a terrorist act, airline crash, 
management change, new product 
introduc)on. 
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Informa)on Extrac)on Approaches 
•   Terminology (name) lists 
–  This works very well if the list of names and name 
expressions is stable and available 
•   Tokeniza)on and morphology 
–  This works well for things like formulas or dates, which 
are readily recognized by their internal format (e.g., 
DD/MM/YY or chemical formulas) 
•   Use of characteris)c paBerns 
–  This works fairly well for novel en))es 
–  Rules can be created by hand or learned via machine 
learning or sta)s)cal algorithms 
–  Rules capture local paBerns that characterize en))es 
from instances of annotated training data

Rela)on (Event) Extrac)on 
•  Iden)fy (and tag) the rela)on among two en))es: 
–  A person is_located_at a loca)on (news) 
–  A gene codes_for a protein (biology) 
•  Rela)ons require more informa)on  
–  Iden)ﬁca)on of two en))es & their rela)onship 
–  Predicted rela)on accuracy  
•  Pr(E1)*Pr(E2)*Pr(R) ~= (.93) * (.93) * (.93) = .80 
•  Informa)on in rela)ons is less local 
–  Contextual informa)on is a problem: right word may not 
be explicitly present in the sentence 
–  Events involve more rela)ons and are even harder

Seman)c Analy)cs 
Named En)ty (NE) Tagging 

NE:Person  NE:Time 
Mayor Rex Luthor announced today the establishment of a 
NE:Loca)on 
new research facility in Alderwood.  It will be known as 
NE:Organiza)on 
Boynton Laboratory.

Seman)c Analysis 
Seman)c Category (unnamed en)ty, UNE) 
Tagging 

UNE:Organiza)on 


Co‐reference Resolu)on for en))es and 
unnamed en))es 

UNE:Organiza)on 


Seman)c Role Analysis 

ACTOR ACTION WHEN OBJECT
Mayor Rex Luthor announced today the establishment

WHERE OBJECT
of a new research facility in Alderwoon. It will be

ACTION COMPL
known as Boynton Laboratory

Concept‐Rela)on Extrac)on 

today
e
tim n )
time
e
(w h
actor
Rex Luthor announce
(who)
person action

ob w h a
(
je t)
ct
establ.

loc(whe
event
ha t
(w jec

at re)
t)
b

io
o

n
Boynton
Alderwood
Lab
organiz. location

IE – Template Extrac)on ‐ Steps 

</VerbGroup> …

Template Extrac)on 
<Facility>Finsbury Park Mosque</Facility>
(c) 2001, Chicago Tribune.
<Country>England</Country>
Visit the Chicago Tribune on the Internet at
<Country>France </Country>
http://www.chicago.tribune.com/
Distributed by Knight Ridder/Tribune
<Country>England</Country>
Information Services.
By Stephen J. Hedges and Cam Simpson
<Country>Belgium</Country>
…….
<Country>United States</Country>
The Finsbury Park Mosque is the center of
<PersonPositionOrganization>
radical Muslim activism in England. Through its <Person>Abu Hamza al-Masri</Person>
<OFFLEN OFFSET=quot;3576quot;
doors have passed at least three of the men
LENGTH=“33quot; />
now held on suspicion of terrorist activity in
France, England and Belgium, as well as one <Person>Abu Hamza al-Masri</Person>
Algerian man in prison in the United States.
<Position>chief cleric</Position>
``The mosque's chief cleric, Abu Hamza al- <Organization>Finsbury Park Mosque</
Masri lost two hands fighting the Soviet Union
Organization>
<PersonArrest>
in Afghanistan and he advocates the <City>London</City>
</PersonPositionOrganization>
<OFFLEN OFFSET=quot;3814quot;
elimination of Western influence from Muslim
countries. He was arrested in London in 1999
LENGTH=quot;61quot; />
for his alleged involvement in a Yemen bomb
<Person>Abu Hamza al-Masri</Person>
plot, but was set free after Yemen failed to
<Location>London</Location>
produce enough evidence to have him
extradited. .'‘ … <Date>1999</Date>
<Reason>his alleged involvement in a
Yemen bomb plot</Reason>
</PersonArrest>

Streaming Text: Knowledge Extrac)on 

•  Leveraging some earlier 
work on informa)on 
extrac)on from text 
streams 

Informa)on extrac)on 
•  process of using 
advanced automated 
machine learning 
approaches  
•  to iden)fy en))es in 
text documents 
•  extract this informa)on 
along with the 
rela)onships these 
The visualiza)on above demonstrates informa)on 
en))es may have in the 
extrac)on of names, places and organiza)ons from real‐
text documents 
)me news feeds. As news ar)cles arrive, the informa)on 
is extracted and displayed. Rela)onships are deﬁned 
when en))es co‐occur within a speciﬁc window of 
words.

Word Sense Disambigua)on 
• 
–  Context based or proximity based 
–  Very accurate

Ontological Associa)on (WordNet) 
•  Wordnet: As of 2006, the database contains about 150,000 words 
organized in over 115,000 synsets for a total of 207,000 word‐sense pairs 
•  Search for dog 
–  n dog, domes)c dog, Canis familiaris (a member of the genus Canis (probably 
descended from the common wolf) that has been domes)cated by man since 
prehistoric )mes; occurs in many breeds) 
–  n frump, dog (a dull unaBrac)ve unpleasant girl or woman) 
–  n dog (informal term for a man) 
–  n cad, bounder, blackguard, dog, hound, heel (someone who is morally 
reprehensible) 
–  n frank, frankfurter, hotdog, hot dog, dog, wiener, wienerwurst, weenie (a 
smooth‐textured sausage of minced beef or pork usually smoked; o}en served 
on a bread roll) 
–  n pawl, detent, click, dog (a hinged catch that fits into a notch of a ratchet to 
move a wheel forward or prevent it from moving backward) 
–  n andiron, firedog, dog, dog‐iron (metal supports for logs in a fireplace) 
–  v chase, chase a}er, trail, tail, tag, give chase, dog, go a}er, track (go a}er with 
the intent to catch)

Feature Selec)on 
•  Reduce Dimensionality 
–  Learners have diﬃculty addressing tasks with high 
dimensionality  
•  Irrelevant Features 
–  Not all features help!  
–  Remove features that occur in only a few 
documents 
–  Reduce features that occur in too many 
documents

Text Mining: General Applica)on Areas 
•  Informa)on Retrieval 
–  Indexing and retrieval of textual documents 
–  Finding a set of (ranked) documents that are relevant to 
the query 
•  Informa)on Extrac)on 
–  Extrac)on of par)al knowledge in the text 
•  Web Mining 
–  Indexing and retrieval of textual documents and extrac)on 
of par)al knowledge using the web 
•  Classiﬁca)on 
–  Predict a class for each text document 
•  Clustering 
–  Genera)ng collec)ons of similar text documents

Text Mining: Supervised vs. Unsupervised 
•  Supervised learning (Classifica)on) 
–  Data (observa)ons, measurements, etc.) are accompanied by 
labels indica)ng the class of the observa)ons 
–  Split into training data and test data for model building process 
–  New data is classified based on the model built with the training 
data 
–  Techniques 
•  Bayesian classifica)on, Decision trees, Neural networks, 
Instance‐Based Methods, Support Vector Machines  
•  Unsupervised learning (Clustering) 
–  Class labels of training data is unknown 
–  Given a set of measurements, observa)ons, etc. with the aim of 
establishing the existence of classes or clusters in the data

Results: Social Network (Tom in Red)

Text Mining: T2K and ThemeWeaver

Text Mining: Themescape and ThemeRiver 
Visualizing Rela)onships Between Documents 
• 

Images from Pacific Northwest Laboratory

Gather – Analyze – Present

Text Mining: Applica)ons 

•  Email: Spam filtering 
•  News Feeds: Discover what is 
interes)ng 
•  Medical: Iden)fy rela)onships and 
link informa)on from different 
medical fields 
•  Homeland Security 
•  Marke)ng: Discover dis)nct groups of 
poten)al buyers and make 
sugges)ons for other products 
•  Industry: Iden)fying groups of 
compe)tors web pages 
•  Job Seeking: Iden)fy parameters in 
searching for jobs

Text Mining: Classifica)on Defini)on 
•  Given: Collec)on of labeled records 
–  Each record contains a set of features (aBributes), and the true class 
(label) 
–  Create a training set to build the model 
–  Create a tes)ng set to test the model 
•  Find: Model for the class as a func)on of the values of the features 
•  Goal: Assign a class (as accurately as possible) to previously unseen 
records 
•  Evalua)on: What Is Good Classifica)on? 
–  Correct classifica)on  
•  Known label of test example is iden)cal to the predicted class from the model 
–  Accuracy ra)o 
•  Percent of test set examples that are correctly classified by the model 
–  Distance measure between classes can be used 
•  e.g., classifying “football” document as a “basketball” document is not as bad 
as classifying it as “crime”

Text Mining: Clustering Deﬁni)on 
•  Given:  Set of documents and a similarity measure 
among documents 
•  Find: Clusters such that 
–  Documents in one cluster are more similar to one 
another 
–  Documents in separate clusters are less similar to one 
another 
•  Goal: 
–  Finding a correct set of documents 
•  Similarity Measures: 
–  Euclidean distance if aBributes are con)nuous 
–  Other problem‐speciﬁc measures 
•  e.g., how many words are common in these documents 
•  Evalua)on: What Is Good Clustering? 
–  Produce high quality clusters with  
•  high intra‐class similarity 
•  low inter‐class similarity  
–  Quality of a clustering method is also measured by its 
ability to discover some or all of the hidden paBerns

Future Work 
•  Enhancements to Seman)c Analysis 
–  Use of Ontological Associa)on (WordNet, 
VerbNet) 
–  Improve co‐referencing 
–  Improve fact extrac)on 
•  Visual explora)on tools

Text Mining and SEASR

Recomendados

Recomendados

Más contenido relacionado

Similar a Text Mining and SEASR

Similar a Text Mining and SEASR (20)

Más de Loretta Auvil

Más de Loretta Auvil (20)

Último

Último (20)

Text Mining and SEASR