SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
Introduc)on
to
SEASR
and
Text
Mining

                   UIUC/NCSA

                   Feb
4,
2009



                   LoreBa
Auvil

Na)onal
Center
for
Supercompu)ng
Applica)ons

   University
of
Illinois
at
Urbana
Champaign

The
SEASR
Picture

SEASR:
Reach
+
Relevance
+
Reuse
+
Repeatability




SEASR
emphasizes
flexibility,
scalability,
modularity,
provides

 community
hub
and
access
to
heterogeneous
data
and

 computa)onal
systems

 –  Seman)c
driven
environment
for
SOA
interoperability

 –  Encourages
sharing
and
par)cipa)on
for
building
communi)es

 –  Modular
construc)on
allows
flows
to
be
modified
and
configured
to

    encourage
reusability
within
and
across
domains

 –  Enables
a
mashup
and
integra)on
of
tools

 –  Data‐intensive
flows
can
be
executed
on
a
simple
desktop
or
a
large

    cluster(s)
without
modifica)on

 –  Computa)on
can
be
created
for
distributed
execu)on
on
servers
where

    the
content
lives

 –  User
accessibility
to
control
trust
and
compliance
with
required
copyright

    license
of
content

 –  Relies
on
standardized
Resource
Descrip)on
Framework
(RDF)
to
define

    components
and
flow

Knowledge
Discovery
in
Data

Workbench

•  Web‐based
UI

•  Components
and
flows

   are
retrieved
from
server

•  Addi)onal
loca)ons
of

   components
and
flows

   can
be
added
to
server

•  Create
flow
using
a

   graphical
drag
and
drop

   interface

•  Change
property
values

•  Execute
the
flow

Community
Hub

SEASR
@
Work
–
Zotero

•  Plugin
to
Firefox


•  Zotero
manages
the

   collec)on

•  Launch
SEASR
Analy)cs


   –  Cita)on
Analysis
uses
the
JUNG

      network
importance
algorithms

      to
rank
the
authors
in
the
cita)on

      network
that
is
exported
as
RDF

      data
from
Zotero
to
SEASR

   –  Zotero
Export
to
Fedora
through

      SEASR

   –  Saves
results
from
SEASR

      Analy)cs
to
a
Collec)on

•  Launch
MONK
Processing

   –  MONK
DB
Inges)on
Workflow

SEASR
@
Work
–
Fedora




                               Interac)ve
Web


                                 Applica)on





Web
Service

SEASR
@
Work
–

En)ty
Mash‐up

•  En)ty

   Extrac)on
with

   OpenNLP

•  Loca)ons

   viewed
on

   Google
Map


•  Dates
viewed

   on
Simile

   Timeline

SEASR
@
Work
–
Audio
Analysis

•  NEMA:
Executes
a
SEASR

   flow
for
each
run

   –  Loads
audio
data

   –  Extracts
features
for
every

      10
sec
moving
window
of

      audio

   –  Loads
and
applies
the

      models

   –  Sends
results
back
to
the

      WebUI

•  NESTER:
Annota)on
of

   Audio
via
Spectral

   Analysis

SEASR
@
Work
–
MONK

Executes
flows
for

  each
analysis

  requested

  –  Predic)ve

     modeling
using

     Naïve
Bayes

  –  Predic)ve

     modeling
using

     Support
Vector

     Machines
(SVM)

SEASR
@
Work
–
DISCUS

     On‐demand
usage
of

• 
     analy)cs
while
surfing

      –  While
naviga)ng

         request
analy)cs
to
be

         performed
on
page

      –  Text
extrac)on
and

         cleaning

     Summariza)on
and
key

• 
     work
extrac)on

      –  List
the
important

         terms
on
the
page

         being
analyzed

      –  Provide
relevant
short

         summaries


     Visual
maps

• 
      –  Provide
a
visual

         representa)on
of
the

         key
concepts

      –  Show
the
graph
of

         rela)ons
between

         concepts

SEASR
and
UIMA
:
Emo)on
Tracking


Goal
is
to
have
this
type
of
Visualiza)on
to
track
emo)ons
across
a
text

 document
(Leveraging
flare.prefuse.org)

SEASR
Text
Analy)cs
Goals

Address
the
Scholarly
text
analy)cs
needs
by:


•  Efficiently
managing
distributed
Literary
and
Historical
textual
assets

•  Structuring
extracted
informa)on
to
facilitate
knowledge
discovery

•  Extract
informa)on
from
text
at
a
level
of
seman)c/func)onal

   abstrac)on
that
is
sufficiently
rich
to
support
ques)on‐answering

•  Devise
a
representa)on
for
the
extracted
informa)on
that
can
be

   efficiently
reasoned
over
to
recover
data
in
the
ques)on‐answer

   process

•  Devise
algorithms
for
ques)on
answering
and
inference

•  Develop
UI
for
effec)ve
visual
knowledge
discovery
with
separate

   query
logic
from
applica)on
logic


•  Leveraging
exis)ng
approaches
and
devise
algorithms
for
clustering,

   inference,
and
Q&A

•  Developing
an
Interac)on
UI
for
effec)ve
visual
data
explora)on

•  Enable
the
text
analy)cs
through
SEASR
components

The
Zotero
Picture





The


WEB





                Zotero

                 Store

The
Zotero
+
SEASR
Picture





                         The

The

                         WEB

WEB





           Zotero

            Store

Your
Zotero
Collec)on

The
SEASR
Analy)cs

The
Value
Added

Some
Examples

•  Authorship Analysis (JUNG network
   importance algorithms to rank the authors
   in the citation network)
  •  Author Centrality Analysis
       –  Uses Betweenness Centrality, which ranks each coauthor graph derived from the
          number of shortest paths that pass through them
  •  Author Degree Analysis
       –  Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors
  •  Author HITS Analysis
       –  The *hubness* of a node is the degree to which a node links to other important
          authorities. The *authoritativeness* of a node is the degree to which a node is pointed
          to by important hubs.

•  Readability
  •  Flesch-Kincaid readability test quot;
     (http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)
SEASR Flow
Text
Mining
Defini)on

Many
defini)ons
in
the
literature

•  The
non
trivial
extrac)on
of
implicit,
previously

   unknown,
and
poten)ally
useful
informa)on

   from
(large
amount
of)
textual
data”

•  An
explora)on
and
analysis
of
textual
(natural‐
   language)
data
by
automa)c
and
semi
automa)c

   means
to
discover
new
knowledge

•  What
is
“previously
unknown”
informa)on?

   –  Strict
defini)on

      •  Informa)on
that
not
even
the
writer
knows

   –  Lenient
defini)on

      •  Rediscover
the
informa)on
that
the
author
encoded
in
the

         text


Text
Mining
Process

     Text
Preprocessing

• 
           Syntac)c
Text
Analysis

      – 
           Seman)c
Text
Analysis


      – 
     Features
Genera)on


• 
           Bag
of
Words

      – 
           Ngrams

      – 
     Feature
Selec)on

• 
           Simple
Coun)ng

      – 
           Sta)s)cs


      – 
           Selec)on
based
on
POS

      – 
     Text/Data
Mining

• 
           Classifica)on
‐
Supervised

      – 
           Learning

           Clustering
‐
Unsupervised

      – 
           Learning

           Informa)on
Extrac)on

      – 
     Analyzing
Results

• 
           Visual
Explora)on,
Discovery

      – 
           and
Knowledge
Extrac)on

           Query‐based
–
ques)on

      – 
           answering

Text
Characteris)cs
(1)

•  Large
textual
data
base


    –  Enormous
wealth
of
textual
informa)on
on
the
Web

    –  Publica)ons
are
electronic

•  High
dimensionality


    –  Consider
each
word/phrase
as
a
dimension

•  Noisy
data

    –  Spelling
mistakes

    –  Abbrevia)ons

    –  Acronyms

•  Text
messages
are
very
dynamic

    –  Web
pages
are
constantly
being
generated
(removed)

    –  Web
pages
are
generated
from
database
queries


•  Not
well
structured
text

    –  Email/Chat
rooms

         •  “r
u
available
?”

         •  “Hey
whazzzzzz
up”


    –  Speech

Text
Characteris)cs
(2)

•  Dependency


    –  Relevant
informa)on
is
a
complex
conjunc)on
of
words/phrases

    –  Order
of
words
in
the
query

        •  hot
dog
stand
in
the
amusement
park


        •  hot
amusement
stand
in
the
dog
park

•  Ambiguity


    –  Word
ambiguity


        •  Pronouns

(he,
she
…)

        •  Synonyms
(buy,
purchase)

        •  Words
with
mul)ple
meanings
(bat
–
it
is
related
to
baseball
or
mammal)

    –  Seman)c
ambiguity

        •  The
king
saw
the
rabbit
with
his
glasses.
(mul)ple
meanings)


•  Authority
of
the
source

    –  IBM
is
more
likely
to
be
an
authorized
source
then
my
second
far

       cousin

Text
Preprocessing

•  Syntac)c
analysis

        Tokeniza)on

   – 
        Lemmi)za)on

   – 
        POS
tagging

   – 
        Shallow
parsing

   – 
        Custom
literary
tagging

   – 
•  Seman)c
analysis

   –  Informa)on
Extrac)on

         •  Named
En)ty
tagging

        Seman)c
Category
(unnamed
en)ty)
tagging

   – 
        Co‐reference
resolu)on

   – 
        Ontological
associa)on
(WordNet,
VerbNet)

   – 
        Seman)c
Role
analysis

   – 
        Concept‐Rela)on
extrac)on

   – 
Syntac)c
Analysis

     Tokeniza)on

• 
           Text
document
is
represented
by
the
words
it
contains
(and
their
occurrences)

      – 
           e.g.,
“Lord
of
the
rings”

→

{“the”,
“Lord”,
“rings”,
“of”}

      – 
           Highly
efficient

      – 
           Makes
learning
far
simpler
and
easier

      – 
           Order
of
words
is
not
that
important
for
certain
applica)ons

      – 
     Lemmi)za)on/Stemming

• 
           Involves
the
reduc)on
of
corpus
words
to
their
respec)ve
headwords
(i.e.
lemmas)

      – 
           Reduce
dimensionality

      – 
           Iden)fies
a
word
by
its
root

      – 
           e.g.,
flying,
flew
→
fly

      – 
     Stop
words

• 
           Iden)fies
the
most
common
words
that
are
unlikely
to
help
with
text
mining

      – 
           e.g.,
“the”,
“a”,
“an”,
“you”

      – 
     Parsing
/
Part
of
Speech
(POS)
tagging

• 
           Generates
a
parse
tree
(graph)
for
each
sentence

      – 
           Each
sentence
is
a
stand
alone
graph


      – 
           Find
the
corresponding
POS
for
each
word

      – 
           e.g.,
John
(noun)
gave
(verb)
the
(det)
ball
(noun)


      – 
           Shallow
Parsing

      – 
                  analysis
of
a
sentence
which
iden)fies
the
cons)tuents
(noun
groups,
verbs,...),
but
does
not
specify
their
internal

             • 
                  structure,
nor
their
role
in
the
main
sentence

           Deep
Parsing

      – 
                  more
sophis)cated
syntac)c,
seman)c
and
contextual
processing
must
be
performed
to
extract
or
construct
the
answer

             • 
Seman)c
Analysis:
Informa)on
Extrac)on

 •  Defini)on:
Informa)on
extrac)on
is
the

    iden)fica)on
of
specific
seman)c
elements

    within
a
text
(e.g.,
en))es,
proper)es,

    rela)ons)

 •  Extract
the
relevant
informa)on
and
ignore

    non‐relevant
informa)on
(important!)


 •  Link
related
informa)on
and
output
in
a

    predetermined
format

Informa)on
Extrac)on


                                  Informa(on
Type
                                                         State
of
the
art
(Accuracy)

                                  En((es
                                                                            90‐98%

                       an
object
of
interest
such
as
a

                          person
or
organiza)on.

                             A9ributes
                                                                               80%

                  a
property
of
an
en)ty
such
as
its

                   name,
alias,
descriptor,
or
type.

                              Facts
                                                                                 60‐70%

               a
rela1onship
held
between
two
or

                more
en))es
such
as
Posi)on
of
a

                      Person
in
a
Company.

                               Events
                                                                               50‐60%

               an
ac1vity
involving
several
en))es

               such
as
a
terrorist
act,
airline
crash,

               management
change,
new
product

                           introduc)on.

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
Informa)on
Extrac)on
Approaches

•  
Terminology
(name)
lists

   –  This
works
very
well
if
the
list
of
names
and
name

      expressions
is
stable
and
available

•  
Tokeniza)on
and
morphology

   –  This
works
well
for
things
like
formulas
or
dates,
which

      are
readily
recognized
by
their
internal
format
(e.g.,

      DD/MM/YY
or
chemical
formulas)

•  
Use
of
characteris)c
paBerns

   –  This
works
fairly
well
for
novel
en))es

   –  Rules
can
be
created
by
hand
or
learned
via
machine

      learning
or
sta)s)cal
algorithms

   –  Rules
capture
local
paBerns
that
characterize
en))es

      from
instances
of
annotated
training
data

Informa)on
Extrac)on

Rela)on
(Event)
Extrac)on

•  Iden)fy
(and
tag)
the
rela)on
among
two
en))es:

   –  A
person
is_located_at
a
loca)on
(news)

   –  A
gene
codes_for
a
protein
(biology)

•  Rela)ons
require
more
informa)on


   –  Iden)fica)on
of
two
en))es
&
their
rela)onship

   –  Predicted
rela)on
accuracy


      •  Pr(E1)*Pr(E2)*Pr(R)
~=
(.93)
*
(.93)
*
(.93)
=
.80

•  Informa)on
in
rela)ons
is
less
local

   –  Contextual
informa)on
is
a
problem:
right
word
may
not

      be
explicitly
present
in
the
sentence

   –  Events
involve
more
rela)ons
and
are
even
harder

Seman)c
Analy)cs

   Named
En)ty
(NE)
Tagging





         NE:Person
                NE:Time

Mayor
Rex
Luthor
announced
today
the
establishment
of
a

                                                NE:Loca)on

new
research
facility
in
Alderwood.

It
will
be
known
as

                          NE:Organiza)on

Boynton
Laboratory.

Seman)c
Analysis

Seman)c
Category
(unnamed
en)ty,
UNE)

  Tagging





  Mayor
Rex
Luthor
announced
today
the
establishment
of
a

          UNE:Organiza)on

  new
research
facility
in
Alderwood.

It
will
be
known
as


  Boynton
Laboratory.

Seman)c
Analysis

Co‐reference
Resolu)on
for
en))es
and

  unnamed
en))es





  Mayor
Rex
Luthor
announced
today
the
establishment
of
a

          UNE:Organiza)on

  new
research
facility
in
Alderwood.

It
will
be
known
as


  Boynton
Laboratory.

Seman)c
Analysis

Seman)c
Role
Analysis





            ACTOR           ACTION   WHEN               OBJECT
       Mayor Rex Luthor announced today the establishment

                                  WHERE           OBJECT
       of a new research facility in Alderwoon.      It will be

      ACTION           COMPL
       known as Boynton Laboratory
Seman)c
Analysis

Concept‐Rela)on
Extrac)on




                                                         today
                                                e
                                             tim n )
                                                          time
                                                e
                                            (w h
                    actor
       Rex Luthor           announce
                    (who)
        person                action



                                       ob w h a
                                       (
                                         je t)
                                           ct
                                            establ.




                                                       loc(whe
                                             event
                                   ha t
                                 (w jec




                                                          at re)
                                     t)
                                    b




                                                            io
                                 o




                                                               n
                            Boynton
                                                           Alderwood
                              Lab
                            organiz.                         location
IE
–
Template
Extrac)on
‐
Steps





                       </VerbGroup> …
Template
Extrac)on

                                                  <Facility>Finsbury Park Mosque</Facility>
(c) 2001, Chicago Tribune.
                                                  <Country>England</Country>
Visit the Chicago Tribune on the Internet at
                                                  <Country>France </Country>
http://www.chicago.tribune.com/
Distributed by Knight Ridder/Tribune
                                                  <Country>England</Country>
Information Services.
By Stephen J. Hedges and Cam Simpson
                                                  <Country>Belgium</Country>
…….
                                                  <Country>United States</Country>
The Finsbury Park Mosque is the center of
                                                   <PersonPositionOrganization> 
radical Muslim activism in England. Through its   <Person>Abu Hamza al-Masri</Person>
                                                     <OFFLEN OFFSET=quot;3576quot;
doors have passed at least three of the men
                                                   LENGTH=“33quot; />
now held on suspicion of terrorist activity in
France, England and Belgium, as well as one          <Person>Abu Hamza al-Masri</Person>
Algerian man in prison in the United States.
                                                     <Position>chief cleric</Position>
``The mosque's chief cleric, Abu Hamza al-           <Organization>Finsbury Park Mosque</
Masri lost two hands fighting the Soviet Union
                                                   Organization>
                                                  <PersonArrest> 
in Afghanistan and he advocates the               <City>London</City>
                                                   </PersonPositionOrganization>
                                                    <OFFLEN OFFSET=quot;3814quot;
elimination of Western influence from Muslim
countries. He was arrested in London in 1999
                                                  LENGTH=quot;61quot; />  
for his alleged involvement in a Yemen bomb
                                                    <Person>Abu Hamza al-Masri</Person>  
plot, but was set free after Yemen failed to
                                                    <Location>London</Location>
produce enough evidence to have him
extradited. .'‘ …                                   <Date>1999</Date>
                                                    <Reason>his alleged involvement in a
                                                  Yemen bomb                 plot</Reason>  
                                                   </PersonArrest>
Streaming
Text:
Knowledge
Extrac)on


•  Leveraging
some
earlier

   work
on
informa)on

   extrac)on
from
text

   streams


Informa)on
extrac)on

•  process
of
using

    advanced
automated

    machine
learning

    approaches


•  to
iden)fy
en))es
in

    text
documents

•  extract
this
informa)on

    along
with
the

    rela)onships
these

                              The
visualiza)on
above
demonstrates
informa)on

    en))es
may
have
in
the

                              extrac)on
of
names,
places
and
organiza)ons
from
real‐
    text
documents

                              )me
news
feeds.
As
news
ar)cles
arrive,
the
informa)on

                              is
extracted
and
displayed.
Rela)onships
are
defined

                              when
en))es
co‐occur
within
a
specific
window
of

                              words.

Seman)c
Analysis

     Word
Sense
Disambigua)on

• 
      –  Context
based
or
proximity
based

      –  Very
accurate

Ontological
Associa)on
(WordNet)

•  Wordnet:
As
of
2006,
the
database
contains
about
150,000
words

   organized
in
over
115,000
synsets
for
a
total
of
207,000
word‐sense
pairs

•  Search
for
dog

    –  n
dog,
domes)c
dog,
Canis
familiaris
(a
member
of
the
genus
Canis
(probably

       descended
from
the
common
wolf)
that
has
been
domes)cated
by
man
since

       prehistoric
)mes;
occurs
in
many
breeds)

    –  n
frump,
dog
(a
dull
unaBrac)ve
unpleasant
girl
or
woman)

    –  n
dog
(informal
term
for
a
man)

    –  n
cad,
bounder,
blackguard,
dog,
hound,
heel
(someone
who
is
morally

       reprehensible)

    –  n
frank,
frankfurter,
hotdog,
hot
dog,
dog,
wiener,
wienerwurst,
weenie
(a

       smooth‐textured
sausage
of
minced
beef
or
pork
usually
smoked;
o}en
served

       on
a
bread
roll)

    –  n
pawl,
detent,
click,
dog
(a
hinged
catch
that
fits
into
a
notch
of
a
ratchet
to

       move
a
wheel
forward
or
prevent
it
from
moving
backward)

    –  n
andiron,
firedog,
dog,
dog‐iron
(metal
supports
for
logs
in
a
fireplace)

    –  v
chase,
chase
a}er,
trail,
tail,
tag,
give
chase,
dog,
go
a}er,
track
(go
a}er
with

       the
intent
to
catch)

Feature
Selec)on

•  Reduce
Dimensionality

  –  Learners
have
difficulty
addressing
tasks
with
high

     dimensionality


•  Irrelevant
Features

  –  Not
all
features
help!


  –  Remove
features
that
occur
in
only
a
few

     documents

  –  Reduce
features
that
occur
in
too
many

     documents

Text
Mining:
General
Applica)on
Areas

•  Informa)on
Retrieval

   –  Indexing
and
retrieval
of
textual
documents

   –  Finding
a
set
of
(ranked)
documents
that
are
relevant
to

      the
query

•  Informa)on
Extrac)on

   –  Extrac)on
of
par)al
knowledge
in
the
text

•  Web
Mining

   –  Indexing
and
retrieval
of
textual
documents
and
extrac)on

      of
par)al
knowledge
using
the
web

•  Classifica)on

   –  Predict
a
class
for
each
text
document

•  Clustering

   –  Genera)ng
collec)ons
of
similar
text
documents

Text
Mining:
Supervised
vs.
Unsupervised

 •  Supervised
learning
(Classifica)on)

    –  Data
(observa)ons,
measurements,
etc.)
are
accompanied
by

       labels
indica)ng
the
class
of
the
observa)ons

    –  Split
into
training
data
and
test
data
for
model
building
process

    –  New
data
is
classified
based
on
the
model
built
with
the
training

       data

    –  Techniques

        •  Bayesian
classifica)on,
Decision
trees,
Neural
networks,

         Instance‐Based
Methods,
Support
Vector
Machines


 •  Unsupervised
learning
(Clustering)

    –  Class
labels
of
training
data
is
unknown

    –  Given
a
set
of
measurements,
observa)ons,
etc.
with
the
aim
of

       establishing
the
existence
of
classes
or
clusters
in
the
data

Results:
Social
Network
(Tom
in
Red)

Results:
Timeline

Results:
Maps

Text
Mining:
T2K
and
ThemeWeaver

Text
Mining:
Themescape
and
ThemeRiver

      Visualizing
Rela)onships
Between
Documents

 • 




                                    Images from Pacific Northwest Laboratory
Gather
–
Analyze
–
Present


Text
Mining:
Applica)ons


•  Email:
Spam
filtering

•  News
Feeds:
Discover
what
is

   interes)ng

•  Medical:
Iden)fy
rela)onships
and

   link
informa)on
from
different

   medical
fields

•  Homeland
Security

•  Marke)ng:
Discover
dis)nct
groups
of

   poten)al
buyers
and
make

   sugges)ons
for
other
products

•  Industry:
Iden)fying
groups
of

   compe)tors
web
pages

•  Job
Seeking:
Iden)fy
parameters
in

   searching
for
jobs

Text
Mining:
Classifica)on
Defini)on

•  Given:
Collec)on
of
labeled
records

    –  Each
record
contains
a
set
of
features
(aBributes),
and
the
true
class

       (label)

    –  Create
a
training
set
to
build
the
model

    –  Create
a
tes)ng
set
to
test
the
model

•  Find:
Model
for
the
class
as
a
func)on
of
the
values
of
the
features

•  Goal:
Assign
a
class
(as
accurately
as
possible)
to
previously
unseen

   records

•  Evalua)on:
What
Is
Good
Classifica)on?

    –  Correct
classifica)on


        •  Known
label
of
test
example
is
iden)cal
to
the
predicted
class
from
the
model

    –  Accuracy
ra)o

        •  Percent
of
test
set
examples
that
are
correctly
classified
by
the
model

    –  Distance
measure
between
classes
can
be
used

        •  e.g.,
classifying
“football”
document
as
a
“basketball”
document
is
not
as
bad

           as
classifying
it
as
“crime”

Text
Mining:
Clustering
Defini)on

•  Given:

Set
of
documents
and
a
similarity
measure

   among
documents

•  Find:
Clusters
such
that

    –  Documents
in
one
cluster
are
more
similar
to
one

       another

    –  Documents
in
separate
clusters
are
less
similar
to
one

       another

•  Goal:

    –  Finding
a
correct
set
of
documents

•  Similarity
Measures:

    –  Euclidean
distance
if
aBributes
are
con)nuous

    –  Other
problem‐specific
measures

         •  e.g.,
how
many
words
are
common
in
these
documents

•  Evalua)on:
What
Is
Good
Clustering?

    –  Produce
high
quality
clusters
with


         •  high
intra‐class
similarity

         •  low
inter‐class
similarity


    –  Quality
of
a
clustering
method
is
also
measured
by
its

       ability
to
discover
some
or
all
of
the
hidden
paBerns

SEASR

Meandre
Workbench

Future
Work

•  Enhancements
to
Seman)c
Analysis

  –  Use
of
Ontological
Associa)on
(WordNet,

     VerbNet)

  –  Improve
co‐referencing

  –  Improve
fact
extrac)on

•  Visual
explora)on
tools


Más contenido relacionado

Similar a Text Mining and SEASR

Social Media Very Simple Overview What Is It How Did It Start What Does It Do
Social Media   Very Simple Overview What Is It How Did It Start What Does It DoSocial Media   Very Simple Overview What Is It How Did It Start What Does It Do
Social Media Very Simple Overview What Is It How Did It Start What Does It DoKristin McCullough
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open StackMegan Eskey
 
Social Computing Tools and Social Technography
Social Computing Tools and Social TechnographySocial Computing Tools and Social Technography
Social Computing Tools and Social TechnographyKiran Budhrani
 
Sapo BUS Hands-On
Sapo BUS Hands-OnSapo BUS Hands-On
Sapo BUS Hands-Oncodebits
 
Fedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacFedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacLoretta Auvil
 
Roll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSRoll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSChris Evjy
 
The Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoThe Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoVenture Hacks
 
LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08Barry Sampson
 
'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...
'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...
'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...Ed Chi
 
Building


















 Terrier by
 Open
 Collaboration
Building


















 Terrier by
 Open
 CollaborationBuilding


















 Terrier by
 Open
 Collaboration
Building


















 Terrier by
 Open
 CollaborationCrai Macdonald
 
2009 05 01 How To Build A Lean Startup Step By Step
2009 05 01 How To Build A Lean Startup Step By Step2009 05 01 How To Build A Lean Startup Step By Step
2009 05 01 How To Build A Lean Startup Step By StepEric Ries
 
Innovation Diffusion and Facebook
Innovation Diffusion and FacebookInnovation Diffusion and Facebook
Innovation Diffusion and FacebookWerner Iucksch
 
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingUW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingChris Sterling
 
Seasr Overview Ws April 2009
Seasr Overview Ws April 2009Seasr Overview Ws April 2009
Seasr Overview Ws April 2009Loretta Auvil
 
API's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic webAPI's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic webDan Delany
 

Similar a Text Mining and SEASR (20)

HTML Parsing With Hpricot
HTML Parsing With HpricotHTML Parsing With Hpricot
HTML Parsing With Hpricot
 
SEASR Overview
SEASR OverviewSEASR Overview
SEASR Overview
 
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
Social Media   Very Simple Overview What Is It How Did It Start What Does It DoSocial Media   Very Simple Overview What Is It How Did It Start What Does It Do
Social Media Very Simple Overview What Is It How Did It Start What Does It Do
 
The Yahoo Open Stack
The Yahoo Open StackThe Yahoo Open Stack
The Yahoo Open Stack
 
Social Computing Tools and Social Technography
Social Computing Tools and Social TechnographySocial Computing Tools and Social Technography
Social Computing Tools and Social Technography
 
Sapo BUS Hands-On
Sapo BUS Hands-OnSapo BUS Hands-On
Sapo BUS Hands-On
 
Fedora App Slide 2009 Hastac
Fedora App Slide 2009 HastacFedora App Slide 2009 Hastac
Fedora App Slide 2009 Hastac
 
Roll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMSRoll-out of the NYU HSL Website and Drupal CMS
Roll-out of the NYU HSL Website and Drupal CMS
 
SEASR Overview
SEASR OverviewSEASR Overview
SEASR Overview
 
The Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 ExpoThe Lean Startup at Web 2.0 Expo
The Lean Startup at Web 2.0 Expo
 
LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08LSG Webinar - 13 Nov 08
LSG Webinar - 13 Nov 08
 
'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...
'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...
'Living Laboratories': Rethinking Ecological Designs and Experimentation in H...
 
Building


















 Terrier by
 Open
 Collaboration
Building


















 Terrier by
 Open
 CollaborationBuilding


















 Terrier by
 Open
 Collaboration
Building


















 Terrier by
 Open
 Collaboration
 
From Work To Word
From Work To WordFrom Work To Word
From Work To Word
 
2009 05 01 How To Build A Lean Startup Step By Step
2009 05 01 How To Build A Lean Startup Step By Step2009 05 01 How To Build A Lean Startup Step By Step
2009 05 01 How To Build A Lean Startup Step By Step
 
Innovation Diffusion and Facebook
Innovation Diffusion and FacebookInnovation Diffusion and Facebook
Innovation Diffusion and Facebook
 
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance TestingUW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
UW ADC - Course 3 - Class 1 - User Stories And Acceptance Testing
 
Seasr Overview Ws April 2009
Seasr Overview Ws April 2009Seasr Overview Ws April 2009
Seasr Overview Ws April 2009
 
API's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic webAPI's, Freebase, and the Collaborative Semantic web
API's, Freebase, and the Collaborative Semantic web
 
Society 3 0
Society 3 0Society 3 0
Society 3 0
 

Más de Loretta Auvil

Más de Loretta Auvil (20)

Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009
 
Discus
DiscusDiscus
Discus
 
Meandre Architecture
Meandre ArchitectureMeandre Architecture
Meandre Architecture
 
SEASR Audio
SEASR AudioSEASR Audio
SEASR Audio
 
SEASR Text
SEASR TextSEASR Text
SEASR Text
 
SEASR Tools
SEASR ToolsSEASR Tools
SEASR Tools
 
SEASR-and-Zotero
SEASR-and-ZoteroSEASR-and-Zotero
SEASR-and-Zotero
 
SEASR-Fedora App
SEASR-Fedora AppSEASR-Fedora App
SEASR-Fedora App
 
SEASR Installation
SEASR InstallationSEASR Installation
SEASR Installation
 
SEASR Community Hub
SEASR Community HubSEASR Community Hub
SEASR Community Hub
 
Meandre Workbench Ws Jan 2009
Meandre Workbench Ws Jan 2009Meandre Workbench Ws Jan 2009
Meandre Workbench Ws Jan 2009
 
SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009
 
SEASR and UIMA
SEASR and UIMASEASR and UIMA
SEASR and UIMA
 
SEASR and Zotero
SEASR and ZoteroSEASR and Zotero
SEASR and Zotero
 
SEASR Overview
SEASR OverviewSEASR Overview
SEASR Overview
 
SEASR eScience 2008
SEASR eScience 2008SEASR eScience 2008
SEASR eScience 2008
 
ICHASS Workshop Lab
ICHASS Workshop LabICHASS Workshop Lab
ICHASS Workshop Lab
 
ICHASS Workshop Seasr
ICHASS Workshop SeasrICHASS Workshop Seasr
ICHASS Workshop Seasr
 
ICHASS Workshop Text Mining
ICHASS Workshop Text MiningICHASS Workshop Text Mining
ICHASS Workshop Text Mining
 
Text Mining Wksp Auvil
Text Mining Wksp AuvilText Mining Wksp Auvil
Text Mining Wksp Auvil
 

Último

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Último (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Text Mining and SEASR

  • 1. Introduc)on
to
SEASR
and
Text
Mining
 UIUC/NCSA
 Feb
4,
2009
 LoreBa
Auvil
 Na)onal
Center
for
Supercompu)ng
Applica)ons
 University
of
Illinois
at
Urbana
Champaign

  • 3. SEASR:
Reach
+
Relevance
+
Reuse
+
Repeatability

 
SEASR
emphasizes
flexibility,
scalability,
modularity,
provides
 community
hub
and
access
to
heterogeneous
data
and
 computa)onal
systems
 –  Seman)c
driven
environment
for
SOA
interoperability
 –  Encourages
sharing
and
par)cipa)on
for
building
communi)es
 –  Modular
construc)on
allows
flows
to
be
modified
and
configured
to
 encourage
reusability
within
and
across
domains
 –  Enables
a
mashup
and
integra)on
of
tools
 –  Data‐intensive
flows
can
be
executed
on
a
simple
desktop
or
a
large
 cluster(s)
without
modifica)on
 –  Computa)on
can
be
created
for
distributed
execu)on
on
servers
where
 the
content
lives
 –  User
accessibility
to
control
trust
and
compliance
with
required
copyright
 license
of
content
 –  Relies
on
standardized
Resource
Descrip)on
Framework
(RDF)
to
define
 components
and
flow

  • 5. Workbench
 •  Web‐based
UI
 •  Components
and
flows
 are
retrieved
from
server
 •  Addi)onal
loca)ons
of
 components
and
flows
 can
be
added
to
server
 •  Create
flow
using
a
 graphical
drag
and
drop
 interface
 •  Change
property
values
 •  Execute
the
flow

  • 7. SEASR
@
Work
–
Zotero
 •  Plugin
to
Firefox

 •  Zotero
manages
the
 collec)on
 •  Launch
SEASR
Analy)cs

 –  Cita)on
Analysis
uses
the
JUNG
 network
importance
algorithms
 to
rank
the
authors
in
the
cita)on
 network
that
is
exported
as
RDF
 data
from
Zotero
to
SEASR
 –  Zotero
Export
to
Fedora
through
 SEASR
 –  Saves
results
from
SEASR
 Analy)cs
to
a
Collec)on
 •  Launch
MONK
Processing
 –  MONK
DB
Inges)on
Workflow

  • 8. SEASR
@
Work
–
Fedora
 Interac)ve
Web

 Applica)on
 Web
Service

  • 9. SEASR
@
Work
–

En)ty
Mash‐up
 •  En)ty
 Extrac)on
with
 OpenNLP
 •  Loca)ons
 viewed
on
 Google
Map

 •  Dates
viewed
 on
Simile
 Timeline

  • 10. SEASR
@
Work
–
Audio
Analysis
 •  NEMA:
Executes
a
SEASR
 flow
for
each
run
 –  Loads
audio
data
 –  Extracts
features
for
every
 10
sec
moving
window
of
 audio
 –  Loads
and
applies
the
 models
 –  Sends
results
back
to
the
 WebUI
 •  NESTER:
Annota)on
of
 Audio
via
Spectral
 Analysis

  • 11. SEASR
@
Work
–
MONK
 Executes
flows
for
 each
analysis
 requested
 –  Predic)ve
 modeling
using
 Naïve
Bayes
 –  Predic)ve
 modeling
using
 Support
Vector
 Machines
(SVM)

  • 12. SEASR
@
Work
–
DISCUS
 On‐demand
usage
of
 •  analy)cs
while
surfing
 –  While
naviga)ng
 request
analy)cs
to
be
 performed
on
page
 –  Text
extrac)on
and
 cleaning
 Summariza)on
and
key
 •  work
extrac)on
 –  List
the
important
 terms
on
the
page
 being
analyzed
 –  Provide
relevant
short
 summaries

 Visual
maps
 •  –  Provide
a
visual
 representa)on
of
the
 key
concepts
 –  Show
the
graph
of
 rela)ons
between
 concepts

  • 14. SEASR
Text
Analy)cs
Goals
 Address
the
Scholarly
text
analy)cs
needs
by:
 •  Efficiently
managing
distributed
Literary
and
Historical
textual
assets
 •  Structuring
extracted
informa)on
to
facilitate
knowledge
discovery
 •  Extract
informa)on
from
text
at
a
level
of
seman)c/func)onal
 abstrac)on
that
is
sufficiently
rich
to
support
ques)on‐answering
 •  Devise
a
representa)on
for
the
extracted
informa)on
that
can
be
 efficiently
reasoned
over
to
recover
data
in
the
ques)on‐answer
 process
 •  Devise
algorithms
for
ques)on
answering
and
inference
 •  Develop
UI
for
effec)ve
visual
knowledge
discovery
with
separate
 query
logic
from
applica)on
logic

 •  Leveraging
exis)ng
approaches
and
devise
algorithms
for
clustering,
 inference,
and
Q&A
 •  Developing
an
Interac)on
UI
for
effec)ve
visual
data
explora)on
 •  Enable
the
text
analy)cs
through
SEASR
components

  • 16. The
Zotero
+
SEASR
Picture
 The
 The
 WEB
 WEB
 Zotero
 Store

  • 20. Some
Examples
 •  Authorship Analysis (JUNG network importance algorithms to rank the authors in the citation network) •  Author Centrality Analysis –  Uses Betweenness Centrality, which ranks each coauthor graph derived from the number of shortest paths that pass through them •  Author Degree Analysis –  Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors •  Author HITS Analysis –  The *hubness* of a node is the degree to which a node links to other important authorities. The *authoritativeness* of a node is the degree to which a node is pointed to by important hubs. •  Readability •  Flesch-Kincaid readability test quot; (http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)
  • 22. Text
Mining
Defini)on
 Many
defini)ons
in
the
literature
 •  The
non
trivial
extrac)on
of
implicit,
previously
 unknown,
and
poten)ally
useful
informa)on
 from
(large
amount
of)
textual
data”
 •  An
explora)on
and
analysis
of
textual
(natural‐ language)
data
by
automa)c
and
semi
automa)c
 means
to
discover
new
knowledge
 •  What
is
“previously
unknown”
informa)on?
 –  Strict
defini)on
 •  Informa)on
that
not
even
the
writer
knows
 –  Lenient
defini)on
 •  Rediscover
the
informa)on
that
the
author
encoded
in
the
 text


  • 23. Text
Mining
Process
 Text
Preprocessing
 •  Syntac)c
Text
Analysis
 –  Seman)c
Text
Analysis

 –  Features
Genera)on

 •  Bag
of
Words
 –  Ngrams
 –  Feature
Selec)on
 •  Simple
Coun)ng
 –  Sta)s)cs

 –  Selec)on
based
on
POS
 –  Text/Data
Mining
 •  Classifica)on
‐
Supervised
 –  Learning
 Clustering
‐
Unsupervised
 –  Learning
 Informa)on
Extrac)on
 –  Analyzing
Results
 •  Visual
Explora)on,
Discovery
 –  and
Knowledge
Extrac)on
 Query‐based
–
ques)on
 –  answering

  • 24. Text
Characteris)cs
(1)
 •  Large
textual
data
base

 –  Enormous
wealth
of
textual
informa)on
on
the
Web
 –  Publica)ons
are
electronic
 •  High
dimensionality

 –  Consider
each
word/phrase
as
a
dimension
 •  Noisy
data
 –  Spelling
mistakes
 –  Abbrevia)ons
 –  Acronyms
 •  Text
messages
are
very
dynamic
 –  Web
pages
are
constantly
being
generated
(removed)
 –  Web
pages
are
generated
from
database
queries

 •  Not
well
structured
text
 –  Email/Chat
rooms
 •  “r
u
available
?”
 •  “Hey
whazzzzzz
up”

 –  Speech

  • 25. Text
Characteris)cs
(2)
 •  Dependency

 –  Relevant
informa)on
is
a
complex
conjunc)on
of
words/phrases
 –  Order
of
words
in
the
query
 •  hot
dog
stand
in
the
amusement
park

 •  hot
amusement
stand
in
the
dog
park
 •  Ambiguity

 –  Word
ambiguity

 •  Pronouns

(he,
she
…)
 •  Synonyms
(buy,
purchase)
 •  Words
with
mul)ple
meanings
(bat
–
it
is
related
to
baseball
or
mammal)
 –  Seman)c
ambiguity
 •  The
king
saw
the
rabbit
with
his
glasses.
(mul)ple
meanings)

 •  Authority
of
the
source
 –  IBM
is
more
likely
to
be
an
authorized
source
then
my
second
far
 cousin

  • 26. Text
Preprocessing
 •  Syntac)c
analysis
 Tokeniza)on
 –  Lemmi)za)on
 –  POS
tagging
 –  Shallow
parsing
 –  Custom
literary
tagging
 –  •  Seman)c
analysis
 –  Informa)on
Extrac)on
 •  Named
En)ty
tagging
 Seman)c
Category
(unnamed
en)ty)
tagging
 –  Co‐reference
resolu)on
 –  Ontological
associa)on
(WordNet,
VerbNet)
 –  Seman)c
Role
analysis
 –  Concept‐Rela)on
extrac)on
 – 
  • 27. Syntac)c
Analysis
 Tokeniza)on
 •  Text
document
is
represented
by
the
words
it
contains
(and
their
occurrences)
 –  e.g.,
“Lord
of
the
rings”

→

{“the”,
“Lord”,
“rings”,
“of”}
 –  Highly
efficient
 –  Makes
learning
far
simpler
and
easier
 –  Order
of
words
is
not
that
important
for
certain
applica)ons
 –  Lemmi)za)on/Stemming
 •  Involves
the
reduc)on
of
corpus
words
to
their
respec)ve
headwords
(i.e.
lemmas)
 –  Reduce
dimensionality
 –  Iden)fies
a
word
by
its
root
 –  e.g.,
flying,
flew
→
fly
 –  Stop
words
 •  Iden)fies
the
most
common
words
that
are
unlikely
to
help
with
text
mining
 –  e.g.,
“the”,
“a”,
“an”,
“you”
 –  Parsing
/
Part
of
Speech
(POS)
tagging
 •  Generates
a
parse
tree
(graph)
for
each
sentence
 –  Each
sentence
is
a
stand
alone
graph

 –  Find
the
corresponding
POS
for
each
word
 –  e.g.,
John
(noun)
gave
(verb)
the
(det)
ball
(noun)

 –  Shallow
Parsing
 –  analysis
of
a
sentence
which
iden)fies
the
cons)tuents
(noun
groups,
verbs,...),
but
does
not
specify
their
internal
 •  structure,
nor
their
role
in
the
main
sentence
 Deep
Parsing
 –  more
sophis)cated
syntac)c,
seman)c
and
contextual
processing
must
be
performed
to
extract
or
construct
the
answer
 • 
  • 28. Seman)c
Analysis:
Informa)on
Extrac)on
 •  Defini)on:
Informa)on
extrac)on
is
the
 iden)fica)on
of
specific
seman)c
elements
 within
a
text
(e.g.,
en))es,
proper)es,
 rela)ons)
 •  Extract
the
relevant
informa)on
and
ignore
 non‐relevant
informa)on
(important!)

 •  Link
related
informa)on
and
output
in
a
 predetermined
format

  • 29. Informa)on
Extrac)on
 Informa(on
Type
 State
of
the
art
(Accuracy)
 En((es
 90‐98%
 an
object
of
interest
such
as
a
 person
or
organiza)on.
 A9ributes
 80%
 a
property
of
an
en)ty
such
as
its
 name,
alias,
descriptor,
or
type.
 Facts
 60‐70%
 a
rela1onship
held
between
two
or
 more
en))es
such
as
Posi)on
of
a
 Person
in
a
Company.
 Events
 50‐60%
 an
ac1vity
involving
several
en))es
 such
as
a
terrorist
act,
airline
crash,
 management
change,
new
product
 introduc)on.
 “Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
  • 30. Informa)on
Extrac)on
Approaches
 •  
Terminology
(name)
lists
 –  This
works
very
well
if
the
list
of
names
and
name
 expressions
is
stable
and
available
 •  
Tokeniza)on
and
morphology
 –  This
works
well
for
things
like
formulas
or
dates,
which
 are
readily
recognized
by
their
internal
format
(e.g.,
 DD/MM/YY
or
chemical
formulas)
 •  
Use
of
characteris)c
paBerns
 –  This
works
fairly
well
for
novel
en))es
 –  Rules
can
be
created
by
hand
or
learned
via
machine
 learning
or
sta)s)cal
algorithms
 –  Rules
capture
local
paBerns
that
characterize
en))es
 from
instances
of
annotated
training
data

  • 31. Informa)on
Extrac)on
 Rela)on
(Event)
Extrac)on
 •  Iden)fy
(and
tag)
the
rela)on
among
two
en))es:
 –  A
person
is_located_at
a
loca)on
(news)
 –  A
gene
codes_for
a
protein
(biology)
 •  Rela)ons
require
more
informa)on

 –  Iden)fica)on
of
two
en))es
&
their
rela)onship
 –  Predicted
rela)on
accuracy

 •  Pr(E1)*Pr(E2)*Pr(R)
~=
(.93)
*
(.93)
*
(.93)
=
.80
 •  Informa)on
in
rela)ons
is
less
local
 –  Contextual
informa)on
is
a
problem:
right
word
may
not
 be
explicitly
present
in
the
sentence
 –  Events
involve
more
rela)ons
and
are
even
harder

  • 32. Seman)c
Analy)cs
 Named
En)ty
(NE)
Tagging
 NE:Person
 NE:Time
 Mayor
Rex
Luthor
announced
today
the
establishment
of
a
 NE:Loca)on
 new
research
facility
in
Alderwood.

It
will
be
known
as
 NE:Organiza)on
 Boynton
Laboratory.

  • 33. Seman)c
Analysis
 Seman)c
Category
(unnamed
en)ty,
UNE)
 Tagging
 Mayor
Rex
Luthor
announced
today
the
establishment
of
a
 UNE:Organiza)on
 new
research
facility
in
Alderwood.

It
will
be
known
as
 Boynton
Laboratory.

  • 34. Seman)c
Analysis
 Co‐reference
Resolu)on
for
en))es
and
 unnamed
en))es
 Mayor
Rex
Luthor
announced
today
the
establishment
of
a
 UNE:Organiza)on
 new
research
facility
in
Alderwood.

It
will
be
known
as
 Boynton
Laboratory.

  • 35. Seman)c
Analysis
 Seman)c
Role
Analysis
 ACTOR ACTION WHEN OBJECT Mayor Rex Luthor announced today the establishment WHERE OBJECT of a new research facility in Alderwoon. It will be ACTION COMPL known as Boynton Laboratory
  • 36. Seman)c
Analysis
 Concept‐Rela)on
Extrac)on
 today e tim n ) time e (w h actor Rex Luthor announce (who) person action ob w h a ( je t) ct establ. loc(whe event ha t (w jec at re) t) b io o n Boynton Alderwood Lab organiz. location
  • 38. Template
Extrac)on
 <Facility>Finsbury Park Mosque</Facility> (c) 2001, Chicago Tribune. <Country>England</Country> Visit the Chicago Tribune on the Internet at <Country>France </Country> http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune <Country>England</Country> Information Services. By Stephen J. Hedges and Cam Simpson <Country>Belgium</Country> ……. <Country>United States</Country> The Finsbury Park Mosque is the center of <PersonPositionOrganization>  radical Muslim activism in England. Through its <Person>Abu Hamza al-Masri</Person> <OFFLEN OFFSET=quot;3576quot; doors have passed at least three of the men LENGTH=“33quot; /> now held on suspicion of terrorist activity in France, England and Belgium, as well as one   <Person>Abu Hamza al-Masri</Person> Algerian man in prison in the United States. <Position>chief cleric</Position> ``The mosque's chief cleric, Abu Hamza al- <Organization>Finsbury Park Mosque</ Masri lost two hands fighting the Soviet Union Organization> <PersonArrest>  in Afghanistan and he advocates the <City>London</City> </PersonPositionOrganization> <OFFLEN OFFSET=quot;3814quot; elimination of Western influence from Muslim countries. He was arrested in London in 1999 LENGTH=quot;61quot; />   for his alleged involvement in a Yemen bomb <Person>Abu Hamza al-Masri</Person>   plot, but was set free after Yemen failed to <Location>London</Location> produce enough evidence to have him extradited. .'‘ … <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>
  • 39. Streaming
Text:
Knowledge
Extrac)on
 •  Leveraging
some
earlier
 work
on
informa)on
 extrac)on
from
text
 streams
 Informa)on
extrac)on
 •  process
of
using
 advanced
automated
 machine
learning
 approaches

 •  to
iden)fy
en))es
in
 text
documents
 •  extract
this
informa)on
 along
with
the
 rela)onships
these
 The
visualiza)on
above
demonstrates
informa)on
 en))es
may
have
in
the
 extrac)on
of
names,
places
and
organiza)ons
from
real‐ text
documents
 )me
news
feeds.
As
news
ar)cles
arrive,
the
informa)on
 is
extracted
and
displayed.
Rela)onships
are
defined
 when
en))es
co‐occur
within
a
specific
window
of
 words.

  • 40. Seman)c
Analysis
 Word
Sense
Disambigua)on
 •  –  Context
based
or
proximity
based
 –  Very
accurate

  • 41. Ontological
Associa)on
(WordNet)
 •  Wordnet:
As
of
2006,
the
database
contains
about
150,000
words
 organized
in
over
115,000
synsets
for
a
total
of
207,000
word‐sense
pairs
 •  Search
for
dog
 –  n
dog,
domes)c
dog,
Canis
familiaris
(a
member
of
the
genus
Canis
(probably
 descended
from
the
common
wolf)
that
has
been
domes)cated
by
man
since
 prehistoric
)mes;
occurs
in
many
breeds)
 –  n
frump,
dog
(a
dull
unaBrac)ve
unpleasant
girl
or
woman)
 –  n
dog
(informal
term
for
a
man)
 –  n
cad,
bounder,
blackguard,
dog,
hound,
heel
(someone
who
is
morally
 reprehensible)
 –  n
frank,
frankfurter,
hotdog,
hot
dog,
dog,
wiener,
wienerwurst,
weenie
(a
 smooth‐textured
sausage
of
minced
beef
or
pork
usually
smoked;
o}en
served
 on
a
bread
roll)
 –  n
pawl,
detent,
click,
dog
(a
hinged
catch
that
fits
into
a
notch
of
a
ratchet
to
 move
a
wheel
forward
or
prevent
it
from
moving
backward)
 –  n
andiron,
firedog,
dog,
dog‐iron
(metal
supports
for
logs
in
a
fireplace)
 –  v
chase,
chase
a}er,
trail,
tail,
tag,
give
chase,
dog,
go
a}er,
track
(go
a}er
with
 the
intent
to
catch)

  • 42. Feature
Selec)on
 •  Reduce
Dimensionality
 –  Learners
have
difficulty
addressing
tasks
with
high
 dimensionality

 •  Irrelevant
Features
 –  Not
all
features
help!

 –  Remove
features
that
occur
in
only
a
few
 documents
 –  Reduce
features
that
occur
in
too
many
 documents

  • 43. Text
Mining:
General
Applica)on
Areas
 •  Informa)on
Retrieval
 –  Indexing
and
retrieval
of
textual
documents
 –  Finding
a
set
of
(ranked)
documents
that
are
relevant
to
 the
query
 •  Informa)on
Extrac)on
 –  Extrac)on
of
par)al
knowledge
in
the
text
 •  Web
Mining
 –  Indexing
and
retrieval
of
textual
documents
and
extrac)on
 of
par)al
knowledge
using
the
web
 •  Classifica)on
 –  Predict
a
class
for
each
text
document
 •  Clustering
 –  Genera)ng
collec)ons
of
similar
text
documents

  • 44. Text
Mining:
Supervised
vs.
Unsupervised
 •  Supervised
learning
(Classifica)on)
 –  Data
(observa)ons,
measurements,
etc.)
are
accompanied
by
 labels
indica)ng
the
class
of
the
observa)ons
 –  Split
into
training
data
and
test
data
for
model
building
process
 –  New
data
is
classified
based
on
the
model
built
with
the
training
 data
 –  Techniques
 •  Bayesian
classifica)on,
Decision
trees,
Neural
networks,
 Instance‐Based
Methods,
Support
Vector
Machines

 •  Unsupervised
learning
(Clustering)
 –  Class
labels
of
training
data
is
unknown
 –  Given
a
set
of
measurements,
observa)ons,
etc.
with
the
aim
of
 establishing
the
existence
of
classes
or
clusters
in
the
data

  • 49. Text
Mining:
Themescape
and
ThemeRiver
 Visualizing
Rela)onships
Between
Documents
 •  Images from Pacific Northwest Laboratory
  • 51. Text
Mining:
Applica)ons
 •  Email:
Spam
filtering
 •  News
Feeds:
Discover
what
is
 interes)ng
 •  Medical:
Iden)fy
rela)onships
and
 link
informa)on
from
different
 medical
fields
 •  Homeland
Security
 •  Marke)ng:
Discover
dis)nct
groups
of
 poten)al
buyers
and
make
 sugges)ons
for
other
products
 •  Industry:
Iden)fying
groups
of
 compe)tors
web
pages
 •  Job
Seeking:
Iden)fy
parameters
in
 searching
for
jobs

  • 52. Text
Mining:
Classifica)on
Defini)on
 •  Given:
Collec)on
of
labeled
records
 –  Each
record
contains
a
set
of
features
(aBributes),
and
the
true
class
 (label)
 –  Create
a
training
set
to
build
the
model
 –  Create
a
tes)ng
set
to
test
the
model
 •  Find:
Model
for
the
class
as
a
func)on
of
the
values
of
the
features
 •  Goal:
Assign
a
class
(as
accurately
as
possible)
to
previously
unseen
 records
 •  Evalua)on:
What
Is
Good
Classifica)on?
 –  Correct
classifica)on

 •  Known
label
of
test
example
is
iden)cal
to
the
predicted
class
from
the
model
 –  Accuracy
ra)o
 •  Percent
of
test
set
examples
that
are
correctly
classified
by
the
model
 –  Distance
measure
between
classes
can
be
used
 •  e.g.,
classifying
“football”
document
as
a
“basketball”
document
is
not
as
bad
 as
classifying
it
as
“crime”

  • 53. Text
Mining:
Clustering
Defini)on
 •  Given:

Set
of
documents
and
a
similarity
measure
 among
documents
 •  Find:
Clusters
such
that
 –  Documents
in
one
cluster
are
more
similar
to
one
 another
 –  Documents
in
separate
clusters
are
less
similar
to
one
 another
 •  Goal:
 –  Finding
a
correct
set
of
documents
 •  Similarity
Measures:
 –  Euclidean
distance
if
aBributes
are
con)nuous
 –  Other
problem‐specific
measures
 •  e.g.,
how
many
words
are
common
in
these
documents
 •  Evalua)on:
What
Is
Good
Clustering?
 –  Produce
high
quality
clusters
with

 •  high
intra‐class
similarity
 •  low
inter‐class
similarity

 –  Quality
of
a
clustering
method
is
also
measured
by
its
 ability
to
discover
some
or
all
of
the
hidden
paBerns

  • 55. Future
Work
 •  Enhancements
to
Seman)c
Analysis
 –  Use
of
Ontological
Associa)on
(WordNet,
 VerbNet)
 –  Improve
co‐referencing
 –  Improve
fact
extrac)on
 •  Visual
explora)on
tools