SlideShare una empresa de Scribd logo
1 de 69
Descargar para leer sin conexión
Information  
Extraction	
	
Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com
Text  Mining  Course	
•  1) Introduction to Text Mining
•  2) Introduction to NLP
•  3) Named Entity Recognition and Disambiguation
•  4) Opinion Mining and Sentiment Analysis
•  5) Information Extraction
•  6) NewsReader and Visualisation
•  7) Guest Lecture and Q&A
Outline	
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and Subtasks
4.  MUC conferences
5.  Main domains of Information Extraction
6.  Methods for Information Extraction
o  Cascaded finite-state transducers
o  Regular expressions and patterns
o  Supervised learning approaches
o  Weakly supervised and unsupervised approaches
7.  How far we are with IE
What  is  IE?	
•  Late 1970s within NLP field
•  Find and extract automatically limited relevant
parts of texts
•  Merge information from many pieces of text
What  is  IE?	
•  Quite often in specialized domains
•  Move from unstructured/semi-structured data to
structured data
o  Schemas
o  Relations (as a database)
o  Knowledge base
o  RDF triples
What  is  IE?	
Unstructured  text	
•  Natural  language  sentences	
•  Historically  NLP  system  have  been  designed  to  process  this  type  of  data	
•  The  meaning  à  linguistic  analysis  and  natural  language  understanding
What  is  IE?	
Semi-­‐‑structured  text	
•  The  physical  layout  helps  to  the  interpretation	
•  Processing  half  way  linguistic  features  ßà  positional  features
What  is  IE?
Main  goals  of  IE	
•  Fill a predefined “template” from raw text
•  Extract who did what to whom and when?
o  Event extraction
•  Organize information so that is useful to people
•  Put information in a form that allows further
inferences by computers
o  Big data
IE.  Task  &  Subtasks	
•  Named Entity Recognition
o  Detection à Mr. Smith eats bitterballen [Mr. Smith] : ENTITY
o  Classification à Mr. Smith eats bitterballen [Mr. Smith] : PERSON
•  Event extraction
o  The thief broke the door with a hammer
•  CAUSE_HARMà Verb: break
Agent: the thief
Patient: the door
Instrument: a hammer
•  Coreference resolution
o  [Mr. Smith] eats bitterballen. Besides to this, [he] only drinks Belgium beer.
IE.  Task  &  Subtasks	
•  Relationship extraction
o  Bill works for IBM PERSON works for ORGANISATION
•  Terminology extraction
o  Finding relevant terms of multi words from a given corpus
•  Some concrete examples
o  Extracting earnings, profits, board members, headquarters from company
reports
o  Searching on the WWW for e-mails for advertising (spamming)
o  Learn drug-gene product interactions from biomedical research papers
IE  Tasks  &  Subtasks	
•  Apple mail
MUC  conferences	
•  Message Understanding Conference (MUC), held
between 1987 and 1998.
•  Domain specific texts + training examples + template
definition
•  Precision, Recall and F1 as evaluation
•  Domains
o  MUC-1 (1987), MUC-2 (1989): Naval operations messages.
o  MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
o  MUC-5 (1993): Joint ventures and microelectronics domain.
o  MUC-6 (1995): News articles on management changes.
o  MUC-7 (1998): Satellite launch reports.
MUC  conferences	
Bridgestone  Sports  Co.  said  Friday  it  has  set  up  a  joint  venture  in  
Taiwan  with  a  local  concern  and  a  Japanese  trading  house  to  produce  
golf  clubs  to  be  shipped  to  Japan.	
The  joint  venture,  Bridgestone  Sports  Taiwan  Co.,  capitalized  at  20  
million  new  Taiwan  dollars,  will  start  production  in  January  1990  with  
production  of  20,000  iron  and  “metal  wood”  clubs  a  month.	
Example  from  MUC5
Main  domains  of  IE	
•  Terrorist events
•  Joint ventures
•  Plane crashes
•  Disease outbreaks
•  Seminar announcements
•  Biological and medical domain
Outline	
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and Subtasks
4.  MUC conferences
5.  Main domains of Information Extraction
6.  Methods for Information Extraction
o  Cascaded finite-state transducers
o  Regular expressions and patterns
o  Supervised learning approaches
o  Weakly supervised and unsupervised approaches
7.  How far we are with IE
Methods  for  IE	
•  Cascaded finite-state transducers
o  Rule based
o  Regular expressions
•  Learning based approaches
o  Traditional classifiers
•  Bayes, MME, SVM …
o  Sequence label models
•  HMM, CMM, CRF
•  Unsupervised approaches
•  Hybrid approaches
Cascaded  finite-­‐‑state  
transducers	
•  Emerging idea from MUC participants and
approaches
•  Decompose the task into small sub-tasks
•  One element is read at a time from a sequence
o  Depending on the type a certain transition in produced in the automaton
to a new state
o  Some states are considered final (the input matches a certain pattern)
•  Can be defined as a regular expression
Cascaded  finite-­‐‑state  
transducers	
Finite  Automaton  for  noun  groups	
=>  John’s  interesting  book  with  a  nice  cover
Cascaded  finite-­‐‑state  
transducers	
•  Earlier stages recognize smaller linguistics objects
o  Usually domain independent
•  Later stages build on top of the previous ones
o  Usually domain dependent
•  Typical IE systems
1.  Complex words
2.  Basic phrases
3.  Complex phrases
4.  Domain events
5.  Merging structures
Cascaded  finite-­‐‑state  
transducers	
•  Complex words
o  Multiwords: “set up” “trading house”
o  NE: “Bridgestone Sports Co”
•  Basic Phrases
o  Syntactic chunking
•  Noun groups (head noun + all modifiers)
•  Verb groups
Cascaded  finite-­‐‑state  
transducers
Cascaded  finite-­‐‑state  
transducers	
•  Complex phrases
o  Complex noun and verb groups on the basis of syntactic information
•  The attachment of appositives to their head noun group
o  “The joint venture, Bridgestone Sports Taiwan Co.,”
•  The construction of measure phrases
o  “20,000 iron and ‘metal wood’ clubs a month”
Cascaded  finite-­‐‑state  
transducers	
•  Domain events
o  Recognize events and match with “fillers” detected in previous steps
o  Requires domain specific patterns
•  To recognize phrases of interest
•  To define what are the roles
o  Patterns can be defined also as a finite-state machines or regular
expressions
•  <Company/ies><Set-up><Joint-Venture> with <Company/ies>
•  <Company><Capitalized> at <Currency>
Cascaded  finite-­‐‑state  
transducers
Regular  Expressions	
•  1950’s Stephen Kleene
•  A string pattern that describes/matches a set of
strings
•  A regular expression consists of:
o  Characters
o  Operation symbols
•  Boolean (and/or)
•  Grouping (for defining scopes)
•  Quantification
Regular  Expressions	
Character	
 Description	
a	
 The  character  a	
.	
 Any  single  character	
[abc]	
 Any  character  in  the  brackets  (OR)  ‘a’  
or  ‘b’  or  ‘c’	
[^abc]	
	
Any  character  not  in  the  brackets.  Any  
symbol  that  is  not  ‘a  ‘  or  ‘b’  or  ‘c’	
*	
 Quantifier.  Matches  the  preceding  
element  ZERO  or  more  times	
+	
 Quantifier.  Matches  the  preceding  
element  ONE  or  more  times	
?	
 Matches  the  previous  element  zero  or  
one  time	
|	
 Choice  (OR)  Matches  one  of  the  
expressions  (before  of  after  the  |)
Regular  Expressions	
①  .at è ???
Regular  Expressions	
①  .at è hat cat bat xat …
②  [hc]at è hat cat
③  [^b]at è all matched by .at but “bat”
④  [^hc]at è all match by .at but “hat” and
“cat”
⑤  s.* è s sssss ssbsd2ck3e
Regular  Expressions	
①  .at è hat cat bat xat …
②  [hc]at è hat cat
③  [^b]at è all matched by .at but “bat”
④  [^hc]at è all match by .at but “hat” and
“cat”
⑤  s.* è s sssss ssbsd2ck3e
⑥  [hc]*at è hat cat hhat chat cchhat at …
⑦  cat|dogè cat dog
⑧  ….
⑨  ….
Using  Regular  
Expressions	
•  Typically extracting information from automatic
generated webpages is easy
o  Wikipedia
•  To know the country for a given city
o  Amazon webpage
•  From a list of hits
o  Weather forecast webpages
o  DBpedia
Using  Regular  
Expressions
Using  Regular  
Expressions	
•  Some “unstructured” pieces of information keep
some structure and are easy to capture by means
of regular expressions
o  Phone numbers
o  What else?
o  …
o  ...
Using  Regular  
Expressions	
•  Some “unstructured” pieces of information keep
some structure and are easy to capture by means
of regular expressions
o  Phone numbers
o  E-mails
o  URL Websites
Using  Regular  
Expressions	
•  Also to detect relations and fill events
•  Higher level regular expressions make use of
“objects” detected by lower level patterns
•  Some NLP information may help (pos tags, phrases,
semantic word categories)
o  Crime-Victim can use things matched by “noun-group”
•  Prefiller: [pos: V, type-of-verb: KILL] WordNet MCR
•  Filler: [phrase: NOUN-GROUP]
Using  Regular  
Expressions	
•  Extraction relations between entities
o  Which PERSON holds what POSITION in what ORGANIZATION
•  [PER], [POSITION] of [ORG]
Entities:	
	
PER:  Jose  Mourinho	
	
POSITION:  trainer	
	
ORG:  Chelsea	
	
Relation	
	
Jose  Mourinho	
	
	
	
Trainer	
	
	
Chelsea
Using  Regular  
Expressions	
•  Extraction relations between entities
o  Which PERSON holds what POSITION in what ORGANIZATION
•  [PER], [POSITION ] of [ORG]
•  [ORG] (named, appointed,…) [PER] Prep [POSITION]
o  Nokia has appointed Rajeev Suri as President
o  Where a ORGANIZATION is located
•  [ORG] headquarters in [LOC]
o  NATO headquarters in Brussels
•  [ORG][LOC] (division, branch, headquarters…)
o  KFOR Kosovo headquarters
Extracting  relations  with  
palerns	
•  Hearst 1992
•  What does Gelidium mean?
•  “Αγαρ ισ α συβστανχε πρεπαρεδ φροµ α µιξτυρε οφ ρεδ αλγαε, συχη ασ
Gelidium, φορ λαβορατορψ ορ ινδυστριαλ υσε”
Extracting  relations  with  
palerns	
•  Hearst 1992
•  What does Gelidium mean?
•  “Agar is a substance prepared from a mixture of red
algae, such as Gelidium, for laboratory or industrial
use”
•  How do you know?
Extracting  relations  with  
palerns	
•  Hearst 1992: Automatic Acquisition of Hyponyms (IS-A)
X à Gelidium (sub-type) Y à red algae (super-type)
X à IS-A à Y
•  “Y such as X”
•  “Y, such as X”
•  “X or other Y”
•  “X and other Y”
•  “Y including X”
•  ….
Extracting  relations  with  
palerns
Hand-­‐‑built  palerns	
•  Positive
o  Tend to be high-precision
o  Can be adapted to specific domains
•  Negative
o  Human patterns are usually low-recall
o  A lot of work to think all possible patterns
o  Need to create a lot of patterns for every relation
Learning-­‐‑based  
Approaches	
•  Statistical techniques and machine learning
algorithms
o  Automatically learn patterns and models for new domains
•  Some types
o  Supervised learning of patterns and rules
o  Supervised Learning for relation extraction
o  Supervised learning of Sequential Classifier Methods
o  Weakly supervised and supervised
Supervised  Learning  of  
Palerns  and  Rules	
•  Aiming to reduce the knowledge engineering
bottleneck to create an IE in a new domain
•  AutoSlog and PALKA à first IE pattern learning
systems
o  AutoSlog: syntactic templates, lexico-syntactic patterns and manual
review
•  Learning Algorithms à generate rules from
annotated text
o  LIEP (Huffman 1996) : syntactic paths, role fillers. Patterns that work ok in
training are kept
o  (LP)2 uses tagging rules and correction rules
Supervised  Learning  of  
Palerns  and  Rules	
•  Relational learning methods
o  RAPIER: rules for pre-filler, filler, and post-filler component. Each
component is a pattern that consists of words, POS tags, and semantic
classes.
Supervised  Learning  for  
relation  extraction  (I)	
•  Design a supervised machine learning framework
•  Decide what relations we are interested in
•  Choose what entities are relevant
•  Find (or create) labeled data
o  Representative corpus
o  Label the entities in the corpus (Automatic NER)
o  Hand label relation between these entities
o  Split into train + dev + test
•  Train, improve and evaluate
Supervised  Learning  for  
relation  extraction  (II)	
•  Relation extraction as a classification problem
•  2 classifiers
o  To decide if two entities are related
o  To decide the class for a pair or related entities
•  Why 2?
o  Faster training by eliminating most pairs
o  Appropriate feature sets for each task
•  Find all pairs of NE (restricted to the sentence)
o  For every pair
1.  Are the entities related (classifier 1)
1.  no à END
2.  Yes à guess the class (classifier 2)
Supervised  Learning  for  
relation  extraction  (III)	
•  Are the two entities related?
•  What is the type of relation?
Supervised  Learning  for  
relation  extraction  (IV)	
“[American Airlines], a unit of AMR, immediately
matched the move, spokesman [Tim Wagner] said”
•  What features?
o  Head words of entity mentions and combination
•  Airlines Wagner Airlines-Wagner
o  Bag-of-words in the two entity mentions
•  American, Airlines, Tim, Wagner, American Airlines, Tim Wagner
o  Words/bigrams in particular positions to the left and right
•  M2#-1: spokesman M1#+1: said
o  Bag-of-words (or bigrams) between the 2 mentions
•  a, AMR, of, immediately, matched, move, spokesman, the, unit
Supervised  Learning  for  
relation  extraction  (V)	
“[American Airlines], a unit of AMR, immediately
matched the move, spokesman [Tim Wagner] said”
•  What features?
o  Named entity types
•  M1: ORG M2: PERSON
o  Entity level (Name, Nominal (NP), Pronoun)
•  M1: NAME (“it” or “he” would be PRONOUN)
•  M2: NAME (“the company” would be NOMINAL)
o  Basic chunk sequence from one entity to the other
•  NP NP PP VP NP NP
o  Constituency path on the parse tree
•  NP é NP é S é S ê NP
Supervised  Learning  for  
relation  extraction  (VI)	
“[American Airlines], a unit of AMR, immediately
matched the move, spokesman [Tim Wagner] said”
•  What features?
•  Trigger lists
o  For family à parent, wife, husband… (WordNet)
•  Gazetteers
o  List of countries…
•  ….
•  ….
•  …
Supervised  Learning  for  
relation  extraction  (VII)	
•  Decide your algorithm
o  MaxEnt, Naïve Bayes, SVM
•  Train the system on the training data
•  Tune it on the dev set
•  Test on the evaluation test
o  Traditional Precision, Recall and F-score
Sequential  Classifier  
Methods	
•  IE as a classification problem using sequential
learning models.
•  A classifier is induced from annotated data to
sequentially scan a text from left to right and
decide what piece of text must be extracted or not
•  Decide what you want to extract
•  Represent the annotated data in a proper way
Sequential  Classifier  
Methods
Sequential  Classifier  
Methods	
•  Typical steps for training
o  Get the annotated training data
o  Represent the data in IOB
o  Design feature extractors
o  Decide the algorithm to use
o  Train the models
•  Testing steps
o  Get the test documents
o  Extract features
o  Run the sequence models
o  Extract the recognized entities
Sequential  Classifier  
Methods	
•  Algorithms
o  HMM
o  CMM
o  CRF
•  Features
o  Words (current, previous, next)
o  Other linguistic information (PoS, chunks…)
o  Task specific features (NER…)
•  Word shapes: abstract representation for words
Sequential  Classifier  
Methods	
•  Algorithms
o  HMM
o  SVM
o  CRF
•  Features
o  Words (current, previous, next)
o  Other linguistic information (PoS, chunks…)
o  Task specific features (NER…)
•  Word shapes: abstract representation for words
Weakly  supervised  and  
unsupervised  	
•  Manual annotation is also “expensive”
o  IE is quite domain specific à not reuse
•  AutoSlog-Ts:
o  Just needs 2 sets of documents: relevant/irrelevant
o  Syntactic templates + relevance according to relevant set
•  Ex-Disco (Yangarber et al. 2000)
o  No need preclassified corpus
o  They use a small set of patterns to decide relevant/irrelevant
Weakly  supervised  and  
unsupervised  	
•  OpeNER:
•  European project dealing with entity recognition,
sentiment analysis and opinion mining mainly in
hotel reviews (also restaurants, attractions, news)
•  Double propagation
o  Method to automatically gather opinion words and targets
•  From a large raw hotel corpus
•  Providing a set of seeds and patterns
Weakly  supervised  and  
unsupervised  	
•  Seed list
•  + à good, nice
•  - à bad, ugly
•  Patterns
•  a [EXP] [TAR]
•  the [EXP] [TAR]
•  Polarity patterns
•  = [EXP] and [EXP] [EXP], [EXP]
•  ! [EXP] but [EXP]
Weakly  supervised  and  
unsupervised  	
•  Propagation method
o  1) Get new targets using the seed expressions and the
patterns
•  a nice [TAR] a bad [TAR] the ugly [TAR]
•  Output à new targets (hotel, room, location)
o  2) Get new expression using the previous targets and the
patterns
•  a [EXP] hotel the [EXP] location
•  Output à new expressions (expensive, cozy, perfect…)
o  Keep running 1 and 2 to get new EXP and TAR
Weakly  supervised  and  
unsupervised  	
•  Polarity guessing
o  Apply the polarity patters to guess the polarity
•  = a nice(+) and cozy(?) à cozy(+)
•  ! Clean(+) but expensive(?) à expensive (-)
hlps://github.com/opener-­‐‑project/opinion-­‐‑domain-­‐‑
lexicon-­‐‑acquisition
Outline	
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and Subtasks
4.  MUC conferences
5.  Main domains of Information Extraction
6.  Methods for Information Extraction
o  Cascaded finite-state transducers
o  Regular expressions and patterns
o  Supervised learning approaches
o  Weakly supervised and unsupervised approaches
7.  How far we are with IE
How  good  is  IE
How  good  is  IE	
•  Some progress has been done
•  Still the barrier of 60% seems difficult to outperform
•  Most errors on entities and event coreference
•  Propagation errors
o  Entity recognition à 90%
o  One event -> 4 entities
o  0.9 x 4 à 60%
•  A lot of knowledge is implicit or “common world
knowledge”
How  good  is  IE	
Information  Type	
 Accuracy	
Entities	
 90  –  98%	
Alributes	
 80%	
Relations	
 60  –  70%	
Events	
 50  –  60%	
•  Very optimistic numbers for well-established tasks
•  The numbers go down for specific/new tasks
Information  
Extraction	
	
Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Text mining
Text miningText mining
Text mining
 
Language models
Language modelsLanguage models
Language models
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Text clustering
Text clusteringText clustering
Text clustering
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Text summarization
Text summarizationText summarization
Text summarization
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Text mining
Text miningText mining
Text mining
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Syntactic analysis in NLP
Syntactic analysis in NLPSyntactic analysis in NLP
Syntactic analysis in NLP
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 

Similar a Information Extraction

Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine ReadingIsabelle Augenstein
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
The Semantic Web meets the Code of Federal Regulations
The Semantic Web meets the Code of Federal RegulationsThe Semantic Web meets the Code of Federal Regulations
The Semantic Web meets the Code of Federal Regulationstbruce
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...IT Arena
 
Named Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptxNamed Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptxMOAZZAMALISATTI
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative DataMike Crabb
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1Taymoor Nazmy
 

Similar a Information Extraction (20)

Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
ICS1020 NLP 2020
ICS1020 NLP 2020ICS1020 NLP 2020
ICS1020 NLP 2020
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
The Semantic Web meets the Code of Federal Regulations
The Semantic Web meets the Code of Federal RegulationsThe Semantic Web meets the Code of Federal Regulations
The Semantic Web meets the Code of Federal Regulations
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
 
Named Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptxNamed Entity Recognition and Information Extraction.pptx
Named Entity Recognition and Information Extraction.pptx
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
uploadscribd2.pptx
uploadscribd2.pptxuploadscribd2.pptx
uploadscribd2.pptx
 
Taming Text
Taming TextTaming Text
Taming Text
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative Data
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
 

Más de Rubén Izquierdo Beviá

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityRubén Izquierdo Beviá
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsRubén Izquierdo Beviá
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusRubén Izquierdo Beviá
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesRubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)Rubén Izquierdo Beviá
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Rubén Izquierdo Beviá
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013Rubén Izquierdo Beviá
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFRubén Izquierdo Beviá
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for DutchRubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRubén Izquierdo Beviá
 

Más de Rubén Izquierdo Beviá (17)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
ULM1 - The borders of Ambiguity
ULM1 - The borders of AmbiguityULM1 - The borders of Ambiguity
ULM1 - The borders of Ambiguity
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Information Extraction

  • 2. Text  Mining  Course •  1) Introduction to Text Mining •  2) Introduction to NLP •  3) Named Entity Recognition and Disambiguation •  4) Opinion Mining and Sentiment Analysis •  5) Information Extraction •  6) NewsReader and Visualisation •  7) Guest Lecture and Q&A
  • 3. Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches 7.  How far we are with IE
  • 4. What  is  IE? •  Late 1970s within NLP field •  Find and extract automatically limited relevant parts of texts •  Merge information from many pieces of text
  • 5. What  is  IE? •  Quite often in specialized domains •  Move from unstructured/semi-structured data to structured data o  Schemas o  Relations (as a database) o  Knowledge base o  RDF triples
  • 6. What  is  IE? Unstructured  text •  Natural  language  sentences •  Historically  NLP  system  have  been  designed  to  process  this  type  of  data •  The  meaning  à  linguistic  analysis  and  natural  language  understanding
  • 7. What  is  IE? Semi-­‐‑structured  text •  The  physical  layout  helps  to  the  interpretation •  Processing  half  way  linguistic  features  ßà  positional  features
  • 9. Main  goals  of  IE •  Fill a predefined “template” from raw text •  Extract who did what to whom and when? o  Event extraction •  Organize information so that is useful to people •  Put information in a form that allows further inferences by computers o  Big data
  • 10. IE.  Task  &  Subtasks •  Named Entity Recognition o  Detection à Mr. Smith eats bitterballen [Mr. Smith] : ENTITY o  Classification à Mr. Smith eats bitterballen [Mr. Smith] : PERSON •  Event extraction o  The thief broke the door with a hammer •  CAUSE_HARMà Verb: break Agent: the thief Patient: the door Instrument: a hammer •  Coreference resolution o  [Mr. Smith] eats bitterballen. Besides to this, [he] only drinks Belgium beer.
  • 11. IE.  Task  &  Subtasks •  Relationship extraction o  Bill works for IBM PERSON works for ORGANISATION •  Terminology extraction o  Finding relevant terms of multi words from a given corpus •  Some concrete examples o  Extracting earnings, profits, board members, headquarters from company reports o  Searching on the WWW for e-mails for advertising (spamming) o  Learn drug-gene product interactions from biomedical research papers
  • 12. IE  Tasks  &  Subtasks •  Apple mail
  • 13. MUC  conferences •  Message Understanding Conference (MUC), held between 1987 and 1998. •  Domain specific texts + training examples + template definition •  Precision, Recall and F1 as evaluation •  Domains o  MUC-1 (1987), MUC-2 (1989): Naval operations messages. o  MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. o  MUC-5 (1993): Joint ventures and microelectronics domain. o  MUC-6 (1995): News articles on management changes. o  MUC-7 (1998): Satellite launch reports.
  • 14. MUC  conferences Bridgestone  Sports  Co.  said  Friday  it  has  set  up  a  joint  venture  in   Taiwan  with  a  local  concern  and  a  Japanese  trading  house  to  produce   golf  clubs  to  be  shipped  to  Japan. The  joint  venture,  Bridgestone  Sports  Taiwan  Co.,  capitalized  at  20   million  new  Taiwan  dollars,  will  start  production  in  January  1990  with   production  of  20,000  iron  and  “metal  wood”  clubs  a  month. Example  from  MUC5
  • 15. Main  domains  of  IE •  Terrorist events •  Joint ventures •  Plane crashes •  Disease outbreaks •  Seminar announcements •  Biological and medical domain
  • 16. Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches 7.  How far we are with IE
  • 17. Methods  for  IE •  Cascaded finite-state transducers o  Rule based o  Regular expressions •  Learning based approaches o  Traditional classifiers •  Bayes, MME, SVM … o  Sequence label models •  HMM, CMM, CRF •  Unsupervised approaches •  Hybrid approaches
  • 18. Cascaded  finite-­‐‑state   transducers •  Emerging idea from MUC participants and approaches •  Decompose the task into small sub-tasks •  One element is read at a time from a sequence o  Depending on the type a certain transition in produced in the automaton to a new state o  Some states are considered final (the input matches a certain pattern) •  Can be defined as a regular expression
  • 19. Cascaded  finite-­‐‑state   transducers Finite  Automaton  for  noun  groups =>  John’s  interesting  book  with  a  nice  cover
  • 20. Cascaded  finite-­‐‑state   transducers •  Earlier stages recognize smaller linguistics objects o  Usually domain independent •  Later stages build on top of the previous ones o  Usually domain dependent •  Typical IE systems 1.  Complex words 2.  Basic phrases 3.  Complex phrases 4.  Domain events 5.  Merging structures
  • 21. Cascaded  finite-­‐‑state   transducers •  Complex words o  Multiwords: “set up” “trading house” o  NE: “Bridgestone Sports Co” •  Basic Phrases o  Syntactic chunking •  Noun groups (head noun + all modifiers) •  Verb groups
  • 23. Cascaded  finite-­‐‑state   transducers •  Complex phrases o  Complex noun and verb groups on the basis of syntactic information •  The attachment of appositives to their head noun group o  “The joint venture, Bridgestone Sports Taiwan Co.,” •  The construction of measure phrases o  “20,000 iron and ‘metal wood’ clubs a month”
  • 24. Cascaded  finite-­‐‑state   transducers •  Domain events o  Recognize events and match with “fillers” detected in previous steps o  Requires domain specific patterns •  To recognize phrases of interest •  To define what are the roles o  Patterns can be defined also as a finite-state machines or regular expressions •  <Company/ies><Set-up><Joint-Venture> with <Company/ies> •  <Company><Capitalized> at <Currency>
  • 26. Regular  Expressions •  1950’s Stephen Kleene •  A string pattern that describes/matches a set of strings •  A regular expression consists of: o  Characters o  Operation symbols •  Boolean (and/or) •  Grouping (for defining scopes) •  Quantification
  • 27. Regular  Expressions Character Description a The  character  a . Any  single  character [abc] Any  character  in  the  brackets  (OR)  ‘a’   or  ‘b’  or  ‘c’ [^abc] Any  character  not  in  the  brackets.  Any   symbol  that  is  not  ‘a  ‘  or  ‘b’  or  ‘c’ * Quantifier.  Matches  the  preceding   element  ZERO  or  more  times + Quantifier.  Matches  the  preceding   element  ONE  or  more  times ? Matches  the  previous  element  zero  or   one  time | Choice  (OR)  Matches  one  of  the   expressions  (before  of  after  the  |)
  • 29. Regular  Expressions ①  .at è hat cat bat xat … ②  [hc]at è hat cat ③  [^b]at è all matched by .at but “bat” ④  [^hc]at è all match by .at but “hat” and “cat” ⑤  s.* è s sssss ssbsd2ck3e
  • 30. Regular  Expressions ①  .at è hat cat bat xat … ②  [hc]at è hat cat ③  [^b]at è all matched by .at but “bat” ④  [^hc]at è all match by .at but “hat” and “cat” ⑤  s.* è s sssss ssbsd2ck3e ⑥  [hc]*at è hat cat hhat chat cchhat at … ⑦  cat|dogè cat dog ⑧  …. ⑨  ….
  • 31. Using  Regular   Expressions •  Typically extracting information from automatic generated webpages is easy o  Wikipedia •  To know the country for a given city o  Amazon webpage •  From a list of hits o  Weather forecast webpages o  DBpedia
  • 32.
  • 33.
  • 35. Using  Regular   Expressions •  Some “unstructured” pieces of information keep some structure and are easy to capture by means of regular expressions o  Phone numbers o  What else? o  … o  ...
  • 36. Using  Regular   Expressions •  Some “unstructured” pieces of information keep some structure and are easy to capture by means of regular expressions o  Phone numbers o  E-mails o  URL Websites
  • 37. Using  Regular   Expressions •  Also to detect relations and fill events •  Higher level regular expressions make use of “objects” detected by lower level patterns •  Some NLP information may help (pos tags, phrases, semantic word categories) o  Crime-Victim can use things matched by “noun-group” •  Prefiller: [pos: V, type-of-verb: KILL] WordNet MCR •  Filler: [phrase: NOUN-GROUP]
  • 38. Using  Regular   Expressions •  Extraction relations between entities o  Which PERSON holds what POSITION in what ORGANIZATION •  [PER], [POSITION] of [ORG] Entities: PER:  Jose  Mourinho POSITION:  trainer ORG:  Chelsea Relation Jose  Mourinho Trainer Chelsea
  • 39. Using  Regular   Expressions •  Extraction relations between entities o  Which PERSON holds what POSITION in what ORGANIZATION •  [PER], [POSITION ] of [ORG] •  [ORG] (named, appointed,…) [PER] Prep [POSITION] o  Nokia has appointed Rajeev Suri as President o  Where a ORGANIZATION is located •  [ORG] headquarters in [LOC] o  NATO headquarters in Brussels •  [ORG][LOC] (division, branch, headquarters…) o  KFOR Kosovo headquarters
  • 40. Extracting  relations  with   palerns •  Hearst 1992 •  What does Gelidium mean? •  “Αγαρ ισ α συβστανχε πρεπαρεδ φροµ α µιξτυρε οφ ρεδ αλγαε, συχη ασ Gelidium, φορ λαβορατορψ ορ ινδυστριαλ υσε”
  • 41. Extracting  relations  with   palerns •  Hearst 1992 •  What does Gelidium mean? •  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use” •  How do you know?
  • 42. Extracting  relations  with   palerns •  Hearst 1992: Automatic Acquisition of Hyponyms (IS-A) X à Gelidium (sub-type) Y à red algae (super-type) X à IS-A à Y •  “Y such as X” •  “Y, such as X” •  “X or other Y” •  “X and other Y” •  “Y including X” •  ….
  • 44. Hand-­‐‑built  palerns •  Positive o  Tend to be high-precision o  Can be adapted to specific domains •  Negative o  Human patterns are usually low-recall o  A lot of work to think all possible patterns o  Need to create a lot of patterns for every relation
  • 45. Learning-­‐‑based   Approaches •  Statistical techniques and machine learning algorithms o  Automatically learn patterns and models for new domains •  Some types o  Supervised learning of patterns and rules o  Supervised Learning for relation extraction o  Supervised learning of Sequential Classifier Methods o  Weakly supervised and supervised
  • 46. Supervised  Learning  of   Palerns  and  Rules •  Aiming to reduce the knowledge engineering bottleneck to create an IE in a new domain •  AutoSlog and PALKA à first IE pattern learning systems o  AutoSlog: syntactic templates, lexico-syntactic patterns and manual review •  Learning Algorithms à generate rules from annotated text o  LIEP (Huffman 1996) : syntactic paths, role fillers. Patterns that work ok in training are kept o  (LP)2 uses tagging rules and correction rules
  • 47. Supervised  Learning  of   Palerns  and  Rules •  Relational learning methods o  RAPIER: rules for pre-filler, filler, and post-filler component. Each component is a pattern that consists of words, POS tags, and semantic classes.
  • 48. Supervised  Learning  for   relation  extraction  (I) •  Design a supervised machine learning framework •  Decide what relations we are interested in •  Choose what entities are relevant •  Find (or create) labeled data o  Representative corpus o  Label the entities in the corpus (Automatic NER) o  Hand label relation between these entities o  Split into train + dev + test •  Train, improve and evaluate
  • 49. Supervised  Learning  for   relation  extraction  (II) •  Relation extraction as a classification problem •  2 classifiers o  To decide if two entities are related o  To decide the class for a pair or related entities •  Why 2? o  Faster training by eliminating most pairs o  Appropriate feature sets for each task •  Find all pairs of NE (restricted to the sentence) o  For every pair 1.  Are the entities related (classifier 1) 1.  no à END 2.  Yes à guess the class (classifier 2)
  • 50. Supervised  Learning  for   relation  extraction  (III) •  Are the two entities related? •  What is the type of relation?
  • 51. Supervised  Learning  for   relation  extraction  (IV) “[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features? o  Head words of entity mentions and combination •  Airlines Wagner Airlines-Wagner o  Bag-of-words in the two entity mentions •  American, Airlines, Tim, Wagner, American Airlines, Tim Wagner o  Words/bigrams in particular positions to the left and right •  M2#-1: spokesman M1#+1: said o  Bag-of-words (or bigrams) between the 2 mentions •  a, AMR, of, immediately, matched, move, spokesman, the, unit
  • 52. Supervised  Learning  for   relation  extraction  (V) “[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features? o  Named entity types •  M1: ORG M2: PERSON o  Entity level (Name, Nominal (NP), Pronoun) •  M1: NAME (“it” or “he” would be PRONOUN) •  M2: NAME (“the company” would be NOMINAL) o  Basic chunk sequence from one entity to the other •  NP NP PP VP NP NP o  Constituency path on the parse tree •  NP é NP é S é S ê NP
  • 53. Supervised  Learning  for   relation  extraction  (VI) “[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features? •  Trigger lists o  For family à parent, wife, husband… (WordNet) •  Gazetteers o  List of countries… •  …. •  …. •  …
  • 54. Supervised  Learning  for   relation  extraction  (VII) •  Decide your algorithm o  MaxEnt, Naïve Bayes, SVM •  Train the system on the training data •  Tune it on the dev set •  Test on the evaluation test o  Traditional Precision, Recall and F-score
  • 55. Sequential  Classifier   Methods •  IE as a classification problem using sequential learning models. •  A classifier is induced from annotated data to sequentially scan a text from left to right and decide what piece of text must be extracted or not •  Decide what you want to extract •  Represent the annotated data in a proper way
  • 57. Sequential  Classifier   Methods •  Typical steps for training o  Get the annotated training data o  Represent the data in IOB o  Design feature extractors o  Decide the algorithm to use o  Train the models •  Testing steps o  Get the test documents o  Extract features o  Run the sequence models o  Extract the recognized entities
  • 58. Sequential  Classifier   Methods •  Algorithms o  HMM o  CMM o  CRF •  Features o  Words (current, previous, next) o  Other linguistic information (PoS, chunks…) o  Task specific features (NER…) •  Word shapes: abstract representation for words
  • 59. Sequential  Classifier   Methods •  Algorithms o  HMM o  SVM o  CRF •  Features o  Words (current, previous, next) o  Other linguistic information (PoS, chunks…) o  Task specific features (NER…) •  Word shapes: abstract representation for words
  • 60. Weakly  supervised  and   unsupervised   •  Manual annotation is also “expensive” o  IE is quite domain specific à not reuse •  AutoSlog-Ts: o  Just needs 2 sets of documents: relevant/irrelevant o  Syntactic templates + relevance according to relevant set •  Ex-Disco (Yangarber et al. 2000) o  No need preclassified corpus o  They use a small set of patterns to decide relevant/irrelevant
  • 61. Weakly  supervised  and   unsupervised   •  OpeNER: •  European project dealing with entity recognition, sentiment analysis and opinion mining mainly in hotel reviews (also restaurants, attractions, news) •  Double propagation o  Method to automatically gather opinion words and targets •  From a large raw hotel corpus •  Providing a set of seeds and patterns
  • 62. Weakly  supervised  and   unsupervised   •  Seed list •  + à good, nice •  - à bad, ugly •  Patterns •  a [EXP] [TAR] •  the [EXP] [TAR] •  Polarity patterns •  = [EXP] and [EXP] [EXP], [EXP] •  ! [EXP] but [EXP]
  • 63. Weakly  supervised  and   unsupervised   •  Propagation method o  1) Get new targets using the seed expressions and the patterns •  a nice [TAR] a bad [TAR] the ugly [TAR] •  Output à new targets (hotel, room, location) o  2) Get new expression using the previous targets and the patterns •  a [EXP] hotel the [EXP] location •  Output à new expressions (expensive, cozy, perfect…) o  Keep running 1 and 2 to get new EXP and TAR
  • 64. Weakly  supervised  and   unsupervised   •  Polarity guessing o  Apply the polarity patters to guess the polarity •  = a nice(+) and cozy(?) à cozy(+) •  ! Clean(+) but expensive(?) à expensive (-) hlps://github.com/opener-­‐‑project/opinion-­‐‑domain-­‐‑ lexicon-­‐‑acquisition
  • 65. Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches 7.  How far we are with IE
  • 67. How  good  is  IE •  Some progress has been done •  Still the barrier of 60% seems difficult to outperform •  Most errors on entities and event coreference •  Propagation errors o  Entity recognition à 90% o  One event -> 4 entities o  0.9 x 4 à 60% •  A lot of knowledge is implicit or “common world knowledge”
  • 68. How  good  is  IE Information  Type Accuracy Entities 90  –  98% Alributes 80% Relations 60  –  70% Events 50  –  60% •  Very optimistic numbers for well-established tasks •  The numbers go down for specific/new tasks