SlideShare una empresa de Scribd logo
1 de 29
Relationship Extraction from Text

   Extending the Espresso Method for Greater Recall


                  Derek Springer
         UCLA Computer Science Department
                November 19, 2009
Related Works

• Ganapathi, Swathi. “Relationship Extraction from Text:
  Comparison and Experimental Evaluation of the State-of-
  the-Art.” UCLA comp exam. March 2009.
• Chu, A., Sakurai, S., Cárdenas, A. F., "Automatic Detection
  of Treatment Relationships in Patent Retrieval." 2008 CIKM
  Patent Information Retrieval Workshop. October 2008.
Related Works, cont'd

• Girju, R. "Automatic Detection of Causal Relations for
  Question Answering." In the proceedings of the 41st Annual
  Meeting of the Association for Computational Linguistics
  (ACL 2003). Workshop on "Multilingual Summarization and
  Question Answering - Machine Learning and Beyond".
  2003.
• Pantel, Patrick and Pennacchiotti, Marco. "Espresso:
  Leveraging Generic Patterns for Automatically Harvesting
  Semantic Relations." In Proceedings of Conference on
  Computational Linguistics / Association for Computational
  Linguistics (COLING/ACL- 06). pp. 113-120.
  Sydney, Australia. 2006.
Relationship Extraction

• The task of recognizing the assertion of a
  particular relationship between two or more
  entities in text.
• Can aid in the development of
  standalone, intelligent, automated and adaptable
  user-specific content retrieval systems.
• We focus on extracting treatment relationships
  → A (subject) used to treat B (object).
Goals and Contributions

• Extended state-of-the-art Espresso relationship
  extraction system originally implemented by
  Ganapathi.
• Did an in-depth experimental evaluation of the
  developed system while comparing it to prior
  work (Chu, Ganapathi).
• Future goal is to use the system developed here
  as a plug for relationship feature extractor in
  iScore.
Integration Into iScore

• iScore presents additional articles based on an
  aggregate score of “interestingness.”
• We believe filtering articles based on
  relationships can improve the results of iScore.
• We hypothesize that extending the Espresso
  system implemented by Swathi Ganapathi will
  improve the ability of a system such as iScore to
  utilize relationship extraction as a feature.
Comparison Criteria

• Performance: Want system to have high
  precision and recall
• Minimal Supervision: Want system to require
  little to no human supervision
• Breadth: Want system to extract relations from
  varying corpus sizes, domains and formats.
• Generality: Want system to extract wide variety
  of relation types without losing its edge in any of
  the above criteria.
The Espresso Algorithm

• General purpose algorithm which can be used to
  extract a wide variety of binary relations.
• Requires minimal supervision. Only input is a
  small seed set of known relations.
• By looking at individual sentences in detecting
  relationships, works well on all kinds of corpora.
• On tests conducted by the creators of the
  algorithm, Espresso generated balanced
  precision and recall.
The Espresso Method
Extending Espresso

Ganapathi's                           37.8%
Implementation



Extension                             91.2%
Ganapathi's Implementation

• Ganapathi's approach uses lexico-syntactic
  patterns of the form NP1 VP NP2 (Verb category
  in Table 1).
• VP contains treatment verb or pattern and the
  two NPs would contain the subject and object.
• This structure is a very common
  relationship, accounting for 37.8% of all
  relationships.
Extension

• There still remains a large number of
  relationships that may provide fruitful results.
• Expanding the implementation to include:
  - Noun+Prep e.g. "X settlement with Y"
  - Verb+Prep e.g. "X moved to Y"
  - Infinitive e.g. "X plans to acquire Y" and
  - Modifier e.g. "X is Y winner" relationship
• Retrieves 91.2% of common relationships.
Test Corpora

• Patent Corpus: Developed by Shige
   o 50,000 drug patent documents from 2008 from Class 424 & 514 of
     the U.S. Patents Classification: “drug, bio-affecting and body
     treating compositions” and their subclasses.
   o Patents were pre-filtered to only contain keywords
     “diabetes”, “metastatic”, “cancer”, “tuberculosis”, “lung”, “bronchitis”,
      “coronary artery”
   o All sentences from each document added to a sentence table in the
     schema
• PubMed Corpus: Developed by Gustavo
   o Comprised of medical abstracts from PubMed
   o Each abstract was parsed and all sentences from each abstract
     was stored as individual tuples in the sentence table
Performance Measures
Seed Treatment Relationships

•   (Xanax, Anxiety)           •   (Glycoside, Depression)
•   (Ambien, Insomnia)         •   (Ibuprofen, Arthritis)
•   (Effexor, Depression)      •   (Ibuprofen, Headache)
•   (Paxil, Depression)        •   (Tylenol, Fever)
•   (Lexapro, Depression)      •   (Tylenol, Headache)
•   (Caffeine, Depression)     •   (Antibody, Inflammation)
•   (Zoloft, Depression)       •   (Ibuprofen, Inflammation)
•   (Imipramine, Depression)   •   (Surgery, Glaucoma)
Procedure

1.Re-tag original data set to incorporate extended
  relationship types.
2.Re-run Ganapathi's baseline Espresso
  implementation to compare against updated data
  set.
3.Run extended Espresso implementation to
  compare against updated data set.
Experiment #1: Extraction on Drug
        Patent Corpus
• Drug Patent corpus used.
• Algorithm was run with seed relations and 12 verbs were extracted as
  being relevant (verbs with rπ greater than 0.2).
• These treatment verbs were used to create a test sentence set of 120
  sentences i.e. 10 sentences containing a treatment verb for every
  relevant treatment verb.
• 358 possible relations were extracted for each of which we calculated
  the ri score.
• 208 relations were obtained with ri score greater than the threshold out
  of which 126 were actually correct (through manual tagging).
• Of the original 358 relations, manual tagging determined that 213 of
  them were correct treatment relations.
Experiment #1 Results
Experiment #2: Number of
 Relationships and Performance
• Drug Patent corpus used.
• Test the performance of the system under
  smaller and larger data loads.
• Started with initial set of 120 sentences obtained
  from Drug Patent corpus (10 sentences for each
  verb, 12 verbs as in test #1)
• Increased the number of sentences for each
  verb by 10 in each case, so that we had
  sentence sets of 240 and 360 sentences each
Experiment #2 Results
Experiment #2 Analysis

• Performance of the system and the number of
  relationships are inversely related.
• ri scores are affected inversely by the max pmi across
  all relationship instances, it is possible that having more
  relationship instances in a set lowers the ri for all those
  relationships.
• more relationships => chance of a greater max pmi =>
  lowered ri for all relationship instances.
• Not worried → articles likely won't have 200 relations of
  the same type.
Experiment #3: Extraction on
          PubMed Corpus
• PubMed corpus used.
• Want to test the performance of the system on a different
  type and sized corpus
• Algorithm was run with input seed relations on this corpus
  and10 verbs with the topmost rπ values were extracted
• We constructed a test sentence set of 80 sentences (8
  sentences for every relevant verb)
• We then extracted a total of 162 relations from this test set
  and calculated their ri scores.
• The average ri score was used as the threshold value
Experiment #3 Results
Comparison Over Both Corpora
Experiment #3 Analysis

• Performance is worse on PubMed corpus.
• Patent corpus dealt with drugs and cures for diseases.
• Therefore, there was an abundance of treatment type
  relations in patent corpus.
• PubMed had more general medical data and only
  contained abstracts => less info.
• Therefore, there were fewer treatment relations in
  PubMed which affected performance.
Comparison with Previous Work




               * signifies our contribution
Analysis

• F-score of Ganapathi's version of Espresso fell
  nearly 10% → due to lower recall, as predicted.
• Results of extension over the re-tagged data are
  on par with Ganapathi's original results.
• When you consider that Ganapathi's system
  dropped nearly 10%, it seems to indicate the
  increased general purpose nature of the
  extension over the original version.
Success

• Recall of system is more important than
  precision, especially when it comes to using
  relationships as a feature in iScore.
• Method is almost completely automated.
• Easily expanded to extract other relationship types by
  changing the input seed relations.
• Initial results seem insignificant, but analysis indicates
  that extended system has the potential to be a general-
  purpose relationship extraction feature.
Future Work

• Development of a relationship feature extractor
  for iScore.
• Relations will have to be syntactically and
  semantically compared with relations present in
  other articles and the best article matches will be
  returned as “interesting” choices for a user.
• Optimizations: algorithm design
  improvements, database connection
  optimizations and parallelization.

Más contenido relacionado

La actualidad más candente

AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
ijcseit
 

La actualidad más candente (10)

Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 TutorialInstance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
 

Destacado

Destacado (20)

Cytokine purine interactions in behavioral depression in rats
Cytokine purine interactions in behavioral depression in ratsCytokine purine interactions in behavioral depression in rats
Cytokine purine interactions in behavioral depression in rats
 
Extracto La hija de los sueños
Extracto La hija de los sueñosExtracto La hija de los sueños
Extracto La hija de los sueños
 
Por qué las suecas son mito erótico
Por qué las suecas son mito erótico Por qué las suecas son mito erótico
Por qué las suecas son mito erótico
 
Screen Robots: UI Tests in Espresso
Screen Robots: UI Tests in EspressoScreen Robots: UI Tests in Espresso
Screen Robots: UI Tests in Espresso
 
Ui testing with espresso
Ui testing with espressoUi testing with espresso
Ui testing with espresso
 
Oh so you test? - A guide to testing on Android from Unit to Mutation
Oh so you test? - A guide to testing on Android from Unit to MutationOh so you test? - A guide to testing on Android from Unit to Mutation
Oh so you test? - A guide to testing on Android from Unit to Mutation
 
Do You Enjoy Espresso in Android App Testing?
Do You Enjoy Espresso in Android App Testing?Do You Enjoy Espresso in Android App Testing?
Do You Enjoy Espresso in Android App Testing?
 
A guide to Android automated testing
A guide to Android automated testingA guide to Android automated testing
A guide to Android automated testing
 
Fast deterministic screenshot tests for Android
Fast deterministic screenshot tests for AndroidFast deterministic screenshot tests for Android
Fast deterministic screenshot tests for Android
 
Android Espresso
Android EspressoAndroid Espresso
Android Espresso
 
Testing android apps with espresso
Testing android apps with espressoTesting android apps with espresso
Testing android apps with espresso
 
Automation test for Android
Automation test for AndroidAutomation test for Android
Automation test for Android
 
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
Robotium vs Espresso: Get ready to rumble ! - DroidCon Paris 2014
 
CITOQUINAS. Fisiología General
CITOQUINAS. Fisiología GeneralCITOQUINAS. Fisiología General
CITOQUINAS. Fisiología General
 
Android Test Automation Workshop
Android Test Automation WorkshopAndroid Test Automation Workshop
Android Test Automation Workshop
 
Abordaje depresion y tiroides svp
Abordaje depresion y tiroides svpAbordaje depresion y tiroides svp
Abordaje depresion y tiroides svp
 
Interleucinas-Citocinas
Interleucinas-CitocinasInterleucinas-Citocinas
Interleucinas-Citocinas
 
Utilizando Espresso e UIAutomator no Teste de Apps Android
Utilizando Espresso e UIAutomator no Teste de Apps AndroidUtilizando Espresso e UIAutomator no Teste de Apps Android
Utilizando Espresso e UIAutomator no Teste de Apps Android
 
Android Unit Tesing at I/O rewind 2015
Android Unit Tesing at I/O rewind 2015Android Unit Tesing at I/O rewind 2015
Android Unit Tesing at I/O rewind 2015
 
Espresso Barista
Espresso BaristaEspresso Barista
Espresso Barista
 

Similar a Extending the Espresso Method for Greater Recall

G filter a general gram filter for string similarity search
G filter a general gram filter for string similarity searchG filter a general gram filter for string similarity search
G filter a general gram filter for string similarity search
ieeepondy
 
UNIT 4.pptx
UNIT 4.pptxUNIT 4.pptx
UNIT 4.pptx
SreeLatha98
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 

Similar a Extending the Espresso Method for Greater Recall (20)

Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Rule based method for entity resolution
Rule based method for entity resolutionRule based method for entity resolution
Rule based method for entity resolution
 
G filter a general gram filter for string similarity search
G filter a general gram filter for string similarity searchG filter a general gram filter for string similarity search
G filter a general gram filter for string similarity search
 
weka data mining
weka data mining weka data mining
weka data mining
 
UNIT 4.pptx
UNIT 4.pptxUNIT 4.pptx
UNIT 4.pptx
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
Competition16
Competition16Competition16
Competition16
 
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
 
Expert System With Python -1
Expert System With Python -1Expert System With Python -1
Expert System With Python -1
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter Optimization
 
Recommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User CurriculumRecommending Scientific Papers: Investigating the User Curriculum
Recommending Scientific Papers: Investigating the User Curriculum
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Extending the Espresso Method for Greater Recall

  • 1. Relationship Extraction from Text Extending the Espresso Method for Greater Recall Derek Springer UCLA Computer Science Department November 19, 2009
  • 2. Related Works • Ganapathi, Swathi. “Relationship Extraction from Text: Comparison and Experimental Evaluation of the State-of- the-Art.” UCLA comp exam. March 2009. • Chu, A., Sakurai, S., Cárdenas, A. F., "Automatic Detection of Treatment Relationships in Patent Retrieval." 2008 CIKM Patent Information Retrieval Workshop. October 2008.
  • 3. Related Works, cont'd • Girju, R. "Automatic Detection of Causal Relations for Question Answering." In the proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003). Workshop on "Multilingual Summarization and Question Answering - Machine Learning and Beyond". 2003. • Pantel, Patrick and Pennacchiotti, Marco. "Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations." In Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL- 06). pp. 113-120. Sydney, Australia. 2006.
  • 4. Relationship Extraction • The task of recognizing the assertion of a particular relationship between two or more entities in text. • Can aid in the development of standalone, intelligent, automated and adaptable user-specific content retrieval systems. • We focus on extracting treatment relationships → A (subject) used to treat B (object).
  • 5. Goals and Contributions • Extended state-of-the-art Espresso relationship extraction system originally implemented by Ganapathi. • Did an in-depth experimental evaluation of the developed system while comparing it to prior work (Chu, Ganapathi). • Future goal is to use the system developed here as a plug for relationship feature extractor in iScore.
  • 6. Integration Into iScore • iScore presents additional articles based on an aggregate score of “interestingness.” • We believe filtering articles based on relationships can improve the results of iScore. • We hypothesize that extending the Espresso system implemented by Swathi Ganapathi will improve the ability of a system such as iScore to utilize relationship extraction as a feature.
  • 7. Comparison Criteria • Performance: Want system to have high precision and recall • Minimal Supervision: Want system to require little to no human supervision • Breadth: Want system to extract relations from varying corpus sizes, domains and formats. • Generality: Want system to extract wide variety of relation types without losing its edge in any of the above criteria.
  • 8. The Espresso Algorithm • General purpose algorithm which can be used to extract a wide variety of binary relations. • Requires minimal supervision. Only input is a small seed set of known relations. • By looking at individual sentences in detecting relationships, works well on all kinds of corpora. • On tests conducted by the creators of the algorithm, Espresso generated balanced precision and recall.
  • 10. Extending Espresso Ganapathi's 37.8% Implementation Extension 91.2%
  • 11. Ganapathi's Implementation • Ganapathi's approach uses lexico-syntactic patterns of the form NP1 VP NP2 (Verb category in Table 1). • VP contains treatment verb or pattern and the two NPs would contain the subject and object. • This structure is a very common relationship, accounting for 37.8% of all relationships.
  • 12. Extension • There still remains a large number of relationships that may provide fruitful results. • Expanding the implementation to include: - Noun+Prep e.g. "X settlement with Y" - Verb+Prep e.g. "X moved to Y" - Infinitive e.g. "X plans to acquire Y" and - Modifier e.g. "X is Y winner" relationship • Retrieves 91.2% of common relationships.
  • 13. Test Corpora • Patent Corpus: Developed by Shige o 50,000 drug patent documents from 2008 from Class 424 & 514 of the U.S. Patents Classification: “drug, bio-affecting and body treating compositions” and their subclasses. o Patents were pre-filtered to only contain keywords “diabetes”, “metastatic”, “cancer”, “tuberculosis”, “lung”, “bronchitis”, “coronary artery” o All sentences from each document added to a sentence table in the schema • PubMed Corpus: Developed by Gustavo o Comprised of medical abstracts from PubMed o Each abstract was parsed and all sentences from each abstract was stored as individual tuples in the sentence table
  • 15. Seed Treatment Relationships • (Xanax, Anxiety) • (Glycoside, Depression) • (Ambien, Insomnia) • (Ibuprofen, Arthritis) • (Effexor, Depression) • (Ibuprofen, Headache) • (Paxil, Depression) • (Tylenol, Fever) • (Lexapro, Depression) • (Tylenol, Headache) • (Caffeine, Depression) • (Antibody, Inflammation) • (Zoloft, Depression) • (Ibuprofen, Inflammation) • (Imipramine, Depression) • (Surgery, Glaucoma)
  • 16. Procedure 1.Re-tag original data set to incorporate extended relationship types. 2.Re-run Ganapathi's baseline Espresso implementation to compare against updated data set. 3.Run extended Espresso implementation to compare against updated data set.
  • 17. Experiment #1: Extraction on Drug Patent Corpus • Drug Patent corpus used. • Algorithm was run with seed relations and 12 verbs were extracted as being relevant (verbs with rπ greater than 0.2). • These treatment verbs were used to create a test sentence set of 120 sentences i.e. 10 sentences containing a treatment verb for every relevant treatment verb. • 358 possible relations were extracted for each of which we calculated the ri score. • 208 relations were obtained with ri score greater than the threshold out of which 126 were actually correct (through manual tagging). • Of the original 358 relations, manual tagging determined that 213 of them were correct treatment relations.
  • 19. Experiment #2: Number of Relationships and Performance • Drug Patent corpus used. • Test the performance of the system under smaller and larger data loads. • Started with initial set of 120 sentences obtained from Drug Patent corpus (10 sentences for each verb, 12 verbs as in test #1) • Increased the number of sentences for each verb by 10 in each case, so that we had sentence sets of 240 and 360 sentences each
  • 21. Experiment #2 Analysis • Performance of the system and the number of relationships are inversely related. • ri scores are affected inversely by the max pmi across all relationship instances, it is possible that having more relationship instances in a set lowers the ri for all those relationships. • more relationships => chance of a greater max pmi => lowered ri for all relationship instances. • Not worried → articles likely won't have 200 relations of the same type.
  • 22. Experiment #3: Extraction on PubMed Corpus • PubMed corpus used. • Want to test the performance of the system on a different type and sized corpus • Algorithm was run with input seed relations on this corpus and10 verbs with the topmost rπ values were extracted • We constructed a test sentence set of 80 sentences (8 sentences for every relevant verb) • We then extracted a total of 162 relations from this test set and calculated their ri scores. • The average ri score was used as the threshold value
  • 25. Experiment #3 Analysis • Performance is worse on PubMed corpus. • Patent corpus dealt with drugs and cures for diseases. • Therefore, there was an abundance of treatment type relations in patent corpus. • PubMed had more general medical data and only contained abstracts => less info. • Therefore, there were fewer treatment relations in PubMed which affected performance.
  • 26. Comparison with Previous Work * signifies our contribution
  • 27. Analysis • F-score of Ganapathi's version of Espresso fell nearly 10% → due to lower recall, as predicted. • Results of extension over the re-tagged data are on par with Ganapathi's original results. • When you consider that Ganapathi's system dropped nearly 10%, it seems to indicate the increased general purpose nature of the extension over the original version.
  • 28. Success • Recall of system is more important than precision, especially when it comes to using relationships as a feature in iScore. • Method is almost completely automated. • Easily expanded to extract other relationship types by changing the input seed relations. • Initial results seem insignificant, but analysis indicates that extended system has the potential to be a general- purpose relationship extraction feature.
  • 29. Future Work • Development of a relationship feature extractor for iScore. • Relations will have to be syntactically and semantically compared with relations present in other articles and the best article matches will be returned as “interesting” choices for a user. • Optimizations: algorithm design improvements, database connection optimizations and parallelization.