SlideShare una empresa de Scribd logo
1 de 47
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources   Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar July 26, 2005
Semantic Web Vision
Background and Motivation ,[object Object],[object Object],[object Object],InterPro MIPS Swissprot
INDUS ( IN telligent  D ata  U nderstanding  S ystem) Goal: knowledge discovery from large,  distributed, semantically heterogeneous data
Outline ,[object Object],[object Object],[object Object],[object Object]
Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16  Peptide-aspartate  beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS … Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS … P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.)  VSSLPKESQA ELQLFQNEIN … P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN … P32589 MIPS Funcat AA Sequence Accession Number AN
Meta Data ,[object Object],[object Object],[object Object],[object Object],Schema for protein data in D 1 EC Number: EC Hierarchy Prosite Motifs: Motifs Protein Sequence: AA String Protein Name: String Protein ID : Swissprot ID
Attribute value hierarchy An  attribute value hierarchy  (AVH) is a partial order   ontology over the values of  attributes of data Example: MIPS Funcat Hierarchy
Making data sources self-describing  - Ontology-extended data source Data Schema Ontology + + MIPS Funcat:  MIPS Hierarchy Prosite Motifs: Motifs Length:  Positive Integer Gene:  Gene ID Accession Number: MIPS ID RIIa HSP70 415 692 BCY1 SSE1 16.19.01 cyclic nucleotide binding (cAMP, cGMP.)  VSSLPKESQA ELQLFQNEIN P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN P32589
User view MIPS Swissprot User Schema Data Sources of Interest User View User Ontology A  user view   is given by : ,[object Object],[object Object],[object Object],GO Function: GO Hierarchy Structural Class: SCOP  Protein:  AA String Source:  Species String PID: Swissprot ID
Mappings ,[object Object],[object Object],[object Object]
Mappings at schema level Protein ID:  Swissprot ID Protein Name:  String Protein Sequence:  AA String Prosite Motifs:  AA String EC Number:  EC Hierarchy Accession No AN:  MIPS ID Gene:  Gene ID AA Sequence:  AA String Length:  Pos Integer MIPS Funcat:  MIPS Hierarchy Pfam Motifs:  Motifs D 1 D 2 PID:  Swissprot ID Protein:  AA String GO Function: GO Hierarchy D U Source:  Species String
Mappings at schema level Protein ID : D 1 ≡  PID : D U Accession Number AN : D 2 ≡  PID : D U Protein ID:  Swissprot ID Protein Name:  String Protein Sequence:  AA String Prosite Motifs:  AA String EC Number:  EC Hierarchy Accession No AN:  MIPS ID Gene:  Gene Set AA Sequence:  AA String Length:  Pos Integer MIPS Funcat:  MIPS Hierarchy Pfam Motifs:  Motifs D 1 D 2 PID:  Swissprot ID Protein:  AA String GO Function: GO Hierarchy D U Source:  Species String
Mappings at schema level Protein ID : D 1 ≡  PID : D U Accession Number AN : D 2 ≡  PID : D U Protein Sequence : D 1 ≡  AA Composition : D U AA Sequence : D 2  ≡  AA Composition : D U Protein ID:  Swissprot ID Protein Name:  String Protein Sequence:  AA String Prosite Motifs:  AA String EC Number:  EC Hierarchy Accession No AN:  MIPS ID Gene:  Gene ID AA Sequence:  AA String Length:  Pos Integer MIPS Funcat:  MIPS Hierarchy Pfam Motifs:  Motifs D 1 D 2 PID:  Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source:  Species String
Mappings at schema level Protein ID : D 1 ≡  PID : D U Accession Number AN : D 2 ≡  PID : D U Protein Sequence : D 1 ≡  AA Composition : D U AA Sequence : D 2  ≡  AA Composition : D U EC Number : D 1  ≡  GO Function : D U’ MIPS Funcat : D 2  ≡  GO Function : D U Protein ID:  SwissProt ID Protein Name:  String Protein Sequence:  AA String Prosite Motifs:  AA String EC Number:  EC Hierarchy Accession No AN:  MIPS ID Gene:  Gene ID AA Sequence:  AA String Length:  Pos Integer MIPS Funcat:  MIPS Hierarchy Pfam Motifs:  Motifs D 1 D 2 PID:  SwissProt ID Protein:  AA String GO Function: GO Hierarchy D U Source:  Species String
Mappings at ontology level D U D U D 1
Mappings at ontology level EC 2.7.1.126 : D 1   ≡  GO 0047696 : D U D U D 1
Mappings at ontology level D U EC 2.7.1 :  D 1      GO 00047696 :  D U D 1
Mappings at ontology level D 1 EC 2.7.1.126 : D 1      GO 0004672  : D U D U
Integration ontology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample Query ,[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object]
Learning classifiers from data Data Labeled Examples Standard learning algorithms assume centralized access to data Unlabeled Examples Learner Classifier (hypothesis) Classification   Learning   Classifier Class
Human and yeast protein training data GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity  GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence Mainly alpha Alpha beta Yeast P39708 Mainly alpha  Yeast Q01574 Not Known Human Q12797 Mainly beta Few Secondary Structures Human P35626 Structural Classes Source PID Attributes/Features/Variables Class/Label Examples/ Instances/ Cases
Probabilistic models for protein function classification GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity  GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence P39708 Q01574 Q12797 P35626 PID ,[object Object],[object Object],[object Object],[object Object],Most probable class of  c ( S ) is:
Learning classifiers from data revisited Learning = Information extraction + Hypothesis generation Query  s ( D,h i ->h i+1 ) Answer  s ( D,h i ->h i+1 ) Information extraction = Sufficient statistics gathering Data D Learner  Partial hypothesis h i Hypothesis Generation h i+ 1  R ( h i  , s ( D, h i ->h i+1 )) Statistical query  formulation
Sufficient statistics for learning classifiers ,[object Object],[object Object]
Naïve Bayes learning as information gathering  and hypothesis generation count(AminoAcid,Class)  and  count(Class) Sufficient statistics: Naïve Bayes class: Query answering engine Naïve Bayes Data For each  a i  &  For each  c j Counts Counts(A i |c j ),  Counts(c j ) P ( c j )  &  P ( a i |c j ) Compute
Learning classifiers from distributed data Information extraction from distributed data +  Hypothesis   generation Query  s ( D,h i ->h i+1 ) Answer  s ( D,h i ->h i+1 ) Query  Decomposition Answer  Composition D 1 D 2 D K Learner Partial hypothesis  h i Query answering engine q 1 q 2 q K Statistical Query  Formulation Hypothesis Generation h i+ 1  R ( h i  , s ( D, h i ->h i+1  ))
Learning classifiers from semantically heterogeneous data sources O Query  s ( D,h i ) Answer  s ( D,h i ) Query  Decomposition Answer  Composition D 1 ,O 1 D 2 , O 2 D K , O K Ontology M(O 1 ...O K  , O) Mappings from O 1  … O K  to O Statistical Query  Formulation Hypothesis Generation h i+ 1  R ( h i  , s ( D, h i )) Learner Partial hypothesis h i q 2 q K q 1
Outline ,[object Object],[object Object],[object Object],[object Object]
Ontology-based information integration in INDUS
Capabilities of INDUS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
INDUS Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
INDUS Users: Domain Ontologists ,[object Object],[object Object],[object Object],[object Object],[object Object]
INDUS Users: Data Providers ,[object Object],[object Object],[object Object],[object Object],[object Object]
INDUS Users: Domain Experts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
INDUS Users: Domain Scientists ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
INDUS ,[object Object],[object Object],[object Object],[object Object],[object Object]
Related work ,[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object]
Summary
Work in progress ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Work in progress ,[object Object],[object Object],[object Object],[object Object]
Work in progress ,[object Object],[object Object],[object Object],[object Object]
Work in progress ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
http://www.cild.iastate.edu/software/indus.html

Más contenido relacionado

Similar a Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

INDUS: A System for Information Integration and Knowledge Acquisition from Au...
INDUS: A System for Information Integration and Knowledge Acquisition from Au...INDUS: A System for Information Integration and Knowledge Acquisition from Au...
INDUS: A System for Information Integration and Knowledge Acquisition from Au...Jie Bao
 
Query Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesJie Bao
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDoctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDavide Chicco
 
Neo4j_Cypher.pdf
Neo4j_Cypher.pdfNeo4j_Cypher.pdf
Neo4j_Cypher.pdfJaberRad1
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...Enrico Glaab
 
Big data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policiesBig data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policiesBigData_Europe
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
A new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelA new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelSotiris Mitracos
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 adrianheilbut
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
Semantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extensionSemantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extensionOscar Corcho
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 

Similar a Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources (20)

INDUS: A System for Information Integration and Knowledge Acquisition from Au...
INDUS: A System for Information Integration and Knowledge Acquisition from Au...INDUS: A System for Information Integration and Knowledge Acquisition from Au...
INDUS: A System for Information Integration and Knowledge Acquisition from Au...
 
Query Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data SourcesQuery Translation for Ontology-extended Data Sources
Query Translation for Ontology-extended Data Sources
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDoctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
 
Neo4j_Cypher.pdf
Neo4j_Cypher.pdfNeo4j_Cypher.pdf
Neo4j_Cypher.pdf
 
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
EnrichNet: Graph-based statistic and web-application for gene/protein set enr...
 
Big data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policiesBig data in the research life cycle: technologies, infrastructures, policies
Big data in the research life cycle: technologies, infrastructures, policies
 
download
downloaddownload
download
 
download
downloaddownload
download
 
Mythri_Thippareddy_Resume
Mythri_Thippareddy_ResumeMythri_Thippareddy_Resume
Mythri_Thippareddy_Resume
 
A new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelA new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score level
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
MoM2010: Bioinformatics
MoM2010: BioinformaticsMoM2010: Bioinformatics
MoM2010: Bioinformatics
 
Semantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extensionSemantics and optimisation of the SPARQL1.1 federation extension
Semantics and optimisation of the SPARQL1.1 federation extension
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
HEPData workshop talk
HEPData workshop talkHEPData workshop talk
HEPData workshop talk
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Missing Data imputation
Missing Data imputationMissing Data imputation
Missing Data imputation
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Más de Jie Bao

python-graph-lovestory
python-graph-lovestorypython-graph-lovestory
python-graph-lovestoryJie Bao
 
unix toolbox 中文版
unix toolbox 中文版unix toolbox 中文版
unix toolbox 中文版Jie Bao
 
unixtoolbox.book
unixtoolbox.bookunixtoolbox.book
unixtoolbox.bookJie Bao
 
Lean startup 精益创业 新创企业的成长思维
Lean startup 精益创业 新创企业的成长思维Lean startup 精益创业 新创企业的成长思维
Lean startup 精益创业 新创企业的成长思维Jie Bao
 
Towards social webtops using semantic wiki
Towards social webtops using semantic wikiTowards social webtops using semantic wiki
Towards social webtops using semantic wikiJie Bao
 
Semantic information theory in 20 minutes
Semantic information theory in 20 minutesSemantic information theory in 20 minutes
Semantic information theory in 20 minutesJie Bao
 
Towards a theory of semantic communication
Towards a theory of semantic communicationTowards a theory of semantic communication
Towards a theory of semantic communicationJie Bao
 
Expressive Query Answering For Semantic Wikis (20min)
Expressive Query Answering For  Semantic Wikis (20min)Expressive Query Answering For  Semantic Wikis (20min)
Expressive Query Answering For Semantic Wikis (20min)Jie Bao
 
Startup best practices
Startup best practicesStartup best practices
Startup best practicesJie Bao
 
Owl 2 quick reference card a4 size
Owl 2 quick reference card a4 sizeOwl 2 quick reference card a4 size
Owl 2 quick reference card a4 sizeJie Bao
 
ISWC 2010 Metadata Work Summary
ISWC 2010 Metadata Work SummaryISWC 2010 Metadata Work Summary
ISWC 2010 Metadata Work SummaryJie Bao
 
Expressive Query Answering For Semantic Wikis
Expressive Query Answering For  Semantic WikisExpressive Query Answering For  Semantic Wikis
Expressive Query Answering For Semantic WikisJie Bao
 
24 Ways to Explore ISWC 2010 Data
24 Ways to Explore ISWC 2010 Data24 Ways to Explore ISWC 2010 Data
24 Ways to Explore ISWC 2010 DataJie Bao
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsJie Bao
 
Representing financial reports on the semantic web a faithful translation f...
Representing financial reports on the semantic web   a faithful translation f...Representing financial reports on the semantic web   a faithful translation f...
Representing financial reports on the semantic web a faithful translation f...Jie Bao
 
XACML 3.0 (Partial) Concept Map
XACML 3.0 (Partial) Concept MapXACML 3.0 (Partial) Concept Map
XACML 3.0 (Partial) Concept MapJie Bao
 
Development of a Controlled Natural Language Interface for Semantic MediaWiki
Development of a Controlled Natural Language Interface for Semantic MediaWikiDevelopment of a Controlled Natural Language Interface for Semantic MediaWiki
Development of a Controlled Natural Language Interface for Semantic MediaWikiJie Bao
 
Digital image self-adaptive acquisition in medical x-ray imaging
Digital image self-adaptive acquisition in medical x-ray imagingDigital image self-adaptive acquisition in medical x-ray imaging
Digital image self-adaptive acquisition in medical x-ray imagingJie Bao
 
Privacy-Preserving Reasoning on the Semantic Web (Poster)
Privacy-Preserving Reasoning on the Semantic Web (Poster)Privacy-Preserving Reasoning on the Semantic Web (Poster)
Privacy-Preserving Reasoning on the Semantic Web (Poster)Jie Bao
 

Más de Jie Bao (20)

python-graph-lovestory
python-graph-lovestorypython-graph-lovestory
python-graph-lovestory
 
unix toolbox 中文版
unix toolbox 中文版unix toolbox 中文版
unix toolbox 中文版
 
unixtoolbox.book
unixtoolbox.bookunixtoolbox.book
unixtoolbox.book
 
Lean startup 精益创业 新创企业的成长思维
Lean startup 精益创业 新创企业的成长思维Lean startup 精益创业 新创企业的成长思维
Lean startup 精益创业 新创企业的成长思维
 
Towards social webtops using semantic wiki
Towards social webtops using semantic wikiTowards social webtops using semantic wiki
Towards social webtops using semantic wiki
 
Semantic information theory in 20 minutes
Semantic information theory in 20 minutesSemantic information theory in 20 minutes
Semantic information theory in 20 minutes
 
Towards a theory of semantic communication
Towards a theory of semantic communicationTowards a theory of semantic communication
Towards a theory of semantic communication
 
Expressive Query Answering For Semantic Wikis (20min)
Expressive Query Answering For  Semantic Wikis (20min)Expressive Query Answering For  Semantic Wikis (20min)
Expressive Query Answering For Semantic Wikis (20min)
 
Startup best practices
Startup best practicesStartup best practices
Startup best practices
 
Owl 2 quick reference card a4 size
Owl 2 quick reference card a4 sizeOwl 2 quick reference card a4 size
Owl 2 quick reference card a4 size
 
ISWC 2010 Metadata Work Summary
ISWC 2010 Metadata Work SummaryISWC 2010 Metadata Work Summary
ISWC 2010 Metadata Work Summary
 
Expressive Query Answering For Semantic Wikis
Expressive Query Answering For  Semantic WikisExpressive Query Answering For  Semantic Wikis
Expressive Query Answering For Semantic Wikis
 
CV
CVCV
CV
 
24 Ways to Explore ISWC 2010 Data
24 Ways to Explore ISWC 2010 Data24 Ways to Explore ISWC 2010 Data
24 Ways to Explore ISWC 2010 Data
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
Representing financial reports on the semantic web a faithful translation f...
Representing financial reports on the semantic web   a faithful translation f...Representing financial reports on the semantic web   a faithful translation f...
Representing financial reports on the semantic web a faithful translation f...
 
XACML 3.0 (Partial) Concept Map
XACML 3.0 (Partial) Concept MapXACML 3.0 (Partial) Concept Map
XACML 3.0 (Partial) Concept Map
 
Development of a Controlled Natural Language Interface for Semantic MediaWiki
Development of a Controlled Natural Language Interface for Semantic MediaWikiDevelopment of a Controlled Natural Language Interface for Semantic MediaWiki
Development of a Controlled Natural Language Interface for Semantic MediaWiki
 
Digital image self-adaptive acquisition in medical x-ray imaging
Digital image self-adaptive acquisition in medical x-ray imagingDigital image self-adaptive acquisition in medical x-ray imaging
Digital image self-adaptive acquisition in medical x-ray imaging
 
Privacy-Preserving Reasoning on the Semantic Web (Poster)
Privacy-Preserving Reasoning on the Semantic Web (Poster)Privacy-Preserving Reasoning on the Semantic Web (Poster)
Privacy-Preserving Reasoning on the Semantic Web (Poster)
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

  • 1. Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar July 26, 2005
  • 3.
  • 4. INDUS ( IN telligent D ata U nderstanding S ystem) Goal: knowledge discovery from large, distributed, semantically heterogeneous data
  • 5.
  • 6. Semantically Heterogeneous Data Data sources need to be made self-describing by specifying the relevant meta data D 1 D 2 Aspartyl/asparaginyl beta-hydroxylase Beta-adrenergic receptor kinase 2 Protein Name 1.14.11.16 Peptide-aspartate beta-dioxygenase TPR TPR_REGION TPR MAQRKNAKSS GNSSSSGSGS … Q12797 2.7.1.126 Beta-adrenergic receptor kinase RGS PROT_KIN_DOM PH_DOMAIN MADLEAVLAD VSYLMAMEKS … P35626 EC Number Prosite Motifs Protein Sequence Protein ID RIIa HSP70 Pfam Domains 415 692 Length BCY1 SSE1 Gene 16.19.01 cyclic nucleotide binding (cAMP, cGMP, etc.) VSSLPKESQA ELQLFQNEIN … P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN … P32589 MIPS Funcat AA Sequence Accession Number AN
  • 7.
  • 8. Attribute value hierarchy An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data Example: MIPS Funcat Hierarchy
  • 9. Making data sources self-describing - Ontology-extended data source Data Schema Ontology + + MIPS Funcat: MIPS Hierarchy Prosite Motifs: Motifs Length: Positive Integer Gene: Gene ID Accession Number: MIPS ID RIIa HSP70 415 692 BCY1 SSE1 16.19.01 cyclic nucleotide binding (cAMP, cGMP.) VSSLPKESQA ELQLFQNEIN P07278 16.01 protein binding STPFGLDLGN NNSVLAVARN P32589
  • 10.
  • 11.
  • 12. Mappings at schema level Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
  • 13. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene Set AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
  • 14. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U Protein ID: Swissprot ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: Swissprot ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
  • 15. Mappings at schema level Protein ID : D 1 ≡ PID : D U Accession Number AN : D 2 ≡ PID : D U Protein Sequence : D 1 ≡ AA Composition : D U AA Sequence : D 2 ≡ AA Composition : D U EC Number : D 1 ≡ GO Function : D U’ MIPS Funcat : D 2 ≡ GO Function : D U Protein ID: SwissProt ID Protein Name: String Protein Sequence: AA String Prosite Motifs: AA String EC Number: EC Hierarchy Accession No AN: MIPS ID Gene: Gene ID AA Sequence: AA String Length: Pos Integer MIPS Funcat: MIPS Hierarchy Pfam Motifs: Motifs D 1 D 2 PID: SwissProt ID Protein: AA String GO Function: GO Hierarchy D U Source: Species String
  • 16. Mappings at ontology level D U D U D 1
  • 17. Mappings at ontology level EC 2.7.1.126 : D 1 ≡ GO 0047696 : D U D U D 1
  • 18. Mappings at ontology level D U EC 2.7.1 : D 1  GO 00047696 : D U D 1
  • 19. Mappings at ontology level D 1 EC 2.7.1.126 : D 1  GO 0004672 : D U D U
  • 20.
  • 21.
  • 22.
  • 23. Learning classifiers from data Data Labeled Examples Standard learning algorithms assume centralized access to data Unlabeled Examples Learner Classifier (hypothesis) Classification Learning Classifier Class
  • 24. Human and yeast protein training data GO 0016208: AMP binding GO 0005515: protein binding GO 0004597: peptide-aspartate GO 0047696: beta-adrenergic-receptor kinase activity GO Function VSSLPKESQA ELQLFQNEIN STPFGLDLGN NNSVLAVARN MAQRKNAKSS GNSSSSGSGS MADLEAVLAD VSYLMAMEKS Sequence Mainly alpha Alpha beta Yeast P39708 Mainly alpha Yeast Q01574 Not Known Human Q12797 Mainly beta Few Secondary Structures Human P35626 Structural Classes Source PID Attributes/Features/Variables Class/Label Examples/ Instances/ Cases
  • 25.
  • 26. Learning classifiers from data revisited Learning = Information extraction + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Information extraction = Sufficient statistics gathering Data D Learner Partial hypothesis h i Hypothesis Generation h i+ 1  R ( h i , s ( D, h i ->h i+1 )) Statistical query formulation
  • 27.
  • 28. Naïve Bayes learning as information gathering and hypothesis generation count(AminoAcid,Class) and count(Class) Sufficient statistics: Naïve Bayes class: Query answering engine Naïve Bayes Data For each a i & For each c j Counts Counts(A i |c j ), Counts(c j ) P ( c j ) & P ( a i |c j ) Compute
  • 29. Learning classifiers from distributed data Information extraction from distributed data + Hypothesis generation Query s ( D,h i ->h i+1 ) Answer s ( D,h i ->h i+1 ) Query Decomposition Answer Composition D 1 D 2 D K Learner Partial hypothesis h i Query answering engine q 1 q 2 q K Statistical Query Formulation Hypothesis Generation h i+ 1  R ( h i , s ( D, h i ->h i+1 ))
  • 30. Learning classifiers from semantically heterogeneous data sources O Query s ( D,h i ) Answer s ( D,h i ) Query Decomposition Answer Composition D 1 ,O 1 D 2 , O 2 D K , O K Ontology M(O 1 ...O K , O) Mappings from O 1 … O K to O Statistical Query Formulation Hypothesis Generation h i+ 1  R ( h i , s ( D, h i )) Learner Partial hypothesis h i q 2 q K q 1
  • 31.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 43.
  • 44.
  • 45.
  • 46.

Notas del editor

  1. INDUS – a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)
  2. Design that is tailored for predictive model building using machine learning algorithms from distributed, semantically heterogeneous, autonomous data sources
  3. INDUS – a federated, query centric approach to the problem of knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources Learning algorithms that can be decomposed into information gathering (obtained by answering queries) and hypothesis generation can be easily linked to INDUS INDUS makes possible the exchange of data and findings between scientists or institutions working on related problems (e.g., bioinformatics)