SlideShare a Scribd company logo
1 of 21
Download to read offline
Data Integration: what I haven’t yet achieved
Neil Saunders

MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au
My main project

Ludwig colorectal cancer study

Data integration 2 of 21
Multiple “omics” platforms

exon expression

Data integration 3 of 21

methylation

copy number
We want to “integrate” these data

but what does that mean?

Data integration 4 of 21
Integration can mean “portals”

Data integration 5 of 21
Integration can mean “visualization”

Data integration 6 of 21
Integration can mean “correlation”

Data integration 7 of 21
What do we think integration means?

A

+

B

+

C

More information when combined than when separate
Data integration 8 of 21
What’s already “out there”? PubMed
PubMed Search: "data integration"
q
q

q

q

articles / 100 000

12

q

q

8
q

q
q

4

q

q

2002

2004

2006

Year

Data integration 9 of 21

2008

2010
What’s already “out there”? CiteULike

http://www.citeulike.org/user/neils/tag/integration

Data integration 10 of 21
Buzz-word compliant

Data integration 11 of 21
Quote from integIRTy paper

These methods can be roughly grouped into four categories:
stepwise, regression-based, correlation-based and
latent variable models
integIRTy: a method to identify genes altered in cancer by accounting for
multiple mechanisms of regulation using item response theory
Bioinformatics, Vol. 28, No. 22. (15 November 2012), pp. 2861-2869

Data integration 12 of 21
Regression: SIM

Integrated analysis of DNA copy number and gene expression microarray data using gene sets
BMC Bioinformatics 2009, 10:203

Data integration 13 of 21
1

2

3

4

5

6

7

8

10
9

11

12

13

14

15

16

17
18

19
20
21
22
0

0

Data integration 14 of 21
0.2
0.4
2

0.6
0.8
4

1

Correlation

010
026
142
011
115
018
037
145
017
009
023
002
116
117
120
003
036
029
040
114
118
121
112
006
113
119
034
035
028
004
007
013
014
016
024
012
019
021
015
001
067
068
072
077
048
058
064
050
075
080
086
051
061
070
076
087
092
096
099
101
104
110
093
097
100
089
109
091
103
127
130
131
135
133
136
134
137
125
128
138
146
032
033
043
038
041
042
140
141
144
153
152
147
122
123
132
126
139
069
074
085
055
095
005
066
010
026
142
011
115
018
037
145
017
009
023
002
116
117
120
003
036
029
040
114
118
121
112
006
113
119
034
035
028
004
007
013
014
016
024
012
019
021
015
001
067
068
072
077
048
058
064
050
075
080
086
051
061
070
076
087
092
096
099
101
104
110
093
097
100
089
109
091
103
127
130
131
135
133
136
134
137
125
128
138
146
032
033
043
038
041
042
140
141
144
153
152
147
122
123
132
126
139
069
074
085
055
095
005
066

Chr

Correlation: DR-Integrator
Latent variable: iCluster

(file under impractical)

Data integration 15 of 21
Basics that are never explained 1/2

Integration across groups or description of samples?

Data integration 16 of 21
Basics that are never explained 2/2

Genes x Samples

Data integration 17 of 21
Conclusions 1/3

We’re not the first people doing this...
...but it’s becoming a “hot topic”

Data integration 18 of 21
Conclusions 2/3

Room for improvement in software, much of which is:

• Poorly-written
• Poorly-documented
• Difficult to implement

Data integration 19 of 21
Conclusions 3/3

Too much for one individual!

Data integration 20 of 21
CSIRO Mathematics, Informatics and Statistics
Neil Saunders
t
+61 2 9325 3144
e Neil.Saunders@csiro.au
w Mathematics, Informatics and Statistics web

MATHEMATICS, INFORMATICS AND STATISTICS
www.csiro.au

More Related Content

Similar to Data Integration: What I Haven't Yet Achieved

Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian networkImpact of big data congestion in IT: An adaptive knowledgebased Bayesian network
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
IJECEIAES
 
Life science requirements from e-infrastructure: initial results from a joint...
Life science requirements from e-infrastructure:initial results from a joint...Life science requirements from e-infrastructure:initial results from a joint...
Life science requirements from e-infrastructure: initial results from a joint...
Rafael C. Jimenez
 
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition MethodKPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
ijtsrd
 
Himss singapore 2012 clinician it leadership 2012[1]
Himss singapore 2012 clinician it leadership 2012[1]Himss singapore 2012 clinician it leadership 2012[1]
Himss singapore 2012 clinician it leadership 2012[1]
HealthXn
 
Big Data and Business Intelligence in Health
Big Data and Business Intelligence in HealthBig Data and Business Intelligence in Health
Big Data and Business Intelligence in Health
HealthXn
 
Le Bauer: Data Driven Model Development
Le Bauer:  Data Driven Model DevelopmentLe Bauer:  Data Driven Model Development
Le Bauer: Data Driven Model Development
questRCN
 
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
IAEME Publication
 
Acceliant white paper_edc_and_epro
Acceliant white paper_edc_and_eproAcceliant white paper_edc_and_epro
Acceliant white paper_edc_and_epro
Trianz
 
Arcs conference
Arcs conferenceArcs conference
Arcs conference
HealthXn
 

Similar to Data Integration: What I Haven't Yet Achieved (20)

Big Data - A view
Big Data - A viewBig Data - A view
Big Data - A view
 
Remote Patient & Elderly Care Monitoring
Remote Patient & Elderly Care MonitoringRemote Patient & Elderly Care Monitoring
Remote Patient & Elderly Care Monitoring
 
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian networkImpact of big data congestion in IT: An adaptive knowledgebased Bayesian network
Impact of big data congestion in IT: An adaptive knowledgebased Bayesian network
 
Life science requirements from e-infrastructure: initial results from a joint...
Life science requirements from e-infrastructure:initial results from a joint...Life science requirements from e-infrastructure:initial results from a joint...
Life science requirements from e-infrastructure: initial results from a joint...
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
COMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management rightCOMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management right
 
Throw the Semantic Web at Today's Health-care
Throw the Semantic Web at Today's Health-careThrow the Semantic Web at Today's Health-care
Throw the Semantic Web at Today's Health-care
 
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
 
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition MethodKPCA and Eigen Face Based Dimension Reduction Face Recognition Method
KPCA and Eigen Face Based Dimension Reduction Face Recognition Method
 
Himss singapore 2012 clinician it leadership 2012[1]
Himss singapore 2012 clinician it leadership 2012[1]Himss singapore 2012 clinician it leadership 2012[1]
Himss singapore 2012 clinician it leadership 2012[1]
 
MultiModal Retrieval Image
MultiModal Retrieval ImageMultiModal Retrieval Image
MultiModal Retrieval Image
 
Big Data and Business Intelligence in Health
Big Data and Business Intelligence in HealthBig Data and Business Intelligence in Health
Big Data and Business Intelligence in Health
 
Le Bauer: Data Driven Model Development
Le Bauer:  Data Driven Model DevelopmentLe Bauer:  Data Driven Model Development
Le Bauer: Data Driven Model Development
 
Supervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For CancerSupervised Multi Attribute Gene Manipulation For Cancer
Supervised Multi Attribute Gene Manipulation For Cancer
 
Blockchain key Drivers
Blockchain key Drivers Blockchain key Drivers
Blockchain key Drivers
 
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
 
Machine Learning Pitfalls
Machine Learning Pitfalls Machine Learning Pitfalls
Machine Learning Pitfalls
 
A comparative study of cn2 rule and svm algorithm
A comparative study of cn2 rule and svm algorithmA comparative study of cn2 rule and svm algorithm
A comparative study of cn2 rule and svm algorithm
 
Acceliant white paper_edc_and_epro
Acceliant white paper_edc_and_eproAcceliant white paper_edc_and_epro
Acceliant white paper_edc_and_epro
 
Arcs conference
Arcs conferenceArcs conference
Arcs conference
 

More from Neil Saunders

More from Neil Saunders (11)

Online bioinformatics forums: why do we keep asking the same questions?
Online bioinformatics forums: why do we keep asking the same questions?Online bioinformatics forums: why do we keep asking the same questions?
Online bioinformatics forums: why do we keep asking the same questions?
 
Should I be dead? a very personal genomics
Should I be dead? a very personal genomicsShould I be dead? a very personal genomics
Should I be dead? a very personal genomics
 
Learning from complete strangers: social networking for bioinformaticians
Learning from complete strangers: social networking for bioinformaticiansLearning from complete strangers: social networking for bioinformaticians
Learning from complete strangers: social networking for bioinformaticians
 
Building A Web Application To Monitor PubMed Retraction Notices
Building A Web Application To Monitor PubMed Retraction NoticesBuilding A Web Application To Monitor PubMed Retraction Notices
Building A Web Application To Monitor PubMed Retraction Notices
 
Version Control in Bioinformatics: Our Experience Using Git
Version Control in Bioinformatics: Our Experience Using GitVersion Control in Bioinformatics: Our Experience Using Git
Version Control in Bioinformatics: Our Experience Using Git
 
What can science networking online do for you
What can science networking online do for youWhat can science networking online do for you
What can science networking online do for you
 
Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...Using structural information to predict protein-protein interaction and enyzm...
Using structural information to predict protein-protein interaction and enyzm...
 
Predikin and PredikinDB: tools to predict protein kinase peptide specificity
Predikin and PredikinDB:  tools to predict protein kinase peptide specificityPredikin and PredikinDB:  tools to predict protein kinase peptide specificity
Predikin and PredikinDB: tools to predict protein kinase peptide specificity
 
The Viking labelled release experiment: life on Mars?
The Viking labelled release experiment:  life on Mars?The Viking labelled release experiment:  life on Mars?
The Viking labelled release experiment: life on Mars?
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Genomics of cold-adapted microorganisms
Genomics of cold-adapted microorganismsGenomics of cold-adapted microorganisms
Genomics of cold-adapted microorganisms
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Data Integration: What I Haven't Yet Achieved

  • 1. Data Integration: what I haven’t yet achieved Neil Saunders MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au
  • 2. My main project Ludwig colorectal cancer study Data integration 2 of 21
  • 3. Multiple “omics” platforms exon expression Data integration 3 of 21 methylation copy number
  • 4. We want to “integrate” these data but what does that mean? Data integration 4 of 21
  • 5. Integration can mean “portals” Data integration 5 of 21
  • 6. Integration can mean “visualization” Data integration 6 of 21
  • 7. Integration can mean “correlation” Data integration 7 of 21
  • 8. What do we think integration means? A + B + C More information when combined than when separate Data integration 8 of 21
  • 9. What’s already “out there”? PubMed PubMed Search: "data integration" q q q q articles / 100 000 12 q q 8 q q q 4 q q 2002 2004 2006 Year Data integration 9 of 21 2008 2010
  • 10. What’s already “out there”? CiteULike http://www.citeulike.org/user/neils/tag/integration Data integration 10 of 21
  • 12. Quote from integIRTy paper These methods can be roughly grouped into four categories: stepwise, regression-based, correlation-based and latent variable models integIRTy: a method to identify genes altered in cancer by accounting for multiple mechanisms of regulation using item response theory Bioinformatics, Vol. 28, No. 22. (15 November 2012), pp. 2861-2869 Data integration 12 of 21
  • 13. Regression: SIM Integrated analysis of DNA copy number and gene expression microarray data using gene sets BMC Bioinformatics 2009, 10:203 Data integration 13 of 21
  • 14. 1 2 3 4 5 6 7 8 10 9 11 12 13 14 15 16 17 18 19 20 21 22 0 0 Data integration 14 of 21 0.2 0.4 2 0.6 0.8 4 1 Correlation 010 026 142 011 115 018 037 145 017 009 023 002 116 117 120 003 036 029 040 114 118 121 112 006 113 119 034 035 028 004 007 013 014 016 024 012 019 021 015 001 067 068 072 077 048 058 064 050 075 080 086 051 061 070 076 087 092 096 099 101 104 110 093 097 100 089 109 091 103 127 130 131 135 133 136 134 137 125 128 138 146 032 033 043 038 041 042 140 141 144 153 152 147 122 123 132 126 139 069 074 085 055 095 005 066 010 026 142 011 115 018 037 145 017 009 023 002 116 117 120 003 036 029 040 114 118 121 112 006 113 119 034 035 028 004 007 013 014 016 024 012 019 021 015 001 067 068 072 077 048 058 064 050 075 080 086 051 061 070 076 087 092 096 099 101 104 110 093 097 100 089 109 091 103 127 130 131 135 133 136 134 137 125 128 138 146 032 033 043 038 041 042 140 141 144 153 152 147 122 123 132 126 139 069 074 085 055 095 005 066 Chr Correlation: DR-Integrator
  • 15. Latent variable: iCluster (file under impractical) Data integration 15 of 21
  • 16. Basics that are never explained 1/2 Integration across groups or description of samples? Data integration 16 of 21
  • 17. Basics that are never explained 2/2 Genes x Samples Data integration 17 of 21
  • 18. Conclusions 1/3 We’re not the first people doing this... ...but it’s becoming a “hot topic” Data integration 18 of 21
  • 19. Conclusions 2/3 Room for improvement in software, much of which is: • Poorly-written • Poorly-documented • Difficult to implement Data integration 19 of 21
  • 20. Conclusions 3/3 Too much for one individual! Data integration 20 of 21
  • 21. CSIRO Mathematics, Informatics and Statistics Neil Saunders t +61 2 9325 3144 e Neil.Saunders@csiro.au w Mathematics, Informatics and Statistics web MATHEMATICS, INFORMATICS AND STATISTICS www.csiro.au