SlideShare una empresa de Scribd logo
1 de 58
Data-knowledge transition zones
within the biomedical research
ecosystem
Maryann E. Martone, Ph. D.
University of California, San Diego
• NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been
surveying,
cataloging and
tracking the
neuroscience
resource
landscape since
< 2008
BD2K: Big Data to Knowledge
• BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise, to facilitate discovery and support new knowledge,
and to maximize community engagement.
• BD2K aims to develop the new approaches, standards, methods, tools,
software, and competencies that will enhance the use of biomedical Big Data
by:
– Facilitating broad use of biomedical digital assets by making them
discoverable, accessible, and citable
– Conducting research and developing the methods, software, and tools
needed to analyze biomedical Big Data
– Enhancing training in the development and use of methods and tools
necessary for biomedical Big Data science
– Supporting a data ecosystem that accelerates discovery as part of a digital
enterprise
http://bd2k.nih.gov/
How many resources are
there?
How do resources get added to the NIF?
•NIF curators
•Nomination by the community
•Semi-automated text mining
pipelines
NIF Registry
Requires no special skills
Manual and semi-
automated updates
•NIF Data Federation
•DISCO interop
•Requires some
programming skill
•Open Source Brain < 2 hr
•Automated update via NIF
DISCO dashboard
Low barrier to entry; incremental refinementMarenco et al., 2010; 2014
Registry vs Federation: Metadata about resource vs
metadata/data in database
What resources are available for GRM1?
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
THE STATE OF RESEARCH
RESOURCES: RESOURCE REGISTRY
Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
Anita Bandrowski and Burak Ozyurt
Population, Coverage and Linkage of Resource
Registry
• Automated text mining is used to look
for “web page last updated” or
copyright dates
– Identified for 570 resources
– 373 were not updated within the last 2
years (65%)
• Manual review of ~200 resources
– 38 not updated within the past 2 years
(~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
What happens to these resources?
The Registry provides a persistent identifier and metadata
record for what once existed but no longer does
Keeping content up
to date
Connectome
Tractography
Epigenetics
•New tags come into
existence
•New resource types come
into existence, e.g., Mobile
apps
•Resources add new types of
content
•Change name
•Change scope
•> 7000 updates to the
registry last year
It’s a challenge to keep the registry up to date;
sitemaps, curation, ontologies, community review
DATA FEDERATION
NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views
250 sources
> 800 M records
What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
NIF: A search engine for data
NIF Information Framework: Query and alignment
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
• Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF uses ontologies to enhance search
and discovery but is not constrained by
them
Find clinical trials that have data
available?
Current challenge: With so much
available, how do I find what I need?
• “What genes are upregulated
by chronic morphine?”
– It depends
• Most often use cases require
connecting a researcher to
relevant data sets and
appropriate tools
– Depending upon the data and
tools, the answers may differ
• Many databases have tool
bases and workflows that
they support
– Much value has been added to
individual data sets
Facets and filters: Progressive
refinement of search
Facet/Filter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene Organism
Expression
level
Geo
Integrated
Expression
Literature
More effective to start with a general query and use
the navigation to refine search
Concept Mapper: Alignment and weighting
Find:gene cerebellum=find all sources with column mapped to gene that also contain
keyword cerebellum; Find:gene Anatomy:cerebellum
“Data trails”: Linking data and analysis tools
Query across Registry and Federation
• Registry and
Federation were
treated
separately, even
though
Federation
comprises views
of Registry
entries
• Experimenting
with new
combined index
SciCrunch: A “social network” for
resources
• NIF is a general search
engine across all of
neuroscience
• Very powerful for discovery
and general browsing
• Can perform analytics across
the spectrum of biomedical
resources
• Many communities want to
create more focused portals
• Specialized for their domain
• Restrict the particular sources
• Organize the data according
to their needs
• Use their own branding
• How do we create a system
that satisfies community
needs without creating
another silo?
Put dkNET here
http://dknet.org
Autogenerated snippets
Where can I find validated antibodies
against CART?
1 100 10,000 1,000,000 100,000,00010,000,000,000
SOFTWARE
PROTOCOLS
PHENOTYPE
PATHWAYS
MULTIMEDIA
MOLECULE
MICROARRAY
IMAGES
GENE
DRUGS
DATASET
CLINICAL TRIALS
BRAIN ACTIVATION FOCI
ATLAS
ANNOTATION
All databases in the SciCrunch
Federation become immediately
available through More Resources
Breaking down silos: Community enrichment
It’s like a Mendeley for
resources!
SciCrunch
Shared
Resources
Undiagnosed
Disease Program
Phenotype RCN
One Mind for
Research
Consortia-Pedia
Faster Cures
Model Organism
Databases
Community
Outreach
Shared curation; shared expertise
Resource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Making use of community
Facet/Filter
Source
Category
Index
Community Community
Community
resources
SciCrunch
data (all)
Gene
Gemma
Gene Organism
Expression
level
Geo
Integrated
Expression
Literature
Brings expertise of community to understanding how to work
with data
KNOWLEDGE TO DATA: GAP
ANALYSIS
Looking across the ecosystem: Where are the data?
Data Sources
Bringing knowledge to data: Gap analysis
Forebrain
Midbrain
Hindbrain
0
1-10
11-100
>101
Data Sources
Revealing biases in the dataspace
SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
Adult mouse brain connectivity matrix: revenge of the
midbrain
The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus, and
could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
Importance of comprehensive indices: For how
many proteins are there antibodies?
0
1-10
11-100
101-1000
1001+
Human, protein coding genes (Entrez Gene) vs # of
search results from the antibodyregistry.org
Antibodyregistry.orgTrish Whetzel and Anita Bandrowski
“The Data Homunculus”
Data-Knowledge Mismatch
Dutowski et al., 2013:
Nature Biotechnology
The scourge of neuroanatomical nomenclature
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385
6 parcellation schemes of mouse
prefrontal cortex based on Nissl alone
Van De Werd HJ1, Uylings HB.. Brain Struct Funct. 2014 Mar;219(2):433-59. doi:
10.1007/s00429-013-0630-
How many neuron types are
there?
NIH funding announcement: BRAIN Initiative: Transformative
Approaches for Cell-Type Classification in the Brain
“The mammalian brain contains a vast number of cells. These cells are
generally grouped within broad classes (e.g., neurons or glia) but it is
currently unknown exactly how many classes exist.”
Location of Cell Soma
Location of dendrites
Location of local axon
arbor
Transition Zones: Neurons and their properties
Analysis of Red Links in the Neuron
Registry
• INCF Project
– Neuron Registry
• Neurolex.org
• Semantic
MediaWiki
– > 30 experts
worldwide
– Fill out neuron
pages in Neurolex
Wiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
Number
Total
redlinks
easy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from
the collective behavior of contributors  show limits in our
knowledge and our knowledge representations
Domain Knowledge
Ontologies
Atlases/Maps
Annotation
Claims, assertions
Registries
Derived data
Models and
simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system, but figure
out how information (data + knowledge) can flow between them; Knowledge is fluid
and will continually update
SciCrunch: Creating a Data and Resource
Discovery Environment
BD2K: Creating a Data Discovery
Index
• BioCADDIE
– Dr. Lucila Ohno-
Machado PI
– FORCE11:
Community
engagement piece
• What should a data
discovery index do?
– Task Forces
– Pilot projects
• How should it be
built? http://biocaddie.org
BIOMEDICAL AND HEALTH CARE DATA
DISCOVERY AND INDEXING ENGINE CENTER
NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
Amarnath Gupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
BD2K-K2BD: Data Discovery Index
• Accounting of what is available
– Comprehensive resource registry
– UPC’s for research resources
• Information framework
– Major concepts contained in data, but also accounting of what happens to
data as it flows through the ecosystem (provenance)
• Community-based portals into shared data resources
– Share expertise
– Metrics of trust
– Shared curation and upkeep
• Two way validation of knowledge to data
Registry vs Federation: Metadata about
resource vs metadata/data in database
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
What have we learned: Grabbing the
long tail of small data
• NIF is in a unique position to ask
questions against the data resource
landscape
• The data space is not uniform
• Data “flows” from one resource to
the next
– Data is reinterpreted, reanalyzed or added
to
• Currently very difficult to track data
as it moves across the landscape
– Makes it difficult to learn from combined
efforts
Working with and extending
ontologies: Neurolex.org
http://neurolex.org Larson et al, Frontiers in Neuroinformatics, in press
•Semantic MediWiki
•Provide a simple interface
for defining the concepts
required
•Light weight semantics-sets of
triples
•Good teaching tool for
learning about semantic
integration and the benefits of
a consistent semantic
framework
•Community based:
•Anyone can contribute their
terms, concepts, things
•Anyone can edit
•Anyone can link
•Accessible: searched by Google
•Growing into a significant
knowledge base for
neuroscience
Demo D03
Neuron Lexicon: Gauging the state of
knowledge in neuroscience
• Led by Dr. Gordon
Shepherd
• > 30 world wide
experts
• Simple set of
properties
• Consistent naming
scheme
• Integrated with
Structural Lexicon
• Used for annotation
in other resources,
e.g., NeuroElectro
Analyzed
Curated
GSE13732
Analyzed
Mirrored
Stable identifiers and annotations allow us
to “track” data as it moves
Data flows throughout the ecosystem...value is added
Analyzed
Curated
GSE13732
Analyzed
Mirrored
But…even our standards need standards
GSE13732
E-GEOD-13732
GEO:GSE13732
Standard identifier format for all data
federation sources; text mining to deal
with inconsistencies
Same data: different analysis
• Gemma: Gene ID + Gene Symbol
• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of change is
opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:
•1370 statements from Gemma regarding gene expression as a function of chronic
morphine
•617 were consistent with DRG;  over half of the claims of the paper were not
confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data
has gone and what has been done with it
How many do we use?
These resources themselves need to be citable
Resource Identification Initiative:
Linking resources to literature
• Have authors supply appropriate
identifiers for key resources used
within a study such that they are:
– Machine processible (i.e., unique
identifier that resolves to a single
resource)
– Outside of the paywall
– Uniform across journals and
publishers
• Pilot project: SciCrunch portal
serving identifiers for
– Software/databases
– Antibodies
– Genetically modified organisms
Launched February 2014: > 30 journals
participating
What studies have used...?
•>200 articles have appeared to date
•>30 journals
•Data set being made available to
community
•> 650 RRID’s
•~10% disappeared after
copyediting
•5% were in error
Database available at: https://www.force11.org/node/5635
: C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers

Más contenido relacionado

La actualidad más candente

How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...Maryann Martone
 
The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework Neuroscience Information Framework
 
The Neuroscience Information Framework: Establishing a practical semantic fra...
The Neuroscience Information Framework: Establishing a practical semantic fra...The Neuroscience Information Framework: Establishing a practical semantic fra...
The Neuroscience Information Framework: Establishing a practical semantic fra...Neuroscience Information Framework
 
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...Maryann Martone
 
Big data from small data:  A survey of the neuroscience landscape through the...
Big data from small data:  A survey of the neuroscience landscape through the...Big data from small data:  A survey of the neuroscience landscape through the...
Big data from small data:  A survey of the neuroscience landscape through the...Maryann Martone
 
How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalWaqas Tariq
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
Data Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceData Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceJian Qin
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 

La actualidad más candente (18)

How do we know what we don’t know: Using the Neuroscience Information Framew...
How do we know what we don’t know:  Using the Neuroscience Information Framew...How do we know what we don’t know:  Using the Neuroscience Information Framew...
How do we know what we don’t know: Using the Neuroscience Information Framew...
 
Navigating the Neuroscience Data Landscape
Navigating the Neuroscience Data LandscapeNavigating the Neuroscience Data Landscape
Navigating the Neuroscience Data Landscape
 
The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework The possibility and probability of a global Neuroscience Information Framework
The possibility and probability of a global Neuroscience Information Framework
 
The Neuroscience Information Framework: Establishing a practical semantic fra...
The Neuroscience Information Framework: Establishing a practical semantic fra...The Neuroscience Information Framework: Establishing a practical semantic fra...
The Neuroscience Information Framework: Establishing a practical semantic fra...
 
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
 
Big data from small data:  A survey of the neuroscience landscape through the...
Big data from small data:  A survey of the neuroscience landscape through the...Big data from small data:  A survey of the neuroscience landscape through the...
Big data from small data:  A survey of the neuroscience landscape through the...
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 
How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?
 
Data Landscapes - Addiction
Data Landscapes - AddictionData Landscapes - Addiction
Data Landscapes - Addiction
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...
 
Genome scale-data as networks
Genome scale-data as networksGenome scale-data as networks
Genome scale-data as networks
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Open Science and Open Data for Librarians
Open Science and Open Data for LibrariansOpen Science and Open Data for Librarians
Open Science and Open Data for Librarians
 
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
Data Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceData Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information Science
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 

Similar a Data-knowledge transition zones within the biomedical research ecosystem

RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
 
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neuroscience Information Framework
 
The real world of ontologies and phenotype representation: perspectives from...
The real world of ontologies and phenotype representation:  perspectives from...The real world of ontologies and phenotype representation:  perspectives from...
The real world of ontologies and phenotype representation: perspectives from...Maryann Martone
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data viaNeuroscience Information Framework
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016Jisc
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNADaniel S. Katz
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhilip Bourne
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 Scott Edmunds
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBioinformaticsCentre
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Data and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planData and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planC. Tobin Magle
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Nolan Nichols
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_PresentationYatpang Cheung
 

Similar a Data-knowledge transition zones within the biomedical research ecosystem (20)

RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkRDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
 
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...
 
A Deep Survey of the Digital Resource Landscape
A Deep Survey of the Digital Resource LandscapeA Deep Survey of the Digital Resource Landscape
A Deep Survey of the Digital Resource Landscape
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
 
The real world of ontologies and phenotype representation: perspectives from...
The real world of ontologies and phenotype representation:  perspectives from...The real world of ontologies and phenotype representation:  perspectives from...
The real world of ontologies and phenotype representation: perspectives from...
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNA
 
Big Data
Big Data Big Data
Big Data
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Data and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planData and Donuts: How to write a data management plan
Data and Donuts: How to write a data management plan
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
Reproducibility in human cognitive neuroimaging: a community-­driven data sha...
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 

Último

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Último (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Data-knowledge transition zones within the biomedical research ecosystem

  • 1. Data-knowledge transition zones within the biomedical research ecosystem Maryann E. Martone, Ph. D. University of California, San Diego
  • 2. • NIF is an initiative of the NIH Blueprint consortium of institutes – What types of resources (data, tools, materials, services) are available to the neuroscience community? – How many are there? – What domains do they cover? What domains do they not cover? – Where are they? • Web sites • Databases • Literature • Supplementary material – Who uses them? – Who creates them? – How can we find them? – How can we make them better in the future? http://neuinfo.org • PDF files • Desk drawers NIF has been surveying, cataloging and tracking the neuroscience resource landscape since < 2008
  • 3. BD2K: Big Data to Knowledge • BD2K - a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement. • BD2K aims to develop the new approaches, standards, methods, tools, software, and competencies that will enhance the use of biomedical Big Data by: – Facilitating broad use of biomedical digital assets by making them discoverable, accessible, and citable – Conducting research and developing the methods, software, and tools needed to analyze biomedical Big Data – Enhancing training in the development and use of methods and tools necessary for biomedical Big Data science – Supporting a data ecosystem that accelerates discovery as part of a digital enterprise http://bd2k.nih.gov/
  • 4. How many resources are there?
  • 5. How do resources get added to the NIF? •NIF curators •Nomination by the community •Semi-automated text mining pipelines NIF Registry Requires no special skills Manual and semi- automated updates •NIF Data Federation •DISCO interop •Requires some programming skill •Open Source Brain < 2 hr •Automated update via NIF DISCO dashboard Low barrier to entry; incremental refinementMarenco et al., 2010; 2014
  • 6. Registry vs Federation: Metadata about resource vs metadata/data in database
  • 7. What resources are available for GRM1? With the thousands of databases and other information sources available, simple descriptive metadata will not suffice
  • 8. THE STATE OF RESEARCH RESOURCES: RESOURCE REGISTRY
  • 9. Database Software Application Data Analysis Service Topical Portal Core Facility Ontology Software Resource Years: Anita Bandrowski and Burak Ozyurt Population, Coverage and Linkage of Resource Registry
  • 10. • Automated text mining is used to look for “web page last updated” or copyright dates – Identified for 570 resources – 373 were not updated within the last 2 years (65%) • Manual review of ~200 resources – 38 not updated within the past 2 years (~20%) – 8 migrated to new addresses or institutions – 7 are no longer in service (~3%) – 3 were deemed no longer appropriate What happens to these resources? The Registry provides a persistent identifier and metadata record for what once existed but no longer does
  • 11. Keeping content up to date Connectome Tractography Epigenetics •New tags come into existence •New resource types come into existence, e.g., Mobile apps •Resources add new types of content •Change name •Change scope •> 7000 updates to the registry last year It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review
  • 13. NIF data federation NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources, providing deep query of the contents and unified views 250 sources > 800 M records
  • 14. What do you mean by data? Databases come in many shapes and sizes • Primary data: – Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL) • Secondary data – Data features extracted through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS) • Tertiary data – Claims and assertions about the meaning of data • E.g., gene upregulation/downregulation, brain activation as a function of task • Registries: – Metadata – Pointers to data sets or materials stored elsewhere • Data aggregators – Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede • Single source – Data acquired within a single context , e.g., Allen Brain Atlas Researchers are producing a variety of information artifacts using a multitude of technologies
  • 15. NIF: A search engine for data
  • 16. NIF Information Framework: Query and alignment • Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology • Available as services through NIF and BioPortal NIFSTD Organism NS FunctionMolecule Investigation Subcellular structure Macromolecule Gene Molecule Descriptors Techniques Reagent Protocols Cell Resource Instrument Dysfunction Quality Anatomical Structure NIF uses ontologies to enhance search and discovery but is not constrained by them
  • 17. Find clinical trials that have data available?
  • 18. Current challenge: With so much available, how do I find what I need? • “What genes are upregulated by chronic morphine?” – It depends • Most often use cases require connecting a researcher to relevant data sets and appropriate tools – Depending upon the data and tools, the answers may differ • Many databases have tool bases and workflows that they support – Much value has been added to individual data sets
  • 19. Facets and filters: Progressive refinement of search Facet/Filter Source Category Index Query Addiction Registry Data Gene Gemma Gene Organism Expression level Geo Integrated Expression Literature More effective to start with a general query and use the navigation to refine search
  • 20. Concept Mapper: Alignment and weighting Find:gene cerebellum=find all sources with column mapped to gene that also contain keyword cerebellum; Find:gene Anatomy:cerebellum
  • 21. “Data trails”: Linking data and analysis tools
  • 22. Query across Registry and Federation • Registry and Federation were treated separately, even though Federation comprises views of Registry entries • Experimenting with new combined index
  • 23. SciCrunch: A “social network” for resources • NIF is a general search engine across all of neuroscience • Very powerful for discovery and general browsing • Can perform analytics across the spectrum of biomedical resources • Many communities want to create more focused portals • Specialized for their domain • Restrict the particular sources • Organize the data according to their needs • Use their own branding • How do we create a system that satisfies community needs without creating another silo?
  • 25. Where can I find validated antibodies against CART?
  • 26. 1 100 10,000 1,000,000 100,000,00010,000,000,000 SOFTWARE PROTOCOLS PHENOTYPE PATHWAYS MULTIMEDIA MOLECULE MICROARRAY IMAGES GENE DRUGS DATASET CLINICAL TRIALS BRAIN ACTIVATION FOCI ATLAS ANNOTATION All databases in the SciCrunch Federation become immediately available through More Resources
  • 27. Breaking down silos: Community enrichment It’s like a Mendeley for resources!
  • 28. SciCrunch Shared Resources Undiagnosed Disease Program Phenotype RCN One Mind for Research Consortia-Pedia Faster Cures Model Organism Databases Community Outreach Shared curation; shared expertise Resource Identification Portal Aging Neuroscience dkNET Phenotypes NSF Earthcube
  • 29. Making use of community Facet/Filter Source Category Index Community Community Community resources SciCrunch data (all) Gene Gemma Gene Organism Expression level Geo Integrated Expression Literature Brings expertise of community to understanding how to work with data
  • 30. KNOWLEDGE TO DATA: GAP ANALYSIS
  • 31. Looking across the ecosystem: Where are the data? Data Sources Bringing knowledge to data: Gap analysis
  • 33. SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186 Adult mouse brain connectivity matrix: revenge of the midbrain
  • 34. The tale of the tail “Human neuroimaging typically is performed on a whole brain basis. However, for several reasons tail of the caudate activity can easily be missed. •One reason is limitations in the normalization algorithms, that typically are optimized to maximize accuracy for cortical rather than subcortical structures. ... •A second reason is that standard neuroimaging atlases such as the Harvard- Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body, and completely exclude the tail... •A final reason is that the tail of the caudate is close to the hippocampus, and could be misidentified as such especially in tasks involving learning and memory. Therefore, the tail of the caudate may be recruited in additional cognitive tasks, but yet not have been properly identified and reported in the neuroimaging literature” Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
  • 35. Importance of comprehensive indices: For how many proteins are there antibodies? 0 1-10 11-100 101-1000 1001+ Human, protein coding genes (Entrez Gene) vs # of search results from the antibodyregistry.org Antibodyregistry.orgTrish Whetzel and Anita Bandrowski
  • 37. Data-Knowledge Mismatch Dutowski et al., 2013: Nature Biotechnology
  • 38. The scourge of neuroanatomical nomenclature •NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions •Brain Architecture Management System (rodent) •Temporal lobe.com (rodent) •Connectome Wiki (human) •Brain Maps (various) •CoCoMac (primate cortex) •UCLA Multimodal database (Human fMRI) •Avian Brain Connectivity Database (Bird) •Total: 1800 unique brain terms (excluding Avian) •Number of exact terms used in > 1 database: 42 •Number of synonym matches: 99 •Number of 1st order partonomy matches: 385
  • 39. 6 parcellation schemes of mouse prefrontal cortex based on Nissl alone Van De Werd HJ1, Uylings HB.. Brain Struct Funct. 2014 Mar;219(2):433-59. doi: 10.1007/s00429-013-0630-
  • 40. How many neuron types are there? NIH funding announcement: BRAIN Initiative: Transformative Approaches for Cell-Type Classification in the Brain “The mammalian brain contains a vast number of cells. These cells are generally grouped within broad classes (e.g., neurons or glia) but it is currently unknown exactly how many classes exist.”
  • 41. Location of Cell Soma Location of dendrites Location of local axon arbor Transition Zones: Neurons and their properties
  • 42. Analysis of Red Links in the Neuron Registry • INCF Project – Neuron Registry • Neurolex.org • Semantic MediaWiki – > 30 experts worldwide – Fill out neuron pages in Neurolex Wiki Soma location Dendrite location Axon location 0 50 100 150 200 250 300 Number Total redlinks easy fixes hard fixes Soma location Dendrite location Axon location Social networks and community sites let us learn things from the collective behavior of contributors  show limits in our knowledge and our knowledge representations
  • 43. Domain Knowledge Ontologies Atlases/Maps Annotation Claims, assertions Registries Derived data Models and simulations Analyses Data Databases Data sets Literature Search and Discovery Cannot try to shoe-horn everything into a single representation or system, but figure out how information (data + knowledge) can flow between them; Knowledge is fluid and will continually update SciCrunch: Creating a Data and Resource Discovery Environment
  • 44. BD2K: Creating a Data Discovery Index • BioCADDIE – Dr. Lucila Ohno- Machado PI – FORCE11: Community engagement piece • What should a data discovery index do? – Task Forces – Pilot projects • How should it be built? http://biocaddie.org BIOMEDICAL AND HEALTH CARE DATA DISCOVERY AND INDEXING ENGINE CENTER
  • 45. NIF team (past and present) Jeff Grethe, UCSD, Co Investigator, Interim PI Amarnath Gupta, UCSD, Co Investigator Anita Bandrowski, NIF Project Leader Gordon Shepherd, Yale University Perry Miller Luis Marenco Rixin Wang David Van Essen, Washington University Erin Reid Paul Sternberg, Cal Tech Arun Rangarajan Hans Michael Muller Yuling Li Giorgio Ascoli, George Mason University Sridevi Polavarum Fahim Imam Larry Lui Andrea Arnaud Stagg Jonathan Cachat Jennifer Lawrence Svetlana Sulima Davis Banks Vadim Astakhov Xufei Qian Chris Condit Mark Ellisman Stephen Larson Willie Wong Tim Clark, Harvard University Paolo Ciccarese Karen Skinner, NIH, Program Officer (retired) Jonathan Pollock, NIH, Program Officer And my colleagues in Monarch, dkNet, 3DVC, Force 11
  • 46. BD2K-K2BD: Data Discovery Index • Accounting of what is available – Comprehensive resource registry – UPC’s for research resources • Information framework – Major concepts contained in data, but also accounting of what happens to data as it flows through the ecosystem (provenance) • Community-based portals into shared data resources – Share expertise – Metrics of trust – Shared curation and upkeep • Two way validation of knowledge to data
  • 47. Registry vs Federation: Metadata about resource vs metadata/data in database With the thousands of databases and other information sources available, simple descriptive metadata will not suffice
  • 48. What have we learned: Grabbing the long tail of small data • NIF is in a unique position to ask questions against the data resource landscape • The data space is not uniform • Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or added to • Currently very difficult to track data as it moves across the landscape – Makes it difficult to learn from combined efforts
  • 49.
  • 50. Working with and extending ontologies: Neurolex.org http://neurolex.org Larson et al, Frontiers in Neuroinformatics, in press •Semantic MediWiki •Provide a simple interface for defining the concepts required •Light weight semantics-sets of triples •Good teaching tool for learning about semantic integration and the benefits of a consistent semantic framework •Community based: •Anyone can contribute their terms, concepts, things •Anyone can edit •Anyone can link •Accessible: searched by Google •Growing into a significant knowledge base for neuroscience Demo D03
  • 51. Neuron Lexicon: Gauging the state of knowledge in neuroscience • Led by Dr. Gordon Shepherd • > 30 world wide experts • Simple set of properties • Consistent naming scheme • Integrated with Structural Lexicon • Used for annotation in other resources, e.g., NeuroElectro
  • 52. Analyzed Curated GSE13732 Analyzed Mirrored Stable identifiers and annotations allow us to “track” data as it moves Data flows throughout the ecosystem...value is added
  • 53. Analyzed Curated GSE13732 Analyzed Mirrored But…even our standards need standards GSE13732 E-GEOD-13732 GEO:GSE13732 Standard identifier format for all data federation sources; text mining to deal with inconsistencies
  • 54. Same data: different analysis • Gemma: Gene ID + Gene Symbol • DRG: Gene name + Probe ID • Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases Chronic vs acute morphine in striatum • Analysis: •1370 statements from Gemma regarding gene expression as a function of chronic morphine •617 were consistent with DRG;  over half of the claims of the paper were not confirmed in this analysis •Results for 1 gene were opposite in DRG and Gemma •45 did not have enough information provided in the paper to make a judgment NIF is working to make it easier to find where data has gone and what has been done with it
  • 55. How many do we use? These resources themselves need to be citable
  • 56. Resource Identification Initiative: Linking resources to literature • Have authors supply appropriate identifiers for key resources used within a study such that they are: – Machine processible (i.e., unique identifier that resolves to a single resource) – Outside of the paywall – Uniform across journals and publishers • Pilot project: SciCrunch portal serving identifiers for – Software/databases – Antibodies – Genetically modified organisms Launched February 2014: > 30 journals participating
  • 57. What studies have used...? •>200 articles have appeared to date •>30 journals •Data set being made available to community •> 650 RRID’s •~10% disappeared after copyediting •5% were in error Database available at: https://www.force11.org/node/5635
  • 58. : C Neurolex: > 1 million triples Dr. Yi Zeng: Chinese neural knowledge base NIF Cell Graph This is your brain on computers