Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuroscience Information Framework
1. A Deep Survey of the Digital
Resource Landscape:
Perspectives from the Neuroscience
Information Framework
Maryann E. Martone, Ph. D.
University of California, San Diego
2. • NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
3. The Neuroscience Information
Framework
• NIF has developed a
production technology
platform for researchers to:
– Discover
– Share
– Analyze
– Integrate
neuroscience-relevant
information
• Since 2008, NIF has
assembled the largest
searchable catalog of
neuroscience data and
resources on the web
• Cost-effective and
innovative strategy for
managing data assets
“This unique data depository serves as a model
for other Web sites to provide research data. “
- Choice Reviews Online
NIF is poised to capitalize on the new tools
and emphasis on big data and open
science
4. http://neuinfo.org
June10, 2013 dkCOIN Investigator's Retreat 4
The Neuroscience Information Framework: Discovery and
utilization of web-based resources for neuroscience
• A portal for finding and using
neuroscience resources
A consistent framework for
describing resources
Provides simultaneous
search of multiple types of
information, organized by
category
Supported by an expansive
ontology for neuroscience
Utilizes advanced
technologies to search the
“hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database
Federation
Registry
5. Part 1: Surveying the
resource landscape
•NIF Registry: A catalog
of neuroscience-
relevant resources
•> 6000 currently
listed
•>2200 databases
•And we are finding
more every day
6. How do resources get added to the
NIF Registry?
June10, 2013 dkCOIN Investigator's Retreat 6
•NIF curators
•Nomination by the
community
•Semi-automated text mining
pipelines
NIF Registry
Requires no special
skills
Site map available for
local hosting
•NIF Data Federation
•DISCO interop
•Requires some
programming skill
Bandrowski et al., 2012
7. NIF Registry
• Extended over time
– Parent resource
– Supporting agency
– Grant numbers
– Accessibility
– Related to
– Organism
– Disease or condition
– Last updated
First catalog: SFN Neuroscience Database Gateway NIF 0.5 NIF 1.0+
Simple metadata model
Name, description, type, URL, other
names, keywords, unique identifier
~2003 2006 2008
8. Resource Curation
June10, 2013 dkCOIN Investigator's Retreat 8
• NIF Registry is hosted
on Semantic Media
Wiki platform
Neurolex
– Community can
add, review, edit
without special
privileges
– Searchable by Google
– Integrated with NIF
ontologies
– Graph structure
http://neurolex.org
10. Keeping the Registry Current
– NIF employs an automated link checker
– Last analysis: 478/6100 invalid URL’s (~8%)
– 199 can’t locate at another university or location out of service (~3%)
– Bigger issue: Many resources are no longer updated or maintained
0
20
40
60
80
100
120
140
160
180
200
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
0
500
1000
1500
2000
2500
3000
3500
Resourcesadded
Lastupdated
11. • Automated text mining is used to look for “web page last
updated” or copyright dates
– Identified for 570 resources; manual review suggested that the results
were accurate although we can’t guarantee that the date itself is
accurate
– 373 were not updated within the last 2 years (65%)
• Manual review of ~200 resources identified by 3DVC for their
catalog
– 38 not updated within the past 2 years (~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
Tracking the fate of digital resources
Yuling Li, Paul Sternberg, Cal Tech
12. Keeping content up
to date
Connectome
Tractography
Epigenetics
•New tags come into
existence
•New resource types come
into existence, e.g., Mobile
apps
•Resources add new types of
content
•Change name
•Change scope
•> 7000 updates to the
registry last year
It’s a challenge to keep the registry up to date;
sitemaps, curation, ontologies, community review
13. Ontology provides a human-centric
model for search and data integration
June10, 2013 dkCOIN Investigator's Retreat 13
14. Last updated...
• Some neglected
resources are still
valuable
– Complete data sets
– Rare data
• Software may still be
usable
• Some
databases, however, ma
y only be of historical
interest
– “all metalloproteins
found in PDB” Are all databases and data sets equally valuable?
15. • The NIF Registry has created a linked data
graph of web-accessible resources
• Maintained on a community wiki
platform
• Provides data on the fluidity of the
resource landscape
– New resources continue to be created and
found
– Relatively few disappear altogether
– Many more grow stale, although their value
may still be significant
– Maintaining up to date curation requires
frequent updating
Summary
NIF Registry provides insight into the state of digital
resources on the web
16. Part 2: Surveying the data
landscape
•The NIF data federation performs deep search over
the content of over 200 databases
•New databases are added at a rate of 25-40 per year
•Latest update: Open Source Brain; ingest
completed in 2 hours
•Databases chosen on a variety of criteria:
•Early: testing different types of resources
•Thematic areas
•Volunteers
17. 0
50
100
150
200
250
0.01
0.1
1
10
100
1000
Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13
NumberofFederatedDatabases
NumberofFederatedRecords(Millions)
Data Federation Growth
NIF searches the largest collation of
neuroscience-relevant data on the web
DISCO
June10, 2013 dkCOIN Investigator's Retreat 17
18. Data Ingestion Architecture
Current
Planned
DISCO Dashboard Functions
• Ingest Script Manager
• Public Script Repository
• Data & Event Tracker
• Versioning System
• Curator Tool
• Data Transformer Manager
June10, 2013 dkCOIN Investigator's Retreat 18Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd
Yale University
19. DISCO Dashboard
June10, 2013 dkCOIN Investigator's Retreat 19
• Management of registry resources
through a single administrative
dashboard
• Associated discovery pipeline
• Tools to manage data updates
• Change tracking
• Globally unique identifier creation
Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd
Yale University
21. What are the connections of the
hippocampus?
Hippocampus OR “CornuAmmonis” OR
“Ammon’s horn” Query expansion: Synonyms
and related concepts
Boolean queries
Data sources
categorized by
“data type” and
level of nervous
system
Common views
across multiple
sources
Tutorials for using
full resource when
getting there from
NIF
Link back to
record in
original source
22. Results are organized within a common
framework
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates
Projects toCellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
23. NIF Semantic Framework: NIFSTD ontology
• NIF covers multiple structural scales and domains of relevance to neuroscience
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
24. Use of Ontologies
• Controlled vocabulary for describing type of resource and
content
– Database, Image, Diabetes
• Entity-mapping of database and data content
• Data integration across sources
• Search: Mixture of mapped content and string-based
search
– Different parts of the infrastructure use the vocabularies in
different ways
– Utilize synonyms, parents, children to refine search
– Increasing use of other relationships and logical inferencing
• Generation of semantic content (i.e. RDF, Linked Data)
June10, 2013 dkCOIN Investigator's Retreat 24
27. The scourge of neuroanatomical nomenclature:
Importance of NIF semantic framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385
30. What have we learned: Grabbing the
long tail of small data
• NIF can be used to survey the
data landscape
• Analysis of NIF shows multiple
databases with similar scope
and content
• Many contain partially
overlapping data
• Data “flows” from one
resource to the next
– Data is reinterpreted, reanalyzed or
added to
• Is duplication good or bad?
31. What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for
reanalysis, e.g., microarray data sets
from GEO; brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple
sources, e.g., Cell Image
Library ,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
32. NIF Analytics: The Neuroscience Landscape
NIF is in a unique position to answer questions about the neuroscience
landscape
Where are the data?
Striatum
Hypothalamus
Olfactory bulb
Cerebral cortex
Brain
Brainregion
Data source
VadimAstakhov, Kepler Workflow Engine
33. Whither neuroscience information?
∞
What is easily machine
processable and accessible
What is potentially knowable
What is known:
Literature, images, human
knowledge
Unstructured;
Natural language
processing, entity
recognition, image
processing and
analysis;
communication
34. Open world meets closed world
We know a lot about some things and less about others; some
of NIF’s sources are comprehensive; others are highly biased
But...NIF has > 900,000
antibodies, 250,000 model
organisms, and 3 million microarray
records
35. Diseases of nervous system
What drives discovery?
The combination of ontologies, diverse data and analytics lets us look at
the current landscape in interesting ways
Neurodegenerative
Seizuredisorders
Neoplasticdiseaseofnervoussystem
NIH
Reporter
NIFdatafederatedsources
36. Embracing duplication: Data Mash ups
•NIF queries across 3 of approximately 10 fMRIdatabases
•Two resources, Brede and SUMSdbcurated activation foci from the literature
•~300 PMID’swere common between Brede and SUMSdb
•PMID serves as a unique identifier for an article
•Same information; value added
Data is additive
37. Same data: different analysis
• Gemma: Gene ID + Gene Symbol
• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of change is
opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:
•1370 statements from Gemma regarding gene expression as a function of
chronicmorphine
•617 were consistent with DRG; over half of the claims of the paper were not
confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
Relatively simple standards would make life easier
38. Phases of NIF
• 2006-2008: A survey of what was out there
• 2008-2009: Strategy for resource discovery
– NIF Registry vs NIF data federation
– Ingestion of data contained within different technology platforms, e.g., XML vs relational
vs RDF
– Effective search across semantically diverse sources
• NIFSTD ontologies
• 2009-2011: Strategy for data integration
– Unified views across common sources
– Mapping of content to NIF vocabularies
• 2011-present: Data analytics
– Uniform external data references
• 2012-present: SciCrunch: unified biomedical resource
services
NIF provides a strategy and set of tools applicable to all
biomedical science
39. Where is the Neuroscience in NIF?
• Search semantics
• Ranking
• Resources supported by NIH Blueprint Institutes are
more thoroughly covered
• Data types, e.g., Brain activation foci
June10, 2013 dkCOIN Investigator's Retreat 39
40. Building a Uniform Resource Layer
Discoverability
Accessibility
Web of Data
Data specified via simple semantics
Data in a usable form
Semantically-enabled search
Enhanced semantics
Standardized representation
Linked Open Data - RDF
Data resources simply described
Automated data harvesting technologies
Common resource registry
A production data (resource)
catalog and underlying technology
platform for researchers to
discover, share, access, analyze, and
integrate biomedical information
June10, 2013 40
41. Community Built Uniform Resource
Layer
June10, 2013 41
SciCrunch
NIF
Neuroscience
MONARCH
Animal Models
Community
Services
dkCOIN
Shared
Resources
Undiagnosed
Disease Program
Phenotype RCN
3D Virtual Cell
National Institute
on Aging
One Mind for
Research
BIRN
International
Neuroinformatics
Coordinating
Facility
Model Organism
Databases
Community
Outreach
DELSA
Varied
(not just a data catalog)
42. Each project shares resources and adds
unique value to the resource layer
42
•3dVC: Focus on models and simulation
•Gene Ontology: Focus on
bioinformatics tools
•National Institute on aging: Aging-
related data sets
•Monarch: Phenotype-Genotype; deep
semantic data integration
•One Mind for Research: Biospecimen
repositories
•NeuroGateway: Computational
resources
•FORCE11: Tools for next-gen publishing
and e-scholarship
SciCrunch
SciCrunch is actively supporting multiple
communities; multiple communities are
enriching and improving SciCrunch
43. Customized portals and rankings
June10, 2013 dkCOIN Investigator's Retreat 43
SciCrunch
NIF
Neuroscience
MONARCH
Animal Models
Community
Services
dkCOIN
Shared
Resources
Undiagnosed
Disease Program
Phenotype RCN
3D Virtual Cell
National Institute
on Aging
One Mind for
Research
BIRN
International
Neuroinformatics
Coordinating
Facility
Model Organism
Databases
Community
Outreach
DELSA
Varied
dkCOIN
Ontology
SciCrunch
Shared
Resources
44. Community
database:
beginning
Community
database:
End
Register your resource to NIF!
“How do I share my
data/tool?”
“There is no database
for my data”
1
2
3
4
Institutional
repositories
Cloud
INCF: Global
infrastructure
Government
Education
Industry
NIF is designed to leverage existing investments in resources and infrastructure
Tool repositories
45. Collaboration, competition, coordinat
ion, cooperation
• The diversity and dynamism of biomedical data will make
data integration challenging always
• The overall data space is vast: No one group or individual
can do everything
– Cooperation and coordination is essential
• Creating a core resource registry and data catalog allows
the entire community to track resources, work together to
keep it updated, promote cross-fertilization, and build
better resources
June10, 2013 dkCOIN Investigator's Retreat 45
46. NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
AmarnathGupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
ArunRangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
SrideviPolavarum
FahimImam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
VadimAstakhov
XufeiQian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
Notas del editor
Lists all NIF resources registered at levels 2+ in the DISCO server.Shows their DISCO services, and location of DISCO filesControls to filter, sort and page all resources