Applied semantic technology and linked data

APPLIED LINKED DATA
AND SEMANTIC
TECHNOLOGY
Expanding a Neurobiology Dataset

Today we are discussing…
• What is the use case and who requested it?
• How do you import and normalize thousands of RDF
•
•
•
•

triples worth of gene data?
How do we enrich the normalized gene data with parallel
research data sets?
Creating instance pages without knowing exactly what will
be displayed on them.
Demonstration of the initial use cases
Question and answer session

Why?
• Prototype: How do we assemble the data mine and

refine the authoring tools?

How do we expand this to the research
community?
• How do we expand ownership of the data to research

professionals?
• How do we build systems in a way that research
professionals can author and link the data?
• How do we publish these new relationships to the wider
research community?

What is the Allen Institute for Brain
Science?
• Launched in 2003 with seed funding from founder and

philanthropist Paul G. Allen.
• Serving the scientific community is at the center of our mission
to accelerate progress toward understanding the brain and
neurological systems.
• The Allen Institute's multidisciplinary staff includes
neuroscientists, molecular biologists, informaticists, and
engineers.

“The Allen Institute for Brain Science is an
independent 501(c)(3) nonprofit medical
research organization dedicated to accelerating
the understanding of how the human brain
works.”

Human Brain Map
• Open, public online access
• A detailed, interactive three-

•
•

•

•

dimensional anatomic atlas of the
"normal" human brain
Data from multiple human brains
Genomic analysis of every brain
structure, providing a quantitative
inventory of which genes are
turned on where
High-resolution atlases of key brain
structures, pinpointing where
selected genes are expressed
down to the cellular level
Navigation and analysis tools for
accessing and mining the data

Biological Linked Data Map
• Open, public online access
• Data from multiple RDF data
•
•

•

•

stores
Complete import pipeline using
LDIF framework
Outlines of each imported
instance embedding inline wiki
properties and providing views of
imported properties from original
RDF datasets
Charting tools that „pivot‟ SPARQL
queries providing several views of
each query
Navigation and composition tools
for accessing and mining the data

Where did we get the data?
• KEGG : Kyoto Encyclopedia of Genes and Genomes
• “KEGG GENES is a collection of gene catalogs for all complete genomes

generated from publicly available resources, mostly NCBI RefSeq.”

• Diseasome
• “The Diseasome website is a disease/disorder relationships explorer and

a sample of an innovative map-oriented scientific work. Built by a team of
researchers and engineers, it uses the Human Disease Network data set.”

• DrugBank
• “The DrugBank database is a unique bioinformatics and cheminformatics

resource that combines detailed drug data with comprehensive drug target
information.”

• SIDER
• “SIDER contains information on marketed medicines and their recorded

adverse drug reactions. The information is extracted from public
documents and package inserts.”

New ontology map for import
•

Genes
•
•
•

•

Diseases
•
•

•

DrugBank : 4,772
KEGG : 2,482
SIDER : 924

Effects
•

•

Diseasome : 4,213
KEGG : 459

Drugs
•
•
•

•

DrugBank : 4,553
Diseasome : 3,919
KEGG : 9,841

SIDER : 1,737

Pathways
•

KEGG : 28,442

We chose to intentionally simplify the
ontology due to disagreements between
researchers about entity relationships and
subclasses.

Importing and mapping the Linked Data
•

R2R
•

•
•

•

32,900 instances were converted to the
wiki ontology.

Networked
Storage

Local
Storage

Download

583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.

SIEVE
•

20,849 instances available in wiki
ontology after SILK normalization

•

Instance merging effected drugs,
genes, and diseases across datasets.

• Triple Store SPARQL Update

R2R
Mapping
Engine

Maps Entities to
New Ontology

Import to
Wiki

Sieve
Mapping
Engine

Normalizes Entities
across data sources

Normalize
Entities

Triple
Store

Available with
SPARQL Queries

LDIF: LINKED DATA
INTEGRATION
FRAMEWORK

Linked Data challenges
• Data sources that overlap in content may:
• Use a wide range of different RDF vocabularies
• Use different identifiers for the same real-world entity
• Provide conflicting values for the same properties
• Implications
• Queries become hand crafted for a specific RDF data set – no
different than using a proprietary API.
• Individual, improvised and manual merging techniques for data
sets.
• Integrating public datasets with internal databases poses

the same problems

Linked Data Integration Framework
• LDIF normalizes the Linked Data from multiple sources

into a clean, local target representation while keeping
track of data provenance.
1

Collect data: Managed download and update

2

Translate data into a single, target vocabulary

3

Resolve identifier aliases into local target URIs

4

Cleanse data and resolve conflicting values

5

Output to local file system or triple store

LDIF Pipeline
1

Collect data

2

Translate data

3

Supported Data Formats

Resolve
identities

4

Cleanse data

5

Output data

•
•
•

RDF Files (Multiple Formats
SPARQL Endpoints
Crawling Linked Data

Component Stack

LDIF Pipeline
1

Collect data

2

Translate data

Sources use a wide range of different
RDF vocabularies
dbpedia-owl:City

schema:Place

R2R

location:City

fb:location.citytown

3

Resolve
identities

4

Cleanse data

5

Output data

Component Stack

LDIF Pipeline
1

Collect data

2

Sources use different identifiers for the
same entity

Translate data

London, England
London, MA, USA
London, TN, USA
London, TX, USA

SILK

London

3

London =
London, England

Resolve
identities

4

Cleanse data

5

Output data

Component Stack

LDIF Pipeline
1

Collect data

2

Translate data

3

Sources provide different values for the
same property
London, England
has a population
of 8.174M people

London, England
has a population
of 9.2M people

SILK

rdfs:population:
8.174M

Resolve
identities

4

Cleanse data

5

Output data

Component Stack

LDIF Pipeline
1

Collect data

2

Translate data

3

Supported Output Formats
•
•
•

N-Quads
N-Triples
SPARQL Update Stream

Resolve
identities

4

Cleanse data

5

Output data

Provenance tracking using Named Graphs

Component Stack

Normalized Linked Data is not always
pretty.

SELECT DISTINCT ?group1 ?item1 ?group2 ?item2 {
GRAPH ?G {
?target drugbank:geneName "{{{1}}}" ;
drugbank:geneName ?geneName ;
.
?drug drugbank:target ?target ;
drugbank:genericName ?item2 ;
drugbank:affectedOrganism ?group2 ;
.
}
GRAPH ?G1 {
?siderDrug sider:drugName ?item2 ;
rdfs:label ?group1 ;
sider:sideEffect ?effect;
.
?effect rdfs:label ?item1 .
}
}

Semantic MediaWiki
Semantic MediaWiki is a full-fledged framework, in
conjunction with many spinoff extensions, that can turn a
wiki into a powerful and flexible knowledge management
system. All data created within SMW can easily be
published via the Semantic Web, allowing other systems to
use this data seamlessly.

Four initial templates for each instance by
category
1. Custom infobox within outline

template
•

Visible inline properties

2. Outline template providing instance

information
3. Widget template displaying dynamic

charts or third party services
•

Donut charts and AIBS gene feed

4. Broad table SPARQL queries

showing instance relationships
5. Hidden inline properties for other

extensions

Creating instance wiki pages
• The Triple Store now contained tens of

thousands of recognized category
instances. Creating the pages require a
bot.

Create List of Page
Names

1.0

RDF Data

Download

1. Fetch the RDF dumps from an active

D2R server
2. Use regex to fetch the rdf:label property

that was mapped by R2R as an instance
name
3. Open category specific text file of wiki

markup (page of template includes)
4. Contact Neurowiki and request a new

page from the list of names with the
category content

Sanitize
Script

2.0

Create CSV

Category
Page Names

Text of Wiki
markup for page
instance

Read Open

3.0
Create MediaWiki Page
MediaWiki
Gateway rb
Framework

REST
interface

4.0

Neurowiki
Instance
Page

Final application stack
JavaScript View Layer (High Charts / Sproutcore / JQuery)

Semantic MediaWiki
Triple Store
(Virtuoso)

Relational Database
(MySQL)

LDIF

AIBS REST API
(Gene Heat Map Data)

AIBS

Diseasome

DrugBank

SIDER

KEGG

NEUROWIKI

How are base entities like Calcium
represented?
1. The wiki page and

corresponding template
components are rendered.

Drug Search

1.0
Wiki Page

Aggregate
Page of
Components

2. Relations are pulled from the

normalized data store of linked
data.

2.0
Calcium
Relations

Neurobase
Data Stores

3. The JavaScript components are

3.0
Selected
Widget for
Display

populated via a data feed

How are base entities like Calcium
represented?
• Because so many

organisms contain
calcium the
mappings to
affected species
were never created
to conserve space
in the data store.

Drug and Disease Class Ratios of Calcium
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class

What are the dangers of Propofol?
1. Propofol DrugBank relations are
Drug Search

Neurobase
Data Stores

rendered in corresponding
JavaScript components.

1.0
Propofol
Relations

2.0

Aggregate
Components

2. The Diseasome disease

relations show classes of illness
Propofol affects.

Propofol
Disease
Relations

3. An aggregate of SIDER side
3.0
Propofol Side
Effects

effects are rendered in relation
to Propofol and disease classes.

What are the dangers of Propofol?

Which drugs are used in Chemotherapy?
1.

2.

Disease
Search

DrugBank and AIBS relations to
genes affected by both the disease
and drug.

3.

SIDER side effects related to the
gene, disease, and drug.

4.

DrugBank drug glossary definition
specifying various forms of Cancer
treatment.

Neurobase
Data Stores

1.0
Disease
Relations

Diseasome disease relations
normalized by LDIF.

Aggregate
Components

2.0
Gene Drug
Relations

3.0
Drug Side
Effects

4.0
Drug Info Box

Which Drugs are used in Chemotherapy?

Drug and Disease Class Ratios of AR

Drug and Side Effect Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs

Drug and Disease Class Ratios of Nilutamide

Drug and Disease Class Ratios of Bicalutamide

Expanding the Prototype
• Semantic MediaWiki query construction
• Could this be done in SPARQL?
• Authoring SILK / R2R mappings for the LDIF Pipeline
• Extremely difficult and the editors are not intuitive
• How do you get data owners to fuse the sets and create

the data store themselves?
• Tested with Aura Wiki prototype

• Expand authoring provenance
• How do we ensure new data / links comes from an authoritative
source?

Today we discussed…
• The Allen Institute for Brain Science (AIBS)
• Four similar research data sets to interlink with the AIBS
•
•
•
•

•

data set
An import pipeline named Link Data Integration
Framework (LDIF)
The interlinking process for 5 concurrent research data
sets (AIBS, DrugBank, Diseasome, KEGG, SIDER)
A prototype neurobiology authoring platform.
Creating instance pages to display the new connections.
Demonstration of the initial use cases.

QUESTIONS?
COMMENTS?

THANK YOU.

Applied semantic technology and linked data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (18)

Similar a Applied semantic technology and linked data

Similar a Applied semantic technology and linked data (20)

Último

Último (20)

Applied semantic technology and linked data

Notas del editor