SemanticCampLondon, 16th February 2008

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew Automatically indexing science using
Walkingshaw,
Nick Day,
Peter Corbett,
natural-language processing, RDF and
Jim Downing,
Joe SPARQL
Townsend,
Peter
Murray-Rust

Gathering
Andrew Walkingshaw, Nick Day, Peter Corbett, Jim
data Downing, Joe Townsend, Peter Murray-Rust
Extracting
(meta)data

Using the data

Thanks February 16, 2008

Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
• Supplemental and experimental data
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • Journals

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter

Gathering
• Self-archived papers (e.g. arXiv)
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter

Gathering
data
• Mainstream journalism
Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter

Gathering
data
• Mainstream journalism
Extracting
(meta)data • Blogs
Using the data

Thanks

Automatically
indexing
science using
natural-
language
Supplemental data: CrystalEye
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• http://wwmm.ch.cam.ac.uk/crystaleye/
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Supplemental data: CrystalEye
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• http://wwmm.ch.cam.ac.uk/crystaleye/
Gathering • Repository for crystallographic data
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Journals and arXiv
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • “Traditional” journal articles
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Journals and arXiv
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • “Traditional” journal articles
Gathering • Titles and abstracts. . .
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Journalism and blogs
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Unstructured text with little semantics;
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Journalism and blogs
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Unstructured text with little semantics;
Gathering • . . . hence Google Scholar, Web of Science, etc.
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Jim Downing,
Joe
• http://en.wikipedia.org/wiki/Chemical Markup Language
Townsend,
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter • . . . but we still need to get data out of that and into a
Murray-Rust
more useful form
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Murray-Rust
more useful form
Gathering
data
• hence Golem: http://www.lexical.org.uk/science/golem/
Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Murray-Rust
more useful form
Gathering
data
Extracting • GRDDLish strategy for extracting data from CML ﬁles:
(meta)data

Using the data
identify dialect-speciﬁc concepts with XPath expressions
Thanks
and XSLT stylesheets

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Murray-Rust
more useful form
Gathering
data
Extracting • GRDDLish strategy for extracting data from CML files:
(meta)data

Using the data
identify dialect-specific concepts with XPath expressions
Thanks
and XSLT stylesheets
• upshot: we can extract JSON objects from CML files.

Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, • http://oscar3-chem.sourceforge.net/
Joe
Townsend,
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Joe
Townsend, • Natural-language parser for documents about chemistry
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Joe
Peter
Murray-Rust
• Dark magic: don’t ask me how it works!
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Joe
Peter
Murray-Rust
Gathering • . . . but it can be run as a Jetty webservice so as long as it
data

Extracting
does, I’m happy
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Joe
Peter
Murray-Rust
Gathering • . . . but it can be run as a Jetty webservice so as long as it
data

Extracting
does, I’m happy
(meta)data
• Author’s blog:
Using the data
http://wwmm.ch.cam.ac.uk/blogs/corbett/
Thanks

Automatically
indexing
science using
natural-
language
Getting the data in
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Everything (more or less) talks RSS nowadays. . .

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Getting the data in
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Getting the data in
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.
Gathering
data
• Thankfully: feedparser (http://feedparser.org/)
Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Serializing metadata
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe • RDF – using:
Townsend,
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Townsend,
Peter • Dublin Core terms
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Townsend,
Murray-Rust
• A homebrew ontology based on the IUCr’s CIF data format
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Townsend,
Murray-Rust
Gathering
data • and another homebrew ontology for OSCAR annotations
Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Townsend,
Murray-Rust
Gathering
data • and another homebrew ontology for OSCAR annotations
Extracting
(meta)data • (it’d be good to standardise these, but to be honest, not
Using the data many people are doing this sort of thing)
Thanks

Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Jim Downing, • If it’s supplying CML data, set Golem on each entry, get
Joe
Townsend, the observables out, and turn them into triples; run
Peter
Murray-Rust OSCAR3 over the title and/or abstract
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Joe
Peter
Gathering • If it’s not, extract the free text from each entry, send it to
data
the OSCAR web service, and assign triples based on the
Extracting
(meta)data chemical entities OSCAR ﬁnds
Using the data

Thanks

Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Joe
Peter
data
Extracting
Using the data • Upload the RDF to your triple store
Thanks

Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Joe
Peter
data
Extracting
Thanks
• (I’m using the Talis platform, so that’s just curl)

Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Peter Corbett,
Joe
Peter
data
Extracting
Thanks
• (I’m using the Talis platform, so that’s just curl)
• And. . .

Automatically
indexing
science using
natural-
language
SPARQL is great.
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, Just post queries at a SPARQL endpoint:
Joe
Townsend, authortemplate=’’’
Peter
Murray-Rust PREFIX dc: <http://purl.org/dc/terms/>
PREFIX ce:
Gathering
data <http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>
Extracting DESCRIBE ?file WHERE { ?file dc:contributor
(meta)data

Using the data
some author . }
Thanks
’’’

Automatically
indexing
science using
natural-
language
SPARQL isn’t (entirely) great.
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter • Scientists shouldn’t have to know this stuﬀ.
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Murray-Rust
• So we need to build a front end which your average senior
Gathering
data
academic might be able to use. . .
Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Murray-Rust
• So we need to build a front end which your average senior
Gathering
data
academic might be able to use. . .
Extracting • (i.e. it’s got to look like a website.)
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
What queries do we want?
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend, • What experimental data is an author responsible for?
Peter
Murray-Rust

Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Peter
Murray-Rust • What chemical entities are in some data?
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Peter
Gathering • Where is a given chemical entity talked about?
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Peter
data
• So we can build a web app around these queries.
Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Peter
data
• So we can build a web app around these queries.
Extracting
(meta)data
• django + rdflib + sparql + Talis Platform
Using the data

Thanks

Automatically
indexing
science using
natural-
language
Demo!
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
And here it is.
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Thanks to. . .
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • Talis (http://n2.talis.com/) for access to their platform
Gathering
data

Extracting
(meta)data

Using the data

Thanks

Automatically
indexing
science using
natural-
language
Thanks to. . .
processing,
RDF and
SPARQL

Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • Talis (http://n2.talis.com/) for access to their platform
Gathering • and to the RSC and IUCr for their support of CrystalEye.
data

Extracting
(meta)data

Using the data

Thanks

SemanticCampLondon, 16th February 2008

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a SemanticCampLondon, 16th February 2008

Similar a SemanticCampLondon, 16th February 2008 (20)

Último

Último (20)

SemanticCampLondon, 16th February 2008