1. Automatically
indexing
science using
natural-
language
processing,
RDF and
SPARQL
Andrew Automatically indexing science using
Walkingshaw,
Nick Day,
Peter Corbett,
natural-language processing, RDF and
Jim Downing,
Joe SPARQL
Townsend,
Peter
Murray-Rust
Gathering
Andrew Walkingshaw, Nick Day, Peter Corbett, Jim
data Downing, Joe Townsend, Peter Murray-Rust
Extracting
(meta)data
Using the data
Thanks February 16, 2008
2. Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
• Supplemental and experimental data
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
3. Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
• Supplemental and experimental data
Peter
Murray-Rust • Journals
Gathering
data
Extracting
(meta)data
Using the data
Thanks
4. Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
• Supplemental and experimental data
Peter
Murray-Rust • Journals
Gathering
• Self-archived papers (e.g. arXiv)
data
Extracting
(meta)data
Using the data
Thanks
5. Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
• Supplemental and experimental data
Peter
Murray-Rust • Journals
Gathering
• Self-archived papers (e.g. arXiv)
data
• Mainstream journalism
Extracting
(meta)data
Using the data
Thanks
6. Automatically
indexing
science using
natural-
language
Data sources
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
• Supplemental and experimental data
Peter
Murray-Rust • Journals
Gathering
• Self-archived papers (e.g. arXiv)
data
• Mainstream journalism
Extracting
(meta)data • Blogs
Using the data
Thanks
7. Automatically
indexing
science using
natural-
language
Supplemental data: CrystalEye
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• http://wwmm.ch.cam.ac.uk/crystaleye/
Gathering
data
Extracting
(meta)data
Using the data
Thanks
8. Automatically
indexing
science using
natural-
language
Supplemental data: CrystalEye
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• http://wwmm.ch.cam.ac.uk/crystaleye/
Gathering • Repository for crystallographic data
data
Extracting
(meta)data
Using the data
Thanks
9. Automatically
indexing
science using
natural-
language
Journals and arXiv
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • “Traditional” journal articles
Gathering
data
Extracting
(meta)data
Using the data
Thanks
10. Automatically
indexing
science using
natural-
language
Journals and arXiv
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • “Traditional” journal articles
Gathering • Titles and abstracts. . .
data
Extracting
(meta)data
Using the data
Thanks
11. Automatically
indexing
science using
natural-
language
Journalism and blogs
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Unstructured text with little semantics;
Gathering
data
Extracting
(meta)data
Using the data
Thanks
12. Automatically
indexing
science using
natural-
language
Journalism and blogs
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Unstructured text with little semantics;
Gathering • . . . hence Google Scholar, Web of Science, etc.
data
Extracting
(meta)data
Using the data
Thanks
13. Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
14. Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
• http://en.wikipedia.org/wiki/Chemical Markup Language
Townsend,
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
15. Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
• http://en.wikipedia.org/wiki/Chemical Markup Language
Townsend,
Peter • . . . but we still need to get data out of that and into a
Murray-Rust
more useful form
Gathering
data
Extracting
(meta)data
Using the data
Thanks
16. Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
• http://en.wikipedia.org/wiki/Chemical Markup Language
Townsend,
Peter • . . . but we still need to get data out of that and into a
Murray-Rust
more useful form
Gathering
data
• hence Golem: http://www.lexical.org.uk/science/golem/
Extracting
(meta)data
Using the data
Thanks
17. Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
• http://en.wikipedia.org/wiki/Chemical Markup Language
Townsend,
Peter • . . . but we still need to get data out of that and into a
Murray-Rust
more useful form
Gathering
data
• hence Golem: http://www.lexical.org.uk/science/golem/
Extracting • GRDDLish strategy for extracting data from CML files:
(meta)data
Using the data
identify dialect-specific concepts with XPath expressions
Thanks
and XSLT stylesheets
18. Automatically
indexing
science using
natural-
language
Semi-structured data: Golem
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • We’ve got a lot of chemical data as CML
Peter Corbett,
Jim Downing,
Joe
• http://en.wikipedia.org/wiki/Chemical Markup Language
Townsend,
Peter • . . . but we still need to get data out of that and into a
Murray-Rust
more useful form
Gathering
data
• hence Golem: http://www.lexical.org.uk/science/golem/
Extracting • GRDDLish strategy for extracting data from CML files:
(meta)data
Using the data
identify dialect-specific concepts with XPath expressions
Thanks
and XSLT stylesheets
• upshot: we can extract JSON objects from CML files.
19. Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, • http://oscar3-chem.sourceforge.net/
Joe
Townsend,
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
20. Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, • http://oscar3-chem.sourceforge.net/
Joe
Townsend, • Natural-language parser for documents about chemistry
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
21. Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, • http://oscar3-chem.sourceforge.net/
Joe
Townsend, • Natural-language parser for documents about chemistry
Peter
Murray-Rust
• Dark magic: don’t ask me how it works!
Gathering
data
Extracting
(meta)data
Using the data
Thanks
22. Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, • http://oscar3-chem.sourceforge.net/
Joe
Townsend, • Natural-language parser for documents about chemistry
Peter
Murray-Rust
• Dark magic: don’t ask me how it works!
Gathering • . . . but it can be run as a Jetty webservice so as long as it
data
Extracting
does, I’m happy
(meta)data
Using the data
Thanks
23. Automatically
indexing
science using
natural-
language
Free text: OSCAR3
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, • http://oscar3-chem.sourceforge.net/
Joe
Townsend, • Natural-language parser for documents about chemistry
Peter
Murray-Rust
• Dark magic: don’t ask me how it works!
Gathering • . . . but it can be run as a Jetty webservice so as long as it
data
Extracting
does, I’m happy
(meta)data
• Author’s blog:
Using the data
http://wwmm.ch.cam.ac.uk/blogs/corbett/
Thanks
24. Automatically
indexing
science using
natural-
language
Getting the data in
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Everything (more or less) talks RSS nowadays. . .
Gathering
data
Extracting
(meta)data
Using the data
Thanks
25. Automatically
indexing
science using
natural-
language
Getting the data in
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Everything (more or less) talks RSS nowadays. . .
• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.
Gathering
data
Extracting
(meta)data
Using the data
Thanks
26. Automatically
indexing
science using
natural-
language
Getting the data in
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
• Everything (more or less) talks RSS nowadays. . .
• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.
Gathering
data
• Thankfully: feedparser (http://feedparser.org/)
Extracting
(meta)data
Using the data
Thanks
27. Automatically
indexing
science using
natural-
language
Serializing metadata
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe • RDF – using:
Townsend,
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
28. Automatically
indexing
science using
natural-
language
Serializing metadata
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe • RDF – using:
Townsend,
Peter • Dublin Core terms
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
29. Automatically
indexing
science using
natural-
language
Serializing metadata
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe • RDF – using:
Townsend,
Peter • Dublin Core terms
Murray-Rust
• A homebrew ontology based on the IUCr’s CIF data format
Gathering
data
Extracting
(meta)data
Using the data
Thanks
30. Automatically
indexing
science using
natural-
language
Serializing metadata
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe • RDF – using:
Townsend,
Peter • Dublin Core terms
Murray-Rust
• A homebrew ontology based on the IUCr’s CIF data format
Gathering
data • and another homebrew ontology for OSCAR annotations
Extracting
(meta)data
Using the data
Thanks
31. Automatically
indexing
science using
natural-
language
Serializing metadata
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe • RDF – using:
Townsend,
Peter • Dublin Core terms
Murray-Rust
• A homebrew ontology based on the IUCr’s CIF data format
Gathering
data • and another homebrew ontology for OSCAR annotations
Extracting
(meta)data • (it’d be good to standardise these, but to be honest, not
Using the data many people are doing this sort of thing)
Thanks
32. Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
33. Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing, • If it’s supplying CML data, set Golem on each entry, get
Joe
Townsend, the observables out, and turn them into triples; run
Peter
Murray-Rust OSCAR3 over the title and/or abstract
Gathering
data
Extracting
(meta)data
Using the data
Thanks
34. Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing, • If it’s supplying CML data, set Golem on each entry, get
Joe
Townsend, the observables out, and turn them into triples; run
Peter
Murray-Rust OSCAR3 over the title and/or abstract
Gathering • If it’s not, extract the free text from each entry, send it to
data
the OSCAR web service, and assign triples based on the
Extracting
(meta)data chemical entities OSCAR finds
Using the data
Thanks
35. Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing, • If it’s supplying CML data, set Golem on each entry, get
Joe
Townsend, the observables out, and turn them into triples; run
Peter
Murray-Rust OSCAR3 over the title and/or abstract
Gathering • If it’s not, extract the free text from each entry, send it to
data
the OSCAR web service, and assign triples based on the
Extracting
(meta)data chemical entities OSCAR finds
Using the data • Upload the RDF to your triple store
Thanks
36. Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing, • If it’s supplying CML data, set Golem on each entry, get
Joe
Townsend, the observables out, and turn them into triples; run
Peter
Murray-Rust OSCAR3 over the title and/or abstract
Gathering • If it’s not, extract the free text from each entry, send it to
data
the OSCAR web service, and assign triples based on the
Extracting
(meta)data chemical entities OSCAR finds
Using the data • Upload the RDF to your triple store
Thanks
• (I’m using the Talis platform, so that’s just curl)
37. Automatically
indexing
science using
natural-
language
The process
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day, • For each feed in a list of feeds:
Peter Corbett,
Jim Downing, • If it’s supplying CML data, set Golem on each entry, get
Joe
Townsend, the observables out, and turn them into triples; run
Peter
Murray-Rust OSCAR3 over the title and/or abstract
Gathering • If it’s not, extract the free text from each entry, send it to
data
the OSCAR web service, and assign triples based on the
Extracting
(meta)data chemical entities OSCAR finds
Using the data • Upload the RDF to your triple store
Thanks
• (I’m using the Talis platform, so that’s just curl)
• And. . .
38. Automatically
indexing
science using
natural-
language
SPARQL is great.
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing, Just post queries at a SPARQL endpoint:
Joe
Townsend, authortemplate=’’’
Peter
Murray-Rust PREFIX dc: <http://purl.org/dc/terms/>
PREFIX ce:
Gathering
data <http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>
Extracting DESCRIBE ?file WHERE { ?file dc:contributor
(meta)data
Using the data
some author . }
Thanks
’’’
39. Automatically
indexing
science using
natural-
language
SPARQL isn’t (entirely) great.
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter • Scientists shouldn’t have to know this stuff.
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
40. Automatically
indexing
science using
natural-
language
SPARQL isn’t (entirely) great.
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter • Scientists shouldn’t have to know this stuff.
Murray-Rust
• So we need to build a front end which your average senior
Gathering
data
academic might be able to use. . .
Extracting
(meta)data
Using the data
Thanks
41. Automatically
indexing
science using
natural-
language
SPARQL isn’t (entirely) great.
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter • Scientists shouldn’t have to know this stuff.
Murray-Rust
• So we need to build a front end which your average senior
Gathering
data
academic might be able to use. . .
Extracting • (i.e. it’s got to look like a website.)
(meta)data
Using the data
Thanks
42. Automatically
indexing
science using
natural-
language
What queries do we want?
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend, • What experimental data is an author responsible for?
Peter
Murray-Rust
Gathering
data
Extracting
(meta)data
Using the data
Thanks
43. Automatically
indexing
science using
natural-
language
What queries do we want?
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend, • What experimental data is an author responsible for?
Peter
Murray-Rust • What chemical entities are in some data?
Gathering
data
Extracting
(meta)data
Using the data
Thanks
44. Automatically
indexing
science using
natural-
language
What queries do we want?
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend, • What experimental data is an author responsible for?
Peter
Murray-Rust • What chemical entities are in some data?
Gathering • Where is a given chemical entity talked about?
data
Extracting
(meta)data
Using the data
Thanks
45. Automatically
indexing
science using
natural-
language
What queries do we want?
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend, • What experimental data is an author responsible for?
Peter
Murray-Rust • What chemical entities are in some data?
Gathering • Where is a given chemical entity talked about?
data
• So we can build a web app around these queries.
Extracting
(meta)data
Using the data
Thanks
46. Automatically
indexing
science using
natural-
language
What queries do we want?
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend, • What experimental data is an author responsible for?
Peter
Murray-Rust • What chemical entities are in some data?
Gathering • Where is a given chemical entity talked about?
data
• So we can build a web app around these queries.
Extracting
(meta)data
• django + rdflib + sparql + Talis Platform
Using the data
Thanks
47. Automatically
indexing
science using
natural-
language
Demo!
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust
And here it is.
Gathering
data
Extracting
(meta)data
Using the data
Thanks
48. Automatically
indexing
science using
natural-
language
Thanks to. . .
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • Talis (http://n2.talis.com/) for access to their platform
Gathering
data
Extracting
(meta)data
Using the data
Thanks
49. Automatically
indexing
science using
natural-
language
Thanks to. . .
processing,
RDF and
SPARQL
Andrew
Walkingshaw,
Nick Day,
Peter Corbett,
Jim Downing,
Joe
Townsend,
Peter
Murray-Rust • Talis (http://n2.talis.com/) for access to their platform
Gathering • and to the RSC and IUCr for their support of CrystalEye.
data
Extracting
(meta)data
Using the data
Thanks