SlideShare una empresa de Scribd logo
1 de 38
OCTOBER 13-16, 2016 • AUSTIN, TX
Searching the Stuff of Life: BioSolr
Matt Pearce & Alan Woodward
Senior Developers, Flax – www.flax.co.uk
•Building open source search applications since 2001
•Independent, honest advice and analysis
•Expert design & development, Apache Solr committers
•UK Authorized Partner of
•Test-driven relevancy and performance tuning
•Custom training & mentoring
02
02
01
•The European Bioinformatics Institute
•Part of the European Molecular Biology Laboratory
•Based on the Wellcome Genome Campus in Hinxton, Cambridge, UK.
•Maintains the world’s most comprehensive range of freely available and up-to-date
molecular databases, serving millions of researchers – indexing over 1 billion items.
•BioSolr project involves two teams from EMBL-EBI:
•Protein Data Bank in Europe (PDBe)
•Samples, Phenotypes and Ontologies (SPOt)
02
The genesis of BioSolr
•Grant Ingersoll visits the Wellcome Campus in July ’13
•Around 90 people attend
•Show of hands indicates 75% using Lucene/Solr
•Sameer Velankar of EMBL-EBI identifies grant funding
•Flax and EMBL-EBI apply successfully to the BBSRC
03
BioSolr
•One year BBSRC funded project from September 2014
•“to significantly advance the state of the art with regard to indexing and querying biomedical data
with freely available open source software”
•Outputs:
•Workshops
•Papers & presentations
•Software (Open Source, of course!)
•Documentation
•Inputs: from the PDBe & SPOt teams
01
BioSolr
•Tom Winch
•Working on site with Sameer Velankar & the PDBe team
•Facet.contains, Xjoin, Federated Search
•Matt Pearce
•Working on site with Tony Burdett & the SPOt team
•Indexing ontologies
01
BioSolr & PDBe - Introduction
•Protein Data Bank (PDBe)
•facet.contains (autosuggest)
•https://issues.apache.org/jira/browse/SOLR-1387
•In Solr 5.1
•Xjoin (searching external sources)
•https://issues.apache.org/jira/browse/SOLR-7341
•Federated search
01
Xjoin concepts
•The problem – you have data in an external data store which is not suitable
for indexing in Solr
•The data may be from a live source, for example.
•You need to match data from your search results against data from one or
more of these external sources for display or analysis.
01
Xjoin implementation
•Xjoin is implemented as a Solr search component.
•There should be one configured instance per external source.
•XJoinResultsFactory interface defines the search behaviour:
•Communicates with the external source to carry out the query
•Configured in solrconfig.xml, along with presets
•Returns results as an XjoinResults object
•Results are keyed by a string ID, defined in the configuration
•User is required to provide the implementation of this interface for each
external source being used
01
Xjoin configuration
•Configure the XJoinSearchComponent in solrconfig.xml with details of your
XJoinResultsFactory implementations
•Add the search component to the search request handler
•Needs to be in both first-components and last-components sections.
01
Xjoin results handling
•Uses a query parser based on TermsQParserPlugin
•Allows the same methods as TermsQParserPlugin
•Enables the use of multiple external sources, joined using Boolean
operators.
•Results are returned in a separate block
•Similar to how highlights are returned.
01
Federated search - introduction
•Problem: we need to search data sets split across multiple locations (even
different countries)
•Records may contain different fields.
•Similar to pre-SolrCloud distributed search
01
Federated search challenges: result counts
•The same document may appear in more than one shard, so they need to be
aggregated.
•Also applies to facet counts.
One solution:
•Shards return all document IDs rather than the number found.
•Aggregator builds a set of unique documents from the returned results.
•Simple for small result sets but inefficient for large sets.
•Estimate the result count, using statistical methods:
•If two shards always return similar counts, overlap likely to be high;
•If they don’t, overlap will be small so dataset can be treated as independent and
number found added to total.
01
Federated search challenges: merging document sets
•Problem: documents are not unique across data sets.
•Default behaviour is to use the first instance of a document, ignoring others.
•Datasets may contain different fields – we need all versions.
One solution:
•Use a custom MergeStrategy to build the ID list.
•Cannot use a Grouper – prevents result grouping in the query.
•Scoring is also a challenge!
01
Federated search challenges: merging document data
•Problem: document data may not be the same across data sets.
A solution:
•Merge documents together into a single, composite document.
•Potentially use an aggregation schema describing merge process.
•Merge strategy must be capable of merging disparate field types.
01
BioSolr & SPOt – Indexing ontologies
Washington, N. & Lewis, S. (2008) Ontologies: Scientific
Data Sharing Made Easy. Nature Education 1(3):5
01
Indexing ontologies – the problem
•You have a collection of documents annotated with ontology references.
•You want to search both the documents and the associated ontology data.
•This may include associated nodes – “has location”, “is part of”, etc.
•Faceting the ontology references would be nice!
•(especially if the facets can be presented in a tree)
01
Approach 1
Keep the data separate
documents
Documents
Indexer
Documents
Indexer
ontology
Ontology
Indexer
01
Approach 1 - steps
•Index the documents with the node annotations but no further
detail.
•Index the ontology in its own core.
•Search the documents, then cross-match against the ontology.
•BUT requires multiple calls, doesn’t allow searching both
cores at the same time.
01
Approach 2
Add some ontology data to your documents
Documents
Indexer Ontology
documents
01
Approach 2 – step 1
•Index the documents with the node annotations.
•While indexing, look up the node references, with their labels
and synonyms.
•Easier to include the ontology references in your search.
•Can boost fields as required.
•Faster to search
01
Approach 2 – step 2
•Expand the ontology data being stored.
•Include single (or multi) level parent and child nodes, with their
labels.
•Use dynamic fields to store additional relationships.
•Dynamic fields allow searches across specific relationship
types (“is part of”, “has location”, etc.)
•BUT requires additional Solr look-ups to be fully dynamic
•(using /admin/luke to look through the current schema for dynamic fields).
01
Approach 2 – search screen
•General search box
•Options to include child and parent labels (one step
removed)
•Dynamically-generated additional relationship search
options
01
Approach 2 – search results
01
Approach 2 – final result
•We have developed an UpdateProcessor to do this as part of the update
chain.
•The user defines the ontology location (file, URL) and the field to
reference.
•Aims for convention over configuration for remaining properties.
•Field names and ontology annotation details all customisable.
•A similar plugin for ElasticSearch has also been developed.
01
Aside: facet trees
•Additional search component to return facets from ontology references in
tree form.
•Extends FacetComponent
•Takes initial facets from results, searches hierarchical references to build
the tree during facet generation.
•Avoids multiple calls to Solr from client-side to build the tree.
•A second collection can be used to search the ontologies...
•… BUT it must be part of the same Solr instance
01
Aside: facet trees - challenges
•Nodes with multiple parents may appear more
than once…
•… not yet found a solution for this!
•Default behaviour is to return entire tree.
•Not always useful – may need to drill through
several layers to get to useful entries.
•Solution: prune the tree!
01
Aside: facet trees - pruning
•Multiple pruning options available:
•Simple pruning: remove layers with no useful
information.
•Datapoint pruning: return the highest counts at
top-level, remainder in “Other” section.
01
Approach 3
•Search the ontology, and cross-match with the documents.
•Allow SPARQL queries over the ontology index.
•Enables complex searches over ontology relationships.
•(SPARQL is a semantic query language)
01
Approach 3
01
Adding Apache Jena
•We use Apache Jena to provide TDB-querying with SPARQL.
•Jena uses Solr to search specified text fields – set via configuration files.
•Uses its own Triple Store for other fields and relationship queries.
•Return the reference URI in the returned fields to cross-match with
documents.
•Use a filter query to choose the matched documents.
•Not that different from Xjoin!
01
Generic ontology indexer
•Stand-alone application to index different ontologies
•Allows separate configuration for each ontology
•Plugins may be used:
•At item level, to external data some/all nodes;
•At ontology level, to push the ontology into MongoDB, etc.
•Cross-pollination with the EBI's Ontology Lookup Service site (currently in
beta)
01
Conclusions – what have we achieved?
•Searching across multiple external datasets (xjoin)
•See SOLR-7341 (and please up-vote!)
•Searching across multiple Solr nodes across different campuses.
•Indexing ontologies – both with and without document data.
•Enriching documents with ontology data.
•(…and facet trees)
01
Get Involved!
•Check out the github page: https://github.com/flaxsearch/BioSolr
•Vote for Xjoin: https://issues.apache.org/jira/browse/SOLR-7341
01
Upcoming events
•7th December: SWAT4LS tutorial session in Cambridge, UK
•Will cover ontology indexing and search
•http://www.swat4ls.org/workshops/cambridge2015/
•3rd/4th February: Workshop at the EBI campus, Hinxton, UK
•http://www.ebi.ac.uk/pdbe/about/events/open-source-search-bioinformatics
01
Thank you for listening!
Matt Pearce
matt@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @flaxsearch

Más contenido relacionado

La actualidad más candente

Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastFranz Inc. - AllegroGraph
 
Exploring neXtProt data and beyond: A SPARQLing solution
Exploring neXtProt data and beyond: A SPARQLing solutionExploring neXtProt data and beyond: A SPARQLing solution
Exploring neXtProt data and beyond: A SPARQLing solutionneXtProt
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Martin Magdinier
 
MR201402 effectiveness of unknown malware classification by logistic regressi...
MR201402 effectiveness of unknown malware classification by logistic regressi...MR201402 effectiveness of unknown malware classification by logistic regressi...
MR201402 effectiveness of unknown malware classification by logistic regressi...FFRI, Inc.
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils FlywebJun Zhao
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsOla Spjuth
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...Work-Bench
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkAdaryl "Bob" Wakefield, MBA
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardizationValery Tkachenko
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsMelanie Courtot
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data ManagementC. Tobin Magle
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...mestato
 
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...Edureka!
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Valery Tkachenko
 

La actualidad más candente (20)

Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
 
Exploring neXtProt data and beyond: A SPARQLing solution
Exploring neXtProt data and beyond: A SPARQLing solutionExploring neXtProt data and beyond: A SPARQLing solution
Exploring neXtProt data and beyond: A SPARQLing solution
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
MR201402 effectiveness of unknown malware classification by logistic regressi...
MR201402 effectiveness of unknown malware classification by logistic regressi...MR201402 effectiveness of unknown malware classification by logistic regressi...
MR201402 effectiveness of unknown malware classification by logistic regressi...
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
What are Data structures in Python? | List, Dictionary, Tuple Explained | Edu...
 
Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0Chemistry Validation and Standardization Platform v2.0
Chemistry Validation and Standardization Platform v2.0
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 

Destacado

FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...Charlie Hull
 
Elasticsearch for Westcoast
Elasticsearch for WestcoastElasticsearch for Westcoast
Elasticsearch for WestcoastCharlie Hull
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...Charlie Hull
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big dataCharlie Hull
 
What's the story with Open Source?
What's the story with Open Source? What's the story with Open Source?
What's the story with Open Source? Charlie Hull
 
Intranet show and_tell_2010
Intranet show and_tell_2010Intranet show and_tell_2010
Intranet show and_tell_2010Charlie Hull
 
Flax ovum search-across_the_enterprise
Flax ovum search-across_the_enterpriseFlax ovum search-across_the_enterprise
Flax ovum search-across_the_enterpriseCharlie Hull
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...DataStax Academy
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologyLucidworks
 
LOD (linked open data) part 2 lod 구축과 현황
LOD (linked open data) part 2   lod 구축과 현황LOD (linked open data) part 2   lod 구축과 현황
LOD (linked open data) part 2 lod 구축과 현황LiST Inc
 

Destacado (12)

FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
 
Elasticsearch for Westcoast
Elasticsearch for WestcoastElasticsearch for Westcoast
Elasticsearch for Westcoast
 
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...Enterprise Search Europe 2015:  Fishing the big data streams - the future of ...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
 
Making sense of big data
Making sense of big dataMaking sense of big data
Making sense of big data
 
What's the story with Open Source?
What's the story with Open Source? What's the story with Open Source?
What's the story with Open Source?
 
Intranet show and_tell_2010
Intranet show and_tell_2010Intranet show and_tell_2010
Intranet show and_tell_2010
 
Flax ovum search-across_the_enterprise
Flax ovum search-across_the_enterpriseFlax ovum search-across_the_enterprise
Flax ovum search-across_the_enterprise
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW Technology
 
LOD (linked open data) part 2 lod 구축과 현황
LOD (linked open data) part 2   lod 구축과 현황LOD (linked open data) part 2   lod 구축과 현황
LOD (linked open data) part 2 lod 구축과 현황
 

Similar a BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...Lucidworks
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked DataKaren Estlund
 
Discovery study detailed results 20140728
Discovery study detailed results 20140728Discovery study detailed results 20140728
Discovery study detailed results 20140728Michael Levine-Clark
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardEMBL-ABR
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaYI-JHEN LIN
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset cataloguese-ROSA
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
Data and Donuts: Data organization
Data and Donuts: Data organizationData and Donuts: Data organization
Data and Donuts: Data organizationC. Tobin Magle
 
One IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success MetricsOne IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success MetricsCharleston Conference
 
Software design with Domain-driven design
Software design with Domain-driven design Software design with Domain-driven design
Software design with Domain-driven design Allan Mangune
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Search enabled applications with lucene.net
Search enabled applications with lucene.netSearch enabled applications with lucene.net
Search enabled applications with lucene.netWillem Meints
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 

Similar a BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015 (20)

Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
Searching the Stuff of Life - BioSolr: Presented by Matt Pearce & Alan Woodwa...
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
 
Discovery study detailed results 20140728
Discovery study detailed results 20140728Discovery study detailed results 20140728
Discovery study detailed results 20140728
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
FAIR data requires FAIR ontologies, how do we do?
FAIR data requires FAIR ontologies, how do we do?FAIR data requires FAIR ontologies, how do we do?
FAIR data requires FAIR ontologies, how do we do?
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Data and Donuts: Data organization
Data and Donuts: Data organizationData and Donuts: Data organization
Data and Donuts: Data organization
 
One IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success MetricsOne IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success Metrics
 
Software design with Domain-driven design
Software design with Domain-driven design Software design with Domain-driven design
Software design with Domain-driven design
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Search enabled applications with lucene.net
Search enabled applications with lucene.netSearch enabled applications with lucene.net
Search enabled applications with lucene.net
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 

Último

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 

Último (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 

BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015

  • 1. OCTOBER 13-16, 2016 • AUSTIN, TX
  • 2. Searching the Stuff of Life: BioSolr Matt Pearce & Alan Woodward Senior Developers, Flax – www.flax.co.uk
  • 3. •Building open source search applications since 2001 •Independent, honest advice and analysis •Expert design & development, Apache Solr committers •UK Authorized Partner of •Test-driven relevancy and performance tuning •Custom training & mentoring 02
  • 4. 02
  • 5. 01 •The European Bioinformatics Institute •Part of the European Molecular Biology Laboratory •Based on the Wellcome Genome Campus in Hinxton, Cambridge, UK. •Maintains the world’s most comprehensive range of freely available and up-to-date molecular databases, serving millions of researchers – indexing over 1 billion items. •BioSolr project involves two teams from EMBL-EBI: •Protein Data Bank in Europe (PDBe) •Samples, Phenotypes and Ontologies (SPOt)
  • 6. 02 The genesis of BioSolr •Grant Ingersoll visits the Wellcome Campus in July ’13 •Around 90 people attend •Show of hands indicates 75% using Lucene/Solr •Sameer Velankar of EMBL-EBI identifies grant funding •Flax and EMBL-EBI apply successfully to the BBSRC
  • 7. 03 BioSolr •One year BBSRC funded project from September 2014 •“to significantly advance the state of the art with regard to indexing and querying biomedical data with freely available open source software” •Outputs: •Workshops •Papers & presentations •Software (Open Source, of course!) •Documentation •Inputs: from the PDBe & SPOt teams
  • 8. 01 BioSolr •Tom Winch •Working on site with Sameer Velankar & the PDBe team •Facet.contains, Xjoin, Federated Search •Matt Pearce •Working on site with Tony Burdett & the SPOt team •Indexing ontologies
  • 9. 01 BioSolr & PDBe - Introduction •Protein Data Bank (PDBe) •facet.contains (autosuggest) •https://issues.apache.org/jira/browse/SOLR-1387 •In Solr 5.1 •Xjoin (searching external sources) •https://issues.apache.org/jira/browse/SOLR-7341 •Federated search
  • 10. 01 Xjoin concepts •The problem – you have data in an external data store which is not suitable for indexing in Solr •The data may be from a live source, for example. •You need to match data from your search results against data from one or more of these external sources for display or analysis.
  • 11. 01 Xjoin implementation •Xjoin is implemented as a Solr search component. •There should be one configured instance per external source. •XJoinResultsFactory interface defines the search behaviour: •Communicates with the external source to carry out the query •Configured in solrconfig.xml, along with presets •Returns results as an XjoinResults object •Results are keyed by a string ID, defined in the configuration •User is required to provide the implementation of this interface for each external source being used
  • 12. 01 Xjoin configuration •Configure the XJoinSearchComponent in solrconfig.xml with details of your XJoinResultsFactory implementations •Add the search component to the search request handler •Needs to be in both first-components and last-components sections.
  • 13. 01 Xjoin results handling •Uses a query parser based on TermsQParserPlugin •Allows the same methods as TermsQParserPlugin •Enables the use of multiple external sources, joined using Boolean operators. •Results are returned in a separate block •Similar to how highlights are returned.
  • 14. 01 Federated search - introduction •Problem: we need to search data sets split across multiple locations (even different countries) •Records may contain different fields. •Similar to pre-SolrCloud distributed search
  • 15. 01 Federated search challenges: result counts •The same document may appear in more than one shard, so they need to be aggregated. •Also applies to facet counts. One solution: •Shards return all document IDs rather than the number found. •Aggregator builds a set of unique documents from the returned results. •Simple for small result sets but inefficient for large sets. •Estimate the result count, using statistical methods: •If two shards always return similar counts, overlap likely to be high; •If they don’t, overlap will be small so dataset can be treated as independent and number found added to total.
  • 16. 01 Federated search challenges: merging document sets •Problem: documents are not unique across data sets. •Default behaviour is to use the first instance of a document, ignoring others. •Datasets may contain different fields – we need all versions. One solution: •Use a custom MergeStrategy to build the ID list. •Cannot use a Grouper – prevents result grouping in the query. •Scoring is also a challenge!
  • 17. 01 Federated search challenges: merging document data •Problem: document data may not be the same across data sets. A solution: •Merge documents together into a single, composite document. •Potentially use an aggregation schema describing merge process. •Merge strategy must be capable of merging disparate field types.
  • 18. 01 BioSolr & SPOt – Indexing ontologies Washington, N. & Lewis, S. (2008) Ontologies: Scientific Data Sharing Made Easy. Nature Education 1(3):5
  • 19. 01 Indexing ontologies – the problem •You have a collection of documents annotated with ontology references. •You want to search both the documents and the associated ontology data. •This may include associated nodes – “has location”, “is part of”, etc. •Faceting the ontology references would be nice! •(especially if the facets can be presented in a tree)
  • 20. 01 Approach 1 Keep the data separate documents Documents Indexer Documents Indexer ontology Ontology Indexer
  • 21. 01 Approach 1 - steps •Index the documents with the node annotations but no further detail. •Index the ontology in its own core. •Search the documents, then cross-match against the ontology. •BUT requires multiple calls, doesn’t allow searching both cores at the same time.
  • 22. 01 Approach 2 Add some ontology data to your documents Documents Indexer Ontology documents
  • 23. 01 Approach 2 – step 1 •Index the documents with the node annotations. •While indexing, look up the node references, with their labels and synonyms. •Easier to include the ontology references in your search. •Can boost fields as required. •Faster to search
  • 24. 01 Approach 2 – step 2 •Expand the ontology data being stored. •Include single (or multi) level parent and child nodes, with their labels. •Use dynamic fields to store additional relationships. •Dynamic fields allow searches across specific relationship types (“is part of”, “has location”, etc.) •BUT requires additional Solr look-ups to be fully dynamic •(using /admin/luke to look through the current schema for dynamic fields).
  • 25. 01 Approach 2 – search screen •General search box •Options to include child and parent labels (one step removed) •Dynamically-generated additional relationship search options
  • 26. 01 Approach 2 – search results
  • 27. 01 Approach 2 – final result •We have developed an UpdateProcessor to do this as part of the update chain. •The user defines the ontology location (file, URL) and the field to reference. •Aims for convention over configuration for remaining properties. •Field names and ontology annotation details all customisable. •A similar plugin for ElasticSearch has also been developed.
  • 28. 01 Aside: facet trees •Additional search component to return facets from ontology references in tree form. •Extends FacetComponent •Takes initial facets from results, searches hierarchical references to build the tree during facet generation. •Avoids multiple calls to Solr from client-side to build the tree. •A second collection can be used to search the ontologies... •… BUT it must be part of the same Solr instance
  • 29. 01 Aside: facet trees - challenges •Nodes with multiple parents may appear more than once… •… not yet found a solution for this! •Default behaviour is to return entire tree. •Not always useful – may need to drill through several layers to get to useful entries. •Solution: prune the tree!
  • 30. 01 Aside: facet trees - pruning •Multiple pruning options available: •Simple pruning: remove layers with no useful information. •Datapoint pruning: return the highest counts at top-level, remainder in “Other” section.
  • 31. 01 Approach 3 •Search the ontology, and cross-match with the documents. •Allow SPARQL queries over the ontology index. •Enables complex searches over ontology relationships. •(SPARQL is a semantic query language)
  • 33. 01 Adding Apache Jena •We use Apache Jena to provide TDB-querying with SPARQL. •Jena uses Solr to search specified text fields – set via configuration files. •Uses its own Triple Store for other fields and relationship queries. •Return the reference URI in the returned fields to cross-match with documents. •Use a filter query to choose the matched documents. •Not that different from Xjoin!
  • 34. 01 Generic ontology indexer •Stand-alone application to index different ontologies •Allows separate configuration for each ontology •Plugins may be used: •At item level, to external data some/all nodes; •At ontology level, to push the ontology into MongoDB, etc. •Cross-pollination with the EBI's Ontology Lookup Service site (currently in beta)
  • 35. 01 Conclusions – what have we achieved? •Searching across multiple external datasets (xjoin) •See SOLR-7341 (and please up-vote!) •Searching across multiple Solr nodes across different campuses. •Indexing ontologies – both with and without document data. •Enriching documents with ontology data. •(…and facet trees)
  • 36. 01 Get Involved! •Check out the github page: https://github.com/flaxsearch/BioSolr •Vote for Xjoin: https://issues.apache.org/jira/browse/SOLR-7341
  • 37. 01 Upcoming events •7th December: SWAT4LS tutorial session in Cambridge, UK •Will cover ontology indexing and search •http://www.swat4ls.org/workshops/cambridge2015/ •3rd/4th February: Workshop at the EBI campus, Hinxton, UK •http://www.ebi.ac.uk/pdbe/about/events/open-source-search-bioinformatics
  • 38. 01 Thank you for listening! Matt Pearce matt@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @flaxsearch

Notas del editor

  1. Test-drive relevancy done via Quepid
  2. recruitment government e-commerce news & media bioinformatics consulting law
  3. PDBe is the European resource for the collection, organisation and dissemination of data on biological macromolecular structures… collate, maintain and provide access to the global repositories of macromolecular structure data SPOT team – 3 subteams Mouse informatics, Functional Genomics Production, Gene ontology editorial office. The FGPT subteam develops ontologies such as the Experimental Factor Ontology (EFO) , the Cell Line Ontology, delivers ontology tooling, provides curation tools for Gene Expression Atlas, BioSamples databases
  4. Grant met heads of SPOT and PDBE teams, did Solr presentation BBSRC = Biotechnology and Biological Sciences Research Council
  5. BBSRC = Biotechnology and Biological Sciences Research Council
  6. PDBe is the European resource for the collection, organisation and dissemination of data on biological macromolecular structures… collate, maintain and provide access to the global repositories of macromolecular structure data actively involved in an effort to integrate data from major biomedical resources at EMBL-EBI and across the world. This "Structure Integration with Function, Taxonomy and Sequence" (SIFTS) initiative integrates data from a number of resources and is used by major global sequence, structure and protein-family resources.
  7. Sources: FASTA (fast- ay) – search tool for protein databases PHMMER – search protein sequences against a protein sequence database
  8. Distributed search, but very distributed EMBL has member states across Europe, plus associate member countries – Australia, Argentina
  9. Ontologies generally hierarchical – root node, child nodes, etc. Additional relationships between nodes at any level though, so not strictly a tree structure.
  10. SPARQL is an RDF query language for searching across triple stores. Triple stores are databases where each item represents a subject, a predicate and an object – eg. “the heart is part of the human body”
  11. Annotations vary between ontologies as to which data is a synonym, definintion, and so on. Need Solr config to tell it which annotation applies for the current ontology.
  12. SWAT4LS – Semantic web applications and tools for life sciences EBI workshop – hands on, interactive, understanding different search technologies for biomedical data