Federating Research Profiling Data

Clinical and Translational Science Institute
Accelerating Research to Improve Health
This project was supported by NIH/NCRR UCSF-CTSI Grant Number UL1 TR000004. Its contents are solely
the responsibility of the authors and do not necessarily represent the official views of the NIH.
Data Harvesting and Indexing
• LOD are acquired from SPARQL-compatible sites through a multi-
threaded harvester program.
• Per-site harvesting times vary significantly, from a current low of 8
minutes to a current high of 4+ hours. Factors in this variability
include both scale (e.g., number of persons represented) and
endpoint implementation (e.g. Loki data are served by a teiid data
federation layer coupled to a D2RQ bridge).
• Additional LOD are harvested through platform-specific multi-
threaded crawlers (one thread per site). Current versions of VIVO
and Profiles support direct access to RDF characterizations,
allowing data collection from sites not yet making SPARQL
endpoints available, while avoiding the need to screen-scrape
HTML. In the one case of HTML-only data (Stanford’s CAP) we
use a DOM parsing library to extract data.
• Harvested data is cached locally in a relational database to support
indexing experiments without the need to harvest data repeatedly.
Harvested data are enhanced where possible with supplemental
metadata from MEDLINE, including abstracts, keywords, MeSH
terms, chemicals and genes.
• The resulting aggregated text is then processed with a UMLS
concept extractor and the resulting concept codes are added to the
record. Shared publications then support both true multi-site
federated search and concept-driven visualization.
Federating Research Profiling Data
David Eichmann, PhD, University of Iowa, Iowa City
Eric Meeks, Clinical and Translational Science Institute, UCSF
CTSAsearch ORNG
Open Research
Networking Gadgets
Introduction
Research profiling systems have achieved notable
adoption by research institutions.
• Multi-site search of research profiling systems has
substantially evolved since the first deployment of
systems such as DIRECT2Experts.
• CTSAsearch is a federated search engine using VIVO-
compliant Linked Open Data (LOD) published by members of
the NIH-funded Clinical and Translational Science (CTSA)
consortium and other interested parties.
• Fifty-seven institutions are currently included, spanning six
distinct platforms and three continents (North America, Europe
and Australia).
• In aggregate, CTSAsearch has data on 150-300 thousand
unique researchers and their 10 million publications. The public
interface is available at http://research.icts.uiowa.edu/polyglot.
Cross-linking Metadata
• Almost all research profiling sites currently provide only
internal links. In the case of non-institutional co-authors,
either no information is provided or stub profiles are
generated containing only an author name generated from
the citation.
• We cross-correlate publications to assert to person URIs as referring
to the same individual if they share one or more publications with the
same PMID or DOI, have the same family name and either the same
first name or one first name is a single initial that matches the first
name of the other.
• We currently cross-link co-author data from ProfilesRNS to their respective
home institution profiles through the CrossLinks project.
Conclusion
• CTSAsearch and CrossLinks demonstrate that substantial value can
be added to the existing research networking landscape through
federation of these data.
• This better reflects the larger collaborative networks that
our researchers comprise, and provides a better user
experience through seamless inter-site navigation.
Profiling system counts by platform
Co-authorships between 313 researchers
with publications involving ontology
External Collaborators links out to co-author pages in other Profiling systems
1. Linked Open Data from many research profiling sources is
harvested and processed by the University of Iowa.
2. A SPARQL endpoint at Iowa is used by UCSF to capture a
subset of data representing cross-institutional co-
authorships.
3. Research profiling installations supporting ORNG access
UCSF to find co-authorship in JSON-LD at run time.
Data flow and key
• Our future work in this area will
include enhanced ability to
interconnect these systems and to
visualize the resulting aggregated
information space.
• CrossLinks interrogates the CTSAsearch SPARQL endpoint (http://marengo.info-
science.uiowa.edu:2020), then provides real-time JSON-LD, supporting cross-
site linking (with thumbnail images), and effectively creating a single inter-
institutional information space.

Federating Research Profiling Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Federating Research Profiling Data

Similar a Federating Research Profiling Data (20)

Más de ericmeeks

Más de ericmeeks (12)

Último

Último (20)

Federating Research Profiling Data