2. Outline
• Introduction of the CORE system
• Three phases:
• Metadata and content harvesting
• Semantic Enrichment
• Providing services
• Supporting research in mining databases of scientific
publications (DiggiCORE)
2/41
3. CORE objectives
• To provide a platform for the delivery of Open Access content
aggregated from multiple sources and to deliver a wide range of
services on top of this aggregation.
• A nation-wide aggregation system that will improve the discovery
of publications stored in British Open Access Repositories (OARs).
3/41
12. Why we need aggregations?
“Each individual repository is of limited value for research: the real
power of Open Access lies in the possibility of connecting and tying
together repositories, which is why we need interoperability. In
order to create a seamless layer of content through connected
repositories from around the world, Open Access relies on
interoperability, the ability for systems to communicate with each
other and pass information back and forth in a usable format.
Interoperability allows us to exploit today's computational power so
that we can aggregate, data mine, create new tools and
services, and generate new knowledge from repository content.’’
[COAR manifesto]
12/41
15. Aggregations need access to content, not just metadata!
• Certain metadata types can be created only at the level of the
aggregation
• Certain metadata can be changing in time
• Ensuring content:
• accessibility
• availability
• validity
• quality
• …
15/41
16. Semantic similarity and duplicates detection
• Cosine similarity calculated on tfidf vectors extracted from full-
texts
[Knoth et al, COLING 2010; Knoth et al, IMMM 2011]
16/41
17. Semantic similarity and duplicates detection
• Heuristics to reduce the number of combinations (problem with
the query length)
• Cross-language linking tests [Knoth et al, NTCIR-9 CrossLink 2011;
Knoth et al IJC-NLP CLIA 2011]
17/41
18. Information extraction, citation parsing and target recognition
• ParsCIT tool (based on CRF) for extraction of reference sections
• Levensthein distance used for target detection
18/41
21. Who should be supported by aggregations?
The following users groups (divided according to the level of
abstraction of information they need):
• Raw data access.
• Transaction information access.
• Analytical information access.
21/41
22. Who should be supported by aggregations?
• The following users groups (divided according to the level of
abstraction of information they need):
• Raw data access. Developers, DLs, DL researchers, companies …
• Transaction information access. Researchers, students, life-long learners …
• Analytical information access. Funders, government, bussiness intelligence
…
22/41
23. Should a single aggregation system support all three user types?
Can be realised by more than one system
providing that
the dataset is the same!
23/41
24. CORE applications
• CORE Portal
• CORE Mobile
• CORE Plugin
• CORE API
• Repository Analytics
24/41
25. Who should be supported by aggregations?
• The following users groups (divided according to the level of
abstraction of information they need):
• Raw data access. Developers, DLs, DL researchers, companies …
• Transaction information access. Researchers, students, life-long learners …
• Analytical information access. Funders, government, bussiness intelligence
…
CORE API CORE Portal, CORE
Mobile, CORE Plugin Repository Analytics
25/41
26. CORE Applications
CORE API – Enables external systems and services to interact with the
CORE repository.
• Search service
• Pdf and plain text
service
• Similarity service
• Classification service
• Citation service
26/41
27. CORE Applications
CORE Portal – Allows searching and navigating scientific publications
aggregated from Open Access repositories
27/41
34. CORE statistics
• Content
• 7M records
• 230 repositories
• 402k full-texts
• 1TB of data
• 40GB large index
• 35 million RDF triples in the CORE LOD repository
• Started: February 2011
• Budget: 140k£
34/41
35. Outline
• Introduction of the CORE system
• Three phases:
• Metadata and content harvesting
• Semantic Enrichment
• Providing services
• Supporting research in mining databases of scientific
publications (DiggiCORE)
35/41
36. objective
Software for exploration and analysis of very large and
fast-growing amounts of research publications stored
across Open Access Repositories (OAR).
36/41
38. DiggiCORE objectives
Allow researchers to use this platform to analyse
publications.
Why?
• To identifying patterns in the behaviour of research
communities
• To detect trends in research disciplines
• To gain new insights into the citation behaviour of researchers
• To discover features that distinguish papers with high impact
38/41
39. Summary
• The rapid growth of OA content provides great opportunity for
text-mining.
• Aggregations need to aggregate content, not just metadata.
• Aggregations should serve the needs of different user groups
including researchers who need access to data. CORE aims to
support them.
• We can have many services that are part of the infrastructure,
but should work with the same data.
39/41
The idea is to give you an overview of CORE and how it makes use of text-mining not a comprehensive description of one method
Content – story about why I started to think about CORE. CORE is not a cross-repository search engine.Wide range of services (not focused only on people looking fro content) – will explain laterFocusing on British, but becoming international
Ou main focus are British Open Access repositories, but because of the collaboration with Europeana we have to go international
All text mining takes place at this phase
Currently 99% of CORE data through metadata havestingThe combination with other techniques has more potential
All text mining takes place at this phase
The use of content is one of the relatively unique features of CORE
Alternative tools TeamBeam,Mendeley tool
I will give an overview of the system (not a comprehensive description of all text mining services)