Introduction to Multilingual Retrieval Augmented Generation (RAG)
British Oceanographic Data Centre's Published Data Library
1.
2.
3. Objectives
- Delivery of meaningful collections of data
- Data delivered by the PDL must be:
- Fixed to the checksum level
- Discoverable
- Usable with confidence without referral to
any additional material (Context!)
- Assured availability for the foreseeable future
4. Objectives
- PDL is not a replacement for BODC’s
“traditional” data serving
- It is a parallel system
- It is tailored to suit the needs of the
academic publishing community
- It does not support “data behind the
graph”
5. Design
- Datasets assigned a DOI through DataCite
- NERC affiliated to DataCite through
British Library
- doi prefix = 10.5285
- suffix = a UUID
6. Design
- DOIs resolve to an HTML landing page
- Landing page contains metadata
concerning dataset
- Landing page links off to usage
metadata & data
9. Current status
- Descriptive pages live
- DOI catalogue live – 8 datasets w/ DOIs
- https://www.bodc.ac.uk/data/published_data_library/catalogue
10.
11. Current status
- Descriptive pages live
- DOI catalogue live – 8 datasets w/ DOIs
- DOI landing pages live
- e.g.
https://www.bodc.ac.uk/data/published_data_library/catalogue/41
479c42_4dfb_4da9_be97_4c532ce13922/
12.
13. Current status
- Landing pages contain human & machine
readable metadata
- HTML & RDFa
- Updates noted as hAtom entries
- Fields documented in cookbook
14.
15. Future work
- Currently PDL is entirely hand-coded
- Design documented for an RDBMS back
office
- Tables in place
- Population started - ironing out issues
- Followed by middleware to create web
views
16. Future work
- Links with IODE POD
- Test dataset to be ingested by AGU
- Prove concept of POD
- Link from a BODC landing page to POD
repository
17.
18. Geoscience Data Journal, Wiley-Blackwell
and the Royal Meteorological Society
●
●
●
●
supported by NERC – in particular the British Atmospheric Data Centre
partnership formed between Royal Meteorological Society & academic publishers Wiley-Blackwell
●
develop a mechanism for the formal publication of data in the Open Access Geoscience Data
Journal
builds on JISC funded OJIMS (Overlay Journal Infrastructure for Meteorological Sciences) project
parallels work done by the NERC Science Information Strategy Data Citation and Publication project
●
brings all the NERC environmental data centres together.
19. GDJ Rationale and
Incentives
●
●
●
●
●
●
Publishing a dataset in a data journal will provide academic credit to data scientists, and
without diverting effort from their primary work on ensuring data quality.
Funders want to get the best possible science for their money. Publication in a data journal
ensures that the dataset is uploaded to a trusted repository where it will be backed up,
archived and curated and so won’t be vulnerable to bit-rot or to being lost/stored on obsolete
media.
The peer-review process also reassures the funder that the published dataset is of good
quality and that the experiment was carried out appropriately.
Data journals will be a good starting point for information for researchers outside the
immediate field, about what sort of data is available and how to access the data.
Data publication will help show transparency in the scientific process, improving public
accountability.
Opportunities to form partnerships with other organisations with the same goal of data
publication to exploit common activities and achieve a wider community buy-in. For
example, the CODATA-ICSTI Task Group on Data Citation Standards and Practises,
DataCite and others.