DSpace at ILRI: A semi-technical overview of “CGSpace”
1. A semi-technical overview of “CGSpace”
DSpace at ILRI
Alan Orth
KAINET Open Data and Open Science’ Workshop
Nairobi, Kenya, 18 June 2015
2. History of DSpace at ILRI
● 2009: ILRI launches Mahider (“repository” in
Amharic)
● 2010: Other CGIAR centers and programs join
our platform and share hard / soft costs
● 2011: Rebranded as “CGSpace”
● 2015: 9 CGIAR centers, ~50,000 items, ~250k
hits/month
4. How we use DSpace
● Content people embedded in each department
help capture results (presentations, papers,
brochures, etc)
● Primary location for institutional outputs!
● No posting PDFs on corporate website!
● Integrate with website and blogs via RSS feeds
● Direct ALL traffic to DSpace!
● For data sets, videos, etc we make a metadata-
only accession with a link to eg YouTube
5. ● Communities, sub-communities, and collections
● Tempting to model after organization hierarchy!
● (we did)
● … but organization hierarchies change!
DSpace hierarchies
7. Metadata
● Standard Dublin Core is available
● No AGROVOC
● You can create custom controlled vocabularies in
arbitrary namespaces, eg: cg.subject.ilri
10. Search engine optimization (SEO)
Help Google Scholar consume your content!
● XML sitemaps
● Consistent domain name, eg: cgspace.cgiar.org
● Persistent links for resources
● Website speed and HTTPS both a plus
● Sign up for Google Webmaster Tools to submit
sitemap, control indexing, see stats, etc
12. Importance of persistent links
● Website addresses change…
● mahider.ilri.org -> cgspace.cgiar.org
● But resources stay the same!
http://hdl.handle.net/10568/67073
● “Handle” service from handle.net
● Everything under prefix 10568 is CGSpace
● Default DSpace handle prefix is 123456789!
14. Getting data INTO DSpace
● Day-to-day submission is manual, by a small
army of editors
● One-time batch uploads of items from other
systems in CSV format (InMagic!)
● OAI-PMH for metadata only
● OAI-ORE for metadata + bitstreams (eg, from
another DSpace or Sharepoint, etc)
● SWORD (haven't tried)
● REST API (DSpace 5+, haven't tried)
15. Getting data OUT OF DSpace
● REST API for structured JSON or XML
● OAI-PMH for metadata
● OAI-ORE for metadata + bitstreams (PDFs, etc)
● RSS feeds for websites / blogs
● XML sitemaps for search engines*
*Google discontinued the use of OAI for discovering
site content in 2008!
http://googlewebmastercentral.blogspot.com/2008
/04/retiring-support-for-oai-pmh-in.html
19. Skills needed in your organization
Besides content people(!)...
● Prioritize Linux systems administration
experience (Tomcat, httpd, PostgreSQL, DNS,
SSH, git)
● General: computer science background
● Web developers a diverse bunch...
● Java development experience doesn't hurt
20. Extra considerations
● Item mapping
● Maintenance tasks (background batch jobs)
● Backups of assetstore and PostgreSQL!
● Altmetrics tracks social media mentions
● Separate production / development
environments
● CGSpace server is $80/month
● ~20GB of PDFs, ~8GB of Solr data
21. Getting help
● “DSpace Tech” mailing list
● “dspace” tag on StackOverflow website
● a.orth@cgiar.org
Notas del editor
Introduce self as computer scientist, apologize for limited knowledge of library stuff. How we do things plus lessons learned.
Mention search engine stumbling and parsing vs consuming structured content