CKAN is an open-source data management solution for open data. It provides a platform for publishing and exposing metadata through an API and front-end interface. Major governments and communities use CKAN to organize large numbers of datasets. While it has advantages like organizing data in a structured way and providing APIs, its data model does not work for all use cases and there are no strict guidelines for dataset publishing. Extensions allow additional functionality and it can be deployed in various ways.
3. My experience with CKAN
● PublicData.eu portal
o Crowd-sourcing CSV2RDF mappings
● LODStats
o Version 1: crawling datahub.io (CKAN)
o Version 2: CKAN aggregator for data.gov,
publicdata.eu and datahub.io
o Version 2: Crawled all three portals and published
the data on datahub.io
5. Why CKAN?
● An open source platform
o Relatively easy to deploy
o Provides a rich set of features for free
● Data management
● Community involvement
6. Who use CKAN?
● All major open governments
o Canada (open.canada.ca): 244,238 datasets
o The U.S. (data.gov): 131,348 datasets
o Europe (publicdata.eu): 47,863 datasets
● And some other communities:
o Semantic Web community (datahub.io): 9,509
datasets
8. CKAN Pros/Cons
● Pros
o Organizes your data in structured way
o Have an extension to support DCAT (only for
datasets)
o Provides API to digest your data
● Cons
o The data model does not work for all use cases
(DBpedia)
o No strict guidelines for dataset publishing
9. CKAN functionality
● Publishing metadata
● Exposing metadata (API/front-end)
● Access control for users/organizations
● Additional functionality via plugins
10. CKAN extensions/plugins
● Data preview and visualization
● CKAN + DCAT
● Extension that adds the Disqus commenting
system to CKAN
● Simple API dataset hits counter
Full list is available at: http://extensions.ckan.org/
11. CKAN deployment
● From source
● OS package (e.g. as debian package)
● Docker image
Official guide: http://docs.ckan.org/en/latest/maintaining/installing/index.html
13. CKAN API
● Well documented
● Covers everything you can do with the web
interface
o You can write your own web interface
● Various API clients
o ckanclient (python) - official
o Ruby, PHP, Java, Nodejs, Perl, R
https://github.com/ckan/ckan/wiki/CKAN-API-Clients
14. CKAN API methods
● Retrieving data
● Creating new data
● Update existing data
● Delete existing data
● Data is: packages, resources, groups, tags,
users etc.
http://docs.ckan.org/en/latest/api/index.html
15. CKAN API: Examples
● Get package list
o http://demo.ckan.org/api/3/action/package_list
o Disabled for data.gov
● Get one package
o http://demo.ckan.org/api/3/action/package_show?id=
adur_district_spending
● ckan.logic.action.get.organization_show
o api/3/action/organization_show?id=...
16. Use Case: LODStats
● Aggregate CKAN
instances via API
● Filter out only related
datasets
● Build an application on top
of it
17. Use Case: CSV2RDF
● Integrated with a particular CKAN instance
● Aggregates all CSV files from the instance
● Provides an interface for CSV2RDF conversion
18. Thank you for your attention!
Presented by Ivan Ermilov.
LinkedIn: https://www.linkedin.com/in/iermilov
Email: iermilov@informatik.uni-leipzig.de
Skype: earthquakesan
Notas del editor
What is CKAN? In two words.
Who am I? -)
PhD student @AKSW, University of Leipzig
URZ (university data center)
I hope, the presentation will be interesting for all of you and I’m looking forward to discussion.
I want to briefly introduce our research group.
We are relatively big, having 40+ PhD students and research assistants.
Our group is divided in subgroups working on different topics, as you can see from the group roster, such as “Semantic Abstraction”, “Emergent Semantics”, “Machine Learning” etc.
Projects started in LOD2 project
The common misconception about CKAN is that it can store files for you.
It can be extended to store files, indeed.
But initially it dedicated to store METADATA, not data itself.
Open source
Open source solutions offer quite scarce documentation in general and even a small deviation from a typical scenario requires a specialist to be involved. In most of the cases it can be resolved through the mailing lists of a project. The customization of CKAN instance (if plugins are not available) requires a programmer to be involved.
Data management
CKAN enables organization and individuals to publish metadata about their datasets through an interface on a web front-end. This is an easy task, which does not require much effort.
Community involvement
CKAN has two main subdivisions for users: individual users and organizations. For govermental portals registration process is closed usually, because only governmental offices should be able to publish the data. For registered users it is possible to comment on the datasets as well as receive updates via various interfaces (more about it later).
CKAN is adopted by all the major open governmental portals (for instance, data.gov was previously running on Socrata data platform).
Why? Because of the reasons I mentioned before. What is also important, that CKAN supports multi-tier architecture, where local CKAN instances (for instance, for cities) can be aggregated on the regional CKAN instance. I will have an example of publicdata.eu portal to show how it can be achieved.
On this slide I depicted a general overview of the CKAN architecture.
As any web application, it consists of a back-end and a front-end.
Organization example:
We @AKSW group have an organization created at datahub.io portal, where we publish our datasets (to support dissemination).
CKAN has a flexible architecture, where new functionality can be added via extensions.
We’ve already seen what CKAN provides out-of-the-box
Simple API dataset hits counter:
Store a counter for calls to the “show” API command for a given dataset.
CKAN + DCAT exposes dataset information as RDF.
All the packages/resources fields are mapped to the dcat RDF vocabulary, which has a status of W3C recommendation.
CKAN is relatively easy to deploy.
The most complex installation is one from the source, it requires manual installation of ckan itself into the virual python environment and setup of apache solr (full text index, uses lucene). It is totally necessary to install CKAN from source only in case, if you want to write your own extension or modify the source code for some reason.
The second option, that is installation from the operation system package, can be a good option if you wish to run only one CKAN per server (or virtual machine). The drawbacks for packages is that they are not very well maintained or in other words you will have to wait for a long time for it to be updated.
The third option is relatively new and by far is the most suitable for large scale deployment. Or if you need several CKAN instances per server/VM. Docker image is assembled from the source code and the last image is available on docker hub. If it’s not available, you can compile it yourself. The overhead here is a person, who can work with docker. -)
The environment we prefer at AKSW is Ubuntu Server last long-term support version.
PublicData.eu is an initiative to make a one-stop portal for data in Europe.
Aggregation was not a part of initial CKAN functionality.
The special harvest extension was developed for this purpose.
Therefore local governments can deploy their own CKAN instance and then they can be aggregated.
You need a good API for your metadata to support the creation of cool applications on top of the data.