UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
Linked data migrational framework
1. In collaboration with
NANYANG TECHNOLOGICAL UNIVERSITY
Wee Kim Wee
School of Communication & Information
K6299 – Critical Inquiry in Knowledge Management
Proposal for Designing a Linked Data Migrational Framework for Singapore
Government Data Sets
Under the guidance of
Dr. Khoo Soo Guan, Christopher (Assoc Prof)
Mr. Soy Boom Lim (Manager, iDA Singapore)
Submitted by
SESAGIRI RAAMKUMAR ARAVIND (G1101761F)
THANGAVELU MUTHU KUMAAR (G1101765E)
KALEESWARAN SUDARSAN (G1001065F)
Page 1 of 9
2. Introduction
“The Internet is becoming the town square for the global village of tomorrow” – This quote of Bill Gates,
Chairman of Microsoft rightly pictures the world’s present business scene using internet as the dominant
medium for connecting with its resources across geographies enabling voluminous transactions at ease.
The challenge now vests upon enabling machines to read and understand data on the internet for a chain
of intelligent transactions that has been manual earlier due to the human understandable format in the
traditional form of WWW. This idea was well formulated with the concept of Semantic Web that has
content defined with semantics (Berners-Lee, Hendler & Lassila, 2001). Based on the concept, principles
describing Linked Data were released to guide individuals, enterprises and public bodies to release their
data in a common standard, RDF (Resource Description Framework) to form a web of data (Berners-Lee,
2006). Standardised data representation provides more scope for interlinking data sets across domains,
creating avenues for multi-point usage and knowledge discovery with intelligent software applications
built over it.
The most interesting large scale application of Linked Data taken for exploration is the eGovernment
(eGov) initiatives of US, UK and many other nations to publish their Open Governmental Data (OGD)
pertaining to governance and public affairs for transparency and value co-creation to empower people
with appropriate knowledge. The recent Open Government Partnership1 mandates nations to publish their
OGD in linked data format. Many nations have started to publish their data in the form of linked data, the
latest being Brazil data portal data.gov.br2. The start of the Linked data movement spurred the release of
new data sets highlighted by the LOD cloud3 maintained by CKAN4 registry.US and UK governments
have realized the benefits by releasing selective data sets in the linked data format in the portals data.gov 5
and data.gov.uk6 respectively. Well-defined relationships between these datasets and ready-made
applications guide public’s daily activities related to transport, business and other needs. Some of the
existing applications are Numberhood7, FixMyTransport8, BIS Research Funding Explorer9, SemaPlorer10
and “Linking Wildland Fire and Government Budget” mashup11.
1
Open Government Partnership http://www.state.gov/g/ogp/
2
Brazil Data Portal data.gov.br
3
LOD cloud diagram shows datasets that have been published in Linked Data format, by contributors to the Linking
Open Data community project and other individuals and organisations http://richard.cyganiak.de/2007/10/lod/
4
Comprehensive Knowledge Archive Network http://ckan.net/
5
data.gov
6
data.gov.uk
7
http://www.Numberhood.net
8
http://www.fixmytransport.com/
9
http://consulting.talis.com/case-study/bis-research-funding-explorer/
Page 2 of 9
3. The current OGD scenario in Singapore doesn’t make use of Linked Data standards. This proposal aims
at suggesting a migrational framework from the existing system of data publishing. A study is being done
on the current OGD ecosystem in Singapore as a starting point. iDA12 maintains the portal data.gov.sg13
that handles data collated from different government agencies (Chee Hean, 2011). The data portal aims to
meet Singapore public’s data needs and also to establish a co-creative environment. The data is provided
in different structured and unstructured formats such as txt, excel, pdf, xml, webpages, maps and also in
the form of agency specific Application Programming Interfaces (APIs) and web services. There are
multiple endpoints for data consumption. Prominent examples include data.gov.sg, OneMap API14,
Singapore Statistics15,mytransport.sg16 and Integrated Land Information Services17. There is some level of
redundancy in data spanning across the different sources in the current OGD ecosystem with limited
interlinking and re-use capabilities. The vocabularies used by the agencies are specific to their own with
limited standardisation of commonly used terms. The process of building a mash-up application
leveraging data across agencies is complex. This study has indicated the scope for the application of
linked data as it requires standardised data representation at source level and common interface at
publication level with the data sets linked by interconnected vocabularies.
Fig1: Linked Data implementation over current DGS (DATA.GOV.SG) Ecosystem
10
http://www.uni-koblenz-landau.de/koblenz/fb4/institute/IFI/AGStaab/Research/systeme/semap
11
http://logd.tw.rpi.edu/demo/linking_wildland_fire_and_government_budget
12
Infocomm Development Authority of Singapore (iDA) http://www.ida.gov.sg/home/index.aspx
13
data.gov.sg
14
http://www.onemap.sg
15
http://www.singstat.gov.sg/
16
http://mytransport.sg
17
http://www.inlis.gov.sg/layout/homepage.aspx#
Page 3 of 9
4. Objectives of the Proposal
The current study aims to build a linked data migrational framework that could be used by iDA and
Singapore Government agencies to publish their data sets in the form of linked data to the public. A
multi-step methodology would be devised with clearly defined activities and deliverables at each step
based on the current ecosystem of data.gov.sg and other OGD publishing portals in Singapore.
Geographical and Statistical data have been selected for describing each step in the framework.
The framework build process is based on the metadata and specifications provided by iDA and
government agencies. The current study focuses on linking the internal data sets. Additionally, it aims to
provide recommendations on a few use-cases that leverage the utility of external linked data. The holistic
nature of the framework will be validated with Geographical and Statistics data provided by SLA and
DOS.
Other objectives of the study are as follows:-
1.) Explore case studies pertaining to implementation of Linked Open Government data
2.) Prepare an inventory by assessing different linked data tools, technical frameworks and processes
3.) Provide recommendations for linked data implementation as per nature of the government
agency.
4.) Build an Ontology Network model (Haase, Rudolph, Wang et al, 2006) meant to unify
vocabularies from different agency domains.
5.) Build a POC application based on the devised methodology to validate its applicability. This
objective is subject to availability of sufficient time and infrastructure.
The migrational framework will be useful for iDA in formulating their Linked Data implementation
strategy in the near future, as the government body intends to make the portal data.gov.sg as a cornerstone
portal for OGD publication. The common output interface suggested by the framework will showcase the
potential of unifying the different end points provided by the agencies thereby simplifying access and
facilitating the creation of applications that integrate data from disparate sources. The ontology network
suggested by the framework will help the agencies in standardising vocabulary across domains for better
understanding their data and its relation to data from other agencies.
The framework can also be used by enterprises and individuals to understand the steps, tools and
processes involved in releasing their data to the WWW in the form of linked data.
Page 4 of 9
5. Literature Review
The Semantic Web facilitates a web of data18 that works on top of URI19 RDF20, Ontology21 and
SPARQL22 concepts. Resources and values are identified and described in a common standard, RDF
based on the modelled Ontology specifying the relationships (Berners-Lee, Hendler & Lassila, 2001). The
LOD223 initiative aims to build a LOD stack of products, frameworks and processes that aim to accelerate
the implementation of linked data across the globe.W3C has setup two committees24 to provide best
practices and recommendations for governments to publish their OGD in standardised linked data format.
(Bizer, Heath, Idehen & Berners-Lee, 2008), (Villazón, Vilches, Corcho & Gómez-Pérez, 2011) and
(Hyland & Wood, 2011) provide cookbooks and guidelines for OGD conversion to Linked Data format.
They are helpful in understanding the general steps and tools required in converting and publishing OGD
in Linked Data format. Governments that are new entrants in adopting Linked Data publication strategy
need a tailored migrational framework specific to the local OGD ecosystem. The customized framework
could be used by the government steering committee to expedite the migration to LOGD format.
Methodology
The project team has been discussing with iDA staff, SLA staff and NIIT staff (the IT vendor supporting
DGS25 platform) prior to the proposal to get a basic understanding of the current architecture and to
identify the DGS components that could accommodate changes as a part of this study. Primary data
would be provided by iDA and SLA. The data sets selected for the study are indicated in the below table
1.1. These seemingly disparate datasets can be connected to give a context specific knowledge on
each site for the prospective tenderers to gain insights on the consumer and locality trends based
on the demographics.
18
Linked Data and Web of Data http://www.youtube.com/watch?v=GKfJ5onP5SQ
19
Uniform Resource Identifiers (URIs) are short strings that identify resources in the web: documents, images,
downloadable files, services, electronic mailboxes, and other resources. They make resources available under a
variety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the same
simple way http://www.w3.org/Addressing/
20
RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if
the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all
the data consumers to be changed http://www.w3.org/RDF/
21
Ontologies or vocabularies define the concepts and relationships (also referred to as “terms”) used to describe and
represent an area of concern. http://www.w3.org/standards/semanticweb/ontology
22
SPARQL is an RDF query language; its name is an acronym that stands for SPARQL Protocol and RDF Query
Language. http://www.w3.org/TR/rdf-sparql-query/
23
LOD2 Project http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html
24
http://www.w3.org/2011/gld/charter and http://www.w3.org/egov/
25
DGS – Data.gov.sg data store
Page 5 of 9
6. Data set Agency Category Data type
Resident Population by DGP Zone/ Department of Population and Textual
Subzone and Age Group, Type of Statistics Household
Dwelling, Ethnic Group Characteristics
Sites Sold by URA - Details Urban Redevelopment Housing and Urban Textual
Authority (URA) Planning
Table 1.1: Primary datasets used for the study
The entire data sets would not be used for the study instead the latest year’s data would be used for the
study. The secondary data for the research study would be extracted from LOGD statistical and
geospatial data sets from the portal thedatahub.org for building the framework. The migrational
framework will be customized based on the current architecture of DGS because the steps will be devised
based on the understanding of the different layers in DGS and still the framework will be generic enough
to be applicable for other cases. The project team would be conducting interviews with iDA support staff
for collecting specification documents and insights relevant to the current architecture of DGS.
The framework formulation would be based on the context-specific integration of different approaches
put forth by LOGD activists, researchers and practitioners. Each step in the framework will be sequential,
comprising of sub steps covering intrinsic activities. For example, object modelling of the different data
objects in the selected data sets is a step that precedes the RDF modelling and Ontology/Vocabulary
building steps. The steps will be substantiated with sample implementations using the primary data.
Suggestions from W3C LOGD steering groups10 will be taken into account for framework formulation.
The tools that will be identified as part of the inventory will be used for the activities such as RDF
creation, RDF storage and Ontology re-use/modelling in the framework.
Difficulties and Issues
Agencies do not provide raw data to iDA. Aggregated report data is split into X dimensions representing
columns, Y dimensions representing rows and data points representing cells. These fields are provided in
an XML file and sent to iDA on a periodic basis. There is no separate master data file. The hierarchy in
master data dimensions is not explicitly set or provided. Therefore, a mechanism to identify the master
data and the relationship between different levels in the master data dimensions needs to be devised. This
mechanism may not serve as a generic transformation applicable for all agencies due to the implicit nature
of data representation in the files.
Page 6 of 9
7. The data conversion to RDF formats will not be done at the agency level instead it will be done on top of
the data model in iDA data store. This leads to data duplication as the data is converted to RDF format for
Linked data implementation.
There is no master data management system in place right now that standardises the dimension values
across agencies. Standardisation is required to link common data in the data sets used in the study. This
might be a complex task due to the different versions of master data values in a single data set and also
across data sets.
The current OGD ecosystem of Singapore provides multiple end points to the users such as API, web
services and files. A common endpoint in the form of Linked data API would mean building different
wrappers over the end points. The below diagram from (Bizer , Heath, Idehen, & Berners-Lee, 2008)
illustrates the different approaches of linked data implementation over existing systems.
Fig2: Different Linked Data Implementation Approaches
Page 7 of 9
8. Schedule
The schedule for the study is covered in the embedded Gantt chart.
Gantt Chart-iDA
Linked Data Project.xlsx
Proposed Report Outline
The proposed final report will be structured in the following format.
1. Abstract
2. Introduction
a. Introduction to Linked Data and its relevance to Open Government Data and eGov
b. Overview of SG OGD Ecosystem
3. Literature Review
a. Government Linked Data Implementation Cookbooks, Guidelines and Recommendations
i.URI formulation
ii.RDF creation
iii.Ontology Formulation
iv.Publication and Exploitation
4. Migrational Framework
a. Multi-step methodology
i.Formulation and Description
ii.Examples
5. Implementation Results and Observations
a. POC details
b. Description of issues faced in implementation
6. Limitations
7. Conclusion and Recommendations
Few new sections and sub-sections might be added in the final report.
Dissemination of Results
The migrational framework will be published in the form of a report subject to review by NTU Supervisor
followed by submission to iDA. The researchers plan to publish the report in the form of a conference
paper in the later part of the year.
Page 8 of 9
9. References
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). THE SEMANTIC WEB. Scientific American, 284(5),
34
Berners-Lee, T. (2006). Linked Data. Available: http://www.w3.org/DesignIssues/LinkedData.html. Last
accessed 11th Jan 2012
Chee Hean, T. (2011). Keynote Address by Mr Teo Chee Hean, Deputy Prime Minister, Coordinating
Minister for National Security and Minister for Home Affairs at the e-Gov Global Exchange 2011.
Available: http://www.ida.gov.sg/News%20and%20Events/20110620114104.aspx?getPagetype=21.
Last accessed 11th Jan 2012
Bizer , C., Heath, T., Idehen, K., & Berners-Lee, T. (2008). Linked Data: Evolving the Web into a Global
Data Space. (J. Hendler & F. Van Harmelen, Eds.)Proceeding of the 17th international conference on
World Wide Web WWW 08 (Vol. 1, p. 1265). ACM Press.
Villazón-Terrazas, B., Vilches-Blázquez, L., Corcho, O., and Gómez-Pérez, A. (2011). Methodological
guidelines for publishing government linked data linking government data. In Wood, D., editor,
Linking Government Data, chapter 2, pages 27-49. Springer New York, New York, NY.
Hyland, B. and Wood, D. (2011). The joy of data - a cookbook for publishing linked government data on
the web linking government data. In Wood, D., editor, Linking Government Data, chapter 1, pages 3-
26. Springer New York, New York, NY.
Haase, P., Rudolph, S., Wang, Y., Brockmans, S., Palma, R., Euzenat, J., & d’ Aquin, M. (2006,
November). Networked Ontology Model. Technical Report, NeOn project deliverable D1.1.1
Page 9 of 9