Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially by government data sharing. Further, the linked data approach, using maturing semantic web technologies, has been shown to be very valuable in creating "mashups" in which government datasets can be combined in new and innovative ways, and turned into "live" infographics. In this talk, presented to the computer science department at PUC-RIO, we describe the research aspects of RPI's ongoing work in this area.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
RPI Research in Linked Open Government Systems
1. Linked Open Government Data http://logd.tw.rpi.edu Jim Hendler Tetherless World Professor of Computer and Cognitive Science Assistant Dean of Information Technology and Web Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
5. Government Data Sharing January 1, 2009 “ Openness will strengthen our democracy and promote efficiency and effectiveness in Government.” --- President Obama Putting Govt Data online- Data.gov.uk beta May 21, 2009 January 19, 2010 data.gov.uk online May 21, 2010 data.gov online data.gov relaunch with semantic web featured June30,2009 December 8, 2009 “ Open Government Directive” released 2009 2010 … 57 Data Sets ~6000 Data Set ~2000 Data Sets >305,000 Data Sets
11. Linked Open Data goes beyond govt http://linkeddata.org/ Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
15. Adding some Web magic Web Analytics Social Data Networks External Links
16. Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
17. Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!
29. Simple Example EPA Toxic Release Data This looks like it could be state identifiers. Look for possible state identifiers: -Names: “Pennsylvania”, “Michigan”, “Wisconsin” -Abbr: “PA”, “MI”, “WI” -FIPS: “42”, “26”, “55” 75% match state identifiers. If this meets our threshold, then recommend interpreting as state and integrating with linked data on the web. Federal Information Processing Standards (FIPS) 14 is “Guam” which is not a US state Facility ID … Latitude Longitude ST:val … … 40.416944 -75.935 42 … … 42.955383 -85.480074 26 … … 43.1698 -88.01829 55 … … 38.87025 -77.00905 14 … … … … …
The table is some sample data from some EPA Toxic Release dataset. We focus on state (ST) for our example. Note that our heuristic looks for full state names, state abbreviations, or appropriate FIPS codes. Guam is colored red because it is not technically a state. Other items in this class include DC, Puerto Rico, American Samoa, the US native American tribal entities and certain municipalities (such as New York City) which are large enough to have their own codes. Explanation: Why is this hard? If we see a database column with the number “36” in it, we have no way to tell what it represents. But if it is in a list of values that all are two digits and the max is under 60, it may be codes (or it may be ages, so we also have to look for other clues). Even more confusing Albany has no separate FIPS code – it is 36, for New York State, but Manhattan has its own FIPS code of 36061 – so telling States from municipalities can be hard and we cannot reject columns just because some entries are not in the right range”
Computational Center for Nanotechnology Innovations *Our test achieved a rate of “checking” triples for heuristic matches of 65k triples/second/process. (Intentionally not mentioned in the slides are the exact numbers. Specifically, it took 3m20s to make recommendations for 209M triples using 16 processes on the Opteron blade cluster at the CCNI.) *Our two heuristics could be summed up as follows: (1) if the column header (property name) looks like it could be about states, and if at least 75% of the values in that column (object values for that property) could be interpreted as states, then recommend that that column be considered as specifying states (the property has range state); (2) if the column header (property name) looks like it could be about latitudes (resp. longitudes), and if at least 75% of the values in the column (object values for that property) could be interpreted as latitudes (resp. longitudes), then recommend that that column be considered as specifying latitudes (resp. longitudes) (the property has range latitude (resp. longitude).