SlideShare una empresa de Scribd logo
1 de 24
The
LinkedGov extension


        for
   Google Refine




                      @danpaulsmith
What is LinkedGov?
         A community project
               aiming to
      make public data more usable

              Cleaning
           Improving access
              Enriching
               Linking
                                     @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                             Question
                                               site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                     data
   XML)        (Data is stored as machine-   .linkedgov
                      readable data)             .org




                                               @danpaulsmith
What is Google Refine?

 “A power tool for working with messy data”
              “cleaning it up”,
             “ transforming it”,
               “extending it”,
              “and linking it”


                                       @danpaulsmith
@danpaulsmith
Spreadsheet software

Spreadsheet software           Google Refine

  Single-cell editing           Bulk-editing

 Create & input data    Use & transform existing data

  Document-based                Data-based

                          Allows extensions to be
                                 installed




                                                    @danpaulsmith
Transposition, multi-valued cells,
  clustering, faceting, filtering




                               @danpaulsmith
What does the LinkedGov extension
               do?




   Image curtosey of http://download.chip.eu
                                               @danpaulsmith
Typing wizards




Date & time   Measurements   Geolocations   Addresses




                                            @danpaulsmith
Other wizards




Columns to rows   Rows to columns   Blank values   Codes and symbols




                                                        @danpaulsmith
@danpaulsmith
@danpaulsmith
Cleaning




           @danpaulsmith
Enriching




            @danpaulsmith
What a machine understands
               before
                       (CSV, TSV, Excel)

      Column Column Column Column Column Column Column
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number




                                                          @danpaulsmith
What a machine understands
              after
                 (machine-readable format)

                                                     Water
          Temp    Name    Gas/hour Postcode Date             Height
                                                     /hour
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius String   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres




                                                              @danpaulsmith
The power of linking


 Latitude &
                   Postcodes       Dates      Measurements
 longitude




                   GP Surgery    NHS events   GP Surgery energy
NHS geo data      address data      data           use data



                                                    @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
Cleaning tasks




                 @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
Question
     site




   @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
data.linkedgov.org




                     @danpaulsmith
Feedback & questions



  http://linkedgov.org - Website

  http://wiki.linkedgov.org - Wiki

  @LinkedGov - Twitter

   #linkedgov – IRC (Freenode.net)




                                     @danpaulsmith

Más contenido relacionado

La actualidad más candente

Graph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesGraph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesNeo4j
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jTobias Lindaaker
 
RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)Daniele Dell'Aglio
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGGRatko Mutavdzic
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
Introduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupIntroduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupVince Gonzalez
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 

La actualidad más candente (12)

Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Graph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesGraph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph Databases
 
Django and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks assDjango and Neo4j - Domain modeling that kicks ass
Django and Neo4j - Domain modeling that kicks ass
 
A Spot of TEI
A Spot of TEIA Spot of TEI
A Spot of TEI
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)RDF Stream Processing Models (RSP2014)
RDF Stream Processing Models (RSP2014)
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
Introduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill MeetupIntroduction to Apache Drill - NYC Apache Drill Meetup
Introduction to Apache Drill - NYC Apache Drill Meetup
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 

Similar a LinkedGov extension for Google Refine

Database Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastDatabase Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastEric Kavanagh
 
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...Big Data Week
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
DataUp Overview: AGU 2012
DataUp Overview: AGU 2012DataUp Overview: AGU 2012
DataUp Overview: AGU 2012Carly Strasser
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Operations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the RunwayOperations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the RunwayCamille Fournier
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseDean Hallman
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSafe Software
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networksalitora
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020Amazon Web Services Korea
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run GraphVaticle
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem NikulchenkoFwdays
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
 

Similar a LinkedGov extension for Google Refine (20)

Database Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory WebcastDatabase Survival Guide: Exploratory Webcast
Database Survival Guide: Exploratory Webcast
 
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
DataUp Overview: AGU 2012
DataUp Overview: AGU 2012DataUp Overview: AGU 2012
DataUp Overview: AGU 2012
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
Operations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the RunwayOperations-Driven Web Services at Rent the Runway
Operations-Driven Web Services at Rent the Runway
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networks
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
 
Data Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data WarehousingData Vault 2.0: Big Data Meets Data Warehousing
Data Vault 2.0: Big Data Meets Data Warehousing
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

LinkedGov extension for Google Refine

  • 1. The LinkedGov extension for Google Refine @danpaulsmith
  • 2. What is LinkedGov? A community project aiming to make public data more usable Cleaning Improving access Enriching Linking @danpaulsmith
  • 3. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) (Data is stored as machine- .linkedgov readable data) .org @danpaulsmith
  • 4. What is Google Refine? “A power tool for working with messy data” “cleaning it up”, “ transforming it”, “extending it”, “and linking it” @danpaulsmith
  • 6. Spreadsheet software Spreadsheet software Google Refine Single-cell editing Bulk-editing Create & input data Use & transform existing data Document-based Data-based Allows extensions to be installed @danpaulsmith
  • 7. Transposition, multi-valued cells, clustering, faceting, filtering @danpaulsmith
  • 8. What does the LinkedGov extension do? Image curtosey of http://download.chip.eu @danpaulsmith
  • 9. Typing wizards Date & time Measurements Geolocations Addresses @danpaulsmith
  • 10. Other wizards Columns to rows Rows to columns Blank values Codes and symbols @danpaulsmith
  • 13. Cleaning @danpaulsmith
  • 14. Enriching @danpaulsmith
  • 15. What a machine understands before (CSV, TSV, Excel) Column Column Column Column Column Column Column Row number word number word date number number Row number word number word date number number Row number word number word date number number Row number word number word date number number Row number word number word date number number @danpaulsmith
  • 16. What a machine understands after (machine-readable format) Water Temp Name Gas/hour Postcode Date Height /hour Building Celsius string kWh Postcode date m3 metres Building Celsius String kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres @danpaulsmith
  • 17. The power of linking Latitude & Postcodes Dates Measurements longitude GP Surgery NHS events GP Surgery energy NHS geo data address data data use data @danpaulsmith
  • 18. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 19. Cleaning tasks @danpaulsmith
  • 20. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 21. Question site @danpaulsmith
  • 22. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 23. data.linkedgov.org @danpaulsmith
  • 24. Feedback & questions http://linkedgov.org - Website http://wiki.linkedgov.org - Wiki @LinkedGov - Twitter #linkedgov – IRC (Freenode.net) @danpaulsmith

Notas del editor

  1. Me. Recent graduate. Have been building interfaces and visualisations for last two years on government projects themed on transparency, big data, open data and linked machine-readable data.This is a presentation on an interface I’ve been building for LinkedGov recently.
  2. When you’re looking for public data – it can be quite hard to find(you need to create accounts, arrive at broken download links, searches fail due to a lack of metadata). Once you’ve found the data – it can be in the wrong format(so you then begin the time consuming process of converting that data into a format you can work with). Then once you’ve started working with the data – you can find it to be mysterious and lacking in explanation. So! LinkedGov makes life easier by:1. Cleaning data (spelling mistakes, formats…). 2. Improving access (format of choice, API’s, high quality metadata). 3. Enriches data – (labels and descriptions for the data at a fine-grained level, uses online vocabularies to describe what the data contains). 4. Links datasets to each other.
  3. The purple block here is Google Refine – with which data is imported. The importeddata is then cleaned and enriched by the LinkedGov extension. The final step of the import process is to store the data in LinkedGov’s database in a machine-understandable format. With the data stored, we can then do a few things: Create “cleaning tasks” for the community that help fix errors in the data. Power a “question site” that lets non-technical users form queries to query datasets. 3. And also power a technical search site aimed at developers that helps them find the data they want.
  4. Free. Open source. Runs in the web browser.
  5. This is what Refine looks like. A little bit like spreadsheet software – you have columns and rows. Though you don’t have any toolbars allowing you edit the style, insert charts, generate reports… That’s because…
  6. Refine has some key differences to spreadsheet software. Spreadsheet software focuses on single-cell editing and inputting of data, Refine focuses on editing hundreds of rows & columns at the same time. ------ Spreadsheet software is largely for creating and capturing data, Refine is for users to reshape and transform existing data. ------- Spreadsheet software is very document-based- allowing you to style the data, use multiple pages or insert media, Refine is data-based – only allowing you to alter the structure and values of the data. ------ Refine also allows people to build extensions for it!
  7. However. Cleaning and transforming data *is*complicated. A non-technical personwill get confused. Google Refine is designed for programmers / frequent data-wranglers…It would be useful if the people who create or own the data are able to clean the data themselves (they after all should know the most about it).
  8. Hides the technical stuff! Instead, asks the user questions about their data… Creates clean, formatted, machine-readable data.
  9. So what are we askingthe user? We ask them “can you spot any of these things in your data?”.Why do we ask these things? These four types of data are a good starting ground for linking datasets as they are common across most datasets. --------- If multiple datasets contain the same time span – you can try to compare them to see if there’s anything that connects. If multiple datasets contain the same measurements (i.e. kilowatts per hour) – it’s a good starting point to see if any of them relate. If multiple datasets contain latitude and longitude values – you can gather and compare data spatially and begin to plot things on maps which everybody seems to love. If multiple datasets contain postcodes – & if any of them match, you automatically have a number of different types of information for each postcode. --------- These questions come in the form of “wizards” – which basically leads the user through a small number tasks - asking them to select a column, specify how the data is currently formatted and then they press “Done”!
  10. Thereare also a few other wizards: The “colums to rows” & “rows to columns” wizards help the user reshape their data in a way that helps us store the data. These are currently the most problematic wizards in regards to the wording and conveying the benefit or reason behind asking the user to do this. The “blank” values wizard BLANKS out any values in the data that represent “NULL” values – each dataset is to it’s own, I’ve come across dashes, full stops and words like “missing” or “none”. The “codes and symbols” wizard asks the user to replace any codes or symbols with what they actually mean, so for example, in some NHS data, a column was filled with lots of A’s, C’s, D’s and P’s – after googling about, I found out that they actually meant Active, Closed, Dormant and Proposed. So having their actual meaning present in the data is obviously a lot more helpful to people trying to use the data.
  11. So, this is what Refine looks like before the extension has been installed… and after the LinkedGov extension is installed. The main addition to the interface being a new panel called the “Typing” panel – which houses the wizards. So, I’ll just walk you through a couple of wizards… Imagine I have some dates in my data and I click on the Date & Time wizard…
  12. The wizard appears and it asks me to select any columns that contain dates… So I select two columns “open date” and “close date” by clicking on their headers…
  13. We ask the user to specify each dart part for each column – as the values could be in any combination: year-month-day, year-month, day-month, month-day…. You can see the column contains a day, month and year – but in a mixture of formats. You have words, dashes and slashes as separators…which the user doesn’t have to worry about. They then press “Finish” and the magic happens. The values are all formatted properly to using the ISO standard, they are also linked to an online definition and breakdown of that specific date and finally stored as machine-readable linked data.
  14. This is the measurements wizard. Select “Avg. Temp” column. It then asks me to search for a measurement type by typing into a text box, which searches an online database of measurements. I click “Finish” after I’ve found the right measurement – “Celsius”, and then the measurements are stored using their online definition – which comes bundled with wikipedia-like information such as alternative names, a description or related measurements (i.e. centimeters, meters, kilometers). So not only is the measurement being stored as an actual measurement, but because we’re using an online database to define it, it comes bundled with a lot of other relevant and potentially useful information to the end user.
  15. Here’s an example of what a machine understands about the data before and after using our extension. After saving a file in spreadsheet software, a machine, at best, only understands that the data is a bunch of columns and rows, containing numbers, words and dates. The ability for machines to understand the data is the magic that powers the question site, the dataset directory and makes linking datasets together a breeze.
  16. After using the wizards, machines are able to understand a little bit more about the data. Now machines have a more in-depth understanding of what the data actually means, The guesswork and inaccuracy is removed when searching and querying the data.
  17. An example of how datasets can link… The red dataset contains latitude/longitudes. The blue dataset contains postcodes and latitude/longitudes. The green dataset contains postcods and dates. And the orange dataset contains dates and measurements… All four datasets can be linked together by those linkable values. When you’re able to start linking datasets together like this – NEW information is created from a NEWLY acquired sense of UNDERSTANDING of those datasets.
  18. So that’s what the LinkedGov extension is and does. I’ll briefly finish off with what happens to the machine-readable data. Cleaning tasks can now be created for the community – asking them to use their expertise and judgement to correct problematic data. For example, a column may contain cryptic codes that represent types of NHS walk-in-clinics. So a task may be to decode one of these values and replace it with what it actually means.
  19. Here’s a screenshot of an example task – It’s asking the user to try to fix a value that contains two dashes instead of a decimal point. The user has the options to say “Yes I can fix this”, ”Refer this to an expert”, “It’s actually fine” etc.
  20. The question site
  21. The question site is aimed at non-technical users. It allows them to form queries to retrieve data, without requiring any knowledge of query languages. They form the question in a human-readable way, using a mixture of selectable question fragments together with free text input. An example: Give me ALL … GP SURGERIES … in … LONDON…
  22. A finally, the data site.
  23. The data site is targeted at the developer community. and is powered by the enriching parts of the data such as: their metadata What types of data are actually in the datasets (postcodes, dates, measurements) What they could potentially link to…
  24. So that’s where we are so farFeedback & questions?