SlideShare una empresa de Scribd logo
1 de 14
We are surrounded
by data             2013-02-06
                    Toronto Data Science Group




                                           1
We are surrounded by
MESSY data                                       2013-02-06
                                                 Toronto Data Science Group




 - Multiple standards and formats
        Structured vs unstructured
        Field nomination and format varies ...
 - Human Error (misspellings, errors, etc)
 - Non-normalized inputs (free-text entries, the
 “other" option)
 - Incomplete data (laziness)
 ....

                                                                        2
Lack of              2013-02-06
                     Toronto Data Science Group




 Time

          Skills

            »      Software

                                            3
OpenRefine the                         2013-02-06
                                       Toronto Data Science Group




 - Swiss army knife for data manipulation!

 - glue step between your IT systems




                                                              4
What's OpenRefine
(former Google Refine, former Gridworks)   2013-02-06
                                           Toronto Data Science Group




 - A Cross platform Web Application that runs
 locally

 - A Community based project hosted on GitHub

 - Which have two distributions and multiple
 extensions

 - Something between a spreadsheet and SQL

                                                                  5
Three use case                         2013-02-06
                                       Toronto Data Science Group




1. Data Cleaning


2. ETL (Extract Transform Load) Prototyping


3. Data extension (reconciliation & linked data)




                                                              6
#1 Data Cleaning                    2013-02-06
                                    Toronto Data Science Group




 Graphical interface   Cluster similar record
 Facet option          Support three languages:
                         - GREL Jyton, Clojure
                         + regex




                                                           7
Facet example   2013-02-06
                Toronto Data Science Group




                                       8
Cluster example   2013-02-06
                  Toronto Data Science Group




                                         9
#2 ETL Prototyping
(Extract – Transform - Load)               2013-02-06
                                           Toronto Data Science Group




  Extract & Load               Transform
  Support:                     - Understand your data
  - tabular (csv, xls)         - Test the
                                 transformation that
  - hierarchical (xml, json)     need to be done
                               - Undo / Redo
                               - Export transformation
                                 in JSON format
                               - Automate using the
                                 python or ruby
                                 extension                        10
History and JSON export   2013-02-06
                          Toronto Data Science Group




                                                 11
#3 Extend your Data
(reconciliation & linked data)                 2013-02-06
                                               Toronto Data Science Group




- Cross between                  Reconcile against
  OpenRefine projects            - RDF file & Local SPARQL
  (vlookup)                        endpoints
- Fetch URL and           - Online databases
  call web services (API)




                                                                      12
Reconciliation example   2013-02-06
                         Toronto Data Science Group




                                                13
2013-02-06
                                      Toronto Data Science Group




   Thanks!
Martin Magdinier             OpenRefine
martin.magdinier@gmail.com http://openrefine.org
@magdmartin                  @OpenRefine




                                                             14

Más contenido relacionado

La actualidad más candente

20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Logilab
 
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...Marta Villegas
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xmlagosti
 
Linked Open Data (LOD) part 3
Linked Open Data (LOD)  part 3Linked Open Data (LOD)  part 3
Linked Open Data (LOD) part 3IPLODProject
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinalDeborah McGuinness
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...
Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...
Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...Safe Software
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod GmodJun Zhao
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structuresNagajothiN1
 
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKANandrea huang
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 
DUDE AT SAOUG 2008
DUDE AT SAOUG 2008DUDE AT SAOUG 2008
DUDE AT SAOUG 2008Kugendran
 
Entity Linking, Link Prediction, and Knowledge Graph Completion
Entity Linking, Link Prediction, and Knowledge Graph CompletionEntity Linking, Link Prediction, and Knowledge Graph Completion
Entity Linking, Link Prediction, and Knowledge Graph CompletionJennifer D'Souza
 

La actualidad más candente (18)

LODAC Museum -- Connecting Museums with LOD --
LODAC Museum -- Connecting Museums with LOD --LODAC Museum -- Connecting Museums with LOD --
LODAC Museum -- Connecting Museums with LOD --
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
morph-LDP Demo
morph-LDP Demomorph-LDP Demo
morph-LDP Demo
 
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
 
The OAI ORE Project
The OAI ORE ProjectThe OAI ORE Project
The OAI ORE Project
 
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
“Publishing and Consuming Linked Data. (Lessons learnt when using LOD in an a...
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xml
 
Linked Open Data (LOD) part 3
Linked Open Data (LOD)  part 3Linked Open Data (LOD)  part 3
Linked Open Data (LOD) part 3
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...
Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...
Using FME to Compile, Validate and Maintain a 4 Million Oil and Gas Well Data...
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structures
 
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
20161004 “Open Data Web” – A Linked Open Data Repository Built with CKAN
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
DUDE AT SAOUG 2008
DUDE AT SAOUG 2008DUDE AT SAOUG 2008
DUDE AT SAOUG 2008
 
Entity Linking, Link Prediction, and Knowledge Graph Completion
Entity Linking, Link Prediction, and Knowledge Graph CompletionEntity Linking, Link Prediction, and Knowledge Graph Completion
Entity Linking, Link Prediction, and Knowledge Graph Completion
 

Similar a 20130206 open refine

20130626 OpenRefine Introduction
20130626 OpenRefine Introduction20130626 OpenRefine Introduction
20130626 OpenRefine IntroductionMartin Magdinier
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemCameron Kiddle
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013Luis Daniel Ibáñez
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Dataconomy Media
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studiosantosluis87
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applicationsbpanulla
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
To architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesTo architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesjiscdatapool
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
Modèles de données et langages de description ouverts 6 - 2021-2022
Modèles de données et langages de description ouverts   6 - 2021-2022Modèles de données et langages de description ouverts   6 - 2021-2022
Modèles de données et langages de description ouverts 6 - 2021-2022François-Xavier Boffy
 

Similar a 20130206 open refine (20)

20130626 OpenRefine Introduction
20130626 OpenRefine Introduction20130626 OpenRefine Introduction
20130626 OpenRefine Introduction
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013
 
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
To architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositoriesTo architect or engineer? Lessons from DataPool on building RDM repositories
To architect or engineer? Lessons from DataPool on building RDM repositories
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Modèles de données et langages de description ouverts 6 - 2021-2022
Modèles de données et langages de description ouverts   6 - 2021-2022Modèles de données et langages de description ouverts   6 - 2021-2022
Modèles de données et langages de description ouverts 6 - 2021-2022
 
Linked Data:Libraries and Beyond
Linked Data:Libraries and BeyondLinked Data:Libraries and Beyond
Linked Data:Libraries and Beyond
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

20130206 open refine

  • 1. We are surrounded by data 2013-02-06 Toronto Data Science Group 1
  • 2. We are surrounded by MESSY data 2013-02-06 Toronto Data Science Group - Multiple standards and formats Structured vs unstructured Field nomination and format varies ... - Human Error (misspellings, errors, etc) - Non-normalized inputs (free-text entries, the “other" option) - Incomplete data (laziness) .... 2
  • 3. Lack of 2013-02-06 Toronto Data Science Group Time Skills » Software 3
  • 4. OpenRefine the 2013-02-06 Toronto Data Science Group - Swiss army knife for data manipulation! - glue step between your IT systems 4
  • 5. What's OpenRefine (former Google Refine, former Gridworks) 2013-02-06 Toronto Data Science Group - A Cross platform Web Application that runs locally - A Community based project hosted on GitHub - Which have two distributions and multiple extensions - Something between a spreadsheet and SQL 5
  • 6. Three use case 2013-02-06 Toronto Data Science Group 1. Data Cleaning 2. ETL (Extract Transform Load) Prototyping 3. Data extension (reconciliation & linked data) 6
  • 7. #1 Data Cleaning 2013-02-06 Toronto Data Science Group Graphical interface Cluster similar record Facet option Support three languages: - GREL Jyton, Clojure + regex 7
  • 8. Facet example 2013-02-06 Toronto Data Science Group 8
  • 9. Cluster example 2013-02-06 Toronto Data Science Group 9
  • 10. #2 ETL Prototyping (Extract – Transform - Load) 2013-02-06 Toronto Data Science Group Extract & Load Transform Support: - Understand your data - tabular (csv, xls) - Test the transformation that - hierarchical (xml, json) need to be done - Undo / Redo - Export transformation in JSON format - Automate using the python or ruby extension 10
  • 11. History and JSON export 2013-02-06 Toronto Data Science Group 11
  • 12. #3 Extend your Data (reconciliation & linked data) 2013-02-06 Toronto Data Science Group - Cross between Reconcile against OpenRefine projects - RDF file & Local SPARQL (vlookup) endpoints - Fetch URL and - Online databases call web services (API) 12
  • 13. Reconciliation example 2013-02-06 Toronto Data Science Group 13
  • 14. 2013-02-06 Toronto Data Science Group Thanks! Martin Magdinier OpenRefine martin.magdinier@gmail.com http://openrefine.org @magdmartin @OpenRefine 14