SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
refinepro.com - @RefinePro – martin@refinepro.com 1
Enable domain experts explore,
normalize and enrich their data via a
self service data preparation platform
Toronto OpenRefine Meet Up Nov 17, 2015
refinepro.com - @RefinePro – martin@refinepro.com 2
Garbage In – Garbage Out
refinepro.com - @RefinePro – martin@refinepro.com 3
Analytics need
clean data to be
reliable
Legacy data
need to be
migrated to a
new system
Data must be
reconciled
against a master
data set
Data projects needs access to
reliable data quickly
refinepro.com - @RefinePro – martin@refinepro.com 4
Data Processing Pipeline
refinepro.com - @RefinePro – martin@refinepro.com 5
60 to 80% of data analysis
is spent on the process of
cleaning, transformation and integration
Data Processing Pipeline
refinepro.com - @RefinePro – martin@refinepro.com 6
• Duplicate value & Typos
• Multi value cells
• Data in the wrong field
• Missing / Partial Values
• Encoding Errors
• Change format (text, number, date)
• Flat to relational data set
• Schema alignment
• Transpose rows and columns
• Join data-set
• Enrichment from other sources
(MDM, API calls)
Data Quality & Integration &
Is Time Consuming
refinepro.com - @RefinePro – martin@refinepro.com 7
• Which field should it contains?
• What format should it follow?
• What geographical scope should it support?
• Enforce data integrity rules (eg. postal code vs city)?
–
Individual and business unit needs are
unique
What is clean data?
How do you know define a clean address?
refinepro.com - @RefinePro – martin@refinepro.com 8
• New economy of machine learning, predictive and data
enrichment service:
• Geocoding and address cleaning
• Name recognition and extraction
• Churn prediction
• …
Those services come with an API first approach requiring
technical skills
Data Service have an
API first approach
refinepro.com - @RefinePro – martin@refinepro.com 9
DBA
ETL
Data Science
Spreadsheet User
Data Visualization / Interpretation
User Base
Understand
the Data
(Business Skills)
Know How To
Transform Data
(Technical Skills)
Today's data environment
challenge traditional
technologies
Excel doesn't
scale or
automate well
IT can't pace
with the volume
of requests
refinepro.com - @RefinePro – martin@refinepro.com 10
Frequency
- number
of use
case
Profiling
Preparation
Discovery
Data Wrangling
1 32
Sense Making
Data Exploration
Is the data useful?
What Can I do with it?
OpenRefine in the Data Quality
& Integration Pipeline
refinepro.com - @RefinePro – martin@refinepro.com 11
Frequency
- number
of use
case
Profiling
Preparation
Discovery
Data Wrangling
1 32
Personal ETL &
Analysis
Prototype
One time migration
Sense Making
Data Exploration
Is the data useful?
What Can I do with it?
OpenRefine in the Data Quality
& Integration Pipeline
refinepro.com - @RefinePro – martin@refinepro.com 12
Frequency
- number
of use
case
Profiling
Preparation
Discovery
Data Wrangling
1 32
Big Data
Real -Time
Processing
Enterprise ETL
Personal ETL &
Analysis
Prototype
One time migration
Sense Making
Data Exploration
Is the data useful?
What Can I do with it?
OpenRefine in the Data Quality
& Integration Pipeline
refinepro.com - @RefinePro – martin@refinepro.com 13
Understand
the Data
(Business Skills)
Know How To
Transform Data
(Technical Skills)
Frequency
- number
of use
case
Profiling
Preparation
Discovery
Data Wrangling
1 2 3
OpenRefine in the Data Quality
& Integration Pipeline
refinepro.com - @RefinePro – martin@refinepro.com 14
OpenRefine Functionality
XLS, CSV, JSON,
XML Input &
Output Support
Point & Click
Cluster &
Deduplication
Filter &
Sort
Transpose Custom Query
Language
Enrich data via
APIs
Join, Merge
& Reconcile
Split to rows
and columns
Undo /
Redo
refinepro.com - @RefinePro – martin@refinepro.com 15
Training
Cloud & on-
premise
hosting
Integration &
Custom
Development
RefinePro helps teams and
organization to scale OpenRefine
refinepro.com - @RefinePro – martin@refinepro.com 16
Contact and Link
OpenRefine
website: http://openrefine.org
Twitter @OpenRefine
RefinePro
website: http://refinepro.com
Twitter: @RefinePro
Martin Magdinier
twitter: @magdmartin
Linkedin: https://ca.linkedin.com/in/magdinier/en
refinepro.com - @RefinePro – martin@refinepro.com 17
Enable domain experts explore,
normalize and enrich their data via a
self service data preparation platform
Toronto OpenRefine Meet Up Nov 17, 2015

Más contenido relacionado

La actualidad más candente

Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersSpazioDati
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineLeigh Dodds
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformOntotext
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarSpazioDati
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APISpazioDati
 
TXDHC OpenRefine Training
TXDHC OpenRefine TrainingTXDHC OpenRefine Training
TXDHC OpenRefine TrainingLiz Grumbach
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFXTrieu Nguyen
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librariansandimou
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsOntotext
 

La actualidad más candente (20)

Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data Engine
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
Tracking data lineage at Stitch Fix
Tracking data lineage at Stitch FixTracking data lineage at Stitch Fix
Tracking data lineage at Stitch Fix
 
TXDHC OpenRefine Training
TXDHC OpenRefine TrainingTXDHC OpenRefine Training
TXDHC OpenRefine Training
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFX
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
High quality Linked Data generation for librarians
High quality Linked Data generation for librariansHigh quality Linked Data generation for librarians
High quality Linked Data generation for librarians
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got Semantics
 

Destacado

Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefineHeather Myers
 
20130626 OpenRefine Introduction
20130626 OpenRefine Introduction20130626 OpenRefine Introduction
20130626 OpenRefine IntroductionMartin Magdinier
 
Employing Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataEmploying Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataFadi Maali
 
OpenRefine Class Tutorial
OpenRefine Class TutorialOpenRefine Class Tutorial
OpenRefine Class TutorialAshwin Dinoriya
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)GOKb Project
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Sources et ressources dans le domaine culturelle
Sources et ressources dans le domaine culturelleSources et ressources dans le domaine culturelle
Sources et ressources dans le domaine culturelleAntoine Courtin
 
BID CE workshop 1 session 10 - presentation - Open Refine
BID CE workshop 1   session 10 - presentation - Open RefineBID CE workshop 1   session 10 - presentation - Open Refine
BID CE workshop 1 session 10 - presentation - Open RefineAlberto González-Talaván
 
Building a successful agile data transformation stack
Building a successful agile data transformation stackBuilding a successful agile data transformation stack
Building a successful agile data transformation stackMartin Magdinier
 
OpenRefine - pomocník líného PPCčkáře
OpenRefine - pomocník líného PPCčkářeOpenRefine - pomocník líného PPCčkáře
OpenRefine - pomocník líného PPCčkářeKarel Rujzl
 
Google Webmaster Tools a SEO - Lukáš Pokorný
Google Webmaster Tools a SEO - Lukáš PokornýGoogle Webmaster Tools a SEO - Lukáš Pokorný
Google Webmaster Tools a SEO - Lukáš PokornýJsmeMarketing
 
Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)Alison Rowland
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
School Of Data - mapping opencorporates networks using openrefine and Gephi
School Of Data - mapping opencorporates networks using openrefine and GephiSchool Of Data - mapping opencorporates networks using openrefine and Gephi
School Of Data - mapping opencorporates networks using openrefine and GephiTony Hirst
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarianstfmorris
 

Destacado (20)

20130206 open refine
20130206  open refine20130206  open refine
20130206 open refine
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
Introduction to OpenRefine
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefine
 
20130626 OpenRefine Introduction
20130626 OpenRefine Introduction20130626 OpenRefine Introduction
20130626 OpenRefine Introduction
 
Employing Google Refine to publish Linked Data
Employing Google Refine to publish Linked DataEmploying Google Refine to publish Linked Data
Employing Google Refine to publish Linked Data
 
OpenRefine Class Tutorial
OpenRefine Class TutorialOpenRefine Class Tutorial
OpenRefine Class Tutorial
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
 
OpenRefine Tutorial
OpenRefine TutorialOpenRefine Tutorial
OpenRefine Tutorial
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Sources et ressources dans le domaine culturelle
Sources et ressources dans le domaine culturelleSources et ressources dans le domaine culturelle
Sources et ressources dans le domaine culturelle
 
BID CE workshop 1 session 10 - presentation - Open Refine
BID CE workshop 1   session 10 - presentation - Open RefineBID CE workshop 1   session 10 - presentation - Open Refine
BID CE workshop 1 session 10 - presentation - Open Refine
 
Building a successful agile data transformation stack
Building a successful agile data transformation stackBuilding a successful agile data transformation stack
Building a successful agile data transformation stack
 
OpenRefine - pomocník líného PPCčkáře
OpenRefine - pomocník líného PPCčkářeOpenRefine - pomocník líného PPCčkáře
OpenRefine - pomocník líného PPCčkáře
 
LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine LOD2 Webinar Series: Zemanta / Open refine
LOD2 Webinar Series: Zemanta / Open refine
 
Google Webmaster Tools a SEO - Lukáš Pokorný
Google Webmaster Tools a SEO - Lukáš PokornýGoogle Webmaster Tools a SEO - Lukáš Pokorný
Google Webmaster Tools a SEO - Lukáš Pokorný
 
Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)Open refine reconciliation service api (dc python 2013_03_05)
Open refine reconciliation service api (dc python 2013_03_05)
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
School Of Data - mapping opencorporates networks using openrefine and Gephi
School Of Data - mapping opencorporates networks using openrefine and GephiSchool Of Data - mapping opencorporates networks using openrefine and Gephi
School Of Data - mapping opencorporates networks using openrefine and Gephi
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarians
 

Similar a Toronto OpenRefine MeetUp Nov 2015

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Precisely
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar Revolution Analytics
 
Arun Mathew Thomas_resume
Arun Mathew Thomas_resumeArun Mathew Thomas_resume
Arun Mathew Thomas_resumeARUN THOMAS
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraMolly Alexander
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackPrecisely
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataVoltDB
 
Hadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyHadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyDataWorks Summit
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02email2jl
 

Similar a Toronto OpenRefine MeetUp Nov 2015 (20)

Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
Joe C
Joe CJoe C
Joe C
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
 
Arun Mathew Thomas_resume
Arun Mathew Thomas_resumeArun Mathew Thomas_resume
Arun Mathew Thomas_resume
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
End User Informatics
End User InformaticsEnd User Informatics
End User Informatics
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time Data
 
Hadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyHadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata Company
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
Creatinganext generationbigdataarchitecture-141204150317-conversion-gate02
 

Último

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 

Último (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 

Toronto OpenRefine MeetUp Nov 2015

  • 1. refinepro.com - @RefinePro – martin@refinepro.com 1 Enable domain experts explore, normalize and enrich their data via a self service data preparation platform Toronto OpenRefine Meet Up Nov 17, 2015
  • 2. refinepro.com - @RefinePro – martin@refinepro.com 2 Garbage In – Garbage Out
  • 3. refinepro.com - @RefinePro – martin@refinepro.com 3 Analytics need clean data to be reliable Legacy data need to be migrated to a new system Data must be reconciled against a master data set Data projects needs access to reliable data quickly
  • 4. refinepro.com - @RefinePro – martin@refinepro.com 4 Data Processing Pipeline
  • 5. refinepro.com - @RefinePro – martin@refinepro.com 5 60 to 80% of data analysis is spent on the process of cleaning, transformation and integration Data Processing Pipeline
  • 6. refinepro.com - @RefinePro – martin@refinepro.com 6 • Duplicate value & Typos • Multi value cells • Data in the wrong field • Missing / Partial Values • Encoding Errors • Change format (text, number, date) • Flat to relational data set • Schema alignment • Transpose rows and columns • Join data-set • Enrichment from other sources (MDM, API calls) Data Quality & Integration & Is Time Consuming
  • 7. refinepro.com - @RefinePro – martin@refinepro.com 7 • Which field should it contains? • What format should it follow? • What geographical scope should it support? • Enforce data integrity rules (eg. postal code vs city)? – Individual and business unit needs are unique What is clean data? How do you know define a clean address?
  • 8. refinepro.com - @RefinePro – martin@refinepro.com 8 • New economy of machine learning, predictive and data enrichment service: • Geocoding and address cleaning • Name recognition and extraction • Churn prediction • … Those services come with an API first approach requiring technical skills Data Service have an API first approach
  • 9. refinepro.com - @RefinePro – martin@refinepro.com 9 DBA ETL Data Science Spreadsheet User Data Visualization / Interpretation User Base Understand the Data (Business Skills) Know How To Transform Data (Technical Skills) Today's data environment challenge traditional technologies Excel doesn't scale or automate well IT can't pace with the volume of requests
  • 10. refinepro.com - @RefinePro – martin@refinepro.com 10 Frequency - number of use case Profiling Preparation Discovery Data Wrangling 1 32 Sense Making Data Exploration Is the data useful? What Can I do with it? OpenRefine in the Data Quality & Integration Pipeline
  • 11. refinepro.com - @RefinePro – martin@refinepro.com 11 Frequency - number of use case Profiling Preparation Discovery Data Wrangling 1 32 Personal ETL & Analysis Prototype One time migration Sense Making Data Exploration Is the data useful? What Can I do with it? OpenRefine in the Data Quality & Integration Pipeline
  • 12. refinepro.com - @RefinePro – martin@refinepro.com 12 Frequency - number of use case Profiling Preparation Discovery Data Wrangling 1 32 Big Data Real -Time Processing Enterprise ETL Personal ETL & Analysis Prototype One time migration Sense Making Data Exploration Is the data useful? What Can I do with it? OpenRefine in the Data Quality & Integration Pipeline
  • 13. refinepro.com - @RefinePro – martin@refinepro.com 13 Understand the Data (Business Skills) Know How To Transform Data (Technical Skills) Frequency - number of use case Profiling Preparation Discovery Data Wrangling 1 2 3 OpenRefine in the Data Quality & Integration Pipeline
  • 14. refinepro.com - @RefinePro – martin@refinepro.com 14 OpenRefine Functionality XLS, CSV, JSON, XML Input & Output Support Point & Click Cluster & Deduplication Filter & Sort Transpose Custom Query Language Enrich data via APIs Join, Merge & Reconcile Split to rows and columns Undo / Redo
  • 15. refinepro.com - @RefinePro – martin@refinepro.com 15 Training Cloud & on- premise hosting Integration & Custom Development RefinePro helps teams and organization to scale OpenRefine
  • 16. refinepro.com - @RefinePro – martin@refinepro.com 16 Contact and Link OpenRefine website: http://openrefine.org Twitter @OpenRefine RefinePro website: http://refinepro.com Twitter: @RefinePro Martin Magdinier twitter: @magdmartin Linkedin: https://ca.linkedin.com/in/magdinier/en
  • 17. refinepro.com - @RefinePro – martin@refinepro.com 17 Enable domain experts explore, normalize and enrich their data via a self service data preparation platform Toronto OpenRefine Meet Up Nov 17, 2015