SlideShare a Scribd company logo
1 of 126
Data wrangling with open
source tools
Tony Hirst
Dept of Communication & Systems
The Open University, UK
Premises
“I take data
from wherever I
can get it”
1
“Appropriate
everything”
2
Conversations
with data
3
Visual
Conversations
with data3
(Accession Plot)
@mediaczar
If a picture’s worth a
thousand words,
maybe it should take
as long to read?
Most learning
analytics won’t be
performed by
learning analytics
researchers
How can we help
people fashion
their own tools to
support data
conversations?
Recipes
site:open.ac.uk
Have a
conversation
with the data…
Ask the right
questions…
xkcd.com/1138
Sometimes a question
makes most sense in
the context of
questions previously
asked and answers
previously received
DATA
USERS
Educators
Learners
Planners
Marketers
Policymakers
Researchers
Press
NGOs
“
D
E
V
E
L
O
P
E
R
S
”
Have
dashboard,
so what?
A tools and
issues
based view
DATA
TOOLS
USERS
PROBLEMS
Example – Google Fusion Tables
Fusion Table
https://www.google.com/fusiontables/DataSource?docid=1VKG7iCbFlsEYJzTuQppf4xoIqq1ABxWTdW6O_7o#rows:id=1
http://is.gd/qhuaoA
Walkthrough
http://blog.ouseful.info/2012/11/16/a-quick-look-at-gcsealevel-certificate-awards-market-share-by-examination-board/
http://is.gd/f9YAbG
DATA
TOOLS
USERS
PROBLEMS
Access/obtain data
Make sense of data
Ask specific questions of data
Communicate in a data-centric way
Load data
Clean data
Merge/enrich data
DATA
Issues
TOOLS
DATA
Other
TOOLS
Issues
TOOLS
“Tool based
programming”
A barrier to access
(for the tool user) is
data format
JSON XMLCSVXLS
TSV
.db
HTML
PDF DOCTXT
GLUE LOGIC(Glue code)
=importHTML(URL, “table”, N)
HTML
QUERYABLE
DATA
Try it…
Example Page
http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_the_United_States_by_endowment
http://is.gd/7Vbg6n
Google Spreadsheets as a database
Explorer
https://views.scraperwiki.com/run/google_spreadsheet_query/
http://is.gd/jiMJoh
Walkthrough
http://schoolofdata.org/2013/05/24/asking-questions-of-data-garment-factories-data-expedition/
http://is.gd/qJHihu
=importCSV(URL, N)
HTML
INTERACTIVE
DASHBOARD
Google Charts
Google Chart
Visualization API
https://code.google.com/apis/ajax/playground/
http://is.gd/TTHIUh
Google
Visualisation
API
googleVis
(R)
https://developers.facebook.com/
docs/reference/api/examples/
http://is.gd/7cRnvS
A barrier to access
(for the tool user) is
data shape
A barrier to access
(for the tool user) is
data cleanliness
Questions of
identity
The Open University
Open University
OU
Open Uni
Open University, UK
NORMALISATION/RECONCILIATION
Reconciliation to
a canonical name
and/orto a
unique identifier
A stumbling block
(for the data user)
is data enrichment
A stumbling block
(for the data user)
is joining datasets
A stumbling block
(for the data user)
is joining partially
matched data
Rolling your own
interactive data
exploration tools
R Shiny
Apps
ui.R server.R
RCharts
Many chart tools
do the work for
you if the data is
in the right shape
DATA
TOOLS
USERS PROBLEMS
Justask…
ask.SchoolOfData.org
blog.ouseful.info
@psychemedia

More Related Content

What's hot

Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
Christoph Trattner
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 

What's hot (9)

Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches Library Connect Webinar - Calculating sharing metrics: Possible approaches
Library Connect Webinar - Calculating sharing metrics: Possible approaches
 
Investigating Performance
Investigating PerformanceInvestigating Performance
Investigating Performance
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Library Connect Webinar - The secret life of articles: From download metrics ...
Library Connect Webinar - The secret life of articles: From download metrics ...Library Connect Webinar - The secret life of articles: From download metrics ...
Library Connect Webinar - The secret life of articles: From download metrics ...
 
Data science as a science
Data science as a scienceData science as a science
Data science as a science
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
 
Investigating Performance
Investigating PerformanceInvestigating Performance
Investigating Performance
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 

Viewers also liked (7)

Funding entrepreneurship cj cornell-december3rd 2012
Funding entrepreneurship cj cornell-december3rd 2012Funding entrepreneurship cj cornell-december3rd 2012
Funding entrepreneurship cj cornell-december3rd 2012
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
 
Propel Arizona Crowdfunding Essentials - for NACET
Propel Arizona Crowdfunding Essentials - for NACETPropel Arizona Crowdfunding Essentials - for NACET
Propel Arizona Crowdfunding Essentials - for NACET
 
How My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing ToolHow My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing Tool
 
Propel Arizona: Crowdfunding for Communities
Propel Arizona:  Crowdfunding for CommunitiesPropel Arizona:  Crowdfunding for Communities
Propel Arizona: Crowdfunding for Communities
 
Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)
 

Similar to Lasi datawrangling

Educational Transformation with Media
Educational Transformation with MediaEducational Transformation with Media
Educational Transformation with Media
TerryKH2006
 

Similar to Lasi datawrangling (20)

Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Data Models And Details About Open Data
Data Models And Details About Open DataData Models And Details About Open Data
Data Models And Details About Open Data
 
Educational Transformation with Media
Educational Transformation with MediaEducational Transformation with Media
Educational Transformation with Media
 
The evolution of research on social media
The evolution of research on social mediaThe evolution of research on social media
The evolution of research on social media
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Data visualization and digital humanities research
Data visualization and digital humanities researchData visualization and digital humanities research
Data visualization and digital humanities research
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Strata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and BibliometricsStrata 2012: Big Data and Bibliometrics
Strata 2012: Big Data and Bibliometrics
 
Ebi
EbiEbi
Ebi
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
020610
020610020610
020610
 
LuisValeroInterests
LuisValeroInterestsLuisValeroInterests
LuisValeroInterests
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
The crowd and the library
The crowd and the libraryThe crowd and the library
The crowd and the library
 
Using technologies to promote projects
Using technologies to promote projectsUsing technologies to promote projects
Using technologies to promote projects
 
Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...Making our mark: the important role of social scientists in the ‘era of big d...
Making our mark: the important role of social scientists in the ‘era of big d...
 
Bibliotheek & Onderzoek 2.0?
Bibliotheek & Onderzoek 2.0?Bibliotheek & Onderzoek 2.0?
Bibliotheek & Onderzoek 2.0?
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 

More from Tony Hirst

Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
Tony Hirst
 

More from Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
A Quick Tour of OpenRefine
A Quick Tour of OpenRefineA Quick Tour of OpenRefine
A Quick Tour of OpenRefine
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Lasi datawrangling

Editor's Notes

  1. I am not a journalist, but it seems to me that a large part of your work, and indeed a large part of the work of a scientist or an analyst, is in asking the right questions of a source, and knowing how to frame those questions.The data journalist knows how to ask questions of data.
  2. Also – high incidence of crime around police stations (no location, so police station used as default location); Russell Square as a murder hotspot.
  3. Another nice example of this, and one used by many advocates of data visualisation, is the famous example of Anscombe’s quartet, for sets of two dimensional data with some interesting properties.
  4. For example, many of the “classic” summary statistics for the corresponding columns in these data sets are to all intents and purposes the same.
  5. But when we look at the datasets as a set of scatterplots, we see how the data tells very different stories.
  6. People learn the skills they need, as they need them.