Talk at a Data Journalism BootCamp organised by ICFJ, World Bank Group and African Media Initiative in New Delhi to a group of 60 journalists, coders and social sector folks. Other amazing sessions included those from Govind Ethiraj of IndiaSpend, Andrew from BBC, Parul from Google, Nasr from HacksHacker, Thej from DataMeet and David from Code for Africa. http://delhi.dbootcamp.org/
1. Getting comfortable with data
@ritvvijparrikh, Data Designer, pykih.com
d|Bootcamp, http://delhi.dbootcamp.org, September 5, 2014
2. About me
I help organisations make sense (visual or otherwise) of data.
2005 University: Neural Net based Market Prediction Stock Market Data
2006 Software Developer at Amdocs
2011
Design Lead
Amdocs Sales Team to AT&T
Product Manager at samhita.org
Founded TracksGiving - analytics for charities
2013 Founded pykih - data visualisation
Telecom data for AT&T
Donation data
FirstPost
Journalism++ Cologne / Datawrapper
Microsoft
visual.ly
NarendraModi.in
5. Article in English
WHO does not marvel at the
prospect of India going to the polls?
Starting on April 7th, illiterate
villagers and destitute slum-dwellers
will have an equal say alongside
Mumbai’s millionaires in picking their
government. Almost 815m citizens
are eligible to cast their ballots in
nine phases of voting over five
weeks—the largest collective
democratic act in history.
!
But who does not also deplore the
fecklessness and venality of India’s
politicians? The country is teeming
with problems, but a decade under a
coalition led by the Congress party
has left it rudderless. Growth…
6. English article on Genomics
Assumption: Unknown domain
Genomics is a discipline in
genetics that applies recombinant
DNA, DNA sequencing methods,
and bioinformatics to sequence,
assemble, and analyze the
function and structure of genomes
(the complete set of DNA within a
single cell of an organism).[1][2]
Advances in genomics have
triggered a revolution in discovery-based
research to understand
even the most complex biological
systems such as brain.[3] The field
includes efforts to determine the
entire DNA sequence of organisms
and fine-scale genetic mapping.
The field also includes studies of
intragenomic phenomena …
10. Objectives
• What is data made up of?!
• Data File Formats!
• Where is the story worthy data!
• Data Types!
• Properties of Data!
• Insights / Recipes for stories!
• Data Aggregation!
• Basic Spreadsheet Functions
11. About me
Data on Glass Manufacturing Factory Floor in German Language
Unknown domain. Unknown language.
We still modelled the data correctly.
13. Where does data come from?
Human!
Actions / experiences
Wind
14. Where does data come from?
Documented Data
Human!
Actions / experiences
Wind Documenting
15. Where does data come from?
Documented Data Insights
Human!
Actions / experiences
Wind Documenting Sea Travel
16. Where does data come from?
Documented Data Insights
Human!
Actions / experiences
Wind Documenting Sea Travel
What am I doing Twitter Sentiment about Budget
17. Where does data come from?
Documented Data Insights
Human!
Actions / experiences
Wind Documenting
What am I doing Twitter
Sea Travel
Sentiment about Budget
Vote Election Commission Political Change
18. Where does data come from?
Human!
Actions / experiences
Documented Data Insights
Wind Documenting
What am I doing Twitter
Sea Travel
Sentiment about Budget
Vote Election Commission Political Change
State Dept. Wires Wikileaks Backdoor Foreign Policy
19. What has changed?
Human
Actions / experiences
Documented Data Insights
Wind Documenting
What am I doing Twitter
Sea Travel
Sentiment about Budget
Vote Election Commission Political Change
State Dept. Wires Wikileaks Backdoor Foreign Policy
20. Technology
Human
Actions / experiences
Documented Data Insights
Wind Documenting
Sea Travel
What am I doing Twitter Sentiment about Budget
Vote Election Commission Political Change
State Dept. Wires Wikileaks Backdoor Foreign Policy
21. Struggling with
Human
Actions / experiences
Documented Data Insights
Grasp
Wind Documenting
Sea Travel
What am I doing Twitter Sentiment about Budget
Vote Election Commission Political Change
State Dept. Wires Wikileaks Backdoor Foreign Policy
22. Struggling with
Human
Actions / experiences
Documented Data Insights
Story
Grasp
Wind Documenting
Sea Travel
What am I doing Twitter Sentiment about Budget
Vote Election Commission Political Change
State Dept. Wires Wikileaks Backdoor Foreign Policy
23. What is data made up of?
Human
Actions / experiences
Documented Data Insights
Wind Documenting
Domain !
Human context
Meta data!
How is it stored
Sea Travel
24. Data Comprehension
Human
Actions / experiences
Documented Data Insights
Wind Documenting
Domain !
Human context
Meta data!
How is it stored
Grammar of the Data
Sea Travel
26. How is the data stored?
Format is a pre-defined structure in which 1s’
and 0s’ are stored to for a software to read it.
27. How is the data stored?
!
!
Data Designed for!
Data and !
Formatting!
Humans
Designed for!
Machine
28. Machine readable data is for us
!
!
Data Designed for!
Data and !
Formatting!
Humans
Designed for!
Machine
Our objective to discover story in data. Formatting will unnecessarily come in the way.
29. Tabular v/s Document data
Designed for!
Humans
Designed for!
Machine
Tabular
Document
30. Scraping / API Integration
Designed for!
Humans
Designed for!
Machine
Tabular
Scrape / API
Document
New Terms: PDF Scraping. Web Scraping. API Integration.
Developer
31. Machine readable Tabular data formats
Designed for!
Humans
Designed for!
Machine
Tabular
Scrape / API
Document
32. * separated values files
| (pipe) acts as a delimiters allowing us to identify columns
new lines
help identify
rows
Extend this concept, and you get !
!
Comma Separated Value files!
Pipe Separated Value files!
Semicolon Separated Value files!
Tab Separated Value files …
33. FYI - Data Formats
Designed for!
Humans
Designed for!
Machine
Tabular
Scrape / API
Document
35. Whom was this created for?
Designed for!
Machine
Document
Designed for!
Humans
Tabular
36. Whom was this created for?
Designed for
Humans
Designed for!
Machine
Tabular
Document
Horizontal
37. Machine readable data is for us
!
!
Data Designed for!
Data and !
Formatting!
Humans
Designed for!
Machine
Our objective to discover story in data. Formatting will unnecessarily come in the way.
Recap
39. Documented Data Insights
Let’s dive into basics
Human!
Actions / experiences
Where is the find story-worthy data
40. Where is all the story worthy data sitting?
• data.gov.in!
• RBI.org.in!
• mospi.nic.in !
• planningcommission.nic.in!
• unicef.org/statistics!
• indiabudget.nic.in!
• ncrb.nic.in!
• mha.nic.in!
• dise.in!
• World Bank!
• Oxfam!
• IMF!
• World Health Organisation!
• …
41. It could also be in…
• data.gov.in!
• RBI.org.in!
• mospi.nic.in !
• planningcommission.nic.in!
• unicef.org/statistics!
• indiabudget.nic.in!
• ncrb.nic.in!
• mha.nic.in!
• dise.in!
• World Bank!
• Oxfam!
• IMF!
• World Health Organisation!
• …
• Tweets!
• Stock Market!
• Politician’s speeches!
• Other news articles!
• Wiki Leaks!
• Police FIR reports!
• Survey!
• Blogs!
• Cell phone tower logs
42. Documented Data Insights
Let’s dive into basics
Grammar of the Data
Human!
Actions / experiences
48. Formatting
Things you do to make the data more Human readable.!
Data Formated Data
3 3%
3.03 $3.03
34950683 3,49,50,683
34950683 34.950683 Million
Rounding Up 35 Million
50. Formatting is for presentation purposes only
Stay away from tools that do not format for presentation only. E.g. Round, Currency.
51. What if Formatting is not used for presentation?
Things you do to make the data more Human readable.!
Data Data type Formated Data Data type
3 Number 3% String
3.03 Float $3.03 String
34950683 Number 3,49,50,683 String
34950683 Number 34.950683 Million String
Rounding Up Number 35 Million String
52. Properties of Data
Quantitative !
!
• is things you ADD e.g. number of sandwiches !
!
Qualitative !
!
• that tell you ATTRIBUTES e.g. staleness of sandwich, veg or non-veg
53. Properties of Data …
Quantitative !
!
• e.g. number of sandwiches !
• Always a number
Qualitative !
!
• e.g. staleness of sandwich, veg or non-veg!
• May or may not be a number e.g. number of days ago it was manufactured!
54. Properties of Data …
Quantitative !
!
• e.g. number of sandwiches !
• Always a number!
• Objective: ADD!
Qualitative !
!
• e.g. staleness of sandwich, veg or non-veg!
• May or may not be a number e.g. number of days ago it was manufactured!
• Objective: Quality / Health
55. Properties of Data …
Geospatial!
!
Terms!
!
• Countries!
• States / Regions!
• Districts / Counties!
• Taluka!
• Cities!
• Latitude Longitude!
!
Need for Standardisation!
!
• India = Bharat = Republic of India = Hindustan!
!
Standards!
!
• ISO2 Codes
56. Properties of Data …
Timeseries!
!
Terms!
!
• Year!
• Month - Year!
• Date!
• Date / Time!
• Time!
• Day of the Week!
• Hour
57. Properties of data …
Exercise
Sentiment Qualitative
Number of
tweets Quantitative
Day Timeseries
58. Properties of Data …
Source: http://www.bbc.co.uk/news/business-15748696
59. Properties of Data …
Health of
Economy Qualitative
Size of
Economy Quantitative
Countries Geospatial
Years Timeseries
Source: http://www.bbc.co.uk/news/business-15748696
60. Properties of Data …
Health of
Economy Qualitative
Size of
Economy Quantitative
Countries Geospatial
Years Timeseries
Debt Relational
61. Properties of Data …
Relational data
friends friends
Joe Ram Zoe
exports
India exports
US
Goa
Pune B’lore
Hubli
64. Properties of Data …
Hierarchical data
Source http://www.pykih.com/data-journalism/election-counting-day-app-for-firstpost
65. Properties of Data …
Hierarchical data is any data that has a tree
Journalist
• CEO - VP - Managers - ….!
• Prime Minister - Cabinet - …!
• Country - State - City - Zipcode
66. Properties of Data …
Hierarchical data is any data that has a tree
Journalist
• CEO - VP - Managers - ….!
• Prime Minister - Cabinet - …
Developer
• Product Hierarchy!
• Distribution of funds!
• Flow of Ganga into various
tributaries
68. I have $10 to spend in a day
Is it more? Is it less?
69. Data when compared makes sense
Everything is relative
December 2012
iPhone division revenue
for Quarter was $24.4 B Fact
70. Data when compared makes sense
Everything is relative
December 2012
iPhone division revenue
for Quarter was $24.4 B Fact
Story
December 2012
Entire Microsoft’s revenue
for same Quarter was $20.9 B
71. Comparisons must have a baseline
What is the common denominator
Source: http://www.statista.com/chart/2628/police-firearms-discharges/
72. Let’s dive into basics
Recipes for stories
Human!
Actions / experiences Documented Data Insights
73. India gives maximum citizenship to people from ____?
I would assume it is Bangladesh or Nepal.
But since Bangladesh’s base is higher…
it should be Bangladesh.
74. India gives maximum citizenship to people from ____?
Source: http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=1153
76. India gives maximum citizenship to people from ____?
Source:http://factchecker.in/pakistanis-get-maximum-indian-citizenship/
77. What did we do?
Hypothesis Testing
Source: http://factchecker.in/category/fact-check/
78. Often you have data but no hypothesis…
In such a case, you will explore
the data set to find patterns and insights.
Census Dashboard - http://www.pykih.com/data-journalism/india-census
82. Two is better than one
If you plot crime in UP across last 10 years,
all you get is a LINE chart.
83. Two is better than one
If you plot crime in UP across last 10 years,
all you get is a LINE chart.
+
Political parties ruling UP in same period
=
Story
86. Data is simply documented human actions / experience.
Focus on understanding the Grammar behind data.
87. Fun fact: The word pykih came to us
in a CAPTCHA. That’s the day we
decided that till we do good work it
does not matter what we are called.
We are at @pykih