SlideShare una empresa de Scribd logo
1 de 92
Descargar para leer sin conexión
DataKind Singapore
DataLearn: Post-DataDive Sharing
23 July 2015
Thanks to our host!
Share photos
& tweets
#DataLearn
#data4good
@DataKindSG
Agenda
1. DataKind Singapore updates
2. DataDive Overview
3. DataDive Sharing: Earth Hour
4. DataDive Sharing: HOME
5. Small group discussions on data
handling best practices (if time allows)
Agenda
1. DataKind Singapore updates
2. DataDive Overview
3. DataDive Sharing: Earth Hour
4. DataDive Sharing: HOME
5. Data handling best practices
DataKind™ harnesses
the power of data
science in the service of
humanity.
Updates
Project Accelerator coming up on 5 Sept!
- Please help spread the word to any social change organizations you know
who do good work.
- Sign up form here: http://goo.gl/forms/0TbDySVFi7
- Sign up by Friday, Aug 14.
Other data4good stuff
http://unitedwehack.com/
- August 15-16, 24 hour hackathon organized with UN Women
- “A Hackathon to Promote Financial Literacy and Economic Empowerment for
Women Around the Globe.”
- Access to partner APIs
http://blog.datalook.io/openimpact/
- Until August 31
- “DataLook is a directory of reusable data-driven projects for social good. Join
our replication marathon and bring a project to your city.”
Agenda
1. DataKind Singapore updates
2. DataDive Overview
3. DataDive Sharing: Earth Hour
4. DataDive Sharing: HOME
5. Data handling best practices
What is a DataDive?
Get to know the
participating
organizations
Select and learn
about a problem
and the data
Determine a
specific task or
team that you
can work on
Data Dive in!
Repeat!
Coordinate with the team’s
Data Ambassadors & Project
Managers !→ code
→ final presentation
→ analysis
Contribute
Results
DataDive Retrospective
- Took place over weekend of 23 - 25 Apr
- More than 70 participants
- 2 non-profit organizations
- Earth Hour
- HOME
- Intro to the orgs on Friday and socialize,
working through Saturday, final
presentations on Sunday at 1pm.
DataDive Key Learnings
- Full involvement from partner orgs is important, and we need to
emphasize this from the very beginning
- Trello will be mandatory to avoid duplication of effort
- Grant access to data on Friday night and help participants to
start setting up data and tools, so that they can start right away
on Saturday morning
- Remind people that final presentations will be from Google
Presentation and set a hard deadline for getting content in, so
that there is time to vet
Agenda
1. DataKind Singapore updates
2. DataDive Overview
3. DataDive Sharing: Earth Hour
4. DataDive Sharing: HOME
5. Data handling best practices
Tweet Data Analysis - EARTH HOUR
Main Objectives
- Identify influencers on twitter
- Sentiment Analysis
- Word Cloud Analysis
Tweets Analysis Prep - EARTH HOUR
1. Understanding the data
anon
_id
created
_at
Text No Comma coordi
nates
lang RT_
count
fav_
count
reply_to
_user_id place
8393
7561
27/03/
2015
RT @JimHarris: SHAME: Canada Ranks LAST
Among OECD Countries for #ClimateChange
Performance #cdnpoli #climate #gls15 http://t.
co/0DD7S6oy7h
en 152 0
8423
3936
28/03/
2015
RT @earthhour: It's not about which country
you're from; it's about what planet we're from.
Join us for #EarthHour! 28th March; 8.30pm.
en 360 493 8423393
6
8069
6055
27/03/2
015
@nenshi will you be joining the #earthhour
party? Plz retweet to encourage lights out?
#earthhourcalgary #yourpower http://t.
co/68TblYiW2Y
-114.0
66491
4;....
en 9 4 [-114.0
59111
818;....
Tweet Data Analysis Prep - EARTH HOUR
2. Identify preliminary tasks to support other analysis
- Identify which tweets are retweets
- Identify which tweets contain which EH hashtags
- For retweets, identify which user is being retweeted
Tweets Analysis Prep - EARTH HOUR
3. Creating additional variables (I)
- Turn data into a table (tblTweets)
- Create binary variable to identify retweets
... Text No Comma ... is_retweet
RT @JimHarris: SHAME:
Canada Ranks LAST
Among OECD Countries
for #ClimateChange
Performance #cdnpoli
#climate #gls15 http://t.
co/0DD7S6oy7h
=IF(ISNUMBER
(SEARCH("RT
@",[@[Text No
Comma]])),1,0)
What does the formula do?
→ Check if “RT @” is found in the tweet text
Case 1: String is found
1. SEARCH returns start character of string
2. ISNUMBER evaluates to true as search
returned a number
3. IF returns 1 as isnumber is true
Case 2: String is not found
1. SEARCH returns #Value error
2. ISNUMBER will evaluate to false as
search did not return a number
3. IF returns 0 as isnumber is false
Tweets Analysis Prep - EARTH HOUR
4. Secondary use for the is_retweet variable
- Understand lasting impact of campaign and event
Tweets Analysis Prep - EARTH HOUR
5. Creating additional variables (II)
- The motto of EH was “use your power to change climate change”
- EH # crawled: #EarthHour, #climatechange, #yourpower, #useyourpower
- Create binary variables for each hashtag
... Text No Comma ... EarthHour ...
RT @JimHarris: SHAME: Canada
Ranks LAST Among OECD
Countries for #ClimateChange
Performance #cdnpoli #climate
#gls15 http://t.co/0DD7S6oy7h
=IF(ISNUMBER(SEARCH
("#"&tblTweets[[#Headers],
[EarthHour]],[@[Text No
Comma]])),1,0)
...
Tweets Analysis Prep - EARTH HOUR
6. Secondary use for binary hashtag variables
Tweets Analysis Prep - EARTH HOUR
Main Takeaways
- Coding knowledge (e.g. R / Python) is not required to
contribute during a Data Dive
- Preparatory tasks can yield useful insights as well
- Excel can be helpful but may not be the most suitable
tool for large data files
Identify Influencers - EARTH HOUR
Problem
Identify influencers
Solution
Analyse tweet data to identify most retweeted users
Identify influencers - EARTH HOUR
1. Creating additional variables
- extract which user is being retweeted
... Text No Comma ... is_ret
weet
original_tweeter
RT @earthhour: It's not about
which country you're from; it's
about what planet we're from. Join
us for #EarthHour! 28th March;
8.30pm.
1 =IF([@[is_retweet]]=0,"original",RIGHT(LEFT([@
[Text No Comma]],FIND(":",[@[Text No Comma]]
&":")-1),LEN(LEFT([@[Text No Comma]],FIND(":",
[@[Text No Comma]]&":")-1))-3))
→ @earthhour
Identify influencers - EARTH HOUR
1. Creating additional variables
What does the formula do in this example?
=IF([@[is_retweet]]=0,"original",RIGHT(LEFT([@[Text No Comma]],FIND(":",[@[Text No Comma]]&":")-1),
LEN(LEFT([@[Text No Comma]],FIND(":",[@[Text No Comma]]&":")-1))-3))
[@[is_retweet]] = 1
[@[Text no comma]] = “RT @earthhour: It's not about which country you're from; it's about what planet
we're from. Join us for #EarthHour! 28th March; 8.30pm.”
The formula can be broken down into a few parts:
1. Check if it is a retweet - if it is, go to Point 2, otherwise mark it as “original”
2. Find the first occurrence of “:” in the text, return the character where it appears - 1
3. Start at the left of the tweet text, truncate the string after [Point 2] characters
4. From the length of the string in [Point 3], subtract 3 (for “RT “)
5. Start at the right of the string in [Point 3], truncate after [Point 4] characters
Identify influencers - EARTH HOUR
1. Create additional variables
2. Select all tweets in dataset with retweets >=500
3. Check for extraction errors (if re-tweets > followers, manual investigation)
4. Check for parsing errors (if length of text >= 180 char, marked as error)
5. From remaining set: For users who were retweeted, assess profile
information, number of followers, country where available
Identify influencers - EARTH HOUR
6. Outcome: Users who were most retweeted
User who is
retweeted
Nr. RT
>=500
Total RT
Count
User info Nr. Of
Followers
Country /
Region
@earthhour 5 46,947 EarthHour 143,000 Global
@LeoDiCaprio 5 7,627 Leonardo di Caprio
- Actor, WWF Ambassador
12,800,000 US
@AstroSamantha 2 2,750 Sam Cristoforetti
- Italian Astronaut on ISS
510,000 Italy
Sentiment Analysis - Tweets
WHAT IS SENTIMENT ANALYSIS
● Sentiment analysis aims to determine the attitude of a speaker or a writer
with respect to some topic or the overall contextual polarity of a
document.
WHAT WE USED FOR SENTIMENT ANALYSIS
● We used the Python package VADER, a lexicon and rule-based sentiment
analysis tool that is specifically attuned to sentiments expressed in social
media, and works well on texts from other domains.
● More information on VADER can be found in
○ http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdfhttp:
//comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
○ https://github.com/cjhutto/vaderSentiment
● Installed VADER thru PIP
Sentiment Analysis - Tweets
HOW TO USE VADER IN YOUR CODE
from vaderSentiment import sentiment as vaderSentiment
sentences = [
"VADER is smart, handsome, and funny.", # positive sentence
example
"VADER is smart, handsome, and funny!", # punctuation emphasis
handled correctly (sentiment intensity adjusted)
].
for sentence in sentences:
print sentence,
vs = vaderSentiment(sentence)
print "nt" + str(vs)
Sentiment Analysis - Tweets
HOW VADER ANALYZES SOME OF THE INPUTS
VADER is smart, handsome, and funny.
{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
VADER is smart, handsome, and funny!
{'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!
{'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}
A really bad, horrible book.
{'neg': 0.791, 'neu': 0.209, 'pos': 0.0, 'compound': -0.8211}
Sentiment Analysis - Tweets
HOW VADER ANALYZES SOME OF THE INPUTS
At least it isn't a horrible book.
{'neg': 0.0, 'neu': 0.637, 'pos': 0.363, 'compound': 0.431}
:) and :D
{'neg': 0.0, 'neu': 0.124, 'pos': 0.876, 'compound': 0.7925}
{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
Today sux
{'neg': 0.714, 'neu': 0.286, 'pos': 0.0, 'compound': -0.3612}
Today SUX!
{'neg': 0.779, 'neu': 0.221, 'pos': 0.0, 'compound': -0.5461}
Sentiment Analysis - Tweets
HOW WE PERFORMED SENTIMENT ANALYSIS ON EARTH HOUR DATA
● DEFINE RANGES FOR COMPOUND VALUE TO CATEGORIZE THE
SENTIMENT OF TWEET
VERY NEGATIVE, NEGATIVE, NEUTRAL, POSITIVE AND VERY
POSITIVE
● OPEN FILE
● READ A RECORD
● PARSE AND EXTRACT TWEET
● PASS TWEET TEXT TO VADER METHOD
Sentiment Analysis - Tweets
HOW WE PERFORMED SENTIMENT ANALYSIS ON EARTH HOUR DATA
● PARSE OUTPUT TO EXTRACT COMPOUND
● BASED ON COMPOUND VALUE , DETERMINE SENTIMENT OF THE TWEET
● STORE THE CATEGORICAL VALUE OF SENTIMENT IN A VARIABLE
● ADD A DUMMY VARIABLE WITH VALUE OF 1
● GO TO STEP 3 TILL EOF
● WRITE THE OUTPUT TO A FILE
Sentiment Analysis - Tweets
HOW WE PREPARED CHARTS FOR SENTIMENT ANALYSIS
● OPEN THE OUTPUT FILE CREATED AFTER APPLYING VADER
● READ RECORDS INTO DATAFRAMES OF PANDAS (A powerful Python
data analysis toolkit)
● PERFORM GROUPING (think of as GROUPBY in SQL) AND SUMMARIZE
THE DUMMY VARIABLE
● PRESENT THE OUTPUT BY PIE CHARTS (using Python package
MATPLOTLIB)
Sentiment Analysis - Tweets
1) Using all valid records
- Apply Python package:
-vaderSentiment
- Categorize sentiment scores:
-Very positive: >0.55
-Positive: >=0.10 to <=.54
-Neutral: >= (-0.10) to <=0.09
-Negative: >=-0.55 to <=-0.11
-Very negative: <=-0.56
Input: Cleaned Hashtag Tweets
Sentiment Analysis - Tweets
Breakdowns for Tweets / Re-tweets
Questions?
1. Which keywords are most retweeted?
2. Is there a relation among the topics containing each of the four
keywords of Earth Hour?
3. Which words represent different sentiments?
Data Cleaning
Looking at English tweets.
Tweeted messages includes
non meaningful characters.
# remove retweet entities RT/via
some_txt = gsub("(RT|via)((?:bW*@w+)+)", "", some_txt)
# remove <br>
some_txt = gsub("<br>", "", some_txt)
# remove @people
some_txt = gsub("@w+", "", some_txt)
# remove html links
some_txt = gsub("http(s?)(://)(.*)[.|/|_](.*)+", "", some_txt)
some_txt = gsub("htt(.*)", "", some_txt)
# replace smiley
some_txt = gsub(": ", " ", some_txt)
Create WordCloud
library(wordcloud)
library(tm)
# import the data
lords <- Corpus(DirSource("./Wordcloud_Input/"))
# transform and prepare the data for the word cloud
lords <- tm_map(lords, stripWhitespace)
lords <- tm_map(lords, content_transformer(tolower))
lords <- tm_map(lords, removeWords, c("amp"))
lords <- tm_map(lords, stemDocument)
# Word cloud graph
wordcloud(lords, scale=c(3,0.3), max.words=100, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
R library
Load input file
Further text
processing
Graph
options
Word Cloud Analysis
Using Unique Tweets Only
- Remove the four EH hashtags
- Apply R package: wordcloud
- This package works on word-stems
Input: Cleaned Hashtag Tweets
Word Cloud Analysis
Comparison (Unique Tweets only):
#earthhour #climatechange
Putting in Sentiment Analysis
Word Cloud (only unique tweets)
Country assignment to subscribers - EARTH HOUR
Problem
Find out country of ActiveCampaign subscribers
Solution
Merge information from different columns, such as city,
country, lat-long
Parallelized human cleaning process (PHCP)
"id","email","sdate","cdate","city","state_province","zip","country1","country2","country3","location","language","address","age","discover"
"100011","email@0accounts.com","2014-01-26 15:35:28","2014-01-26 02:35:28","","","","Netherlands","","","","","","",""
"100012","email@163.com","2012-11-29 02:01:46","2012-11-28 13:01:46","","","","China","","","","","","",""
"100013","email@student.hcmus.edu.vn","2015-03-21 20:32:55","2015-03-21 07:32:55","ho chi minh","","","Viet Nam","","","","","","",""
"100014","email@gmail.com","2014-03-14 14:59:15","2014-03-14 01:59:15","Coimbatore","","","","","","","","","",""
"100015","email@QQ.COM","2013-09-27 10:25:29","2013-09-26 21:25:29","","","","China, Guangdong Province, Foshan City","","","","","","",""
"100016","email@yahoo.com.tw","2012-11-29 02:41:50","2012-11-28 13:41:50","","","","Taipei","","","","","","",""
"100017","email@gmail.com","2013-03-12 11:36:39","2013-03-11 22:36:39","","","","12`3123","","","","","","",""
ActiveCampaign crawler and conversion from JSON
Result
# of subscribers: 321,704
# of subs. with country: 188,462 → 58.6%
Campaign response rate comparison of users -
EARTH HOUR
Problem
Rank users based on historical campaign response
data
Solution
Use expected open rates to compare users with
different number of campaigns received
Raw data
JSON with user-level campaign response from ActiveCampaign API
"344": {"subscriberid": "451267", "orgname": "",
"times": "0", "phone": "", "tstamp": "2014-03-28
00:51:14", "email": "email@gmail.com"}, "0":
{"subscriberid": "439666", "orgname": "",
"times": "0", "phone": "", "tstamp": "2014-03-28
00:51:14", "email": "email2@gmail.com"},
"346": {"subscriberid": "451324", "orgname": "",
"times": "0", "phone": "", "tstamp": "2014-03-28
00:51:14", "email": "email3@yahoo.com"},
"347": {"subscriberid": "451330", "orgname": "",
"times": "0", "phone": "", "tstamp": "2014-03-28
00:51:14", "email": "email4@yahoo.com"}
{"open_list": [], "campaign_id": "90"}
{"open_list": [{"times": "304", "tstamp":
"2014-03-07 13:49:26", "subscriberid":
"395746", "email":
"somebody@earthhour.org"}],
"campaign_id": "89"}
{"open_list": [], "campaign_id": "20"}
{"open_list": [{"times": "2", "tstamp":
"2013-01-22 15:00:20", "subscriberid":
"14604", "email": "someone@earthhour.
org"}], "campaign_id": "8"}
{"0": {"info": [{"subscriberid": "5", "orgname":
"", "times": "1", "phone": "", "tstamp": "2015-
03-29 22:58:27", "email": "email@gmail.
com"}, {"subscriberid": "8", "orgname": "",
"times": "1", "phone": "", "tstamp": "2015-03-
29 23:03:03", "email": "puikwan.lee@gmail.
com"}], "a_unique": "2", "tracked": "1", "link":
"https://github.com/DataKind-SG", "a_total":
"2", "id": "26", "name": ""}, "result_output":
"json", "result_message": "Success:
Something is returned", "result_code": 1}}
Open list
Unopen
list
Link list
Open rates
Why raw open rate is a bad estimate?
Campaign counts
Users’ chance of engagement varies; they receive different amount of emails
Jan 2014 Jan 2015
Bridget
Alan
Who is better?
Response rate and campaign count together define how interested a user is
Alan Bridget
Response rate = 50%
Campaign count = 2
Response rate = 40%
Campaign count = 5
Uncertainty
To make users comparable suppose both users receive the same count
Alan Bridget
Response rate = 20-
80%
Campaign count = 5
Response rate = 40%
Campaign count = 5
?
?
?
Expected open rate
Calculate expected open rate based on distribution of open rate
Using evidence of open count
We already know Alan opened one e-mail and did not open another
• OpenCount >= 1
• UnopenCount >= 1 → UnopenCount = CampaignCount - OpenCount
Generalize to N campaigns
Expected open rate is weighted average of the conditional expected values
• M maximum campaign count any user received in the dataset
• j number of campaigns that Alan received
Score component examples
Final scoring
0.75 * Click-through score + 0.25 * Open score
Best userDistribution of score
Globe Visualisation - EARTH HOUR
Problem
Make a sexy visualisation for eye candy
Solution
Pre-canned javascript package!
Globe Visualisation
http://datakind-sg.github.io/
- Height of each bar is related to number of Twitter followers of Earth Hour,
and color is related to temperature change from 1970’s to 1990’s
- WebGL Globe is used: https://www.chromeexperiments.com/globe
- Essentially, you just need to pass in a JSON formatted array with the
following form:
[lat1, long1, height1, colour1, lat2, long2, height2, colour2, …]
- You can use the code as a template: https://github.com/DataKind-
SG/datakind-sg.github.io
Globe Visualisation - Colour
- The temperature data is available from the IPCC: http://www.ipcc-data.
org/cgi-bin/ddc_nav/
- This is a (bad) proxy for degree of climate change at each lat/long, and
Earth Hour suggested a broader measure.
- The temp difference between the 1970’s and 1990’s was scaled to be
between 0 and 3 so that blue corresponds to the biggest decrease in
temp during that period, and red corresponds to the biggest increase in
temp.
- As homework for you… change the color map so that there isn’t green in
the middle.
Globe Visualisation - Height
- The height of each bar is ~ log(1 + n), where n is the number of Twitter
followers at the lat/long (with the exact locations rounded to the nearest
quarter degree in order to bucket).
- So the difficult part is finding the lat/long.
- Twitter profiles have free text for the location, and this needs to be
converted into a lat/long.
- Geocoding hack: try to match with given list of:
cities http://download.geonames.org/export/dump/
or countries https://developers.google.com/public-
data/docs/canonical/countries_csv
Geocoding hack
- Code is here https://github.com/oliverxchen/geovis, quasi-pseudo code below.
- We’ll look at what happens to a few examples of inputs:
- "Üt: 10.253083,-67.585859"
- "01.350750086, 103.951196586"
- "Bristol, UK"
- "between sky and earth"
- "CALAMBA, laguna, philippines"
- “Singapore”
- "Xalapa, Veracruz, Mexico"
Geocoding hack
The program is basically just a big loop through all of the free text locations and
applying the following in sequence.
A) standardize the string (change to lower case, replace multiple spaces with single spaces)
- "Üt: 10.253083,-67.585859"
- "01.350750086, 103.951196586"
- "between sky and earth"
- "bristol, uk"
- "calamba, laguna, philippines"
- "singapura"
- "xalapa, veracruz, mexico"
Geocoding hack
B) if the string starts with "Üt:", usually followed by an actual lat/long which can be
directly used
In the example strings,
"Üt: 10.253083,-67.585859" is mapped to [10.253083,-67.585859]
C) split remaining strings by commas
- ["01.350750086", "103.951196586"]
- ["between sky and earth"]
- ["bristol", "uk"]
- ["calamba", "laguna", "philippines"]
- ["singapura"]
- ["xalapa", "veracruz", "mexico"]
Geocoding hack
D) if single string after split and there’s no match yet, try to match with country list
- ["singapura"] is matched to "singapore"
- ["between sky and earth"] is not mapped
E) if two strings after split and there’s no match yet, try to parse to a lat/long:
- ["01.350750086", "103.951196586"] is mapped to [1.350750086, 103.951196586]
- ["bristol", "uk"]: float parse fails
Geocoding hack (cont.)
F) if there isn’t a match yet
try to match zeroth string to list of cities
- ["bristol", "uk"] is mapped to "bristol, united states" (Whoops!)
- ["between sky and earth"] is not mapped
- ["calamba", "laguna", "philippines"] is mapped to "calamba, philippines"
- ["xalapa", "veracruz", "mexico"] is not mapped
Geocoding hack (cont.)
G) if there still isn’t a match yet
try to match the last string to list of countries
- ["between sky and earth"] is not mapped
- ["xalapa", "veracruz", "mexico"] is mapped to "mexico"
H) if still no match, you’re out of luck
- ["between sky and earth"]
Geocoding hack (cont.)
- To deal with typos and ‘interesting’ spellings, match to cities and countries is done
using a Jaro-Winkler measure (similar to Levenshtein edit distance, but higher
weight on letters early in the word).
- Largest Jaro-Winkler value is used and needs to be above a threshold to be
considered a match.
- Python package is python-levenshtein
- Other logic to use previous results if standardized strings match
- Many improvements are possible! Eg:
- non-uniqueness of city names is not handled
- splitting on not just commas
- etc.
Agenda
1. DataKind Singapore updates
2. DataDive Overview
3. DataDive Sharing: Earth Hour
4. DataDive Sharing: HOME
5. Data handling best practices
How many cases does HOME receive each year, segmented
by domestic and non-domestic workers?
- Problems faced
- Dates had to be
reformatted
while reading
into R
- Some outlier
cases had to be
removed (e.g.,
year > 2016)
- Demo on how plot
was created
Across nationalities, what’s the proportion of workers placed
by each agency?
- Problems faced
- Some country
data was in free
text and had to
be integrated
- Demo on how plot
was created
HOME: Worker Salary Data
What’s the salary range for each nationality?
● Overall
● Domestic Workers
● Non-domestic Workers
HOME: Worker Salary Data
Key Challenges:
● Missing salary
● Inconsistent units (per hour/day/week/month) or missing units
● Different currencies (SGD, USD, MYR)
● Ranges given (eg 6-8)
● Converting hours/days/weeks to months
How it was handled:
● Using Excel, divide into two columns: values and units
● Standardize to monthly salary in third column
● Discovered on the second day that a filter hadn’t been applied
correctly, so the columns became misaligned… quick fix was
applied, but this should be checked.
HOME: Worker Salary Data
How did we create this?
HOME: Worker Salary Data
Key Lessons
One graph may not be sufficient to convey
the intended information.
Be careful of unknowingly comparing
apples to oranges.
Postal Code Problem - HOME
- Problem
- Postal Codes were not present in every record.
- Solution
- Use Google maps API, OneMap API, OpenStreetMap API
to map address to Postal Code
Postal Code Problem - HOME
- Method
- Retrieve Postal code using 3 APIs.
- Each API returned more than 1 Postal Code for one
address, as each address could map to different Postal
codes.
- Eg {"postalcode": ["089057", "088542", "079716", "088541",
"079911", "079120"], "address": "fuji xerox"}{"postalcode":
["039801", "039594", "039797", "039596", "039799"], "address":
"raffles"}{"postalcode": ["310031", "310035"], "address": "toa
payoh"}
Postal Code Problem - HOME
Postal Code Problem - HOME
Postal Code Problem - HOME
Postal Code Problem - HOME
- Problem
- All 3 API may or may not returned same set of Postal
codes.
- Solution
- Use polling method to decide which Postal code to pick.
Polling/Voting Algorithm
1. Collect all zips across all data source.
2. Weigh each zip by the number of times it appears in a data source.
3. Select the highest weighted zip.
4. Random select if there multiple highest weighted zips.
5. Sensor integration.
Polling/Voting Algorithm (Precisely)
1. Let A := ⋃i
Ai
be the set of all unique zips where Ai
is the set of zips from
data source i.
2. For each a ∊ A, compute the weight wa
= ∑i Ai
(a).
3. Select the zip a* where ∀a : wa*
wa
.
1. A1
:= (44, 34, 26, 17), A2
:= (34, 45, 17), A3
:= (17)
2. A = (44, 34, 26, 17, 45)
3. w44
= 1, w34
= 2, w45
= 1, w17
= 3
4. a*
= 17
Issues
1. Due to privacy concerns, we never saw the full address.
2. This means that we have no feel for the data.
3. Potential for errors.
Postal Code Problem - HOME
- Map based visualization
- Problem : Couldn’t find a geo data to highlight districts in
Singapore.
- Solution : Find the Center Lat Long of each district and
show the results with a marker.
- Tools : Used leaflet.js for map visualization.
- Geo data for the map was used from openstreet
maps.
Postal Code Problem - HOME
source - https://www.ura.gov.sg/realEstateIIWeb/resources/misc/list_of_postal_districts.htm
Postal Code Problem - HOME
- Number of abuses per district
Example DC.js plot.
Interactive Data Visualization
Problem:
• HOME may need to do analysis in
future to see if the situation has
improved/changed
Solution:
• Build an interactive data
visualization tool to support self-
serviced investigations
Tools:
• Used DC.js for data visualization
Interactive Data Visualization
5 easy steps to use DC.js
Interactive Data Visualization
Filter by age and salary range
Data anonymization
Problems:
• lot of sensitive data:
• first order: name, home address, passport number, birthday,
contact number, FIN
• second order: current/previous employer nfo, case created
by, agency contact nfo
• HOME data had a lot of free text fields that had various level of private
information:
• “Do you want me to treat you like <NAME>?!”
• “On <exact date>...”
• “His friend, <NAME>, application….”
Data anonymization
• real anonymization:
• un-anonymized data should not leave its usual working environment
• un-anonymized data should be only handled by authorized users
• this requires a highly portable & easy to use utility:
• python - what about Windows?
• R - don’t get me started…
• compiled CLI utility: so many things can go wrong (apart from which
OS, arch)
browsers are pretty bloated SW products; you can do video editing with it.
https://github.com/DataKind-SG/HOME
Thanks to our supporters!
Agenda
1. DataKind Singapore updates
2. DataDive Overview
3. DataDive Sharing: Earth Hour
4. DataDive Sharing: HOME
5. Data handling best practices
Data handling best practices
• Break up into small groups for discussions on below topics, and appoint a
spokesperson to tell the larger group your thoughts
• Excel is a versatile tool that many people can use, but it has its
drawbacks. Should future DataLearns cover coding to replace Excel?
Should we discourage using Excel or would that discourage some
people from participating?
• In the heat of the DataDive, it’s easy to forget to document the steps
that were taken in data cleaning and transformations. Any ideas on the
fastest, most painless way to document?
• Open for any suggestions!
DataKind SG sharing of our first DataDive

Más contenido relacionado

La actualidad más candente

How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
What is Data Science actually is?
What is Data Science actually is?What is Data Science actually is?
What is Data Science actually is?Rupak Roy
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkBas Geerdink
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Introduction to Machine Learning
Introduction to Machine Learning Introduction to Machine Learning
Introduction to Machine Learning Rupak Roy
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectEugene Mandel
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Simplilearn
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
Exploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks LikeExploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks LikeProduct School
 
Cognitive Point of View from World of Watson 2016
Cognitive Point of View from World of Watson 2016Cognitive Point of View from World of Watson 2016
Cognitive Point of View from World of Watson 2016diannepatricia
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaEdureka!
 

La actualidad más candente (20)

How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
What is Data Science actually is?
What is Data Science actually is?What is Data Science actually is?
What is Data Science actually is?
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Introduction to Machine Learning
Introduction to Machine Learning Introduction to Machine Learning
Introduction to Machine Learning
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science Project
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Exploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks LikeExploring What a Typical Data Science Project Looks Like
Exploring What a Typical Data Science Project Looks Like
 
Big databigideasit4bc
Big databigideasit4bcBig databigideasit4bc
Big databigideasit4bc
 
Cognitive Point of View from World of Watson 2016
Cognitive Point of View from World of Watson 2016Cognitive Point of View from World of Watson 2016
Cognitive Point of View from World of Watson 2016
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
Analytics 2
Analytics 2Analytics 2
Analytics 2
 

Destacado

Diving into Twitter data on consumer electronic brands
Diving into Twitter data on consumer electronic brandsDiving into Twitter data on consumer electronic brands
Diving into Twitter data on consumer electronic brandsEugene Yan Ziyou
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Eugene Yan Ziyou
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communityEugene Yan Ziyou
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Eugene Yan Ziyou
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)Eugene Yan Ziyou
 
Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Eugene Yan Ziyou
 
Sentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsSentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsMichael Lin
 
Statistical inference: Probability and Distribution
Statistical inference: Probability and DistributionStatistical inference: Probability and Distribution
Statistical inference: Probability and DistributionEugene Yan Ziyou
 
Statistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-testsStatistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-testsEugene Yan Ziyou
 
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsStatistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsEugene Yan Ziyou
 

Destacado (12)

Diving into Twitter data on consumer electronic brands
Diving into Twitter data on consumer electronic brandsDiving into Twitter data on consumer electronic brands
Diving into Twitter data on consumer electronic brands
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
 
Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)
 
Sentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsSentiment Analysis of Airline Tweets
Sentiment Analysis of Airline Tweets
 
Statistical inference: Probability and Distribution
Statistical inference: Probability and DistributionStatistical inference: Probability and Distribution
Statistical inference: Probability and Distribution
 
Statistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-testsStatistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-tests
 
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsStatistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
 

Similar a DataKind SG sharing of our first DataDive

Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveIRJET Journal
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizingKrist Wongsuphasawat
 
A Gentle Introduction to Tidy Statistics in R.pdf
A Gentle Introduction to Tidy Statistics in R.pdfA Gentle Introduction to Tidy Statistics in R.pdf
A Gentle Introduction to Tidy Statistics in R.pdfVickyAlers
 
Accessing and analysing your own social media data.pptx
Accessing and analysing your own social media data.pptxAccessing and analysing your own social media data.pptx
Accessing and analysing your own social media data.pptxLadduAnanu
 
IRJET- Review Analyser with Bot
IRJET- Review Analyser with BotIRJET- Review Analyser with Bot
IRJET- Review Analyser with BotIRJET Journal
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysisTaylor Graham
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique IJERA Editor
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysisijtsrd
 
Heath Information Technology Interoperability Report (Individual A
Heath Information Technology Interoperability Report (Individual AHeath Information Technology Interoperability Report (Individual A
Heath Information Technology Interoperability Report (Individual ASusanaFurman449
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash TagIRJET Journal
 
Strategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using TweetsStrategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using TweetsKeRoxiLi
 
Social data analysis using apache flume, hdfs, hive
Social data analysis using apache flume, hdfs, hiveSocial data analysis using apache flume, hdfs, hive
Social data analysis using apache flume, hdfs, hiveijctet
 
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET Journal
 

Similar a DataKind SG sharing of our first DataDive (20)

Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and HiveSentiment Analysis on Twitter Data Using Apache Flume and Hive
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
 
Integrating Social Media
Integrating Social MediaIntegrating Social Media
Integrating Social Media
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
What to expect when you are visualizing
What to expect when you are visualizingWhat to expect when you are visualizing
What to expect when you are visualizing
 
A Gentle Introduction to Tidy Statistics in R.pdf
A Gentle Introduction to Tidy Statistics in R.pdfA Gentle Introduction to Tidy Statistics in R.pdf
A Gentle Introduction to Tidy Statistics in R.pdf
 
Accessing and analysing your own social media data.pptx
Accessing and analysing your own social media data.pptxAccessing and analysing your own social media data.pptx
Accessing and analysing your own social media data.pptx
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Final_report6
Final_report6Final_report6
Final_report6
 
IRJET- Review Analyser with Bot
IRJET- Review Analyser with BotIRJET- Review Analyser with Bot
IRJET- Review Analyser with Bot
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysis
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysis
 
Heath Information Technology Interoperability Report (Individual A
Heath Information Technology Interoperability Report (Individual AHeath Information Technology Interoperability Report (Individual A
Heath Information Technology Interoperability Report (Individual A
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
Strategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using TweetsStrategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using Tweets
 
Social data analysis using apache flume, hdfs, hive
Social data analysis using apache flume, hdfs, hiveSocial data analysis using apache flume, hdfs, hive
Social data analysis using apache flume, hdfs, hive
 
IRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data AnalysisIRJET- Categorization of Geo-Located Tweets for Data Analysis
IRJET- Categorization of Geo-Located Tweets for Data Analysis
 

Más de Eugene Yan Ziyou

System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and searchEugene Yan Ziyou
 
Recommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrixRecommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrixEugene Yan Ziyou
 
Predicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionPredicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionEugene Yan Ziyou
 
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsOLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsEugene Yan Ziyou
 
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...Eugene Yan Ziyou
 
INSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my JourneyINSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my JourneyEugene Yan Ziyou
 
Culture at Lazada Data Science
Culture at Lazada Data ScienceCulture at Lazada Data Science
Culture at Lazada Data ScienceEugene Yan Ziyou
 
A Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USA Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USEugene Yan Ziyou
 

Más de Eugene Yan Ziyou (8)

System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
 
Recommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrixRecommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrix
 
Predicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionPredicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admission
 
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsOLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
 
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
 
INSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my JourneyINSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my Journey
 
Culture at Lazada Data Science
Culture at Lazada Data ScienceCulture at Lazada Data Science
Culture at Lazada Data Science
 
A Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USA Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the US
 

Último

Junnar ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Junnar ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Junnar ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Junnar ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...tanu pandey
 
Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'NAP Global Network
 
Zechariah Boodey Farmstead Collaborative presentation - Humble Beginnings
Zechariah Boodey Farmstead Collaborative presentation -  Humble BeginningsZechariah Boodey Farmstead Collaborative presentation -  Humble Beginnings
Zechariah Boodey Farmstead Collaborative presentation - Humble Beginningsinfo695895
 
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.Christina Parmionova
 
The NAP process & South-South peer learning
The NAP process & South-South peer learningThe NAP process & South-South peer learning
The NAP process & South-South peer learningNAP Global Network
 
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...tanu pandey
 
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...robinsonayot
 
Regional Snapshot Atlanta Aging Trends 2024
Regional Snapshot Atlanta Aging Trends 2024Regional Snapshot Atlanta Aging Trends 2024
Regional Snapshot Atlanta Aging Trends 2024ARCResearch
 
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...ranjana rawat
 
VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...
VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...
VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...SUHANI PANDEY
 
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
CBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related TopicsCBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related TopicsCongressional Budget Office
 
The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)Congressional Budget Office
 
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Finance strategies for adaptation. Presentation for CANCC
Finance strategies for adaptation. Presentation for CANCCFinance strategies for adaptation. Presentation for CANCC
Finance strategies for adaptation. Presentation for CANCCNAP Global Network
 
Election 2024 Presiding Duty Keypoints_01.pdf
Election 2024 Presiding Duty Keypoints_01.pdfElection 2024 Presiding Duty Keypoints_01.pdf
Election 2024 Presiding Duty Keypoints_01.pdfSamirsinh Parmar
 
Scaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP processScaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP processNAP Global Network
 
Government e Marketplace GeM Presentation
Government e Marketplace GeM PresentationGovernment e Marketplace GeM Presentation
Government e Marketplace GeM Presentationgememarket11
 

Último (20)

Junnar ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Junnar ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Junnar ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Junnar ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'Coastal Protection Measures in Hulhumale'
Coastal Protection Measures in Hulhumale'
 
Zechariah Boodey Farmstead Collaborative presentation - Humble Beginnings
Zechariah Boodey Farmstead Collaborative presentation -  Humble BeginningsZechariah Boodey Farmstead Collaborative presentation -  Humble Beginnings
Zechariah Boodey Farmstead Collaborative presentation - Humble Beginnings
 
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
WORLD DEVELOPMENT REPORT 2024 - Economic Growth in Middle-Income Countries.
 
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
(NEHA) Call Girls Nagpur Call Now 8250077686 Nagpur Escorts 24x7
 
The NAP process & South-South peer learning
The NAP process & South-South peer learningThe NAP process & South-South peer learning
The NAP process & South-South peer learning
 
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...Pimpri Chinchwad ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi R...
Pimpri Chinchwad ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi R...
 
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
TEST BANK For Essentials of Negotiation, 7th Edition by Roy Lewicki, Bruce Ba...
 
Regional Snapshot Atlanta Aging Trends 2024
Regional Snapshot Atlanta Aging Trends 2024Regional Snapshot Atlanta Aging Trends 2024
Regional Snapshot Atlanta Aging Trends 2024
 
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
 
VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...
VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...
VIP Model Call Girls Lohegaon ( Pune ) Call ON 8005736733 Starting From 5K to...
 
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
 
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
Russian🍌Dazzling Hottie Get☎️ 9053900678 ☎️call girl In Chandigarh By Chandig...
 
CBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related TopicsCBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related Topics
 
The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)
 
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
 
Finance strategies for adaptation. Presentation for CANCC
Finance strategies for adaptation. Presentation for CANCCFinance strategies for adaptation. Presentation for CANCC
Finance strategies for adaptation. Presentation for CANCC
 
Election 2024 Presiding Duty Keypoints_01.pdf
Election 2024 Presiding Duty Keypoints_01.pdfElection 2024 Presiding Duty Keypoints_01.pdf
Election 2024 Presiding Duty Keypoints_01.pdf
 
Scaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP processScaling up coastal adaptation in Maldives through the NAP process
Scaling up coastal adaptation in Maldives through the NAP process
 
Government e Marketplace GeM Presentation
Government e Marketplace GeM PresentationGovernment e Marketplace GeM Presentation
Government e Marketplace GeM Presentation
 

DataKind SG sharing of our first DataDive

  • 1. DataKind Singapore DataLearn: Post-DataDive Sharing 23 July 2015 Thanks to our host! Share photos & tweets #DataLearn #data4good @DataKindSG
  • 2. Agenda 1. DataKind Singapore updates 2. DataDive Overview 3. DataDive Sharing: Earth Hour 4. DataDive Sharing: HOME 5. Small group discussions on data handling best practices (if time allows)
  • 3. Agenda 1. DataKind Singapore updates 2. DataDive Overview 3. DataDive Sharing: Earth Hour 4. DataDive Sharing: HOME 5. Data handling best practices
  • 4. DataKind™ harnesses the power of data science in the service of humanity.
  • 5. Updates Project Accelerator coming up on 5 Sept! - Please help spread the word to any social change organizations you know who do good work. - Sign up form here: http://goo.gl/forms/0TbDySVFi7 - Sign up by Friday, Aug 14.
  • 6. Other data4good stuff http://unitedwehack.com/ - August 15-16, 24 hour hackathon organized with UN Women - “A Hackathon to Promote Financial Literacy and Economic Empowerment for Women Around the Globe.” - Access to partner APIs http://blog.datalook.io/openimpact/ - Until August 31 - “DataLook is a directory of reusable data-driven projects for social good. Join our replication marathon and bring a project to your city.”
  • 7. Agenda 1. DataKind Singapore updates 2. DataDive Overview 3. DataDive Sharing: Earth Hour 4. DataDive Sharing: HOME 5. Data handling best practices
  • 8. What is a DataDive?
  • 9. Get to know the participating organizations Select and learn about a problem and the data Determine a specific task or team that you can work on Data Dive in! Repeat! Coordinate with the team’s Data Ambassadors & Project Managers !→ code → final presentation → analysis Contribute Results
  • 10. DataDive Retrospective - Took place over weekend of 23 - 25 Apr - More than 70 participants - 2 non-profit organizations - Earth Hour - HOME - Intro to the orgs on Friday and socialize, working through Saturday, final presentations on Sunday at 1pm.
  • 11. DataDive Key Learnings - Full involvement from partner orgs is important, and we need to emphasize this from the very beginning - Trello will be mandatory to avoid duplication of effort - Grant access to data on Friday night and help participants to start setting up data and tools, so that they can start right away on Saturday morning - Remind people that final presentations will be from Google Presentation and set a hard deadline for getting content in, so that there is time to vet
  • 12. Agenda 1. DataKind Singapore updates 2. DataDive Overview 3. DataDive Sharing: Earth Hour 4. DataDive Sharing: HOME 5. Data handling best practices
  • 13. Tweet Data Analysis - EARTH HOUR Main Objectives - Identify influencers on twitter - Sentiment Analysis - Word Cloud Analysis
  • 14. Tweets Analysis Prep - EARTH HOUR 1. Understanding the data anon _id created _at Text No Comma coordi nates lang RT_ count fav_ count reply_to _user_id place 8393 7561 27/03/ 2015 RT @JimHarris: SHAME: Canada Ranks LAST Among OECD Countries for #ClimateChange Performance #cdnpoli #climate #gls15 http://t. co/0DD7S6oy7h en 152 0 8423 3936 28/03/ 2015 RT @earthhour: It's not about which country you're from; it's about what planet we're from. Join us for #EarthHour! 28th March; 8.30pm. en 360 493 8423393 6 8069 6055 27/03/2 015 @nenshi will you be joining the #earthhour party? Plz retweet to encourage lights out? #earthhourcalgary #yourpower http://t. co/68TblYiW2Y -114.0 66491 4;.... en 9 4 [-114.0 59111 818;....
  • 15. Tweet Data Analysis Prep - EARTH HOUR 2. Identify preliminary tasks to support other analysis - Identify which tweets are retweets - Identify which tweets contain which EH hashtags - For retweets, identify which user is being retweeted
  • 16. Tweets Analysis Prep - EARTH HOUR 3. Creating additional variables (I) - Turn data into a table (tblTweets) - Create binary variable to identify retweets ... Text No Comma ... is_retweet RT @JimHarris: SHAME: Canada Ranks LAST Among OECD Countries for #ClimateChange Performance #cdnpoli #climate #gls15 http://t. co/0DD7S6oy7h =IF(ISNUMBER (SEARCH("RT @",[@[Text No Comma]])),1,0) What does the formula do? → Check if “RT @” is found in the tweet text Case 1: String is found 1. SEARCH returns start character of string 2. ISNUMBER evaluates to true as search returned a number 3. IF returns 1 as isnumber is true Case 2: String is not found 1. SEARCH returns #Value error 2. ISNUMBER will evaluate to false as search did not return a number 3. IF returns 0 as isnumber is false
  • 17. Tweets Analysis Prep - EARTH HOUR 4. Secondary use for the is_retweet variable - Understand lasting impact of campaign and event
  • 18. Tweets Analysis Prep - EARTH HOUR 5. Creating additional variables (II) - The motto of EH was “use your power to change climate change” - EH # crawled: #EarthHour, #climatechange, #yourpower, #useyourpower - Create binary variables for each hashtag ... Text No Comma ... EarthHour ... RT @JimHarris: SHAME: Canada Ranks LAST Among OECD Countries for #ClimateChange Performance #cdnpoli #climate #gls15 http://t.co/0DD7S6oy7h =IF(ISNUMBER(SEARCH ("#"&tblTweets[[#Headers], [EarthHour]],[@[Text No Comma]])),1,0) ...
  • 19. Tweets Analysis Prep - EARTH HOUR 6. Secondary use for binary hashtag variables
  • 20. Tweets Analysis Prep - EARTH HOUR Main Takeaways - Coding knowledge (e.g. R / Python) is not required to contribute during a Data Dive - Preparatory tasks can yield useful insights as well - Excel can be helpful but may not be the most suitable tool for large data files
  • 21. Identify Influencers - EARTH HOUR Problem Identify influencers Solution Analyse tweet data to identify most retweeted users
  • 22. Identify influencers - EARTH HOUR 1. Creating additional variables - extract which user is being retweeted ... Text No Comma ... is_ret weet original_tweeter RT @earthhour: It's not about which country you're from; it's about what planet we're from. Join us for #EarthHour! 28th March; 8.30pm. 1 =IF([@[is_retweet]]=0,"original",RIGHT(LEFT([@ [Text No Comma]],FIND(":",[@[Text No Comma]] &":")-1),LEN(LEFT([@[Text No Comma]],FIND(":", [@[Text No Comma]]&":")-1))-3)) → @earthhour
  • 23. Identify influencers - EARTH HOUR 1. Creating additional variables What does the formula do in this example? =IF([@[is_retweet]]=0,"original",RIGHT(LEFT([@[Text No Comma]],FIND(":",[@[Text No Comma]]&":")-1), LEN(LEFT([@[Text No Comma]],FIND(":",[@[Text No Comma]]&":")-1))-3)) [@[is_retweet]] = 1 [@[Text no comma]] = “RT @earthhour: It's not about which country you're from; it's about what planet we're from. Join us for #EarthHour! 28th March; 8.30pm.” The formula can be broken down into a few parts: 1. Check if it is a retweet - if it is, go to Point 2, otherwise mark it as “original” 2. Find the first occurrence of “:” in the text, return the character where it appears - 1 3. Start at the left of the tweet text, truncate the string after [Point 2] characters 4. From the length of the string in [Point 3], subtract 3 (for “RT “) 5. Start at the right of the string in [Point 3], truncate after [Point 4] characters
  • 24. Identify influencers - EARTH HOUR 1. Create additional variables 2. Select all tweets in dataset with retweets >=500 3. Check for extraction errors (if re-tweets > followers, manual investigation) 4. Check for parsing errors (if length of text >= 180 char, marked as error) 5. From remaining set: For users who were retweeted, assess profile information, number of followers, country where available
  • 25. Identify influencers - EARTH HOUR 6. Outcome: Users who were most retweeted User who is retweeted Nr. RT >=500 Total RT Count User info Nr. Of Followers Country / Region @earthhour 5 46,947 EarthHour 143,000 Global @LeoDiCaprio 5 7,627 Leonardo di Caprio - Actor, WWF Ambassador 12,800,000 US @AstroSamantha 2 2,750 Sam Cristoforetti - Italian Astronaut on ISS 510,000 Italy
  • 26. Sentiment Analysis - Tweets WHAT IS SENTIMENT ANALYSIS ● Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. WHAT WE USED FOR SENTIMENT ANALYSIS ● We used the Python package VADER, a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains. ● More information on VADER can be found in ○ http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdfhttp: //comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf ○ https://github.com/cjhutto/vaderSentiment ● Installed VADER thru PIP
  • 27. Sentiment Analysis - Tweets HOW TO USE VADER IN YOUR CODE from vaderSentiment import sentiment as vaderSentiment sentences = [ "VADER is smart, handsome, and funny.", # positive sentence example "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted) ]. for sentence in sentences: print sentence, vs = vaderSentiment(sentence) print "nt" + str(vs)
  • 28. Sentiment Analysis - Tweets HOW VADER ANALYZES SOME OF THE INPUTS VADER is smart, handsome, and funny. {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316} VADER is smart, handsome, and funny! {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439} VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!! {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469} A really bad, horrible book. {'neg': 0.791, 'neu': 0.209, 'pos': 0.0, 'compound': -0.8211}
  • 29. Sentiment Analysis - Tweets HOW VADER ANALYZES SOME OF THE INPUTS At least it isn't a horrible book. {'neg': 0.0, 'neu': 0.637, 'pos': 0.363, 'compound': 0.431} :) and :D {'neg': 0.0, 'neu': 0.124, 'pos': 0.876, 'compound': 0.7925} {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0} Today sux {'neg': 0.714, 'neu': 0.286, 'pos': 0.0, 'compound': -0.3612} Today SUX! {'neg': 0.779, 'neu': 0.221, 'pos': 0.0, 'compound': -0.5461}
  • 30. Sentiment Analysis - Tweets HOW WE PERFORMED SENTIMENT ANALYSIS ON EARTH HOUR DATA ● DEFINE RANGES FOR COMPOUND VALUE TO CATEGORIZE THE SENTIMENT OF TWEET VERY NEGATIVE, NEGATIVE, NEUTRAL, POSITIVE AND VERY POSITIVE ● OPEN FILE ● READ A RECORD ● PARSE AND EXTRACT TWEET ● PASS TWEET TEXT TO VADER METHOD
  • 31. Sentiment Analysis - Tweets HOW WE PERFORMED SENTIMENT ANALYSIS ON EARTH HOUR DATA ● PARSE OUTPUT TO EXTRACT COMPOUND ● BASED ON COMPOUND VALUE , DETERMINE SENTIMENT OF THE TWEET ● STORE THE CATEGORICAL VALUE OF SENTIMENT IN A VARIABLE ● ADD A DUMMY VARIABLE WITH VALUE OF 1 ● GO TO STEP 3 TILL EOF ● WRITE THE OUTPUT TO A FILE
  • 32. Sentiment Analysis - Tweets HOW WE PREPARED CHARTS FOR SENTIMENT ANALYSIS ● OPEN THE OUTPUT FILE CREATED AFTER APPLYING VADER ● READ RECORDS INTO DATAFRAMES OF PANDAS (A powerful Python data analysis toolkit) ● PERFORM GROUPING (think of as GROUPBY in SQL) AND SUMMARIZE THE DUMMY VARIABLE ● PRESENT THE OUTPUT BY PIE CHARTS (using Python package MATPLOTLIB)
  • 33. Sentiment Analysis - Tweets 1) Using all valid records - Apply Python package: -vaderSentiment - Categorize sentiment scores: -Very positive: >0.55 -Positive: >=0.10 to <=.54 -Neutral: >= (-0.10) to <=0.09 -Negative: >=-0.55 to <=-0.11 -Very negative: <=-0.56 Input: Cleaned Hashtag Tweets
  • 34. Sentiment Analysis - Tweets Breakdowns for Tweets / Re-tweets
  • 35. Questions? 1. Which keywords are most retweeted? 2. Is there a relation among the topics containing each of the four keywords of Earth Hour? 3. Which words represent different sentiments?
  • 36. Data Cleaning Looking at English tweets. Tweeted messages includes non meaningful characters. # remove retweet entities RT/via some_txt = gsub("(RT|via)((?:bW*@w+)+)", "", some_txt) # remove <br> some_txt = gsub("<br>", "", some_txt) # remove @people some_txt = gsub("@w+", "", some_txt) # remove html links some_txt = gsub("http(s?)(://)(.*)[.|/|_](.*)+", "", some_txt) some_txt = gsub("htt(.*)", "", some_txt) # replace smiley some_txt = gsub(": ", " ", some_txt)
  • 37. Create WordCloud library(wordcloud) library(tm) # import the data lords <- Corpus(DirSource("./Wordcloud_Input/")) # transform and prepare the data for the word cloud lords <- tm_map(lords, stripWhitespace) lords <- tm_map(lords, content_transformer(tolower)) lords <- tm_map(lords, removeWords, c("amp")) lords <- tm_map(lords, stemDocument) # Word cloud graph wordcloud(lords, scale=c(3,0.3), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2")) R library Load input file Further text processing Graph options
  • 38. Word Cloud Analysis Using Unique Tweets Only - Remove the four EH hashtags - Apply R package: wordcloud - This package works on word-stems Input: Cleaned Hashtag Tweets
  • 39. Word Cloud Analysis Comparison (Unique Tweets only): #earthhour #climatechange
  • 40. Putting in Sentiment Analysis Word Cloud (only unique tweets)
  • 41. Country assignment to subscribers - EARTH HOUR Problem Find out country of ActiveCampaign subscribers Solution Merge information from different columns, such as city, country, lat-long
  • 42. Parallelized human cleaning process (PHCP) "id","email","sdate","cdate","city","state_province","zip","country1","country2","country3","location","language","address","age","discover" "100011","email@0accounts.com","2014-01-26 15:35:28","2014-01-26 02:35:28","","","","Netherlands","","","","","","","" "100012","email@163.com","2012-11-29 02:01:46","2012-11-28 13:01:46","","","","China","","","","","","","" "100013","email@student.hcmus.edu.vn","2015-03-21 20:32:55","2015-03-21 07:32:55","ho chi minh","","","Viet Nam","","","","","","","" "100014","email@gmail.com","2014-03-14 14:59:15","2014-03-14 01:59:15","Coimbatore","","","","","","","","","","" "100015","email@QQ.COM","2013-09-27 10:25:29","2013-09-26 21:25:29","","","","China, Guangdong Province, Foshan City","","","","","","","" "100016","email@yahoo.com.tw","2012-11-29 02:41:50","2012-11-28 13:41:50","","","","Taipei","","","","","","","" "100017","email@gmail.com","2013-03-12 11:36:39","2013-03-11 22:36:39","","","","12`3123","","","","","","","" ActiveCampaign crawler and conversion from JSON Result # of subscribers: 321,704 # of subs. with country: 188,462 → 58.6%
  • 43. Campaign response rate comparison of users - EARTH HOUR Problem Rank users based on historical campaign response data Solution Use expected open rates to compare users with different number of campaigns received
  • 44. Raw data JSON with user-level campaign response from ActiveCampaign API "344": {"subscriberid": "451267", "orgname": "", "times": "0", "phone": "", "tstamp": "2014-03-28 00:51:14", "email": "email@gmail.com"}, "0": {"subscriberid": "439666", "orgname": "", "times": "0", "phone": "", "tstamp": "2014-03-28 00:51:14", "email": "email2@gmail.com"}, "346": {"subscriberid": "451324", "orgname": "", "times": "0", "phone": "", "tstamp": "2014-03-28 00:51:14", "email": "email3@yahoo.com"}, "347": {"subscriberid": "451330", "orgname": "", "times": "0", "phone": "", "tstamp": "2014-03-28 00:51:14", "email": "email4@yahoo.com"} {"open_list": [], "campaign_id": "90"} {"open_list": [{"times": "304", "tstamp": "2014-03-07 13:49:26", "subscriberid": "395746", "email": "somebody@earthhour.org"}], "campaign_id": "89"} {"open_list": [], "campaign_id": "20"} {"open_list": [{"times": "2", "tstamp": "2013-01-22 15:00:20", "subscriberid": "14604", "email": "someone@earthhour. org"}], "campaign_id": "8"} {"0": {"info": [{"subscriberid": "5", "orgname": "", "times": "1", "phone": "", "tstamp": "2015- 03-29 22:58:27", "email": "email@gmail. com"}, {"subscriberid": "8", "orgname": "", "times": "1", "phone": "", "tstamp": "2015-03- 29 23:03:03", "email": "puikwan.lee@gmail. com"}], "a_unique": "2", "tracked": "1", "link": "https://github.com/DataKind-SG", "a_total": "2", "id": "26", "name": ""}, "result_output": "json", "result_message": "Success: Something is returned", "result_code": 1}} Open list Unopen list Link list
  • 45. Open rates Why raw open rate is a bad estimate?
  • 46. Campaign counts Users’ chance of engagement varies; they receive different amount of emails Jan 2014 Jan 2015 Bridget Alan
  • 47. Who is better? Response rate and campaign count together define how interested a user is Alan Bridget Response rate = 50% Campaign count = 2 Response rate = 40% Campaign count = 5
  • 48. Uncertainty To make users comparable suppose both users receive the same count Alan Bridget Response rate = 20- 80% Campaign count = 5 Response rate = 40% Campaign count = 5 ? ? ?
  • 49. Expected open rate Calculate expected open rate based on distribution of open rate
  • 50. Using evidence of open count We already know Alan opened one e-mail and did not open another • OpenCount >= 1 • UnopenCount >= 1 → UnopenCount = CampaignCount - OpenCount
  • 51. Generalize to N campaigns Expected open rate is weighted average of the conditional expected values • M maximum campaign count any user received in the dataset • j number of campaigns that Alan received
  • 53. Final scoring 0.75 * Click-through score + 0.25 * Open score Best userDistribution of score
  • 54. Globe Visualisation - EARTH HOUR Problem Make a sexy visualisation for eye candy Solution Pre-canned javascript package!
  • 55. Globe Visualisation http://datakind-sg.github.io/ - Height of each bar is related to number of Twitter followers of Earth Hour, and color is related to temperature change from 1970’s to 1990’s - WebGL Globe is used: https://www.chromeexperiments.com/globe - Essentially, you just need to pass in a JSON formatted array with the following form: [lat1, long1, height1, colour1, lat2, long2, height2, colour2, …] - You can use the code as a template: https://github.com/DataKind- SG/datakind-sg.github.io
  • 56. Globe Visualisation - Colour - The temperature data is available from the IPCC: http://www.ipcc-data. org/cgi-bin/ddc_nav/ - This is a (bad) proxy for degree of climate change at each lat/long, and Earth Hour suggested a broader measure. - The temp difference between the 1970’s and 1990’s was scaled to be between 0 and 3 so that blue corresponds to the biggest decrease in temp during that period, and red corresponds to the biggest increase in temp. - As homework for you… change the color map so that there isn’t green in the middle.
  • 57. Globe Visualisation - Height - The height of each bar is ~ log(1 + n), where n is the number of Twitter followers at the lat/long (with the exact locations rounded to the nearest quarter degree in order to bucket). - So the difficult part is finding the lat/long. - Twitter profiles have free text for the location, and this needs to be converted into a lat/long. - Geocoding hack: try to match with given list of: cities http://download.geonames.org/export/dump/ or countries https://developers.google.com/public- data/docs/canonical/countries_csv
  • 58. Geocoding hack - Code is here https://github.com/oliverxchen/geovis, quasi-pseudo code below. - We’ll look at what happens to a few examples of inputs: - "Üt: 10.253083,-67.585859" - "01.350750086, 103.951196586" - "Bristol, UK" - "between sky and earth" - "CALAMBA, laguna, philippines" - “Singapore” - "Xalapa, Veracruz, Mexico"
  • 59. Geocoding hack The program is basically just a big loop through all of the free text locations and applying the following in sequence. A) standardize the string (change to lower case, replace multiple spaces with single spaces) - "Üt: 10.253083,-67.585859" - "01.350750086, 103.951196586" - "between sky and earth" - "bristol, uk" - "calamba, laguna, philippines" - "singapura" - "xalapa, veracruz, mexico"
  • 60. Geocoding hack B) if the string starts with "Üt:", usually followed by an actual lat/long which can be directly used In the example strings, "Üt: 10.253083,-67.585859" is mapped to [10.253083,-67.585859] C) split remaining strings by commas - ["01.350750086", "103.951196586"] - ["between sky and earth"] - ["bristol", "uk"] - ["calamba", "laguna", "philippines"] - ["singapura"] - ["xalapa", "veracruz", "mexico"]
  • 61. Geocoding hack D) if single string after split and there’s no match yet, try to match with country list - ["singapura"] is matched to "singapore" - ["between sky and earth"] is not mapped E) if two strings after split and there’s no match yet, try to parse to a lat/long: - ["01.350750086", "103.951196586"] is mapped to [1.350750086, 103.951196586] - ["bristol", "uk"]: float parse fails
  • 62. Geocoding hack (cont.) F) if there isn’t a match yet try to match zeroth string to list of cities - ["bristol", "uk"] is mapped to "bristol, united states" (Whoops!) - ["between sky and earth"] is not mapped - ["calamba", "laguna", "philippines"] is mapped to "calamba, philippines" - ["xalapa", "veracruz", "mexico"] is not mapped
  • 63. Geocoding hack (cont.) G) if there still isn’t a match yet try to match the last string to list of countries - ["between sky and earth"] is not mapped - ["xalapa", "veracruz", "mexico"] is mapped to "mexico" H) if still no match, you’re out of luck - ["between sky and earth"]
  • 64. Geocoding hack (cont.) - To deal with typos and ‘interesting’ spellings, match to cities and countries is done using a Jaro-Winkler measure (similar to Levenshtein edit distance, but higher weight on letters early in the word). - Largest Jaro-Winkler value is used and needs to be above a threshold to be considered a match. - Python package is python-levenshtein - Other logic to use previous results if standardized strings match - Many improvements are possible! Eg: - non-uniqueness of city names is not handled - splitting on not just commas - etc.
  • 65. Agenda 1. DataKind Singapore updates 2. DataDive Overview 3. DataDive Sharing: Earth Hour 4. DataDive Sharing: HOME 5. Data handling best practices
  • 66. How many cases does HOME receive each year, segmented by domestic and non-domestic workers? - Problems faced - Dates had to be reformatted while reading into R - Some outlier cases had to be removed (e.g., year > 2016) - Demo on how plot was created
  • 67. Across nationalities, what’s the proportion of workers placed by each agency? - Problems faced - Some country data was in free text and had to be integrated - Demo on how plot was created
  • 68. HOME: Worker Salary Data What’s the salary range for each nationality? ● Overall ● Domestic Workers ● Non-domestic Workers
  • 69. HOME: Worker Salary Data Key Challenges: ● Missing salary ● Inconsistent units (per hour/day/week/month) or missing units ● Different currencies (SGD, USD, MYR) ● Ranges given (eg 6-8) ● Converting hours/days/weeks to months How it was handled: ● Using Excel, divide into two columns: values and units ● Standardize to monthly salary in third column ● Discovered on the second day that a filter hadn’t been applied correctly, so the columns became misaligned… quick fix was applied, but this should be checked.
  • 70. HOME: Worker Salary Data How did we create this?
  • 71. HOME: Worker Salary Data Key Lessons One graph may not be sufficient to convey the intended information. Be careful of unknowingly comparing apples to oranges.
  • 72. Postal Code Problem - HOME - Problem - Postal Codes were not present in every record. - Solution - Use Google maps API, OneMap API, OpenStreetMap API to map address to Postal Code
  • 73. Postal Code Problem - HOME - Method - Retrieve Postal code using 3 APIs. - Each API returned more than 1 Postal Code for one address, as each address could map to different Postal codes. - Eg {"postalcode": ["089057", "088542", "079716", "088541", "079911", "079120"], "address": "fuji xerox"}{"postalcode": ["039801", "039594", "039797", "039596", "039799"], "address": "raffles"}{"postalcode": ["310031", "310035"], "address": "toa payoh"}
  • 77. Postal Code Problem - HOME - Problem - All 3 API may or may not returned same set of Postal codes. - Solution - Use polling method to decide which Postal code to pick.
  • 78. Polling/Voting Algorithm 1. Collect all zips across all data source. 2. Weigh each zip by the number of times it appears in a data source. 3. Select the highest weighted zip. 4. Random select if there multiple highest weighted zips. 5. Sensor integration.
  • 79. Polling/Voting Algorithm (Precisely) 1. Let A := ⋃i Ai be the set of all unique zips where Ai is the set of zips from data source i. 2. For each a ∊ A, compute the weight wa = ∑i Ai (a). 3. Select the zip a* where ∀a : wa* wa . 1. A1 := (44, 34, 26, 17), A2 := (34, 45, 17), A3 := (17) 2. A = (44, 34, 26, 17, 45) 3. w44 = 1, w34 = 2, w45 = 1, w17 = 3 4. a* = 17
  • 80. Issues 1. Due to privacy concerns, we never saw the full address. 2. This means that we have no feel for the data. 3. Potential for errors.
  • 81. Postal Code Problem - HOME - Map based visualization - Problem : Couldn’t find a geo data to highlight districts in Singapore. - Solution : Find the Center Lat Long of each district and show the results with a marker. - Tools : Used leaflet.js for map visualization. - Geo data for the map was used from openstreet maps.
  • 82. Postal Code Problem - HOME source - https://www.ura.gov.sg/realEstateIIWeb/resources/misc/list_of_postal_districts.htm
  • 83. Postal Code Problem - HOME - Number of abuses per district
  • 84. Example DC.js plot. Interactive Data Visualization Problem: • HOME may need to do analysis in future to see if the situation has improved/changed Solution: • Build an interactive data visualization tool to support self- serviced investigations Tools: • Used DC.js for data visualization
  • 85. Interactive Data Visualization 5 easy steps to use DC.js
  • 86. Interactive Data Visualization Filter by age and salary range
  • 87. Data anonymization Problems: • lot of sensitive data: • first order: name, home address, passport number, birthday, contact number, FIN • second order: current/previous employer nfo, case created by, agency contact nfo • HOME data had a lot of free text fields that had various level of private information: • “Do you want me to treat you like <NAME>?!” • “On <exact date>...” • “His friend, <NAME>, application….”
  • 88. Data anonymization • real anonymization: • un-anonymized data should not leave its usual working environment • un-anonymized data should be only handled by authorized users • this requires a highly portable & easy to use utility: • python - what about Windows? • R - don’t get me started… • compiled CLI utility: so many things can go wrong (apart from which OS, arch) browsers are pretty bloated SW products; you can do video editing with it. https://github.com/DataKind-SG/HOME
  • 89. Thanks to our supporters!
  • 90. Agenda 1. DataKind Singapore updates 2. DataDive Overview 3. DataDive Sharing: Earth Hour 4. DataDive Sharing: HOME 5. Data handling best practices
  • 91. Data handling best practices • Break up into small groups for discussions on below topics, and appoint a spokesperson to tell the larger group your thoughts • Excel is a versatile tool that many people can use, but it has its drawbacks. Should future DataLearns cover coding to replace Excel? Should we discourage using Excel or would that discourage some people from participating? • In the heat of the DataDive, it’s easy to forget to document the steps that were taken in data cleaning and transformations. Any ideas on the fastest, most painless way to document? • Open for any suggestions!