The path to be
a
Data Scientist
Poo Kuan Hoong, Ph.D
Senior Manager Data Science,
Nielsen Malaysia
Disclaimer: The views and opinions expressed in this slides are those of
the author and do not necessarily reflect the official policy or position
of Nielsen Malaysia. Examples of analysis performed within this slides
are only examples. They should not be utilized in real-world analytic
products as they are based only on very limited and dated open source
information. Assumptions made within the analysis are not reflective of
the position of Nielsen Malaysia.
Agenda
• What is a data scientist?
• What kinds of companies that employ data scientists?
• What are the key functions of data scientist?
• What type of work does a data scientist do?
• General Aptitude to be a data scientist
• What skillsets needed to be a data scientist?
• What is data science?
• Where do I begin?
• MDEC National Big App Challenge 3.0 Knowledge Sharing
Data Scientist
The term "data scientist" has been
around for years, and the various
advanced analytics specialties that
fall under it are even older.
However, due to recent explosion
of data, the term has been used in
the convergence of disciplines and
that leads to the soaring
popularity.
What are the job title?
• Data Scientist
• Data Engineer
• Big Data Engineer
• Machine Learning Scientist
• Business Analytics Specialist
• Data Visualization Developer
• BI Solutions Architect/ BI Specialist
• Operations Research Analyst
• Analytics Manager
• Machine Learning Engineer
• Statistician
• Business Intelligence (BI) Engineer
Why the Global Need?
Abundance of
Data
Availability of
affordable
compute
resources
Internet of
Things (IoT)
sensors data
950 Data Analyst (India)
8,411 Data Scientist (US)
808 Data Analyst (UK)
1,188 Data Manager (US)
81 Data Analyst (Australia)
80 in April 2015 1,500 by 2020
The Star, Friday, 24 April 2015
“Malaysia needs 1,500 data scientists by 2020”
Key functions of data scientist
Devising
Business
Strategies
from the
insights
Descriptive
and Predictive
Analytics
Data Mining
and Analysis
Design
Understanding
the business
problem
Customer churn - who do customers change
operators?
• The top 3 reasons why
subscribers change providers:
• They want a new handset
• They believe they pay too
much for calls/data
• Providers do not offer
additional loyalty benefits
Data Collection
Data Preprocessing
Attributes selection
• Attribute 1
• Attribute 2
• Attribute 3
Algorithm
Training Model Score Model
Apply Data
/Test Data
Predicting Output
Initialization Step Learn Step Apply Step
Machine Learning Framework
Models comparison
• Receiver operating characteristic
curve (ROC curve) illustrates the
performance of a binary classifier
system as its discrimination
threshold is varied.
Market Basket Analysis
Where should detergents be placed in the
store to maximize sales?
Are bleach products purchased when
detergents and orange juice are bought
together?
Is cola typically purchased with bananas?
Does the brand of cola make a difference?
How are the demographics of the
neighbourhood affecting what customers
are buying?
Data Scientist
• Common sense
• Curious mind
• Clear and simplify
thought
• Love to solve
puzzles
• Good listening,
writing and
communication
skills
• Maths & Stats
• Business
sense
I have 4 red, 18 black and 8 brown socks in my sock drawer. If it is
completely dark and I cannot see the colour of the socks that I am
picking, how many socks do I need to take from the drawer to be sure
that I have at least one pair of socks that are the same colour?
Data Science
• Data science is as an evolutionary step in interdisciplinary fields like
business analysis that incorporate computer science, modeling, statistics,
analytics, and mathematics.
• At its core, data science involves using automated methods to analyze
massive amounts of data and to extract knowledge from them.
• Drawing insight from a piece of data involves understanding how it fits
into the larger picture of an organization,
Massive Open Online Course (MOOC)
• MSC Malaysia MyProCert (SRI) – Data Science Massive Open Online
Courses (MOOC)
• The Center of Applied Data Science (MDEC & HRDF)
• John Hopkins University – Data Science Specialization
• University of Washington - Data Science at Scale Specialization
• Data Analyst Nanodegree - Udacity
• CSCI E-109 Data Science (Harvard Extension School)
• Machine Learning - Stanford University
BDA Undergraduate & Postgraduate
Programme
Undergraduate
• Multimedia University – Bachelor of Computer Science (Data Science
Specialization)
• Sunway University - BSc (Hons) Information Systems (Business
Analytics)
• Universiti Teknologi Malaysia (UTM), International Islamic University
Malaysia, Monash University, University Institute Technology Mara
(UiTM) & University Teknologi Petronas (UTP).
Postgraduate
• Big Data Analytics Post Graduate Programme
Kaggle
• Data sets, real problems, in
unprocessed manner.
• Recommend to go through
past competitions.
• Read through the forums
with particular
competitions to find out
useful discussion and
tips/hints that will be
useful for solving future
problems.
• https://www.kaggle.com/
UC Irvine Machine Learning Repository
• 360 data sets as a service to the machine learning community
http://archive.ics.uci.edu/ml/
Open data
• Open data from various countries
• Malaysia - http://www.data.gov.my/
• Singapore - https://data.gov.sg/
• June 4th – June 5th 2016, Berjaya Times Square
• The themes for AHKL2016 were as follows:
1. Big Data Analytics --- Powered by MDEC. Access to 65mil
rows of real datasets sponsored by iProperty.com Malaysia
2. O2O Commerce --- Powered by MOLWallet MOLPay
3. Smart Living --- Powered by TIME Internet
PropertySenze
• B2B business model
• Provide machine learning and AI
services to customers
• Visual Search
• Personalized customer experience
BUSINESS
MODEL
Big Data becomes Smart Data
1. PropertySenze
contracts with
property sites and
property developers
to generate
analytics and visual
search
5. Analytics at the
fingertips for both
buyers and sellers
2. PropertySenze’s
machine learning algorithm
enables search and buy
similar properties that user
sees on the sites, from
user‐generated photos and
from user‐uploaded images
3. Enhanced
search experience
and personalized
results for users
7. PropertySenze
verifies all
transactions and
charges
commission fees
every month
4. Improved platform
that recognizes
properties for retrieval
purposes or instant
purchases.
6. Improved user
experience that
leads to more
engagement and
sale transactions
Hackathon: Tips
• Have a well-shaped team with not more than one
server-side developer with relevant experience,
one good designer and one the amazing storyteller
• Understand the expected outcomes of the
hackathon
• Develop something that everyone can see the
benefits
• Have an impressive aim or objective
• Start promoting your product during the
hackathon
• Hit the demo 100%. The pitch is for the product to
shine