1. How to crack down
BIG DATA?
Required Skills for Data Scientist
2. hello!
I AM
DAVID
HUANG
I am here because I
want to find more lovers for
data science .
You can find me at:
tawei.huang1@gmail,com
My Experience
• Data Scientist Intern, Yoctol
• Data & Strategy Intern , Chocolabs
• Summer Intern Student, Institute of
Mathematics, Academic Sinica
My Education Background
• Master in Statistics, NTU
• BSc. In Quantitative Finance, NTHU
• Research Student, PKU
3. “Big data is a big trend, but it is very
difficult to hire a data scietist.
It’s also hard to find a job in TW XD
“
4. 1. Who is a Data
Scientist?
The skill sets you need to be a
data scientist.
5. In a big data project, we need these people!
Data
Backend
Engineer
Database
Architect
Data Analyst /
Data Scientist
Domain
Expert
Develop and operate
backend systems related
to data access, collection,
processing and storage,
Architect and design Database
solutions for the enterprise, and
lead the effort on database
performance and optimization
To use advanced quantitative analysis,
data mining techniques and strong
industry acumen to interpret, connect
and predict data to deliver insight and
recommendations for decisions.
Assist the data team to
understand the domain
problem & knowledge.
6. Data analyst / Machine Leaning
Lots of people say that they are different, but I think “every data
analyst should be a data scientist, and the converse holds!”
Explanatory
Analytics
Theory-based, statistical testing of causal
hypothesis (commonly see in economics)
Strength of relationship in statistical model
Data analyst
Predictive
Analytics
Empirical method for predicting new
observations (in statistical / math / CS ways)
Ability to accurately predict new individuals
Data scientist
Both fields are important for discovering knowledge.
8. The Data Scientist Venn Diagram
Math &
Statistics
Hacking
Skills
Domain
Expertise
Machine
Learning
ResearchProgram
Unicorn
First become a
(1) researcher,
(2) machine learner,
(3) programmer,
and then find your own
way to be a data scientist.
9. Skill Sets for Data Scientist – Math & Stat
Mathematics & Statistics
Multivariable Calculus
Linear Algebra
Probability Theory
Statistics / Math Statistics
Convex Optimization
Discrete Analysis
Basic Knowledge
Regression Analysis / GLM
Experimental Design
Causal Inference
Multivariate Analysis
Biz Analytics & Data Mining
Data Mining
Machine Learning
Deep Learning (ANN/CNN)
Machine Learning
Time Series Analysis
Forecasting
1
10. Skill Sets for Data Scientist – Programming
Programming Skills
Python
(Scripting Language)
R
(Statistical Software)
Matlab
(Super Fast but Expensive)
Programming Skill
SQL & Relational Algebra
NoSQL / Cassandra / etc.
HDFS / Map Reduce
Hadoop and Hive /Pig
Spark & Scala
Database Querying
A little bit Java
Data Structure & Algorithm
Data Munging (python!)
Data Viz (d3.js / Tableau)
Software Engineering
2
1. D3.js visualization: http://goo.gl/cVlTX7
2. Spark MiLib: http://goo.gl/VNMQ97
11. Skill Sets for Data Scientist – Business Sense
Business Professionalism
Hypothesis Thinking
Pyramid Principles
BizPro is a good choice!
Logical Thinking
To be honest, the crucial truth is that “this part is very important, but
the less important skill set!”
Presentation & Presence
Communication Skill
Upward Management
Communication Skill
I think this is the niche for business school students. Specific
knowledge about marketing, financial analysis, etc. helps a lot.
3
12. My Learning Path for you – Math matters!
Calculus
Linear Algebra
Probability Theory
Math Statistics
Freshman - Junior
1
Programming
C / Java / R
2
Financial Market
Marketing
Management
3
Advance Statistics
Data Mining
Econometrics
Senior
R Programming
Matlab (Basic)
Competitions
Advanced Finance
Macroeconomics
Statistical Learning
Compress Sensing
Current
Python & SQL
Hadoop & Spark
BizPro Training
Logical Thinking
Marketing Analytic
13. 2. Master in Data
Science free!
How to become the data unicorn
without any tuition fee
14. Data Scientist 101: Johns Hopkins MOOC
The Coursera Specializations offered by Johns Hopkins University give a
very good general exposure to the world of data science.
Executive Data Science
I think this specialization is designed for those who don’t want to become a
data scientist but may work in a data-driven company.
URL: https://goo.gl/ZNBF7N
Data Science
I think this specialization is designed for those who don’t have a very strong
academic background but want to become a data scientist.
URL: https://goo.gl/8OzBhe
Difficulty
Difficulty
15. Basic Math: Calculus & Linear Algebra
Calculus and linear algebra are fundamental tools for data scientists and
statisticians. Having a solid foundation will help a lot.
Calculus I & II, NTHU
This course gives you a solid foundation of Euclidean space and
multivariable calculus, which is very important for a data scientist.
URL: http://ocw.nthu.edu.tw/ocw/index.php?page=course&cid=7&
Linear Algebra, NCTU
A data scientist usually thinks data with a matrix representation. The concept
of vector algebra helps a lot for high dimensional data analysis.
URL: http://goo.gl/KFdJTT
Difficulty
Difficulty
16. Advance Math: Convex Optimization
This is a very advanced topic we will use when doing machine learning.
However, I don’t think every data scientist should understand this field.
Convex Optimization, Stanford
This course should benefit anyone who uses or will use scientific computing
or optimization in engineering or related work (e.g., machine learning,
finance, operational research).
URL: http://stanford.edu/class/ee364a/
MOOC: https://goo.gl/KBQ473
Difficulty
17. Basic Stat: Probability & Math Statistics
If you don‘t have a probability & math statistics, you can’t learn any advanced
data analytics method. Please learn it!
Probability, NTHU
This course gives you a solid foundation of Euclidean space and
multivariable calculus, which is very important for a data scientist.
URL: http://goo.gl/G4MhIj
Math Statistics, NTHU
A data scientist usually thinks data with a matrix representation. The concept
of vector algebra helps a lot for high dimensional data analysis.
URL: http://goo.gl/nQ2cE2
Difficulty
Difficulty
18. Stat Method: Advanced Methods
These three fields are core data analytics methods. You will find them
everywhere, like in econometrics, machine learning, and so on.
Regression Analysis, NTHU
URL: http://goo.gl/YQBAla
Difficulty
Multivariate Analysis, NTHU
URL: http://goo.gl/934GKd
Difficulty
Experimental Design, NTHU
URL: http://goo.gl/ED9HMr
Difficulty
19. Data Mining: Illinois & Stanford MOOC
Data mining is the most powerful tools for business analytics. It can be
applied to user behavior data, questionnaire design, and financial market.
Data Mining, UIUC
The Data Mining Specialization teaches data mining techniques for both structured data
which conform to a clearly defined schema, and unstructured data which exist in the form of
natural language text.
URL: https://goo.gl/Tyzm6Z
Difficulty
Mining Massive Dataset, Stanford
Introduce the participant to modern distributed file systems and MapReduce, including what
distinguishes good MapReduce algorithms from good algorithms in general. The rest of the
course is devoted to algorithms for extracting models and information from large datasets.
URL: https://goo.gl/NYyxy9
Difficulty
20. Data Mining: Illinois & Stanford MOOC
Data mining is the most powerful tools for business analytics. It can be
applied to user behavior data, questionnaire design, and financial market.
Data Mining, UIUC
The Data Mining Specialization teaches data mining techniques for both structured data
which conform to a clearly defined schema, and unstructured data which exist in the form of
natural language text.
URL: https://goo.gl/Tyzm6Z
Difficulty
Mining Massive Dataset, Stanford
Introduce the participant to modern distributed file systems and MapReduce, including what
distinguishes good MapReduce algorithms from good algorithms in general. The rest of the
course is devoted to algorithms for extracting models and information from large datasets.
URL: https://goo.gl/NYyxy9
Difficulty
21. Machine Learning: Stnaford / NTU MOOC
Machine learning is the science of getting computers to act without being
explicitly programmed.
Machine Learning, Stanford
This course provides a broad introduction to machine learning, datamining,
and statistical pattern recognition.
URL: https://www.coursera.org/learn/machine-learning
Difficulty
Machine Learning, NTU
The students shall enjoy a story-like flow moving from "When Can Machines
Learn" to "Why", "How" and beyond.. (Very tough course!)
URL: https://www.coursera.org/course/ntumlone
Difficulty
22. 3. What I’ve done
in practice!
How to become the data unicorn
without any tuition fee
23. SOP for Data Analytic Project
Data Task
Formulation
Data
Collection
Data
Cleaning
Data
Exploration
Data
Modeling
Define
Purpose
Model
Selection
Performance
Evaluation
Model
Deployment
Initial Phase
90% Efforts
Middle Phase
90% Professions
Final Phase
90% Domain
24. 25,054,386 vc
Monthly View Counts
751,631,580 values
Lots of user behavior!
1,785,244 users
Monthly Active Users