The “definition” of Data Scientist says that one should know Math and Statistics, has a domain or business-specific knowledge and knows how to put it in programming code. Nobody knows to what extent this knowledge should be present in a single unicorn. One’s for sure - it grows over time. Knowing to implement and use ML models as repeatable tasks is what separates statisticians and researchers from the Data Scientists that help businesses improve their performance. That’s where the art of coding jumps in.
1. THE CODING PORTION OF
DATA SCIENCE
MILOS MILOVANOVIC
milos@thingsolver.com
CTO & Co-Founder
ENLIGHTEN
YOUR DATA
November, 2019
www.thingsolver.com
2. WHY DO I CARE ABOUT THE TOPIC?
There is one problem with your topic - you are not a Data Scientist!
- Valentina Djordjevic / Head of Data Science @ THINGS SOLVER
Building Data Products
As a business owner, I need to
ensure that our products work and
improve client’s business processes.
Technical Lead
As a CTO, I need to ensure that my
colleagues have the necessary skill set
and that our technology is smooth.
Data Engineering
As a Data Engineer, I work with Data
Scientists on productizing ML
workflows and optimization.
3. CAN I BE A DATA SCIENTIST WITHOUT CODING?
Common algorithms are already known, coded and
optimized.
Explicit coding is being replaced with drag-and-drop
interfaces.
Data science is becoming more automated with options
like Google’s Cloud AutoML, DataRobot, ...
Basic knowledge of Python and/or R helps us to tackle our
ML tasks with common algorithms.
VS.
https://www.glassdoor.com/research/data-scientist-personas/ *
4. LEARNING DATA SCIENCE
- Robert Chang, Data Scientist @ Airbnb
Link: https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7
Computer Science
M
ath
&
Statistics
B
usiness
K
now
ledge
DATA
SCIENCE
Machine
Learning
Software
Development
Traditional
Research
5. LEARNING DATA SCIENCE
- Robert Chang, Data Scientist @ Airbnb
Link: https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7
M
ath
&
Statistics
B
usiness
K
now
ledge
DATA
SCIENCE
Machine
Learning
Software
Development
Traditional
Research
Computer
Science
6. WHAT ACADEMIC
INSTITUTIONS
TEACH US?
WHAT DO WE
NEED TO LEARN
IN REAL-LIFE?
Import Data
Build Features
Modeling
Model Evaluation
Problem Formulation
ETL
Productizing
Integration with Business Processes
Debugging
Context!
thingsolver.com
7. www.thingsolver.com
BUT WE CANNOT BLAME THE ACADEMIA...
Speed of expansion
in Data Science and
ML fields is too high
for academic
institutions to keep
the pace.
Time and resources
restrictions in a
Master’s degree limit
the content that can
be taught.
Data Science and ML
include too many
fields to be
completely thought
over a 4-5 years
degree program.
8. www.thingsolver.com
WHAT KAGGLE TEACHES US?
21 3
Join a Competition Build and Submit Your Model
Watch the Leaderboard and
Win Prizes
Incredible Datasets
Impressive Kernels
Huge learning base
Cleaned (and labeled) Data
Runtime Environment
Preparation Kernels
Kaggle Stars
Benchmark against other
solutions
* https://www.kaggle.com/challenge-yourself
9. www.thingsolver.com
KAGGLE VS REALITY
21 3
Join a Competition Build and Submit Your Model
Watch the Leaderboard and
Win Prizes
Incredible Datasets
Impressive Kernels
Huge learning base
Cleaned (and labeled) Data
Runtime Environment
Preparation Kernels
Kaggle Stars
Benchmark against other
solutions
* https://www.kaggle.com/challenge-yourself
Problems
are already
formulated!
Datasets are
prepared!
Data is
labeled!
Lack of
Decision
Process!
15. BUILDING A ROBUST AND OPTIMAL SYSTEM
thingsolver.com
Building a
model in
your Jupyter
Notebook
Building a
model in live
and robust
environment
VS
16. BUILDING A ROBUST AND OPTIMAL SYSTEM
thingsolver.com
Building a
model in
your Jupyter
Notebook
Pipeline
Automation
Scale
Monitoring
Integrate
Quality
Assurance
Build &
Deployment
Building a
model in live
and robust
environment
VS
17. BUILDING A ROBUST AND OPTIMAL SYSTEM
thingsolver.com
Building a
model in
your Jupyter
Notebook
Pipeline
Automation
Efficient
Code
Scale
Distributed
Code
Monitoring
Logging
Integrate
API Design
Quality
Assurance
Unit Testing
Build &
Deployment
Pluggable
Packaging
DATA
PRODUCT
VS
18. BUILDING A ROBUST AND OPTIMAL SYSTEM
thingsolver.com
Building a
model in
your Jupyter
Notebook
Pipeline
Automation
Scale
Monitoring
Logging
Integrate
API Design
CI / CD
Build &
Deployment
Pluggable
Packaging
VS
19. DOES THE MODEL WORTH
THE INVESTMENT?
PIPELINE EFFICIENCY
$
thingsolver.com
20. Data Science is more than pure analytics:
ITERATIVE
INTERCONNECTED
ADAPTIVE
PROCESSES AUTOMATION
21. Data Science is more than pure analytics:
ITERATIVE
INTERCONNECTED
ADAPTIVE
PROCESSES AUTOMATION
LEARNING