Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

What are common mistakes
in Data Science projects?
(and how to avoid them?)
Artur Suchwałko, Ph.D., QuantUp
AI & Big Data 2018, March 10, 2018, Lviv, Ukraine

Real-world Data Science projects

Real-world Data Science projects
• Kaggle competitions and real Data Science projects are two quite
different disciplines
• When a data frame is prepared then it’s easy
• What is done not correctly and can be corrected?
• Analysis of a business problem
• Data
• Process
• Methods, models
• Hardware, sofware
• People
(Everything based on practical experience: 20 years, 100 projects, 3,000
hours of workshops.
For the majority of topics I could add quotes from talks.)

Analysis of a business problem

No. We don’t want to build a model of
production and storage in our factory
Problem:
• We’d like just to optimize cutting a log (a trunk of a dead tree) into
planks
• Let’s do it in the simplest way. Why should we waste time and
money?
• The others can do it. Why do you make it complicated?!?
Solution:
• To build the production and storage model
• Otherwise you will optimize log cutting in a different sawmill
• or something completely different

Solution of a wrong analytical problem
Problem:
• Stating of a wrong problem and solving it can decrease predictive
ability of a model
• Similarly, removing so called false predictors (leaks from future)
• But we never want to have pure predictive power. Usually business
wants actionability and real value
Solution:
• Focus on what influences your busines

Preparation of a development sample is not
very important
Problem:
• Let’s take a sample and model!
• Preparation of the development sample decides if the model will fit
the reality we model or not
• The data and thus the sample is generated (or influenced) by a
process that must be well known and understoo
Solution:
• Think it over really carefully.

We have Big Data. We need to implement
Big Data solutions
Problem:
• If you can email your data or fit it in a pendrive it means you don’t
have Big Data!
• Many Data Science tasks for millions of records can be completed
using (powerful) laptops
• Decisions are data-driven or not. It’s not about data magnitude but
about way the decisions are taken
Solution:
• Be (more than) sure that we need Big Data technologies for storing
and processing
• During PoC / prototype stage don’t use Big Data tools
• Important: Not valid for some problems

Use social media data
Problem:
• It’s a tremendous effort if you don’t use an off-the-shelf solution
• Usually business value is not big
Solution:
• Be sure that the effort will be rewarded

Let’s build a model in one week
Problem:
• It’s possible (in theory)
• If you don’t analyze the process thoughtfully and don’t detect false
predictors then the model will not work in production
• We will be really happy to see how well it performs on our
development sample
Solution:
• Take enough time
• Be sure that the process is correct

There is too short time to complete the task /
model
Problem:
• Data problems
• Stucked in preprocessing
• The implementation takes too long
• Too short experience
Solution:
• Prepare a full product as soon as possible, e.g.:
• cutting out all the functionalities, e.g. a scoring application with a
simple / dummy model
• a full code for building the model but using simpler methods
• improve it in the next iterations
• Using CRISP-DM / checklist to support your memory
• Usually you can start implementation from the first product version

Way you prepare the result (a model, a data
product) doesn’t matter
Problem:
• I want a model. It must work. I don’t care how you’ll build it. Just
build it!
• The process is crucial
• If it is wrong then the analysis is not fully reproducible
• We take a technical debt
• and sooner or later we will be forced to pay it back
Solution:
• Build models in a fully reproducible way

Implementation – I’m sure it’ll work out
somehow
Problem:
• Implementation without planned tests usually fail
• What is really painful, it takes time to realize that they failed (a
model works and generates risk)
Solution:
• Plan both, implementation and tests

AI. We desperately need AI!
Problem:
• We don’t need
• Predictive modeling is not AI!
• It happens that full control over a model is more important than
predictive power
Solution:
• Let’s think what we’d like to achieve and how to do this
• Data-driven decision making is more important

A model just learns everything it is exposed
to
Problem:
• You need to promise self-learning to sell a service / a software
• But it will not learn automatically if not fed by suitable data
• In many situations you don’t have such data to design a feedback loop
Solution:
• Analyze a process that generates the data for the development sample
• Put aside a “not touched” sample
• The model will be taught using a sample and refined in an ongoing
way

Start modeling from using Deep Learning!
Problem:
• But everybody uses it…
• No!!!
• Many problems are too simple for DL
• In particular, the problems with data in a data frame
Solution:
• Random Forest, xgboost

If we have 3000 classes then let’s build a
BIG classifier
Problem:
• For example when we’d like to recommend bank products
• Such a random classifier has error 2999/3000 = 99.97% (not 50%)
• Usually the dataset is too small
Solution:
• It’s good to use a simpler method (usually)

You can do calculations using a laptop
Problem:
• Sometimes yes, you can
• But usually you cannot
• Usually it doesn’t make any sense – human’s time is more expensive
that machine’s time
Solution:
• It is good to invest some money in hardware
• or use AWS from Amazon (or something similar)

Commercial software is excellent
Problem:
• Users often tell that it is excellent unless bought
• The problems appear later
Solution:
• Test it in similar conditions it will be used
• Think seriously about using open source

Free software is excellent (and it’s free!)
Problem:
• It’s free – in terms of a buying cost
• It’s not just excellent – the cost is neccessity to have qualified people
onboard and to develop software
• There happen inconvenient problems
Solution:
• Use as it should be used
• i.e. write clear and clean code, use additional tools, e.g. VCS
• Take care of the team to have the skills needed

All companies have Data Science teams.
Let’s build one for us!
Problem:
• It’s possible to build a team. It will take a lot of time and lots of
money.
• If the results will be wasted then the people will leave
• They need to have fun working on projects
• If I need a plank then do I really need to buy a sawmill?
Solution::
• Be sure that:
• we know how to use their results
• it will give value to the business
• PoC can be outsourced. The first data science project can be
outsourced.

A student or a freshman is enough to give
profits from deep analytics to business
Problem:
• If someone can cut with a scalpel then will we call him a surgeon?
• Why someone who can build (technically) a model having a data
frame is called a Data Scientist?
• Data Scientist is a profession – experience matters!
• People without experience usually don’t give any business value for
a company. Even after spending a year working with data (!)
Solution:
• Hire experienced people, especially in the beginning of a DS journey
• let them teach the freshmen
• But what is you don’t have experienced people?
• Invest time, effort, and money in your team. Let a more business
analyst control the team

The team will learn everything on online
courses
Problem:
• I give each of you $20 (ok, even $50) and learn everything online
• It’s true. The team will learn some things
• But not the most important ones
• A good hands-on training cannot be substituted
Solution:
• Learning by doing (and applying)
• Control and stimulate learning
• Buy knowledge

Summary
• To avoid mistakes it is good to ask ourselves these questions (and
answer them), e.g.:
• What business problem are we solving?
• What will be business value we can get from the results?
• What could be lost in translation fro business into analytics?
• Do we have adequate and representative data?
• What process does generate them? What are they influenced by?
• What is model building process?
• What analytical tools should be used? Could we apply simpler
approaches?
• How do we control all the risk?
• It is good to do it repeatedly
• It’s best to involve someone experienced
• It’s beneficial to educate the receivers of the results

Contact
• During the conference!
• After the conference: artur [at] quantup [dot] eu

Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

Similar a Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?" (20)

Más de Lviv Startup Club

Más de Lviv Startup Club (20)

Último

Último (20)

Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"