SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
What are common mistakes
in Data Science projects?
(and how to avoid them?)
Artur Suchwałko, Ph.D., QuantUp
AI & Big Data 2018, March 10, 2018, Lviv, Ukraine
Real-world Data Science projects
Real-world Data Science projects
• Kaggle competitions and real Data Science projects are two quite
different disciplines
• When a data frame is prepared then it’s easy
• What is done not correctly and can be corrected?
• Analysis of a business problem
• Data
• Process
• Methods, models
• Hardware, sofware
• People
(Everything based on practical experience: 20 years, 100 projects, 3,000
hours of workshops.
For the majority of topics I could add quotes from talks.)
Analysis of a business problem
No. We don’t want to build a model of
production and storage in our factory
Problem:
• We’d like just to optimize cutting a log (a trunk of a dead tree) into
planks
• Let’s do it in the simplest way. Why should we waste time and
money?
• The others can do it. Why do you make it complicated?!?
Solution:
• To build the production and storage model
• Otherwise you will optimize log cutting in a different sawmill
• or something completely different
Solution of a wrong analytical problem
Problem:
• Stating of a wrong problem and solving it can decrease predictive
ability of a model
• Similarly, removing so called false predictors (leaks from future)
• But we never want to have pure predictive power. Usually business
wants actionability and real value
Solution:
• Focus on what influences your busines
Data
Preparation of a development sample is not
very important
Problem:
• Let’s take a sample and model!
• Preparation of the development sample decides if the model will fit
the reality we model or not
• The data and thus the sample is generated (or influenced) by a
process that must be well known and understoo
Solution:
• Think it over really carefully.
We have Big Data. We need to implement
Big Data solutions
Problem:
• If you can email your data or fit it in a pendrive it means you don’t
have Big Data!
• Many Data Science tasks for millions of records can be completed
using (powerful) laptops
• Decisions are data-driven or not. It’s not about data magnitude but
about way the decisions are taken
Solution:
• Be (more than) sure that we need Big Data technologies for storing
and processing
• During PoC / prototype stage don’t use Big Data tools
• Important: Not valid for some problems
Use social media data
Problem:
• It’s a tremendous effort if you don’t use an off-the-shelf solution
• Usually business value is not big
Solution:
• Be sure that the effort will be rewarded
Process
Let’s build a model in one week
Problem:
• It’s possible (in theory)
• If you don’t analyze the process thoughtfully and don’t detect false
predictors then the model will not work in production
• We will be really happy to see how well it performs on our
development sample
Solution:
• Take enough time
• Be sure that the process is correct
There is too short time to complete the task /
model
Problem:
• Data problems
• Stucked in preprocessing
• The implementation takes too long
• Too short experience
Solution:
• Prepare a full product as soon as possible, e.g.:
• cutting out all the functionalities, e.g. a scoring application with a
simple / dummy model
• a full code for building the model but using simpler methods
• improve it in the next iterations
• Using CRISP-DM / checklist to support your memory
• Usually you can start implementation from the first product version
Way you prepare the result (a model, a data
product) doesn’t matter
Problem:
• I want a model. It must work. I don’t care how you’ll build it. Just
build it!
• The process is crucial
• If it is wrong then the analysis is not fully reproducible
• We take a technical debt
• and sooner or later we will be forced to pay it back
Solution:
• Build models in a fully reproducible way
Implementation – I’m sure it’ll work out
somehow
Problem:
• Implementation without planned tests usually fail
• What is really painful, it takes time to realize that they failed (a
model works and generates risk)
Solution:
• Plan both, implementation and tests
Methods & models
AI. We desperately need AI!
Problem:
• We don’t need
• Predictive modeling is not AI!
• It happens that full control over a model is more important than
predictive power
Solution:
• Let’s think what we’d like to achieve and how to do this
• Data-driven decision making is more important
A model just learns everything it is exposed
to
Problem:
• You need to promise self-learning to sell a service / a software
• But it will not learn automatically if not fed by suitable data
• In many situations you don’t have such data to design a feedback loop
Solution:
• Analyze a process that generates the data for the development sample
• Put aside a “not touched” sample
• The model will be taught using a sample and refined in an ongoing
way
Start modeling from using Deep Learning!
Problem:
• But everybody uses it…
• No!!!
• Many problems are too simple for DL
• In particular, the problems with data in a data frame
Solution:
• Random Forest, xgboost
If we have 3000 classes then let’s build a
BIG classifier
Problem:
• For example when we’d like to recommend bank products
• Such a random classifier has error 2999/3000 = 99.97% (not 50%)
• Usually the dataset is too small
Solution:
• It’s good to use a simpler method (usually)
Hardware & software
You can do calculations using a laptop
Problem:
• Sometimes yes, you can
• But usually you cannot
• Usually it doesn’t make any sense – human’s time is more expensive
that machine’s time
Solution:
• It is good to invest some money in hardware
• or use AWS from Amazon (or something similar)
Commercial software is excellent
Problem:
• Users often tell that it is excellent unless bought
• The problems appear later
Solution:
• Test it in similar conditions it will be used
• Think seriously about using open source
Free software is excellent (and it’s free!)
Problem:
• It’s free – in terms of a buying cost
• It’s not just excellent – the cost is neccessity to have qualified people
onboard and to develop software
• There happen inconvenient problems
Solution:
• Use as it should be used
• i.e. write clear and clean code, use additional tools, e.g. VCS
• Take care of the team to have the skills needed
People
All companies have Data Science teams.
Let’s build one for us!
Problem:
• It’s possible to build a team. It will take a lot of time and lots of
money.
• If the results will be wasted then the people will leave
• They need to have fun working on projects
• If I need a plank then do I really need to buy a sawmill?
Solution::
• Be sure that:
• we know how to use their results
• it will give value to the business
• PoC can be outsourced. The first data science project can be
outsourced.
A student or a freshman is enough to give
profits from deep analytics to business
Problem:
• If someone can cut with a scalpel then will we call him a surgeon?
• Why someone who can build (technically) a model having a data
frame is called a Data Scientist?
• Data Scientist is a profession – experience matters!
• People without experience usually don’t give any business value for
a company. Even after spending a year working with data (!)
Solution:
• Hire experienced people, especially in the beginning of a DS journey
• let them teach the freshmen
• But what is you don’t have experienced people?
• Invest time, effort, and money in your team. Let a more business
analyst control the team
The team will learn everything on online
courses
Problem:
• I give each of you $20 (ok, even $50) and learn everything online
• It’s true. The team will learn some things
• But not the most important ones
• A good hands-on training cannot be substituted
Solution:
• Learning by doing (and applying)
• Control and stimulate learning
• Buy knowledge
Summary
Summary
• To avoid mistakes it is good to ask ourselves these questions (and
answer them), e.g.:
• What business problem are we solving?
• What will be business value we can get from the results?
• What could be lost in translation fro business into analytics?
• Do we have adequate and representative data?
• What process does generate them? What are they influenced by?
• What is model building process?
• What analytical tools should be used? Could we apply simpler
approaches?
• How do we control all the risk?
• It is good to do it repeatedly
• It’s best to involve someone experienced
• It’s beneficial to educate the receivers of the results
Contact
Contact
• During the conference!
• After the conference: artur [at] quantup [dot] eu

Más contenido relacionado

La actualidad más candente

A Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software TestingA Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software Testing
TechWell
 
Problem solving section 1
Problem solving section 1Problem solving section 1
Problem solving section 1
dwyer1an
 
The 7 step problem solving methodology
The 7 step problem solving methodologyThe 7 step problem solving methodology
The 7 step problem solving methodology
quest_pune
 
Data scientist enablement dse 400 week 5 roadmap
Data scientist enablement   dse 400   week 5 roadmapData scientist enablement   dse 400   week 5 roadmap
Data scientist enablement dse 400 week 5 roadmap
Dr. Mohan K. Bavirisetty
 

La actualidad más candente (20)

Testing in the Wild
Testing in the WildTesting in the Wild
Testing in the Wild
 
Startup Operating Systems
Startup Operating SystemsStartup Operating Systems
Startup Operating Systems
 
Herman- Pieter Nijhof - Where Do Old Testers Go?
Herman- Pieter Nijhof - Where Do Old Testers Go?Herman- Pieter Nijhof - Where Do Old Testers Go?
Herman- Pieter Nijhof - Where Do Old Testers Go?
 
A Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software TestingA Rapid Introduction to Rapid Software Testing
A Rapid Introduction to Rapid Software Testing
 
Hiring a developer: step by step debugging
Hiring a developer: step by step debuggingHiring a developer: step by step debugging
Hiring a developer: step by step debugging
 
hypothesis driven development
hypothesis driven developmenthypothesis driven development
hypothesis driven development
 
Witness wednesdays informing agile software development with continuous user...
Witness wednesdays  informing agile software development with continuous user...Witness wednesdays  informing agile software development with continuous user...
Witness wednesdays informing agile software development with continuous user...
 
Better products faster: let's bring the user into the userstory // TAPOST_201...
Better products faster: let's bring the user into the userstory // TAPOST_201...Better products faster: let's bring the user into the userstory // TAPOST_201...
Better products faster: let's bring the user into the userstory // TAPOST_201...
 
Managing Data Science by David Martínez Rego
Managing Data Science by David Martínez RegoManaging Data Science by David Martínez Rego
Managing Data Science by David Martínez Rego
 
Problem solving section 1
Problem solving section 1Problem solving section 1
Problem solving section 1
 
The 7 step problem solving methodology
The 7 step problem solving methodologyThe 7 step problem solving methodology
The 7 step problem solving methodology
 
Principal as agent of change
Principal as agent of change Principal as agent of change
Principal as agent of change
 
The Pragmatic Agilist: estimating, improving quality, and communication with...
The Pragmatic Agilist: estimating, improving quality, and communication  with...The Pragmatic Agilist: estimating, improving quality, and communication  with...
The Pragmatic Agilist: estimating, improving quality, and communication with...
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 
Problem solving skills
Problem solving skillsProblem solving skills
Problem solving skills
 
Write code and find a job
Write code and find a jobWrite code and find a job
Write code and find a job
 
eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...
eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...
eLearning Guild Online Forum - Application of the Thiagi Four-Door Model for ...
 
Using Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A JobUsing Problem Solving Skills To Get A Job
Using Problem Solving Skills To Get A Job
 
Data scientist enablement dse 400 week 5 roadmap
Data scientist enablement   dse 400   week 5 roadmapData scientist enablement   dse 400   week 5 roadmap
Data scientist enablement dse 400 week 5 roadmap
 
Dancing for a product release
Dancing for a product releaseDancing for a product release
Dancing for a product release
 

Similar a Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

ACC presentation for QA Club Kiev
ACC presentation for QA Club KievACC presentation for QA Club Kiev
ACC presentation for QA Club Kiev
Nikita Knysh
 
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHubSOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
DevOpsDays Tel Aviv
 

Similar a Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?" (20)

"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
"Startups, comment gérer une équipe de développeurs" par Laurent Cerveau
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Lecture13-Product-Development-PartI-Feb25-2018.pptx
Lecture13-Product-Development-PartI-Feb25-2018.pptxLecture13-Product-Development-PartI-Feb25-2018.pptx
Lecture13-Product-Development-PartI-Feb25-2018.pptx
 
ACC presentation for QA Club Kiev
ACC presentation for QA Club KievACC presentation for QA Club Kiev
ACC presentation for QA Club Kiev
 
Life in the tech trenches (2015)
Life in the tech trenches (2015)Life in the tech trenches (2015)
Life in the tech trenches (2015)
 
CTO Crunch avec Julien Simon, Viadeo
CTO Crunch avec Julien Simon, ViadeoCTO Crunch avec Julien Simon, Viadeo
CTO Crunch avec Julien Simon, Viadeo
 
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHubSOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
SOLVING MLOPS FROM FIRST PRINCIPLES, DEAN PLEBAN, DagsHub
 
Rex Sprint 0 - how build the data model with 2 BA and 3 IT architects
Rex Sprint 0 - how build the data model with 2 BA and 3 IT architectsRex Sprint 0 - how build the data model with 2 BA and 3 IT architects
Rex Sprint 0 - how build the data model with 2 BA and 3 IT architects
 
Roadmap
RoadmapRoadmap
Roadmap
 
Adam Ochs - Office 365 Roadmap
Adam Ochs - Office 365 RoadmapAdam Ochs - Office 365 Roadmap
Adam Ochs - Office 365 Roadmap
 
Mini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem managementMini-Training: Using root-cause analysis for problem management
Mini-Training: Using root-cause analysis for problem management
 
Adopting innovation
Adopting innovationAdopting innovation
Adopting innovation
 
Shipping code is not the problem, deciding what to ship it is!
Shipping code is not the problem, deciding what to ship it is!Shipping code is not the problem, deciding what to ship it is!
Shipping code is not the problem, deciding what to ship it is!
 
FDS Unit I_PPT.pptx
FDS Unit I_PPT.pptxFDS Unit I_PPT.pptx
FDS Unit I_PPT.pptx
 
Product Management in the Era of Data Science
Product Management in the Era of Data ScienceProduct Management in the Era of Data Science
Product Management in the Era of Data Science
 
Adopting innovation
Adopting innovationAdopting innovation
Adopting innovation
 
Building Startups and Minimum Viable Products (NDC2013)
Building Startups and Minimum Viable Products (NDC2013)Building Startups and Minimum Viable Products (NDC2013)
Building Startups and Minimum Viable Products (NDC2013)
 
Cracking the Coding Interview (Oct 2012)
Cracking the Coding Interview (Oct 2012)Cracking the Coding Interview (Oct 2012)
Cracking the Coding Interview (Oct 2012)
 
20180324 zen and the art of programming
20180324 zen and the art of programming20180324 zen and the art of programming
20180324 zen and the art of programming
 

Más de Lviv Startup Club

Más de Lviv Startup Club (20)

Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
 
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
 
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
 
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
 
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
 
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
 
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 

Último

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 

Último (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 

Artur Suchwalko “What are common mistakes in Data Science projects and how to avoid them?"

  • 1. What are common mistakes in Data Science projects? (and how to avoid them?) Artur Suchwałko, Ph.D., QuantUp AI & Big Data 2018, March 10, 2018, Lviv, Ukraine
  • 3. Real-world Data Science projects • Kaggle competitions and real Data Science projects are two quite different disciplines • When a data frame is prepared then it’s easy • What is done not correctly and can be corrected? • Analysis of a business problem • Data • Process • Methods, models • Hardware, sofware • People (Everything based on practical experience: 20 years, 100 projects, 3,000 hours of workshops. For the majority of topics I could add quotes from talks.)
  • 4. Analysis of a business problem
  • 5. No. We don’t want to build a model of production and storage in our factory Problem: • We’d like just to optimize cutting a log (a trunk of a dead tree) into planks • Let’s do it in the simplest way. Why should we waste time and money? • The others can do it. Why do you make it complicated?!? Solution: • To build the production and storage model • Otherwise you will optimize log cutting in a different sawmill • or something completely different
  • 6. Solution of a wrong analytical problem Problem: • Stating of a wrong problem and solving it can decrease predictive ability of a model • Similarly, removing so called false predictors (leaks from future) • But we never want to have pure predictive power. Usually business wants actionability and real value Solution: • Focus on what influences your busines
  • 8. Preparation of a development sample is not very important Problem: • Let’s take a sample and model! • Preparation of the development sample decides if the model will fit the reality we model or not • The data and thus the sample is generated (or influenced) by a process that must be well known and understoo Solution: • Think it over really carefully.
  • 9. We have Big Data. We need to implement Big Data solutions Problem: • If you can email your data or fit it in a pendrive it means you don’t have Big Data! • Many Data Science tasks for millions of records can be completed using (powerful) laptops • Decisions are data-driven or not. It’s not about data magnitude but about way the decisions are taken Solution: • Be (more than) sure that we need Big Data technologies for storing and processing • During PoC / prototype stage don’t use Big Data tools • Important: Not valid for some problems
  • 10. Use social media data Problem: • It’s a tremendous effort if you don’t use an off-the-shelf solution • Usually business value is not big Solution: • Be sure that the effort will be rewarded
  • 12. Let’s build a model in one week Problem: • It’s possible (in theory) • If you don’t analyze the process thoughtfully and don’t detect false predictors then the model will not work in production • We will be really happy to see how well it performs on our development sample Solution: • Take enough time • Be sure that the process is correct
  • 13. There is too short time to complete the task / model Problem: • Data problems • Stucked in preprocessing • The implementation takes too long • Too short experience Solution: • Prepare a full product as soon as possible, e.g.: • cutting out all the functionalities, e.g. a scoring application with a simple / dummy model • a full code for building the model but using simpler methods • improve it in the next iterations • Using CRISP-DM / checklist to support your memory • Usually you can start implementation from the first product version
  • 14. Way you prepare the result (a model, a data product) doesn’t matter Problem: • I want a model. It must work. I don’t care how you’ll build it. Just build it! • The process is crucial • If it is wrong then the analysis is not fully reproducible • We take a technical debt • and sooner or later we will be forced to pay it back Solution: • Build models in a fully reproducible way
  • 15. Implementation – I’m sure it’ll work out somehow Problem: • Implementation without planned tests usually fail • What is really painful, it takes time to realize that they failed (a model works and generates risk) Solution: • Plan both, implementation and tests
  • 17. AI. We desperately need AI! Problem: • We don’t need • Predictive modeling is not AI! • It happens that full control over a model is more important than predictive power Solution: • Let’s think what we’d like to achieve and how to do this • Data-driven decision making is more important
  • 18. A model just learns everything it is exposed to Problem: • You need to promise self-learning to sell a service / a software • But it will not learn automatically if not fed by suitable data • In many situations you don’t have such data to design a feedback loop Solution: • Analyze a process that generates the data for the development sample • Put aside a “not touched” sample • The model will be taught using a sample and refined in an ongoing way
  • 19. Start modeling from using Deep Learning! Problem: • But everybody uses it… • No!!! • Many problems are too simple for DL • In particular, the problems with data in a data frame Solution: • Random Forest, xgboost
  • 20. If we have 3000 classes then let’s build a BIG classifier Problem: • For example when we’d like to recommend bank products • Such a random classifier has error 2999/3000 = 99.97% (not 50%) • Usually the dataset is too small Solution: • It’s good to use a simpler method (usually)
  • 22. You can do calculations using a laptop Problem: • Sometimes yes, you can • But usually you cannot • Usually it doesn’t make any sense – human’s time is more expensive that machine’s time Solution: • It is good to invest some money in hardware • or use AWS from Amazon (or something similar)
  • 23. Commercial software is excellent Problem: • Users often tell that it is excellent unless bought • The problems appear later Solution: • Test it in similar conditions it will be used • Think seriously about using open source
  • 24. Free software is excellent (and it’s free!) Problem: • It’s free – in terms of a buying cost • It’s not just excellent – the cost is neccessity to have qualified people onboard and to develop software • There happen inconvenient problems Solution: • Use as it should be used • i.e. write clear and clean code, use additional tools, e.g. VCS • Take care of the team to have the skills needed
  • 26. All companies have Data Science teams. Let’s build one for us! Problem: • It’s possible to build a team. It will take a lot of time and lots of money. • If the results will be wasted then the people will leave • They need to have fun working on projects • If I need a plank then do I really need to buy a sawmill? Solution:: • Be sure that: • we know how to use their results • it will give value to the business • PoC can be outsourced. The first data science project can be outsourced.
  • 27. A student or a freshman is enough to give profits from deep analytics to business Problem: • If someone can cut with a scalpel then will we call him a surgeon? • Why someone who can build (technically) a model having a data frame is called a Data Scientist? • Data Scientist is a profession – experience matters! • People without experience usually don’t give any business value for a company. Even after spending a year working with data (!) Solution: • Hire experienced people, especially in the beginning of a DS journey • let them teach the freshmen • But what is you don’t have experienced people? • Invest time, effort, and money in your team. Let a more business analyst control the team
  • 28. The team will learn everything on online courses Problem: • I give each of you $20 (ok, even $50) and learn everything online • It’s true. The team will learn some things • But not the most important ones • A good hands-on training cannot be substituted Solution: • Learning by doing (and applying) • Control and stimulate learning • Buy knowledge
  • 30. Summary • To avoid mistakes it is good to ask ourselves these questions (and answer them), e.g.: • What business problem are we solving? • What will be business value we can get from the results? • What could be lost in translation fro business into analytics? • Do we have adequate and representative data? • What process does generate them? What are they influenced by? • What is model building process? • What analytical tools should be used? Could we apply simpler approaches? • How do we control all the risk? • It is good to do it repeatedly • It’s best to involve someone experienced • It’s beneficial to educate the receivers of the results
  • 32. Contact • During the conference! • After the conference: artur [at] quantup [dot] eu