SlideShare una empresa de Scribd logo
1 de 24
Closing the Loop
Evaluating Big Data Analysis
Karolina Alexiou
About
The speaker
● ETH graduate
● Joined Teralytics in September 2013
● Data Scientist/Software Engineer
The talk (takeaways)
● Point out how evaluation can improve your project
● Suggest concrete steps to build an evaluation
framework
The value of evaluation
Data analysis can be fun and exploratory, BUT:
“If you torture the data long enough,
it will confess to anything.”
-Ronald Coase, economist
The value of evaluation
Without feedback on the data analysis results, (=closing
the loop) I don’t know whether my fancy algorithm is better
than a naive one.
How to measure?
Strategy
People-driven
● Get a 2nd opinion on your methodology
Data-driven
● Get another data source to verify results (ground truth)
● Convert ground truth and your output to the same
format
● Compare against meaningful metric
● Store & visualize results
General evaluation framework
General evaluation framework
Statistical significance?
Teralytics Case Study: Congestion
Estimation
Ongoing project: Use of cellular data to
estimate traffic/congestion in Swiss roads
Our estimations: Mean speed on a highway at
a given time, given location
Ground truth
● Complex algorithm with lots of knobs and subproblems
● How to know we’re changing things for the better?
● Collect ground truth regarding road traffic in Switzerland
-> sensor data available from 3rd party site
● Write hackish script to login to website and fetch sensor
data that match our highway locations
● Instant sense of purpose :)
Same format
Not just a data architecture problem.
● Our algorithm’s speed estimations are fancy averages
of distance/time_needed_for_distance (journey speed)
● Sensor data reports instantaneous speed.
● Sensors are probably going to report higher speeds
systematically (bias).
Comparing against metric
● Group data every 3 minutes
● Metric: Percentage of data where the
difference between ground truth and
estimation is <7%
● Other options
○ linear correlation of time-series of speed
○ cross-correlation to find optimal time shift
Pitfalls of comparison
● Overfitting to ground truth
● Correlation may be statistically insignificant
Need proper methodology (training set/testing
set) & adequate amounts of ground truth
Visualization
● Instant feedback on
what is working and
what is not.
● Insights
○ on assumptions
○ on quality of data sources
○ presence of time shift
Lessons learned
Ground truth isn’t easy to get
● No API - web scraping
● May be biased
● May have to create it yourself
Lessons learned
Use the right tools
● The output of a Big Data analysis problem is of more manageable size ->
no need to overengineer, python is fitting for the job
● Need to be able to handle missing data/add constraints
/average/interpolate-> use existing library (pandas) with useful abstractions
● Crucial to be able to pinpoint what goes wrong -> interactivity (ipython),
logging
Lessons learned
Use the right workflow
● Run the whole thing at once for timely feedback
● Always visualize -> large CSVs are hard to make sense
of (false sense of security)
● Iterative development pays off & is sped up by
automated evaluation :)
Action Points
Ask questions
● Is there some place of my data analysis where my
results are unverified?
● Am I using the right tools to evaluate?
● Is overengineering getting in the way of quick & timely
feedback?
Action Points
Make a plan
● What ground truth can I get or create?
● How can I make sure I am comparing apples to apples?
● How should I compare my data to the ground truth
(metric, comparison method)?
● What’s the best visualization to show correlation?
Recommended Reading
● Excellent abstractions for data
cleaning & transformation
● Good performance
● Portable data formats
● Increases productivity
● +ipython for easy exploring of
the data (more insight, what
went wrong etc)
It takes some time to learn to use the
full power of pandas - so get your
data scientists to learn it asap. :)
Recommended Reading
● Even new companies have
“legacy” code (code that is
blocking change)
● Acknowledges the imperfection
of the real world (even if design
is good, problems may arise)
● Acknowledges the value of
quick feedback in dev
productivity
● Case-by-case scenarios to
unblock yourself and be able to
evaluate your code
Recommended Reading
Thanks
I would like to thank my colleagues for making
good decisions, in particular
● Valentin for introducing pandas to Teralytics
● Nima for organizing the collection of ground truth on
several projects
● Laurent for insisting on testing & best practices
Questions?
We are hiring :)
Looking for Machine Learning/Big Data experts
Experience with pandas is a plus
Just send your CV to recruiting@teralytics.net
Bonus Recommended Reading
Evaluation of impact of
charity organizations is a
hard, unsolved problem
involving data
● transparency
● more motivation to
give

Más contenido relacionado

La actualidad más candente

Data Science Salon: Building a Data Science Culture
Data Science Salon: Building a Data Science CultureData Science Salon: Building a Data Science Culture
Data Science Salon: Building a Data Science CultureFormulatedby
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science ProjectDigital Vidya
 
Anchormen corne versloot
Anchormen corne verslootAnchormen corne versloot
Anchormen corne verslootBigDataExpo
 
Data science team, a practice to setup
Data science team, a practice to setupData science team, a practice to setup
Data science team, a practice to setupOmid Mogharian
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati
 
Microsoft jeroen ter heerdt
Microsoft jeroen ter heerdtMicrosoft jeroen ter heerdt
Microsoft jeroen ter heerdtBigDataExpo
 
Data Science Salon: Digital Transformation: The Data Science Catalyst
Data Science Salon: Digital Transformation: The Data Science CatalystData Science Salon: Digital Transformation: The Data Science Catalyst
Data Science Salon: Digital Transformation: The Data Science CatalystFormulatedby
 
The Five Data Questions
The Five Data QuestionsThe Five Data Questions
The Five Data Questionscrystalpullen
 
Operationalizing Data Science: The Right Architecture and Tools
Operationalizing Data Science: The Right Architecture and ToolsOperationalizing Data Science: The Right Architecture and Tools
Operationalizing Data Science: The Right Architecture and ToolsVMware Tanzu
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
Five Pitfalls when Operationalizing Data Science and a Strategy for SuccessFive Pitfalls when Operationalizing Data Science and a Strategy for Success
Five Pitfalls when Operationalizing Data Science and a Strategy for SuccessVMware Tanzu
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centuryFrank Kienle
 
1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptopRising Media, Inc.
 
H2O World - Learning How Humans and Non-Humans Interact with Digital Ads
H2O World - Learning How Humans and Non-Humans Interact with Digital AdsH2O World - Learning How Humans and Non-Humans Interact with Digital Ads
H2O World - Learning How Humans and Non-Humans Interact with Digital AdsSri Ambati
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack BigDataExpo
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Edureka!
 
Top career opportunities in data science
Top career opportunities in data scienceTop career opportunities in data science
Top career opportunities in data scienceTanyaAgarwal71
 
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Formulatedby
 

La actualidad más candente (20)

Vikrant data scientist
Vikrant data scientistVikrant data scientist
Vikrant data scientist
 
Data Science Salon: Building a Data Science Culture
Data Science Salon: Building a Data Science CultureData Science Salon: Building a Data Science Culture
Data Science Salon: Building a Data Science Culture
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
 
Data science
Data scienceData science
Data science
 
Anchormen corne versloot
Anchormen corne verslootAnchormen corne versloot
Anchormen corne versloot
 
Data science team, a practice to setup
Data science team, a practice to setupData science team, a practice to setup
Data science team, a practice to setup
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
Microsoft jeroen ter heerdt
Microsoft jeroen ter heerdtMicrosoft jeroen ter heerdt
Microsoft jeroen ter heerdt
 
Data Science Salon: Digital Transformation: The Data Science Catalyst
Data Science Salon: Digital Transformation: The Data Science CatalystData Science Salon: Digital Transformation: The Data Science Catalyst
Data Science Salon: Digital Transformation: The Data Science Catalyst
 
The Five Data Questions
The Five Data QuestionsThe Five Data Questions
The Five Data Questions
 
Operationalizing Data Science: The Right Architecture and Tools
Operationalizing Data Science: The Right Architecture and ToolsOperationalizing Data Science: The Right Architecture and Tools
Operationalizing Data Science: The Right Architecture and Tools
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
Five Pitfalls when Operationalizing Data Science and a Strategy for SuccessFive Pitfalls when Operationalizing Data Science and a Strategy for Success
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st century
 
1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop
 
H2O World - Learning How Humans and Non-Humans Interact with Digital Ads
H2O World - Learning How Humans and Non-Humans Interact with Digital AdsH2O World - Learning How Humans and Non-Humans Interact with Digital Ads
H2O World - Learning How Humans and Non-Humans Interact with Digital Ads
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!
 
Top career opportunities in data science
Top career opportunities in data scienceTop career opportunities in data science
Top career opportunities in data science
 
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
 

Destacado

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
demo_teralytics
demo_teralyticsdemo_teralytics
demo_teralyticsKuhan Wang
 
Build agile and elastic data pipeline
Build agile and elastic data pipelineBuild agile and elastic data pipeline
Build agile and elastic data pipelineDeba Chatterjee
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ..."Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...Dataconomy Media
 
Let's Build a Service Oriented Data Pipeline!
Let's Build a Service Oriented Data Pipeline!Let's Build a Service Oriented Data Pipeline!
Let's Build a Service Oriented Data Pipeline!Yasha Podeswa
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and FlinkBryan Bende
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...DataStax
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDataWorks Summit
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
NATS vs HTTP
NATS vs HTTPNATS vs HTTP
NATS vs HTTPApcera
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 
Esri 2016 User Conference - ArcGIS Online steps for success
Esri 2016 User Conference - ArcGIS Online steps for successEsri 2016 User Conference - ArcGIS Online steps for success
Esri 2016 User Conference - ArcGIS Online steps for successBern Szukalski
 
eMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleeMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleCraig Sullivan
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
 

Destacado (20)

Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
demo_teralytics
demo_teralyticsdemo_teralytics
demo_teralytics
 
Build agile and elastic data pipeline
Build agile and elastic data pipelineBuild agile and elastic data pipeline
Build agile and elastic data pipeline
 
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ..."Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
"Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & ...
 
Let's Build a Service Oriented Data Pipeline!
Let's Build a Service Oriented Data Pipeline!Let's Build a Service Oriented Data Pipeline!
Let's Build a Service Oriented Data Pipeline!
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using Hadoop
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
NATS vs HTTP
NATS vs HTTPNATS vs HTTP
NATS vs HTTP
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Esri 2016 User Conference - ArcGIS Online steps for success
Esri 2016 User Conference - ArcGIS Online steps for successEsri 2016 User Conference - ArcGIS Online steps for success
Esri 2016 User Conference - ArcGIS Online steps for success
 
eMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleeMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype Cycle
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 

Similar a Closing The Loop for Evaluating Big Data Analysis

Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...Lauren Cormack
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Benjamin Bengfort
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Matthias Schuurmans
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?DeZyre
 
Webinar on Big Data Challenges : Presented by Raj Kasturi
Webinar on Big Data Challenges : Presented by Raj KasturiWebinar on Big Data Challenges : Presented by Raj Kasturi
Webinar on Big Data Challenges : Presented by Raj KasturioGuild .
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Analytics Lessons Learnt
Analytics Lessons Learnt Analytics Lessons Learnt
Analytics Lessons Learnt Venkata Pingali
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Building a successful data organization nov 2018
Building a successful data organization   nov 2018Building a successful data organization   nov 2018
Building a successful data organization nov 2018Alejandro Cantarero
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analyticsRob Winters
 
Landing a career in data science
Landing a career in data scienceLanding a career in data science
Landing a career in data scienceParul Pandey
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist Manjunath Sindagi
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Ali Alkan
 

Similar a Closing The Loop for Evaluating Big Data Analysis (20)

Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
GIAF UK Winter 2015 - Analytical techniques: A practical guide to answering b...
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22
 
How to program your way into data science?
How to program your way into data science?How to program your way into data science?
How to program your way into data science?
 
Webinar on Big Data Challenges : Presented by Raj Kasturi
Webinar on Big Data Challenges : Presented by Raj KasturiWebinar on Big Data Challenges : Presented by Raj Kasturi
Webinar on Big Data Challenges : Presented by Raj Kasturi
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Analytics Lessons Learnt
Analytics Lessons Learnt Analytics Lessons Learnt
Analytics Lessons Learnt
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Building a successful data organization nov 2018
Building a successful data organization   nov 2018Building a successful data organization   nov 2018
Building a successful data organization nov 2018
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analytics
 
Landing a career in data science
Landing a career in data scienceLanding a career in data science
Landing a career in data science
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 

Más de Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

Más de Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Closing The Loop for Evaluating Big Data Analysis

  • 1. Closing the Loop Evaluating Big Data Analysis Karolina Alexiou
  • 2. About The speaker ● ETH graduate ● Joined Teralytics in September 2013 ● Data Scientist/Software Engineer The talk (takeaways) ● Point out how evaluation can improve your project ● Suggest concrete steps to build an evaluation framework
  • 3. The value of evaluation Data analysis can be fun and exploratory, BUT: “If you torture the data long enough, it will confess to anything.” -Ronald Coase, economist
  • 4. The value of evaluation Without feedback on the data analysis results, (=closing the loop) I don’t know whether my fancy algorithm is better than a naive one. How to measure?
  • 5. Strategy People-driven ● Get a 2nd opinion on your methodology Data-driven ● Get another data source to verify results (ground truth) ● Convert ground truth and your output to the same format ● Compare against meaningful metric ● Store & visualize results
  • 8. Teralytics Case Study: Congestion Estimation Ongoing project: Use of cellular data to estimate traffic/congestion in Swiss roads Our estimations: Mean speed on a highway at a given time, given location
  • 9. Ground truth ● Complex algorithm with lots of knobs and subproblems ● How to know we’re changing things for the better? ● Collect ground truth regarding road traffic in Switzerland -> sensor data available from 3rd party site ● Write hackish script to login to website and fetch sensor data that match our highway locations ● Instant sense of purpose :)
  • 10. Same format Not just a data architecture problem. ● Our algorithm’s speed estimations are fancy averages of distance/time_needed_for_distance (journey speed) ● Sensor data reports instantaneous speed. ● Sensors are probably going to report higher speeds systematically (bias).
  • 11. Comparing against metric ● Group data every 3 minutes ● Metric: Percentage of data where the difference between ground truth and estimation is <7% ● Other options ○ linear correlation of time-series of speed ○ cross-correlation to find optimal time shift
  • 12. Pitfalls of comparison ● Overfitting to ground truth ● Correlation may be statistically insignificant Need proper methodology (training set/testing set) & adequate amounts of ground truth
  • 13. Visualization ● Instant feedback on what is working and what is not. ● Insights ○ on assumptions ○ on quality of data sources ○ presence of time shift
  • 14. Lessons learned Ground truth isn’t easy to get ● No API - web scraping ● May be biased ● May have to create it yourself
  • 15. Lessons learned Use the right tools ● The output of a Big Data analysis problem is of more manageable size -> no need to overengineer, python is fitting for the job ● Need to be able to handle missing data/add constraints /average/interpolate-> use existing library (pandas) with useful abstractions ● Crucial to be able to pinpoint what goes wrong -> interactivity (ipython), logging
  • 16. Lessons learned Use the right workflow ● Run the whole thing at once for timely feedback ● Always visualize -> large CSVs are hard to make sense of (false sense of security) ● Iterative development pays off & is sped up by automated evaluation :)
  • 17. Action Points Ask questions ● Is there some place of my data analysis where my results are unverified? ● Am I using the right tools to evaluate? ● Is overengineering getting in the way of quick & timely feedback?
  • 18. Action Points Make a plan ● What ground truth can I get or create? ● How can I make sure I am comparing apples to apples? ● How should I compare my data to the ground truth (metric, comparison method)? ● What’s the best visualization to show correlation?
  • 19. Recommended Reading ● Excellent abstractions for data cleaning & transformation ● Good performance ● Portable data formats ● Increases productivity ● +ipython for easy exploring of the data (more insight, what went wrong etc) It takes some time to learn to use the full power of pandas - so get your data scientists to learn it asap. :)
  • 20. Recommended Reading ● Even new companies have “legacy” code (code that is blocking change) ● Acknowledges the imperfection of the real world (even if design is good, problems may arise) ● Acknowledges the value of quick feedback in dev productivity ● Case-by-case scenarios to unblock yourself and be able to evaluate your code
  • 22. Thanks I would like to thank my colleagues for making good decisions, in particular ● Valentin for introducing pandas to Teralytics ● Nima for organizing the collection of ground truth on several projects ● Laurent for insisting on testing & best practices
  • 23. Questions? We are hiring :) Looking for Machine Learning/Big Data experts Experience with pandas is a plus Just send your CV to recruiting@teralytics.net
  • 24. Bonus Recommended Reading Evaluation of impact of charity organizations is a hard, unsolved problem involving data ● transparency ● more motivation to give