SlideShare una empresa de Scribd logo
1 de 21
Data Scientists in large
organizations
Spiros Antonatos
whoami
•
•

•

Greek
7 years as a researcher
– High performance computing, network security, social network analysis
Specific role: between data scientists and engineers

2
My first experience with data science
•
•
•
•
•

EGEE pan-european grid cluster, 2002
Thousands of analytics jobs from CERN labs
MPI jobs
Power of around 10,000 CPUs
My first submitted jobs were particle simulation and a
parallel version of the Conway’s game of life

3
The importance of data science

Source: IBM analytics, http://www-935.ibm.com/services/us/gbs/thoughtleadership/ninelevers/
4
The problem of “unicorn” data scientists

Statistical analysis
- Math
- Data Mining
- Machine Learning
- Graph mining
- Data Visualization

Computer Science
- Advanced/High
performance
computing
- Visualization

Database
- Data engineering
- Data
warehousing

Domain expertise
- Finance
- Advertising
- Physics

5
Top daily activities
•
•
•
•
•
•

Data cleaning (painful)
Data processing (boring)
Data modeling (starting to get fun)
Statistical analysis, machine learning, data mining (yeaaahhh)
Visualization (exciting)
Report (back to painful stuff)

6
From data to actions

End users
Teams
Actions
Insights

Summaries and aggregations
Data Foundation
Data sources

7
Data sources - Data engineers
•

•
•

Most data sources encountered contain either:
– Unclean data (for exampple inconsistent formats)
– Incomplete data (sampling)
– Noise
Data engineers capture, process and store data sources
Hadoop, MapReduce, HBase, Cassandra, Python scripts

8
Data Foundation
•
•
•
•
•
•

The basic foundation where all data and analytic results are stored
Combined scientific and engineering effort
Heavy data modeling driven by analytics requirements
A good foundation means less time spent to retrieve and query data
Summaries and aggregation are helpful for large-scale data
If there is no data foundation, spend your initial effort to build one

9
Validation
•
•
•
•

Critical part of the analytics process
Validating against the ground truth is not always feasible
Finding representative training sets is hard
Open source and social network data sometimes help with validation

10
Engineering side
•
•
•

•

A good data scientist needs to have a good engineering side
Not expert, up to the stage of prototyping
Big teams have engineers side by side with data scientists
– Engineers gain the domain expertise
– Data scientists acquire engineering skills to facilitate the handover of their analytics
processes
Which comes to the question: what tools/languages/skills/methodologies should I learn?

11
Data Scientist Toolkit
•
•
•
•
•
•
•
•
•
•

R, Python, Java
Hadoop, HDFS, MapReduce, Spark
Hbase, Pig, Hive, Impala
SQL, RDBMS
SciPy, Numpy, scikit-learn
D3.js, Tableau, Gephi
SAS, Matlab, SPSS
NoSQL, MongoDB, Cassandra
Neo4J, FlockDB
MS-Excel

Which tools
should I learn?
As many as you
can

Bold: my skillsets

12
But I know only R, will I have a hard time?
•
•

•
•

Tricky question
The window opportunity for pure analysts is getting smaller
– Company-specific statement
Even paired with an engineer, knowledge transfer is hard if you are stubborn with one
toolkit/technology/methodology
The churn analysis example

13
Churning
•
•
•

•

Apart from regular contract termination, customers leave the provider early
Churn analysis tries to identify and quantify the reasons behind churning
Variables for investigation
– Call quality (calls being dropped)
– Network coverage (bad 3G/4G quality in my place)
– Prices and bundles
– My friends left the provider
Country and culture-specific problem

14
Churn analysis
•
•
•
•
•
•

Billions of call and SMS records
Millions of subscribers
Thousands of contract cancellations (5-10% of total subscribers)
Subscribers have a very small number of people they interact with (less than 5)
Insight: canceling customers are 7x more likely to be linked (country: US)
Action: identify churners social group, take actions to prevent them from leaving

CDR
database

Data

Insights
15
Domain expertise
•
•
•
•
•

Diverse opinions whether data scientists should have domain expertise
Domain expertise vs machine learning
Opinions so far are shared
Cases where non-experts outperform experts
No point of worrying, most data scientists that join large companies do not have domain
expertise

16
The importance of visualization
•
•
•

All performed analyses should be accompanied by the appropriate visualization
Do not get stuck on Excel / matplotlib graphs
Introduce infographics, custom heatmaps, Google maps to your skill arsenal

17
Visualization leads to great insights
•
•
•
•

Understanding data through visualization
Data scientists with expert visualization skills are rare
Relying on professional UI/UX experts is not always the solution for data products
Examples: spatial and SNA graph representation

18
Do not stand isolated from the business owners
•
•
•
•
•
•

Use cases define the requirements of what you are trying to solve
Isolation from use cases leads to generic models that do not fit to real life problems
Sales people are paired with data scientists to address customer needs
Data scientists can answer all the hard questions around data!
Cases where top sales people were data scientists or engineers
Data scientists can even become CEOs of leading companies!

19
Sense of privacy
•
•
•
•

Environments like telcos and social network companies deal with private and sensitive
data
Companies enforce security and privacy measures to prevent data leakage
Dealing with massive amounts of data requires a great sense of responsibility
Confidentiality protection ensures that specific individuals are not pinpointed

20
Thank you

21

Más contenido relacionado

La actualidad más candente

CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDomino Data Lab
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspectiveAmir Ziai
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)Buhwan Jeong
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleDomino Data Lab
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USCSri Ambati
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Data Science Thailand
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data ScientistNarong Intiruk
 
Meetup #1. Trends, talks, cool stuff.
Meetup #1. Trends, talks, cool stuff.Meetup #1. Trends, talks, cool stuff.
Meetup #1. Trends, talks, cool stuff.SPb_Data_Science
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! PromptCloud
 

La actualidad más candente (20)

CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up Seattle
 
Data science a practitioner's perspective
Data science  a practitioner's perspectiveData science  a practitioner's perspective
Data science a practitioner's perspective
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USC
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Building up a Data Science Team from Scratch
Building up a Data Science Team from ScratchBuilding up a Data Science Team from Scratch
Building up a Data Science Team from Scratch
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Meetup #1. Trends, talks, cool stuff.
Meetup #1. Trends, talks, cool stuff.Meetup #1. Trends, talks, cool stuff.
Meetup #1. Trends, talks, cool stuff.
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps! Be a Data Scientist in 8 steps!
Be a Data Scientist in 8 steps!
 

Destacado

Mclarens @ Data Science Sg
Mclarens @ Data Science SgMclarens @ Data Science Sg
Mclarens @ Data Science SgBenji Thian
 
Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Eugene Yan Ziyou
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communityEugene Yan Ziyou
 
Scalable Collaborative Filtering for Commerce Recommendation
Scalable Collaborative Filtering for Commerce RecommendationScalable Collaborative Filtering for Commerce Recommendation
Scalable Collaborative Filtering for Commerce RecommendationYiqun Hu
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)Eugene Yan Ziyou
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingXavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingKai Xin Thia
 
Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Eugene Yan Ziyou
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Deep Learning for Stock Prediction
Deep Learning for Stock PredictionDeep Learning for Stock Prediction
Deep Learning for Stock PredictionLim Zhi Yuan (Zane)
 

Destacado (11)

Mclarens @ Data Science Sg
Mclarens @ Data Science SgMclarens @ Data Science Sg
Mclarens @ Data Science Sg
 
Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
 
Scalable Collaborative Filtering for Commerce Recommendation
Scalable Collaborative Filtering for Commerce RecommendationScalable Collaborative Filtering for Commerce Recommendation
Scalable Collaborative Filtering for Commerce Recommendation
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricingXavier Conort, DataScience SG Meetup - Challenges in insurance pricing
Xavier Conort, DataScience SG Meetup - Challenges in insurance pricing
 
Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Deep Learning for Stock Prediction
Deep Learning for Stock PredictionDeep Learning for Stock Prediction
Deep Learning for Stock Prediction
 

Similar a Data science meetup - Spiros Antonatos

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxAbderrahmanABID2
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Precisely
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science teamAshish Bansal
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
Big data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersBig data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersRuhollah Farchtchi
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data ScientistsMitch Sanders
 
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...Garrett Teoh Hor Keong
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterpriseankit_ppt
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 

Similar a Data science meetup - Spiros Antonatos (20)

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science team
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Big data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersBig data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makers
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
Big Data World Singapore 2017 - Moving Towards Digitization & Artificial Inte...
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterprise
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 

Último

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Data science meetup - Spiros Antonatos

  • 1. Data Scientists in large organizations Spiros Antonatos
  • 2. whoami • • • Greek 7 years as a researcher – High performance computing, network security, social network analysis Specific role: between data scientists and engineers 2
  • 3. My first experience with data science • • • • • EGEE pan-european grid cluster, 2002 Thousands of analytics jobs from CERN labs MPI jobs Power of around 10,000 CPUs My first submitted jobs were particle simulation and a parallel version of the Conway’s game of life 3
  • 4. The importance of data science Source: IBM analytics, http://www-935.ibm.com/services/us/gbs/thoughtleadership/ninelevers/ 4
  • 5. The problem of “unicorn” data scientists Statistical analysis - Math - Data Mining - Machine Learning - Graph mining - Data Visualization Computer Science - Advanced/High performance computing - Visualization Database - Data engineering - Data warehousing Domain expertise - Finance - Advertising - Physics 5
  • 6. Top daily activities • • • • • • Data cleaning (painful) Data processing (boring) Data modeling (starting to get fun) Statistical analysis, machine learning, data mining (yeaaahhh) Visualization (exciting) Report (back to painful stuff) 6
  • 7. From data to actions End users Teams Actions Insights Summaries and aggregations Data Foundation Data sources 7
  • 8. Data sources - Data engineers • • • Most data sources encountered contain either: – Unclean data (for exampple inconsistent formats) – Incomplete data (sampling) – Noise Data engineers capture, process and store data sources Hadoop, MapReduce, HBase, Cassandra, Python scripts 8
  • 9. Data Foundation • • • • • • The basic foundation where all data and analytic results are stored Combined scientific and engineering effort Heavy data modeling driven by analytics requirements A good foundation means less time spent to retrieve and query data Summaries and aggregation are helpful for large-scale data If there is no data foundation, spend your initial effort to build one 9
  • 10. Validation • • • • Critical part of the analytics process Validating against the ground truth is not always feasible Finding representative training sets is hard Open source and social network data sometimes help with validation 10
  • 11. Engineering side • • • • A good data scientist needs to have a good engineering side Not expert, up to the stage of prototyping Big teams have engineers side by side with data scientists – Engineers gain the domain expertise – Data scientists acquire engineering skills to facilitate the handover of their analytics processes Which comes to the question: what tools/languages/skills/methodologies should I learn? 11
  • 12. Data Scientist Toolkit • • • • • • • • • • R, Python, Java Hadoop, HDFS, MapReduce, Spark Hbase, Pig, Hive, Impala SQL, RDBMS SciPy, Numpy, scikit-learn D3.js, Tableau, Gephi SAS, Matlab, SPSS NoSQL, MongoDB, Cassandra Neo4J, FlockDB MS-Excel Which tools should I learn? As many as you can Bold: my skillsets 12
  • 13. But I know only R, will I have a hard time? • • • • Tricky question The window opportunity for pure analysts is getting smaller – Company-specific statement Even paired with an engineer, knowledge transfer is hard if you are stubborn with one toolkit/technology/methodology The churn analysis example 13
  • 14. Churning • • • • Apart from regular contract termination, customers leave the provider early Churn analysis tries to identify and quantify the reasons behind churning Variables for investigation – Call quality (calls being dropped) – Network coverage (bad 3G/4G quality in my place) – Prices and bundles – My friends left the provider Country and culture-specific problem 14
  • 15. Churn analysis • • • • • • Billions of call and SMS records Millions of subscribers Thousands of contract cancellations (5-10% of total subscribers) Subscribers have a very small number of people they interact with (less than 5) Insight: canceling customers are 7x more likely to be linked (country: US) Action: identify churners social group, take actions to prevent them from leaving CDR database Data Insights 15
  • 16. Domain expertise • • • • • Diverse opinions whether data scientists should have domain expertise Domain expertise vs machine learning Opinions so far are shared Cases where non-experts outperform experts No point of worrying, most data scientists that join large companies do not have domain expertise 16
  • 17. The importance of visualization • • • All performed analyses should be accompanied by the appropriate visualization Do not get stuck on Excel / matplotlib graphs Introduce infographics, custom heatmaps, Google maps to your skill arsenal 17
  • 18. Visualization leads to great insights • • • • Understanding data through visualization Data scientists with expert visualization skills are rare Relying on professional UI/UX experts is not always the solution for data products Examples: spatial and SNA graph representation 18
  • 19. Do not stand isolated from the business owners • • • • • • Use cases define the requirements of what you are trying to solve Isolation from use cases leads to generic models that do not fit to real life problems Sales people are paired with data scientists to address customer needs Data scientists can answer all the hard questions around data! Cases where top sales people were data scientists or engineers Data scientists can even become CEOs of leading companies! 19
  • 20. Sense of privacy • • • • Environments like telcos and social network companies deal with private and sensitive data Companies enforce security and privacy measures to prevent data leakage Dealing with massive amounts of data requires a great sense of responsibility Confidentiality protection ensures that specific individuals are not pinpointed 20

Notas del editor

  1. Demystify the “Big Data” role: “I know Java” vs “I know programming” paradigmGE has 600 data scientists
  2. Alternative drew conway’svenn diagram : hacking skills, math & statistics, substantive expertise
  3. Graph databases are exciting
  4. Churn analysis shown before was done by non-expert
  5. Data artisans: Data artisans are employees who possess a blend of technical skills and business acumen that enables them to extract actionable insight from the huge volumes of data that exist--despite their lack of experience with it--demonstrating that businesses don’t always need a data scientist to interpret data effectively