SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
© 2015 PayPal Inc. All rights reserved..
Data Science with Big Data
in a Corporate Environment
Tasks, Challenges, and Tradeoffs
Nachum Shacham
Principal Data Scientist
PayPal
11/10/2015
© 2015 PayPal Inc. All rights reserved.
Data Science From 10K ft
2
ML Algorithm Design
Optimization
Scores
Performance
Charts
Training
set
Test
set(s)
Model
Fitting
Validate, calculate KPIs
© 2015 PayPal Inc. All rights reserved.
Predictive Modeling Meets Big Data:Bridging the Gap
3
Data
Warehouse
Model(s) selection
and tuning
Data Munging
Deploy
Feature
Engineering
Observational
big-data
Data
Exploration
Validation
Data
Sourcing
Deep, wide, distributed
,sparse, partiallyunknown,
redundant,unreliable
• Unified  format  (“data  frame”)
• Well  understood
• Statistically  known  and  “well  behaved”
Data Science with Big Data: Quantitative change à Process adjustment
© 2015 PayPal Inc. All rights reserved. 4
Predictive Models for Serving Customers
Rewards
Tech
Support
Customer’s
future
needs
Unblanced
Precision/recall
latency
Personalization
Model
Organic
External
Cohort
Info
gathering
vs UX
Features
DATA
Metrics
Deplyment
© 2015 PayPal Inc. All rights reserved.
Data Exploration: Knowing What’s in the Data and Fitting the Pieces Together
5
Understaning Data Contents &
meaning
Big data visualization
Units
Language Join keys
Null values: codes, quantities
Correlations
Distributions and outliers
© 2015 PayPal Inc. All rights reserved. 6
Data Munging: Reshaping the Data à Decisions, Decisions
Unify values
Money, time
Dummy
variables
Sub
groups
categorical
variable
Missing values
Join keys do
not match
Unify null symbols
Impute / discard
Reformat
keys
Ignore
rows
DiscardTransform
Nullify
Irrelevant/
redundant
values
Transform
Remove
outliers
Standardize
Skewed
distributions
© 2015 PayPal Inc. All rights reserved.
•R
•data.table
•Dplyr
•Lubridate
•complete
•{t, l, s}apply
•{g}sub
•grep{l}
•ggplot2
7
Data Munging Packages & Functions
•Python
•pandas
•re
•Numpy
•Matplotlib
•csv
•datetime
© 2015 PayPal Inc. All rights reserved. 8
Feature Engineering
Transform raw data to better represent the problem to the model
Better features à Simpler models, better results
Manual Feature Manipulation
Feature learning
algorithms
Integrated in
model fitting
• Dummy  variables
• Text  integration
• Sub/super  sampling  
(unbalanced  sets)
© 2015 PayPal Inc. All rights reserved. 9
Model Tuning
Learning Rate
Parameter Grid
Ensemble: degree, type
Interpretability vs Accuacy
© 2015 PayPal Inc. All rights reserved.
Model Building Using H2O
10
Rich library of ML algorithms Auto detect feature
format
Interoperate with
ordinary R: split
work for complete
munging Runs through
parameter grid
Big Server
© 2015 PayPal Inc. All rights reserved.
• Who should judge the value of the results?
•Evaluation criteria
• What are the cost/value of FP, TP, FN, TP?
•FP: PR cost
•FN: Financialloss
•TP: Intended gain
• ContinuousModel evaluation in
production
11
Model Evaluation
© 2015 PayPal Inc. All rights reserved.
• Model results’ deployment destinations
•Humans (decision makers, analysts)
•Computers (recommendations)
•Databases (scores, leads)
• Latencies: hours to milliseconds
•From events to model
•From model output to action
12
Model Deployment
© 2015 PayPal Inc. All rights reserved.
• Multiple aligned plots
• Large number of data points on frame
• Statisticalplots
• Network plots
• Model fitting on plot data
13
Big Data Visualization
© 2015 PayPal Inc. All rights reserved.
• Detection of concept drift post deployment
•Changein feature set distribution
•Changein conditionalprobability P(y | X)
• Auto correction for concept drift
•Repeat training on time window of recent data
•Time window: fixed vs. variable size
• Changedetection
•CUSUM
•StatisticalProcess Control
14
Model Drift
© 2015 PayPal Inc. All rights reserved.
Transaction & Behavioral: Site Speed impact
15
NOT  ALL  SPEED  IMPROVEMENTS  
ARE  CREATED  EQUAL
load  time
Bounce  rate
© 2015 PayPal Inc. All rights reserved.
System Log Data:Query Cost Comparison
16
TCO CONSUMPTION EXEC  COST
© 2015 PayPal Inc. All rights reserved.
Summary: The “Extra” Steps Make The Difference
17
Details: the Critical Factor
Platform à
Performance
Decisions:
understand the
impact Delayed
Incomplete
Irrelevant
© 2015 PayPal Inc. All rights reserved.
THANK YOU
18

Más contenido relacionado

La actualidad más candente

Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
Databricks
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
Databricks
 

La actualidad más candente (20)

H2O World - H2O Deep Learning with Arno Candel
H2O World - H2O Deep Learning with Arno CandelH2O World - H2O Deep Learning with Arno Candel
H2O World - H2O Deep Learning with Arno Candel
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Dealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AIDealing with uncertainty in fintech using AI
Dealing with uncertainty in fintech using AI
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and Cheaply
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Wizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in Azure
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 

Destacado

Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
GurinderG
 
EAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using HadoopEAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using Hadoop
DataWorks Summit
 
PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
DataWorks Summit
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
Data Insight
 

Destacado (20)

Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, Paypal
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Driving Sales Execution with Compelling Messaging
Driving Sales Execution with Compelling MessagingDriving Sales Execution with Compelling Messaging
Driving Sales Execution with Compelling Messaging
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
 
EAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using HadoopEAP - Accelerating behavorial analytics at PayPal using Hadoop
EAP - Accelerating behavorial analytics at PayPal using Hadoop
 
Big- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalBig- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPal
 
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine Udell
 
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
H2O World - Quora: Machine Learning Algorithms to Grow the World's Knowledge ...
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
H2O World - NCS Continuous Media Optimization w/H2O - Satya SatyamoorthyH2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
 
PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
 
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
H2O World - A Look Under Progressive's Big Data Hood - Pawan Divakarla & Bria...
 
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
Driving In-Store Sales with Real-Time Personalization - Cyril Nigg, Catalina ...
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 

Similar a H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

Similar a H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham (20)

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
How to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHow to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software Platforms
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
 
How to Revamp your Legacy Applications For More Agility and Better Service - ...
How to Revamp your Legacy Applications For More Agility and Better Service - ...How to Revamp your Legacy Applications For More Agility and Better Service - ...
How to Revamp your Legacy Applications For More Agility and Better Service - ...
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Come fare business con i big data in concreto
Come fare business con i big data in concretoCome fare business con i big data in concreto
Come fare business con i big data in concreto
 
Extending BI with Big Data Analytics
Extending BI with Big Data AnalyticsExtending BI with Big Data Analytics
Extending BI with Big Data Analytics
 
Data Analytics in Digital Transformation
Data Analytics in Digital TransformationData Analytics in Digital Transformation
Data Analytics in Digital Transformation
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
The Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamThe Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science Team
 
Hadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyHadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata Company
 
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 

Más de Sri Ambati

Más de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Último (20)

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

  • 1. © 2015 PayPal Inc. All rights reserved.. Data Science with Big Data in a Corporate Environment Tasks, Challenges, and Tradeoffs Nachum Shacham Principal Data Scientist PayPal 11/10/2015
  • 2. © 2015 PayPal Inc. All rights reserved. Data Science From 10K ft 2 ML Algorithm Design Optimization Scores Performance Charts Training set Test set(s) Model Fitting Validate, calculate KPIs
  • 3. © 2015 PayPal Inc. All rights reserved. Predictive Modeling Meets Big Data:Bridging the Gap 3 Data Warehouse Model(s) selection and tuning Data Munging Deploy Feature Engineering Observational big-data Data Exploration Validation Data Sourcing Deep, wide, distributed ,sparse, partiallyunknown, redundant,unreliable • Unified  format  (“data  frame”) • Well  understood • Statistically  known  and  “well  behaved” Data Science with Big Data: Quantitative change à Process adjustment
  • 4. © 2015 PayPal Inc. All rights reserved. 4 Predictive Models for Serving Customers Rewards Tech Support Customer’s future needs Unblanced Precision/recall latency Personalization Model Organic External Cohort Info gathering vs UX Features DATA Metrics Deplyment
  • 5. © 2015 PayPal Inc. All rights reserved. Data Exploration: Knowing What’s in the Data and Fitting the Pieces Together 5 Understaning Data Contents & meaning Big data visualization Units Language Join keys Null values: codes, quantities Correlations Distributions and outliers
  • 6. © 2015 PayPal Inc. All rights reserved. 6 Data Munging: Reshaping the Data à Decisions, Decisions Unify values Money, time Dummy variables Sub groups categorical variable Missing values Join keys do not match Unify null symbols Impute / discard Reformat keys Ignore rows DiscardTransform Nullify Irrelevant/ redundant values Transform Remove outliers Standardize Skewed distributions
  • 7. © 2015 PayPal Inc. All rights reserved. •R •data.table •Dplyr •Lubridate •complete •{t, l, s}apply •{g}sub •grep{l} •ggplot2 7 Data Munging Packages & Functions •Python •pandas •re •Numpy •Matplotlib •csv •datetime
  • 8. © 2015 PayPal Inc. All rights reserved. 8 Feature Engineering Transform raw data to better represent the problem to the model Better features à Simpler models, better results Manual Feature Manipulation Feature learning algorithms Integrated in model fitting • Dummy  variables • Text  integration • Sub/super  sampling   (unbalanced  sets)
  • 9. © 2015 PayPal Inc. All rights reserved. 9 Model Tuning Learning Rate Parameter Grid Ensemble: degree, type Interpretability vs Accuacy
  • 10. © 2015 PayPal Inc. All rights reserved. Model Building Using H2O 10 Rich library of ML algorithms Auto detect feature format Interoperate with ordinary R: split work for complete munging Runs through parameter grid Big Server
  • 11. © 2015 PayPal Inc. All rights reserved. • Who should judge the value of the results? •Evaluation criteria • What are the cost/value of FP, TP, FN, TP? •FP: PR cost •FN: Financialloss •TP: Intended gain • ContinuousModel evaluation in production 11 Model Evaluation
  • 12. © 2015 PayPal Inc. All rights reserved. • Model results’ deployment destinations •Humans (decision makers, analysts) •Computers (recommendations) •Databases (scores, leads) • Latencies: hours to milliseconds •From events to model •From model output to action 12 Model Deployment
  • 13. © 2015 PayPal Inc. All rights reserved. • Multiple aligned plots • Large number of data points on frame • Statisticalplots • Network plots • Model fitting on plot data 13 Big Data Visualization
  • 14. © 2015 PayPal Inc. All rights reserved. • Detection of concept drift post deployment •Changein feature set distribution •Changein conditionalprobability P(y | X) • Auto correction for concept drift •Repeat training on time window of recent data •Time window: fixed vs. variable size • Changedetection •CUSUM •StatisticalProcess Control 14 Model Drift
  • 15. © 2015 PayPal Inc. All rights reserved. Transaction & Behavioral: Site Speed impact 15 NOT  ALL  SPEED  IMPROVEMENTS   ARE  CREATED  EQUAL load  time Bounce  rate
  • 16. © 2015 PayPal Inc. All rights reserved. System Log Data:Query Cost Comparison 16 TCO CONSUMPTION EXEC  COST
  • 17. © 2015 PayPal Inc. All rights reserved. Summary: The “Extra” Steps Make The Difference 17 Details: the Critical Factor Platform à Performance Decisions: understand the impact Delayed Incomplete Irrelevant
  • 18. © 2015 PayPal Inc. All rights reserved. THANK YOU 18