Más contenido relacionado
La actualidad más candente (20)
Similar a H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham (20)
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
- 1. © 2015 PayPal Inc. All rights reserved..
Data Science with Big Data
in a Corporate Environment
Tasks, Challenges, and Tradeoffs
Nachum Shacham
Principal Data Scientist
PayPal
11/10/2015
- 2. © 2015 PayPal Inc. All rights reserved.
Data Science From 10K ft
2
ML Algorithm Design
Optimization
Scores
Performance
Charts
Training
set
Test
set(s)
Model
Fitting
Validate, calculate KPIs
- 3. © 2015 PayPal Inc. All rights reserved.
Predictive Modeling Meets Big Data:Bridging the Gap
3
Data
Warehouse
Model(s) selection
and tuning
Data Munging
Deploy
Feature
Engineering
Observational
big-data
Data
Exploration
Validation
Data
Sourcing
Deep, wide, distributed
,sparse, partiallyunknown,
redundant,unreliable
• Unified format (“data frame”)
• Well understood
• Statistically known and “well behaved”
Data Science with Big Data: Quantitative change à Process adjustment
- 4. © 2015 PayPal Inc. All rights reserved. 4
Predictive Models for Serving Customers
Rewards
Tech
Support
Customer’s
future
needs
Unblanced
Precision/recall
latency
Personalization
Model
Organic
External
Cohort
Info
gathering
vs UX
Features
DATA
Metrics
Deplyment
- 5. © 2015 PayPal Inc. All rights reserved.
Data Exploration: Knowing What’s in the Data and Fitting the Pieces Together
5
Understaning Data Contents &
meaning
Big data visualization
Units
Language Join keys
Null values: codes, quantities
Correlations
Distributions and outliers
- 6. © 2015 PayPal Inc. All rights reserved. 6
Data Munging: Reshaping the Data à Decisions, Decisions
Unify values
Money, time
Dummy
variables
Sub
groups
categorical
variable
Missing values
Join keys do
not match
Unify null symbols
Impute / discard
Reformat
keys
Ignore
rows
DiscardTransform
Nullify
Irrelevant/
redundant
values
Transform
Remove
outliers
Standardize
Skewed
distributions
- 7. © 2015 PayPal Inc. All rights reserved.
•R
•data.table
•Dplyr
•Lubridate
•complete
•{t, l, s}apply
•{g}sub
•grep{l}
•ggplot2
7
Data Munging Packages & Functions
•Python
•pandas
•re
•Numpy
•Matplotlib
•csv
•datetime
- 8. © 2015 PayPal Inc. All rights reserved. 8
Feature Engineering
Transform raw data to better represent the problem to the model
Better features à Simpler models, better results
Manual Feature Manipulation
Feature learning
algorithms
Integrated in
model fitting
• Dummy variables
• Text integration
• Sub/super sampling
(unbalanced sets)
- 9. © 2015 PayPal Inc. All rights reserved. 9
Model Tuning
Learning Rate
Parameter Grid
Ensemble: degree, type
Interpretability vs Accuacy
- 10. © 2015 PayPal Inc. All rights reserved.
Model Building Using H2O
10
Rich library of ML algorithms Auto detect feature
format
Interoperate with
ordinary R: split
work for complete
munging Runs through
parameter grid
Big Server
- 11. © 2015 PayPal Inc. All rights reserved.
• Who should judge the value of the results?
•Evaluation criteria
• What are the cost/value of FP, TP, FN, TP?
•FP: PR cost
•FN: Financialloss
•TP: Intended gain
• ContinuousModel evaluation in
production
11
Model Evaluation
- 12. © 2015 PayPal Inc. All rights reserved.
• Model results’ deployment destinations
•Humans (decision makers, analysts)
•Computers (recommendations)
•Databases (scores, leads)
• Latencies: hours to milliseconds
•From events to model
•From model output to action
12
Model Deployment
- 13. © 2015 PayPal Inc. All rights reserved.
• Multiple aligned plots
• Large number of data points on frame
• Statisticalplots
• Network plots
• Model fitting on plot data
13
Big Data Visualization
- 14. © 2015 PayPal Inc. All rights reserved.
• Detection of concept drift post deployment
•Changein feature set distribution
•Changein conditionalprobability P(y | X)
• Auto correction for concept drift
•Repeat training on time window of recent data
•Time window: fixed vs. variable size
• Changedetection
•CUSUM
•StatisticalProcess Control
14
Model Drift
- 15. © 2015 PayPal Inc. All rights reserved.
Transaction & Behavioral: Site Speed impact
15
NOT ALL SPEED IMPROVEMENTS
ARE CREATED EQUAL
load time
Bounce rate
- 16. © 2015 PayPal Inc. All rights reserved.
System Log Data:Query Cost Comparison
16
TCO CONSUMPTION EXEC COST
- 17. © 2015 PayPal Inc. All rights reserved.
Summary: The “Extra” Steps Make The Difference
17
Details: the Critical Factor
Platform à
Performance
Decisions:
understand the
impact Delayed
Incomplete
Irrelevant