[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

Smart approach in development and
deployment process for various ML models
Jelena Pekez (Advanced Analytics Team Lead)
Miloš Josifović (Big Data Architect)
Danijel Ilievski (Senior ML Engineer)

Comtrade System Integration
Introduction
→Since 87% of models are never deployed, all steps should be planned at the
beginning of Data Science Lifecycle (pipeline):
1. Manage
2. Develop
3. Deploy
4. Monitor
→The first goal is to reduce go to production time for new ML models
with development of Smart Generic Data Mart(s).
→With Smart Data Mart(s) we can prototype ML model and evaluate feasibility.
→The final goal is to generate Production Models and easily orchestrate them.
2
Results
Interpretation
Modeling
Data
Preprocess
Data mart design
ADS
Problem
Formulation
Deployment
PROD.MODEL

3
ADS smart development to support all future ML models
→Planning DataMart for creation of first ML model in a program takes exhaustive time:
• Collect at high-level all possible future use-cases
• Come up with all relevant and available data sources
• Customer’s activities which company has interest in
• Combine data from structured and unstructured data sources
• Extensive feature engineering (text processing, normalization, binning,…)
• Complying with GDPR regulation
• Define proper access rights on selected Data Mart(s)
• Resolving data quality issues at the very beginning will reduce endless reloads
FornextMLmodeldatascientistscanspendmoretimeoncreativeactivitiesusingdevelopedAnalyticalDatamarts/Sets(ADS)

Smart generic data mart(s)
→Creating Multipurpose Data Marts:
• Generate list of target features and relevant target events
• Design it so new events can be easily added
• Eliminate data that have no business/use-case value
• Filter out system records - clean data
• Make initial (starting) base table/s - what is definition of customer?
• Aggregate data to different granularity levels to catch behavior trends
• Feature Engineering do indeed make a difference!
4
Generate quickly and easily new ML training datasets

Data Science requires domain knowledge
makes a big difference
→How much domain knowledge do I need? Depends.
→Domain knowledge is critical for data preparation, productization and orchestration
→Which data points add value?
→Domain knowledge is necessary in data pre-processing:
• Outlier detection, feature importance, model selection, model evaluation stage...
5
DATA SCIENCE
DOMAIN
KNOWLEDGE
MATH, STATS
& ML
COMPUTER
SCIENCE
You have to get best of both worlds!

Control your data mart(s) in production
→Steps in data pipeline for data quality check:
• Missing data vs Loaded data - aggregations
• Duplicates – the same records were repeated
• Relative change threshold - increment or decrement in the number of records
• Statistical expected range
• Data drift – target variable distribution
6
Data
Pipeline

Example how Generic Data Set can help to focus on
Data Science – Transfer between DWH and Data Lake
→Data on two platforms (DWH – SQL database, Data Lake – Hadoop)
→Data can be transferred among databases:
• Through SQL federation / DB link – with certain specifics/products compatibility
• Via Spark engine (PySpark) to Hadoop
→Aim is to simplify data transfer between platforms so,
Data Scientist can do it on their own, without:
• Dealing with Spark’s jobs directly
• Manage Hadoop security (Kerberos, read-write permissions, etc.)
7

Speed up writting SQL queries
→ADS  [GENERATE SQL QUERY]  Training/Scoring table
→Query automation for training table
→ Input for Python script: e.g. of Python script:
8
SCHEMA SOURCE VAR_IN VAR_OUT FUNCTIONS
PERIOD
S
ZERO
EXCLUDE
ADS DS_PAYMENT TOTAL_PAYMENT_AMT
TOTAL_PAYMENT_AM
T
[MAX, AVG/P] [3, 6] 1
ADS DS_PAYMENT TOTAL_PAYMENT_CNT
TOTAL_PAYMENT_CN
T
[SUM] [1] 1
ADS DS_PAYMENT MAX_PAYMENT_AMT MAX_PAYMENT_AMT [MAX] [3] 1
ADS DS_PAYMENT MIN_PAYMENT_AMT MIN_PAYMENT_AMT [MIN] [3] 1
ADS DS_PAYMENT ADD_PAYMENT_CNT ADD_PAYMENT_CNT [AVG/P] [6] 1
ADS DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR [SUM] [1] 1
ADS
DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR
[AVG/P, MAX,
MIN]
[3, 6] 1
ADS
DS_USAGE USAGE_OUT_IN_PACK_DUR
USAGE_OUT_IN_PACK
_DUR
[SUM] [1] 1
ADS
DS_USAGE
NVL(USAGE_OUT_REG_INT_DUR,
0) +
NVL(USAGE_OUT_INT_DUR,0) USAGE_OUT_INT_DUR
[AVG/P] [6] 1
for i, line in enumerate(variables):
for i2, k in enumerate(line[2]): #funkcija
for i3, kk in enumerate(line[3]): #period
if (i == len(variables) - 1) & (i2 == len(line[2])-1) & (i3 == len(line[3])-1):
zarez = ''
else:
zarez = ','
#KREIRA AGREGACIONU KOLONU, npr. AVG(FIELD_NAME) AS NEW_FIELD_NAME
divider = ''
if 'AVG/P' == str.upper(k):
func1 = 'SUM'
func2 = '_' + 'AVG'
divider = '/' + str(kk)
elif ('SUM' == str.upper(k)) & (kk == '1'):
func1 = 'SUM'
func2 = ''
else:
func1 = k
func2 = '_' + k
query += (func1 + '(' + line[1] + '_' + str(kk) + 'M' + ')' + divider + ' AS ' + line[1] + func2 + '_' + str(kk) + 'M' + zarez + ' n’)
…
for i, line in enumerate(variables):
for i2, line2 in enumerate(line[3]):
if (i == len(variables) - 1) & (i2 == len(line[3])-1):
zarez = ''
else:
zarez = ','
if line[4] == 1:
zero_rule = 'AND {varijabla} <> 0'.format(varijabla = line[0])
else:
zero_rule = ''
query += ("CASE WHEN TIME_ID BETWEEN ADD_MONTHS('{datum_place}', {vreme2}) AND
'{datum_place}' {zero_rule} THEN {varijabla} ELSE
NULL END AS
{varijabla2}_{vreme}M{zarez_place}".format(varijabla = line[0],
varijabla2 = line[1], datum_place = datum, vreme2 = -1 * (int(line2) - 1),
zero_rule=zero_rule, vreme = line2, zarez_place = zarez))+ ' n'
query += ("FROMn

Develop phase - Devote more time to the creative side
→Improve ML traditional development processes:
• Benefit from pre-trained models (deep learning – mainly image recognition)
• Automated Machine learning (AutoML) – pretty good in supervised ML
9
→Auto ML:
• Optimize DS workload or lack of experience
• Processes tasks like Feature Selection, Data Preprocessing, Hyperparameter Optimization,
Model/Algorithm Selection
• Let you focus more on the data side
• Is no silver bullet, it is more exploration tool rather than an optimal model generation tool
MLBox, Auto-Sklearn, TPOT, H2O AutoML, Auto Keras, Auto PyTorch, Google Cloud AutoML, DataRobot, etc.

Deploy phase - don’tgetanyvalueoutofamodelsittingonsomeonecomputer
→Phase where model is transferred to a production environment.
→Same best-practice principles and design patterns for software also apply to ML models
→ML model should be deployed as part of existing data pipeline
→Output of ML model should be monitored for bias
→ML model in deploy phase:
• Registered in appropriate repository
• Passed testing
• Model artifacts are retained
→Validate model  Publish model Deliver model
→Don’t update Python libraries before proper testing on development environment 😊 10

Deploy phase – more than one ML model
12
→Model registry:
• Place for all trained/production-ready models (with version control)
• Alternative models as backup
• All model artifacts, model dependencies, evaluation metrics, documentation
• Which dataset was used for training / model lineage
• Log performance details of the model and comparison with other models
• Tracking models during whole time (training, staging and production)
→Model registry enables faster deployment of your models or retrain current ones
→Shared by multiple team members (team collaboration)
→Tie up business rules and output from production model
→Consume the model through API integration

Single
Pipeline for
datatransfer
Conclusion
12
Easy
deployment
Smart
Generic
Data
Mart(s)
More
creative
time

Contact us as on:
Danijel.Ilievski@comtrade.com
Jelena.Pekez@comtrade.com
Milos.Josifovic@comtrade.com Milos.

www.comtradeintegration.com
Copyright © 2020 Comtrade. All rights reserved. The content of this presentation is copyright protected. Any reproduction, distribution, or modification is not allowed.
The information, solutions, and opinions contained in this presentation are of informative nature only and are not intended to be a comprehensive study, nor should they be relied on or treated as
a means to provide a complete solution or advice, since we may not be aware of all specific circumstances of the case. We try to provide quality information, but we make no claims, promises, or
guaranties about the accuracy, completeness, or adequacy of the information contained herein.
Thank you

[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

Recomendados

Recomendados

Más contenido relacionado

Similar a [DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

Similar a [DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic (20)

Más de DataScienceConferenc1

Más de DataScienceConferenc1 (20)

Último

Último (20)

[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic

Notas del editor