During development of machine learning model about 80% of time is used for data preparation and due to data quality issues, especially when there is a need to combine data from structured and unstructured data sources. Development of smart generic data mart can reduce go to production time for new ML models. We will share creative solutions for challenges we encountered during data transfer between DWH and Data Lake, furthermore data preprocessing, development, deployment/orchestration of ML models, using python/pyspark scripts.
[DSC Europe 22] Smart approach in development and deployment process for various ML models - Danijel Ilievski & Milos Josifovic
1. Smart approach in development and
deployment process for various ML models
Jelena Pekez (Advanced Analytics Team Lead)
Miloš Josifović (Big Data Architect)
Danijel Ilievski (Senior ML Engineer)
2. Comtrade System Integration
Introduction
→Since 87% of models are never deployed, all steps should be planned at the
beginning of Data Science Lifecycle (pipeline):
1. Manage
2. Develop
3. Deploy
4. Monitor
→The first goal is to reduce go to production time for new ML models
with development of Smart Generic Data Mart(s).
→With Smart Data Mart(s) we can prototype ML model and evaluate feasibility.
→The final goal is to generate Production Models and easily orchestrate them.
2
Results
Interpretation
Modeling
Data
Preprocess
Data mart design
ADS
Problem
Formulation
Deployment
PROD.MODEL
3. Comtrade System Integration
3
ADS smart development to support all future ML models
→Planning DataMart for creation of first ML model in a program takes exhaustive time:
• Collect at high-level all possible future use-cases
• Come up with all relevant and available data sources
• Customer’s activities which company has interest in
• Combine data from structured and unstructured data sources
• Extensive feature engineering (text processing, normalization, binning,…)
• Complying with GDPR regulation
• Define proper access rights on selected Data Mart(s)
• Resolving data quality issues at the very beginning will reduce endless reloads
FornextMLmodeldatascientistscanspendmoretimeoncreativeactivitiesusingdevelopedAnalyticalDatamarts/Sets(ADS)
4. Comtrade System Integration
Smart generic data mart(s)
→Creating Multipurpose Data Marts:
• Generate list of target features and relevant target events
• Design it so new events can be easily added
• Eliminate data that have no business/use-case value
• Filter out system records - clean data
• Make initial (starting) base table/s - what is definition of customer?
• Aggregate data to different granularity levels to catch behavior trends
• Feature Engineering do indeed make a difference!
4
Generate quickly and easily new ML training datasets
5. Comtrade System Integration
Data Science requires domain knowledge
makes a big difference
→How much domain knowledge do I need? Depends.
→Domain knowledge is critical for data preparation, productization and orchestration
→Which data points add value?
→Domain knowledge is necessary in data pre-processing:
• Outlier detection, feature importance, model selection, model evaluation stage...
5
DATA SCIENCE
DOMAIN
KNOWLEDGE
MATH, STATS
& ML
COMPUTER
SCIENCE
You have to get best of both worlds!
6. Comtrade System Integration
Control your data mart(s) in production
→Steps in data pipeline for data quality check:
• Missing data vs Loaded data - aggregations
• Duplicates – the same records were repeated
• Relative change threshold - increment or decrement in the number of records
• Statistical expected range
• Data drift – target variable distribution
6
Data
Pipeline
7. Comtrade System Integration
Example how Generic Data Set can help to focus on
Data Science – Transfer between DWH and Data Lake
→Data on two platforms (DWH – SQL database, Data Lake – Hadoop)
→Data can be transferred among databases:
• Through SQL federation / DB link – with certain specifics/products compatibility
• Via Spark engine (PySpark) to Hadoop
→Aim is to simplify data transfer between platforms so,
Data Scientist can do it on their own, without:
• Dealing with Spark’s jobs directly
• Manage Hadoop security (Kerberos, read-write permissions, etc.)
7
8. Comtrade System Integration
Speed up writting SQL queries
→ADS [GENERATE SQL QUERY] Training/Scoring table
→Query automation for training table
→ Input for Python script: e.g. of Python script:
8
SCHEMA SOURCE VAR_IN VAR_OUT FUNCTIONS
PERIOD
S
ZERO
EXCLUDE
ADS DS_PAYMENT TOTAL_PAYMENT_AMT
TOTAL_PAYMENT_AM
T
[MAX, AVG/P] [3, 6] 1
ADS DS_PAYMENT TOTAL_PAYMENT_CNT
TOTAL_PAYMENT_CN
T
[SUM] [1] 1
ADS DS_PAYMENT MAX_PAYMENT_AMT MAX_PAYMENT_AMT [MAX] [3] 1
ADS DS_PAYMENT MIN_PAYMENT_AMT MIN_PAYMENT_AMT [MIN] [3] 1
ADS DS_PAYMENT ADD_PAYMENT_CNT ADD_PAYMENT_CNT [AVG/P] [6] 1
ADS DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR [SUM] [1] 1
ADS
DS_USAGE USAGE_OUT_DUR USAGE_OUT_DUR
[AVG/P, MAX,
MIN]
[3, 6] 1
ADS
DS_USAGE USAGE_OUT_IN_PACK_DUR
USAGE_OUT_IN_PACK
_DUR
[SUM] [1] 1
ADS
DS_USAGE
NVL(USAGE_OUT_REG_INT_DUR,
0) +
NVL(USAGE_OUT_INT_DUR,0) USAGE_OUT_INT_DUR
[AVG/P] [6] 1
for i, line in enumerate(variables):
for i2, k in enumerate(line[2]): #funkcija
for i3, kk in enumerate(line[3]): #period
if (i == len(variables) - 1) & (i2 == len(line[2])-1) & (i3 == len(line[3])-1):
zarez = ''
else:
zarez = ','
#KREIRA AGREGACIONU KOLONU, npr. AVG(FIELD_NAME) AS NEW_FIELD_NAME
divider = ''
if 'AVG/P' == str.upper(k):
func1 = 'SUM'
func2 = '_' + 'AVG'
divider = '/' + str(kk)
elif ('SUM' == str.upper(k)) & (kk == '1'):
func1 = 'SUM'
func2 = ''
else:
func1 = k
func2 = '_' + k
query += (func1 + '(' + line[1] + '_' + str(kk) + 'M' + ')' + divider + ' AS ' + line[1] + func2 + '_' + str(kk) + 'M' + zarez + ' n’)
…
for i, line in enumerate(variables):
for i2, line2 in enumerate(line[3]):
if (i == len(variables) - 1) & (i2 == len(line[3])-1):
zarez = ''
else:
zarez = ','
if line[4] == 1:
zero_rule = 'AND {varijabla} <> 0'.format(varijabla = line[0])
else:
zero_rule = ''
query += ("CASE WHEN TIME_ID BETWEEN ADD_MONTHS('{datum_place}', {vreme2}) AND
'{datum_place}' {zero_rule} THEN {varijabla} ELSE
NULL END AS
{varijabla2}_{vreme}M{zarez_place}".format(varijabla = line[0],
varijabla2 = line[1], datum_place = datum, vreme2 = -1 * (int(line2) - 1),
zero_rule=zero_rule, vreme = line2, zarez_place = zarez))+ ' n'
query += ("FROMn
9. Comtrade System Integration
Develop phase - Devote more time to the creative side
→Improve ML traditional development processes:
• Benefit from pre-trained models (deep learning – mainly image recognition)
• Automated Machine learning (AutoML) – pretty good in supervised ML
9
→Auto ML:
• Optimize DS workload or lack of experience
• Processes tasks like Feature Selection, Data Preprocessing, Hyperparameter Optimization,
Model/Algorithm Selection
• Let you focus more on the data side
• Is no silver bullet, it is more exploration tool rather than an optimal model generation tool
MLBox, Auto-Sklearn, TPOT, H2O AutoML, Auto Keras, Auto PyTorch, Google Cloud AutoML, DataRobot, etc.
10. Comtrade System Integration
Deploy phase - don’tgetanyvalueoutofamodelsittingonsomeonecomputer
→Phase where model is transferred to a production environment.
→Same best-practice principles and design patterns for software also apply to ML models
→ML model should be deployed as part of existing data pipeline
→Output of ML model should be monitored for bias
→ML model in deploy phase:
• Registered in appropriate repository
• Passed testing
• Model artifacts are retained
→Validate model Publish model Deliver model
→Don’t update Python libraries before proper testing on development environment 😊 10
11. Comtrade System Integration
Deploy phase – more than one ML model
12
→Model registry:
• Place for all trained/production-ready models (with version control)
• Alternative models as backup
• All model artifacts, model dependencies, evaluation metrics, documentation
• Which dataset was used for training / model lineage
• Log performance details of the model and comparison with other models
• Tracking models during whole time (training, staging and production)
→Model registry enables faster deployment of your models or retrain current ones
→Shared by multiple team members (team collaboration)
→Tie up business rules and output from production model
→Consume the model through API integration
DANIJEL
During deployment in large organizations, we have to orchestrate more than one ML model and best thing is to have in mind that at very beginning of projects that we will have more ml models in future, so organize everything in that manner that can support adding new models easily.
…
- Since very beginning special focus in Data Science Lifecycle should be on data quality and production.
Foundation for more models in a future:
Development of analytical dataset for future models development we can observe like a different project.
JELENA
So if we go more in details….
Kada se razvija model, focus na pripremi podataka
–Organize DB tables considering performance and optimization
Analiza dodavanje kolona, bitnih izvora
Osmislite izvore, target tabele, kako organiz. Tabel po pitanju perform, I logike, imati higtlevel koji su use case-ove.
JELENA
POMENUTI:
Organize DB tables considering performance and optimization
Feature Engineering - isn't about generating a higher quantity of new features. It's about the quality of the features created.
-DANIJEL ILI JELENA
Doman knowledge cannot be optimized.
- Make an instruction file with field names and action how to handling null:
Constant value, Max(), Min(), Mean(), Nearby value, Regression, Delete record
- Domain knowledge will allow you to take the impact of your machine learning skills to a much higher level of significance.
--------------
--Random forests, for example, can handle heterogeneous data types right out of the box.
As Data Scientist with domain knowledge you will have answer on question Which data points add value? And you just need to find them.
DANIJEL
MILOS
Benefit / suggestion:
Parallel execution
No temp data on initial database
Fast transfer
Careful about data types specified on table level
DANIJEL
JELNEA DO KRAJA
Efficiently automate all regular, manual, and tedious workloads of ML implementations
„Fails short“ for Feature Engineering.
Can easily overfit (watch for label distribution, how many outliers, etc.