Data Ops：從實驗室走進生產線, 談談怎麼和資料科學家合作

從實驗室走進生產線
——談談怎麼和資料科學家合作

安捏母湯資料科學家
Source: http://codewithmax.com/2018/03/06/basic-example-of-a-neural-network-with-tensorflow-and-keras/

實際上的資料科學家
Source: Sculley et al.: Hidden Technical Debt in Machine Learning Systems

當我想做一個「資料科學」專案的時候

當我想做一個「資料科學」專案的時候
資料清洗資料分析資料驗證
資料切分訓練模型驗證模型

當我想做一個「資料科學」產品的時候
Source: https://udn.com/news/story/11320/3222213

資料清洗資料分析資料驗證
資料切分訓練模型驗證模型

資料清洗資料分析資料驗證資料切分
訓練模型驗證模型規模訓練模型更新
模型上線模型監控模型日誌模型優化

資料科學工作流程
• 一致的編排與環境
• 可擴張的團隊建模協作
• 持續滿足需求
• 改進迭代週期自動部署
• 可重現的結果
• 監控品質與效能測試監控

開發＋運維
• 開發＋運維＝DevOps
• 使用者、開發人員、QA、以及運維人員協力解決
軟體遞交的問題。

資料＋運維
• 資料＋運維＝DataOps
• 讓所有資料從業人員（包含資料分析師、資料科學
家、資料工程師和 IT 人員等等）一起來持續地遞
交有品質的資料給應用及商業流程。

資料＋運維
Source: https://medium.com/data-ops/dataops-is-not-just-devops-for-data-6e03083157b7

實踐 DataOps
Source: https://www.kubeflow.org/

實踐 DataOps
Source: KubeCon Europe 2018

實踐 DataOps
Source: https://blog.paperspace.com/ci-cd-for-machine-learning-ai/

實踐 DataOps
https://www.infuseai.io

SQL DB
Cosmos DB
Datawarehouse
Data lake
Blob storage
… Prepare Data Build & Train Deploy
Machine Learning Process

How much is this car worth?
Machine Learning Problem Example

Model Creation Is Typically Time-Consuming
Mileage
Condition
Car brand
Year of make
Regulations
…
Parameter 1
Parameter 2
Parameter 3
Parameter 4
…
Gradient Boosted
Nearest Neighbors
SVM
Bayesian Regression
LGBM
…
Mileage Gradient Boosted Criterion
Loss
Min Samples Split
Min Samples Leaf
Others Model
Which algorithm? Which parameters?Which features?
Car brand
Year of make

Criterion
Loss
Min Samples Split
Min Samples Leaf
Others
N Neighbors
Weights
Metric
P
Others
Mileage
Condition
Car brand
Year of make
Regulations
…
Gradient Boosted
Nearest Neighbors
SVM
Bayesian Regression
LGBM
…
Nearest Neighbors
Model
Iterate
Gradient BoostedMileage
Car brand
Year of make
Car brand
Year of make
Condition

Iterate

Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Machine Learning Complexity

Dataset
Training
Algorithm 1
Hyperparameter
Values – config 1
Model 1
Hyperparameter
Values – config 2
Model 2
Hyperparameter
Values – config 3
Model 3
Model Training
InfrastructureTraining
Algorithm 2
Hyperparameter
Values – config 4
Model 4
Model Selection & Hyperparameter Tuning

Introducing Automated Machine Learning
Dataset
Optimization
Metric
Constraints
(Time/Cost)
ML ModelAutomated ML
Accessible & Faster

Enter data
Define goals
Apply constraints
Output
Automated ML Accelerates Model Development
Input Intelligently test multiple models in parallel
Optimized model

Automated ML Customer Testimonials
• Press-coverage from
public preview:
• CNET
• VentureBeat
• PRNewswire
“I quite like your AutoML function. It gives me good results compared to
other libraries I tested before (tpot and auto-sklearn) that I believe was
only looking at scores and often gave me models that over-trained my
data. And of course the model from your suggested code is better.”
- Big oil company
“I will start with AutoML and use the algorithm that AutoML
recommends to further tune the model”
- Data Scientist
“I actually enjoy being able to use AutoML in a Jupyter notebook. The
DataRobot interface was nice for non-experts, but for someone like me,
it felt a bit basic.”
- Data Scientist

Automated ML Capabilities
• Based on Microsoft Research
• Brain trained with several
million experiments
• Collaborative filtering and
Bayesian optimization
• Privacy preserving: No need to
“see” the data

Automated ML Capabilities
• ML Scenarios: Classification &
Regression, Forecasting
• Integration: Azure Machine
Learning, Azure Notebooks,
Jupyter Notebooks
• Data Type: Numeric, Text
• Languages: Python SDK for
deployment and hosting for
inference
• Training Compute: Local Machine,
Remote Azure DSVM (Linux), Azure
Batch AI, Databricks
• Transparency: View run history,
model metrics
• Scale: Faster model training using
multiple cores and parallel
experiments

• Dropping high cardinality or no
variance features
• Missing value imputation
• Generating additional features
• Transformations and encodings
Feature Engineering

• Feature importance as part of
training
• Local feature importance for a
given sample
Model Explain-ability

Data Ops：從實驗室走進生產線, 談談怎麼和資料科學家合作

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Ops：從實驗室走進生產線, 談談怎麼和資料科學家合作

Similar a Data Ops：從實驗室走進生產線, 談談怎麼和資料科學家合作 (20)

Último

Último (20)

Data Ops：從實驗室走進生產線, 談談怎麼和資料科學家合作