SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
TransmogrifAI
Automate Machine Learning Workflow with the power of Scala and
Spark at massive scale.
@khatri_chetanBy: Chetan Khatri
Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France
About me
Lead - Data Science @ Accion labs India Pvt. Ltd.
Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark
HBase Connectors.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
Advisor - Data Science Lab, University of Kachchh, India.
M.Sc. - Computer Science from University of Kachchh, India.
Agenda
● What is TransmogrifAI ?
● Why you need TransmogrifAI ?
● Automation of Machine learning life Cycle - from development to deployment.
○ Feature Inference
○ Transformation
○ Automated Feature validation
○ Automated Model Selection
○ Hyperparameter Optimization
● Type Safety in Spark, TransmogrifAI.
● Example: Code - Titanic kaggle problem.
What is TransmogrifAI ?
● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018
● An end to end automated machine learning workflow library for structured
data build on top of Scala and SparkML.
Build with
What is TransmogrifAI ?
● TransmogrifAI helps extensively to automate Machine learning model life
cycle such as Feature Selection, Transformation, Automated Feature
validation, Automated Model Selection, Hyperparameter Optimization.
● It enforces compile-time type-safety, modularity, and reuse.
● Through automation, It achieves accuracies close to hand-tuned models with
almost 100x reduction in time.
Why you need TransmogrifAI ?
AUTOMATION
Numerous Transformers
and Estimators.
MODULARITY AND
REUSE
Enforces a strict separation
between ML workflow
definitions and data
manipulation.
COMPILE TIME TYPE
SAFETY
Workflow built are Strongly
typed, code completion
during development and
fewer runtime errors.
TRANSPARENCY
Model insights leverage
stored feature metadata
and lineage to help debug
models.
Features
Why you need TransmogrifAI ?
Use TransmogrifAI if you need a machine learning library to:
● Build production ready machine learning applications in hours, not months
● Build machine learning models without getting a Ph.D. in machine learning
● Build modular, reusable, strongly typed machine learning workflows
More read documentation: https://transmogrif.ai/
Why Machine Learning is hard ?! Really! ...
For example, this may be using a linear
classifier when your true decision
boundaries are non-linear.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
Why Machine Learning is hard ?! Really! ...
fast and effective debugging is the skill that is most required for
implementing modern day machine learning pipelines.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
Real time Machine Learning takes time to Productionize
TransmogrifAI Automates entire ML
Life Cycle to accelerate developer’s
productivity.
Under the Hood
Automated Feature Engineering
Automated Feature Selection
Automated Model Selection
Automated Feature Engineering
Automatic Derivation of new features based on existing features.
Email Phone Age Subject Zip Code DOB Gender
Email is
Spam
Country
Code [0-20]
[21-30]
[ > 30]
Stop
words
Top
terms
(TF-IDF)
Detect
Language
Average Income
House Price
School Quality
Shopping
Transportation
To
Binary
Age
Day of Week
Week of Year
Quarter
Month
Year
Hour
Feature Vector
Automated Feature Engineering
● Analyze every feature columns and compute descriptive statistics.
○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation.
● Handle Missing values / Noisy values.
○ Ex. fillna by Mean / Avg / near by values.
patient_details = patient_details.fillna(-1)
data['City_Type'] = data['City_Type'].fillna('Z')
imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False)
data_total_imputed = imp.fit_transform(data_total)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)
Automated Feature Engineering
● Does features have acceptable ranges / Does it contain valid values ?
● Does that feature could be leaker ?
○ Is it usually filled out after predicted field is ?
○ Is it highly correlated with the predicted field ?
● Does that feature is Outlier ?
Automated Feature Selection / Data Pre-processing
● Data Type of Features, Automatic Data Pre-processing.
○ MinMaxScaler
○ Normalizer
○ Binarizer
○ Label Encoding
○ One Hot Encoding
● Auto Data Pre-Processing based on chosen ML Model.
● Algorithm like XGBoost, specifically requires dummy encoded data while
algorithm like decision tree doesn’t seem to care at all (sometimes)!
Auto Data Pre-processing
● Numeric - Imputation, Track Null Value, Log Transformation for large range,
Scaling, Smart Binning.
● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy
Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category
Embedding.
● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis,
Language Detection.
● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week,
Month, Year).
Auto Selection of Best Model with Hyper Parameter
Tuning
● Machine Learning Model
○ Learning Rate
○ Epoc
○ Batch Size
○ Optimizer
○ Activation Function
○ Loss Function
● Search Algorithms to find best model and optimal hyper parameters.
○ Ex. Grid Search, Random Search, Bandit Methods
Examples - Hyper parameter tuning
XGBoost:
params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss',
'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9,
'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3}
num_rounds = 400
params['seed'] = 523264626346 # 0.85533
dtrain = xgb.DMatrix(train, labels, missing=np.nan)
clf = xgb.train(params, dtrain, num_rounds)
dtest = xgb.DMatrix(test, missing = np.nan)
test_preds = clf.predict(dtest)
Examples - Hyper parameter tuning
rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry),
criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True)
rf.fit(X_train, y_train)
gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features,
subsample = subsample, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion,
max_features = max_features, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000],
'criterion' : ['gini', 'entropy'],
'max_features' : [15,20,25,30],
'max_depth' : [4,5,6]
}
gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train)
gs_cv.best_params_
Ensemble Modeling
ens['XGB2'] = xgb2_pred['Disbursed']
ens['RF'] = rf_pred['Disbursed']
ens['FTRL'] = ftrl_pred['Disbursed']
ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min')
ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min')
ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank']
ens['RF_Rank'] = rankdata(ens['RF'], method='min')
ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min')
ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
Type Safety: Integration with Apache Spark and Scala
● Modular, Reusable, Strongly typed Machine learning workflow on top of
Apache Spark.
● Type Safety in Apache Spark with DataSet API.
Structured Data in Apache Spark
Structured in Spark
DataFrames
Datasets
Unification of APIs in Apache Spark 2.0
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
DataSet [T]
Why Dataset ?
● Strongly Typing.
● Ability to use powerful lambda functions.
● Spark SQL’s optimized execution engine (catalyst, tungsten).
● Can be constructed from JVM objects & manipulated using Functional.
● transformations (map, filter, flatMap etc).
● A DataFrame is a Dataset organized into named columns.
● DataFrame is simply a type alias of Dataset[Row].
DataFrame API Code
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
Why Structure APIs ?
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
Dataset API in Spark 2.x
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
Spark SQL API - Analysis Error example.
Spark SQL API - Analysis Error example.
TransmogrifAI - Type Safety is Everywhere!
● Value operations
● Feature operations
● Transformation Pipelines (aka Workflows)
// Typed value operations
val tokenize(t: Text): TextList = t.map(_.split("")).toTextList
// Types feature operations
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val tokens: Feature[TextList] = title.map(tokenize)
// Transformation pipelines
new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
Example Code
Ref. https://github.com/fosscoder/transmogrifai-demo
A Case Story - Functional Flow - Spark as a SaaS
User
Interface
Build workflow
- Source
- Target
- Transformations
- filter
- aggregation
- Joins
- Expressions
- Machine Learning
Algorithms
Store Metadata
of workflow in
Document based
NoSQL
Ex. MongoDB
ReactiveMongo
Scala / Spark
Job Reads
Metadata from
NoSQL ex.
MongoDB
Run on the
Cluster
Schedule Using
Airflow
SparkSubmit
Operator
A Case Story - High Level Technical Architecture - Spark as a SaaS
User
Interface
Middleware
Akka HTTP
Web
Service’s
Apache Livy Configuration
Apache Livy Configuration
Apache Livy Configuration ...
Apache Livy Integration
Apache Livy Integration ...
Apache Livy Integration ...
Questions ?
Thank you!
Big Thanks to Scala.IO Organizers and Scala France Community!
@khatri_chetan
chetan.khatri@live.com
References
[1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows
on Spark from Salesforce Engineering
[online] https://transmogrif.ai
[2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator
[online] https://github.com/rssanders3/airflow-spark-operator-plugin
[3] Apache Spark - Unified Analytics Engine for Big Data
[online] https://spark.apache.org/
[4] Apache Livy
[online] https://livy.incubator.apache.org/
[5] Zayd's Blog - Why is machine learning 'hard'?
[online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
[6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
[online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s
[7] Auto-Machine Learning: The Magic Behind Einstein
[online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s

Más contenido relacionado

Similar a TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 

Similar a TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. (20)

Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
ProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPSProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPS
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFX
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 

Más de Chetan Khatri

Más de Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 

Último

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

  • 1. TransmogrifAI Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. @khatri_chetanBy: Chetan Khatri Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France
  • 2. About me Lead - Data Science @ Accion labs India Pvt. Ltd. Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. Agenda ● What is TransmogrifAI ? ● Why you need TransmogrifAI ? ● Automation of Machine learning life Cycle - from development to deployment. ○ Feature Inference ○ Transformation ○ Automated Feature validation ○ Automated Model Selection ○ Hyperparameter Optimization ● Type Safety in Spark, TransmogrifAI. ● Example: Code - Titanic kaggle problem.
  • 4. What is TransmogrifAI ? ● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018 ● An end to end automated machine learning workflow library for structured data build on top of Scala and SparkML. Build with
  • 5. What is TransmogrifAI ? ● TransmogrifAI helps extensively to automate Machine learning model life cycle such as Feature Selection, Transformation, Automated Feature validation, Automated Model Selection, Hyperparameter Optimization. ● It enforces compile-time type-safety, modularity, and reuse. ● Through automation, It achieves accuracies close to hand-tuned models with almost 100x reduction in time.
  • 6. Why you need TransmogrifAI ? AUTOMATION Numerous Transformers and Estimators. MODULARITY AND REUSE Enforces a strict separation between ML workflow definitions and data manipulation. COMPILE TIME TYPE SAFETY Workflow built are Strongly typed, code completion during development and fewer runtime errors. TRANSPARENCY Model insights leverage stored feature metadata and lineage to help debug models. Features
  • 7. Why you need TransmogrifAI ? Use TransmogrifAI if you need a machine learning library to: ● Build production ready machine learning applications in hours, not months ● Build machine learning models without getting a Ph.D. in machine learning ● Build modular, reusable, strongly typed machine learning workflows More read documentation: https://transmogrif.ai/
  • 8. Why Machine Learning is hard ?! Really! ... For example, this may be using a linear classifier when your true decision boundaries are non-linear. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  • 9. Why Machine Learning is hard ?! Really! ... fast and effective debugging is the skill that is most required for implementing modern day machine learning pipelines. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  • 10. Real time Machine Learning takes time to Productionize TransmogrifAI Automates entire ML Life Cycle to accelerate developer’s productivity.
  • 11. Under the Hood Automated Feature Engineering Automated Feature Selection Automated Model Selection
  • 12. Automated Feature Engineering Automatic Derivation of new features based on existing features. Email Phone Age Subject Zip Code DOB Gender Email is Spam Country Code [0-20] [21-30] [ > 30] Stop words Top terms (TF-IDF) Detect Language Average Income House Price School Quality Shopping Transportation To Binary Age Day of Week Week of Year Quarter Month Year Hour Feature Vector
  • 13. Automated Feature Engineering ● Analyze every feature columns and compute descriptive statistics. ○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation. ● Handle Missing values / Noisy values. ○ Ex. fillna by Mean / Avg / near by values. patient_details = patient_details.fillna(-1) data['City_Type'] = data['City_Type'].fillna('Z') imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False) data_total_imputed = imp.fit_transform(data_total) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values dataset.fillna(dataset.mean(), inplace=True)
  • 14. Automated Feature Engineering ● Does features have acceptable ranges / Does it contain valid values ? ● Does that feature could be leaker ? ○ Is it usually filled out after predicted field is ? ○ Is it highly correlated with the predicted field ? ● Does that feature is Outlier ?
  • 15. Automated Feature Selection / Data Pre-processing ● Data Type of Features, Automatic Data Pre-processing. ○ MinMaxScaler ○ Normalizer ○ Binarizer ○ Label Encoding ○ One Hot Encoding ● Auto Data Pre-Processing based on chosen ML Model. ● Algorithm like XGBoost, specifically requires dummy encoded data while algorithm like decision tree doesn’t seem to care at all (sometimes)!
  • 16. Auto Data Pre-processing ● Numeric - Imputation, Track Null Value, Log Transformation for large range, Scaling, Smart Binning. ● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category Embedding. ● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis, Language Detection. ● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week, Month, Year).
  • 17. Auto Selection of Best Model with Hyper Parameter Tuning ● Machine Learning Model ○ Learning Rate ○ Epoc ○ Batch Size ○ Optimizer ○ Activation Function ○ Loss Function ● Search Algorithms to find best model and optimal hyper parameters. ○ Ex. Grid Search, Random Search, Bandit Methods
  • 18. Examples - Hyper parameter tuning XGBoost: params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss', 'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9, 'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3} num_rounds = 400 params['seed'] = 523264626346 # 0.85533 dtrain = xgb.DMatrix(train, labels, missing=np.nan) clf = xgb.train(params, dtrain, num_rounds) dtest = xgb.DMatrix(test, missing = np.nan) test_preds = clf.predict(dtest)
  • 19. Examples - Hyper parameter tuning rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry), criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True) rf.fit(X_train, y_train) gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features, subsample = subsample, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion, max_features = max_features, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000], 'criterion' : ['gini', 'entropy'], 'max_features' : [15,20,25,30], 'max_depth' : [4,5,6] } gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train) gs_cv.best_params_
  • 20. Ensemble Modeling ens['XGB2'] = xgb2_pred['Disbursed'] ens['RF'] = rf_pred['Disbursed'] ens['FTRL'] = ftrl_pred['Disbursed'] ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min') ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min') ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank'] ens['RF_Rank'] = rankdata(ens['RF'], method='min') ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min') ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
  • 21. Type Safety: Integration with Apache Spark and Scala ● Modular, Reusable, Strongly typed Machine learning workflow on top of Apache Spark. ● Type Safety in Apache Spark with DataSet API.
  • 22. Structured Data in Apache Spark Structured in Spark DataFrames Datasets
  • 23. Unification of APIs in Apache Spark 2.0 DataFrame Dataset Untyped API Typed API Dataset (2016) DataFrame = Dataset [Row] Alias DataSet [T]
  • 24. Why Dataset ? ● Strongly Typing. ● Ability to use powerful lambda functions. ● Spark SQL’s optimized execution engine (catalyst, tungsten). ● Can be constructed from JVM objects & manipulated using Functional. ● transformations (map, filter, flatMap etc). ● A DataFrame is a Dataset organized into named columns. ● DataFrame is simply a type alias of Dataset[Row].
  • 25. DataFrame API Code // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  • 26. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 27. Why Structure APIs ? // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  • 28. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 29. Dataset API in Spark 2.x val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  • 30. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 31. Spark SQL API - Analysis Error example.
  • 32. Spark SQL API - Analysis Error example.
  • 33. TransmogrifAI - Type Safety is Everywhere! ● Value operations ● Feature operations ● Transformation Pipelines (aka Workflows) // Typed value operations val tokenize(t: Text): TextList = t.map(_.split("")).toTextList // Types feature operations val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val tokens: Feature[TextList] = title.map(tokenize) // Transformation pipelines new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
  • 35. A Case Story - Functional Flow - Spark as a SaaS User Interface Build workflow - Source - Target - Transformations - filter - aggregation - Joins - Expressions - Machine Learning Algorithms Store Metadata of workflow in Document based NoSQL Ex. MongoDB ReactiveMongo Scala / Spark Job Reads Metadata from NoSQL ex. MongoDB Run on the Cluster Schedule Using Airflow SparkSubmit Operator
  • 36. A Case Story - High Level Technical Architecture - Spark as a SaaS User Interface Middleware Akka HTTP Web Service’s
  • 43. Questions ? Thank you! Big Thanks to Scala.IO Organizers and Scala France Community! @khatri_chetan chetan.khatri@live.com
  • 44. References [1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark from Salesforce Engineering [online] https://transmogrif.ai [2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator [online] https://github.com/rssanders3/airflow-spark-operator-plugin [3] Apache Spark - Unified Analytics Engine for Big Data [online] https://spark.apache.org/ [4] Apache Livy [online] https://livy.incubator.apache.org/ [5] Zayd's Blog - Why is machine learning 'hard'? [online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html [6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions [online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s [7] Auto-Machine Learning: The Magic Behind Einstein [online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s