Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak

Distributed Machine and
Deep Learning at Scale with
Apache Ignite
Yuriy Babak

Yuriy Babak
Head of ML/DL framework
development at GridGain
Apache Ignite committer

2019 © GridGain Systems
Agenda
● Introduction of distributed ML/DL
● Small math background
● Apache Ignite ML overview
● External integrations

Apache Ignite

Distributed ML
with Apache Ignite

Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency

Common approach
Local calculations Result aggregationModel
Model update

Ignite partition-based distributed dataset
Partition Data Dataset Context Dataset Data
Upstream Cache Context Cache On-Heap
Learning Env
On-Heap
Durable Stateless Durable Recoverable
double[][] x = …
double[] y = ...
double[][] x = …
double[] y = ...
Partition Based Dataset StructuresSource Data

Let`s do the math!

Example: Gradient descent
• Model: Linear regression
f(x)
x

Example: Gradient Descent
• Loss function:

• Loss function: MSE
• Gradient:

• Loss function:
• Gradient:

Example: Random Forest
• Model: Random forest

• Loss function: min Gini coefficient

• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist (what shall we do?)

• Gradient: doesn’t exist
• What to do: sum histograms
=+

• What to do:
– One can sum histograms
– Summarizing histograms - commutative and associative operation
– So you can count the histograms using map-reduce!

• What to do: approximation using histogram partitioning
• Code: RandomForestTrainer, RandomForestClassifierTrainer

What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering

Ignite ML API
Vectorizer<Integer, BinaryObject, String, Double> vectorizer =
new BinaryObjectVectorizer<Integer>(
"feature1",
“feature2”
).labeled("label");
LogisticRegressionSGDTrainer trainer = new LogisticRegressionSGDTrainer();
LogisticRegressionModel mdl = trainer.fit(ignite, dataCache, vectorizer);

• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Фреймворк для ансамблей
• Интеграция с Tensorflow
– как источник данных
– как менеджмент кластера

Preprocessing and Pipelines
Vectorizer<Integer, Vector, Integer, Double> vectorizer
= new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1);
PipelineMdl<Integer, Vector> mdl =
new Pipeline<Integer, Vector, Integer, Double>()
.addVectorizer(vectorizer)
.addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>()
.withEncoderType(EncoderType.STRING_ENCODER)
.withEncodedFeature(1)
.withEncodedFeature(6))
.addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>()
.withP(1))
.addTrainer(new DecisionTreeClassificationTrainer(5, 0))
.fit(ignite, dataCache);

• Classical ML
– classification
– regression
– clustering
• Preprocessing
– string encoding
• Pipelines
• Ensemble Framework

Bagging, Boosting and Stacking
DatasetTrainer<LogisticRegressionModel, Double> trainer =
new LogisticRegressionSGDTrainer(...)...;
BaggedTrainer<Double> baggedTrainer = TrainerTransformers.makeBagged(trainer,
// ensemble size, subsample ration, feature vector size, features subspace dim
7, 0.7, 2, 2,
new onMajorityPredictionsAggregator());

• Classical ML
– classification
– regression
– clustering
• Preprocessing
– string encoding
• Pipelines
• Continuous learning

Online learning
SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer();
SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer);
SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2,
vectorizer);

• Classical ML
– classification
– regression
– clustering
• Preprocessing
– string encoding
• Pipelines
• Continuous learning
• SQL in-place inference

IMDB with built-in ML
IgniteModelStorageUtil.saveModel(ignite, model, “titanik_model_tree”);
QueryCursor<List<?>> cursor = cache.query(new SqlFieldsQuery("select " +
"survived as truth, " +
"predict('titanik_model_tree', pclass, age, sibsp, parch, fare, case
sex when 'male' then 1 else 0 end) as prediction " +
"from titanik_train"))

Model import

Inference in Ignite ML
Request queue
Response queue
Inference
Service
Inference
Service
Inference
Gateway

Import / Inference API
AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 4, 4);
XGModelParser parser = new XGModelParser();
try (Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser);
...
double prediction = mdl.predict(testObj).get();
}

TensorFlow on Apache Ignite
• Ignite Dataset
• IGFS Plugin
• Distributed Training
• More info here
#UnifiedAnalytics #SparkAISummit
>>> import tensorflow as tf
>>> from tensorflow.contrib.ignite import IgniteDataset
>>>
>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
>>> iterator = dataset.make_one_shot_iterator()
>>> next_obj = iterator.get_next()
>>>
>>> with tf.Session() as sess:
>>> for _ in range(3):
>>> print(sess.run(next_obj))
{'key': 1, 'val': {'NAME': b'WARM KITTY'}}
{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Apache Ignite with GridGain ML Python API
GridGain ML client library provides user applications the ability to work
with GridGain ML functionality using Py4J as an integration mechanism.
If you want to use ggml in your project, you may install it from PyPI:
$ pip install ggml

GridGain ML client library provides user applications the ability to work
with GridGain ML functionality using Py4J as an integration mechanism.
If you want to use ggml in your project, you may install it from PyPI:
$ pip install ggml
NB: available only for Apache Ignite master and for GG 8.8

from ggml.regression import LinearRegressionTrainer
trainer = LinearRegressionTrainer()
model = trainer.fit(x_train, y_train)
r2_score(y_test, model.predict(x_test))

Apache Ignite ML Tutorial
https://github.com/apache/ignite/
org.apache.ignite.examples.ml.tutorial

Distributed Machine and Deep Learning
at Scale with Apache Ignite
Links:
• http://ignite.apache.org/
• https://medium.com/tensorflow/tensorflow-on-apache-ignite-99f1fc60efeb
• https://github.com/gridgain/ml-python-api
• http://bit.ly/DataNativesOslo
Email:
● user@ignite.apache.org
● dev@ignite.apache.org
● ybabak@gridgain.com

Preprocessing and Pipelines

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak

Similar a Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak (20)

Más de Dataconomy Media

Más de Dataconomy Media (20)

Último

Último (20)

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak