Apache Hivemall @ Apache BigData '17, Miami

Apache Hivemall: Scalable machine learning
library for Apache Hive/Spark/Pig
1). Research Engineer, Treasure Data
Makoto YUI @myui
2017/5/16 Apache BigData North America '17, Miami
2). Research Engineer, NTT
Takashi Yamamuro @maropu
@ApacheHivemall

Plan of the talk
1. Introduction to Hivemall
2. Hivemall on Spark

Hivemall entered Apache Incubator
on Sept 13, 2016 🎉
Since then, we invited 2 contributors as new committers (a
committer has been voted as PPMC). Currently, we are working
toward the first Apache release on this Q2.
hivemall.incubator.apache.org

What is Apache Hivemall
Scalable machine learning library built
as a collection of Hive UDFs
Multi/Cross
platform VersatileScalableEase-of-use

Hivemall is easy and scalable …
ML made easy for SQL developers
Born to be parallel and scalable
Ease-of-use
Scalable
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
This query automatically runs in parallel on Hadoop

Hivemall is a multi/cross-platform ML library
HiveQL SparkSQL/Dataframe API Pig Latin
Hivemall is Multi/Cross platform ..
Multi/Cross
platform
prediction models built by Hive can be used from Spark, and conversely,
prediction models build by Spark can be used from Hive

Hadoop HDFS
MapReduce
(MRv1)
Hivemall
Apache YARN
Apache Tez
DAG processing
Machine Learning
Query Processing
Parallel Data
Processing Framework
Resource Management
Distributed File System
Cloud Storage
SparkSQL
Apache Spark
MESOS
Hive Pig
MLlib
Hivemall’s Technology Stack
Amazon S3

Hivemall on Apache Hive

Hivemall on Apache Spark Dataframe

Hivemall on SparkSQL

Hivemall on Apache Pig

Online Prediction by Apache Streaming

Versatile
Hivemall is a Versatile library ..
ü Not only for Machine Learning
ü provides a bunch of generic utility functions
Each organization has own sets of
UDFs for data preprocessing
Don’t Repeat Yourself!
Don’t Repeat Yourself!

Hivemall generic functions
Array and Map Bit and compress String and NLP
We welcome contributing your generic UDFs to Hivemall!
Geo Spatial
Top-k processing
> TF/IDF
> TILE
> MAP_URL

Map tiling functions

Tile(lat,lon,zoom)
= xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Map tiling functions
Zoom=10
Zoom=15

student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
Top-k query processing
student class score
3 a 90
2 a 80
1 b 70
6 b 60
List top-2 students for each class

student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
SELECT * FROM (
SELECT
*,
rank() over (partition by class order by score desc)
as rank
FROM table
) t
WHERE rank <= 2
RANK over() query does not finishes in 24 hours L
where 20 million MOOCs classes and avg 1,000 students in each classes

student class score
1 b 70
2 a 80
3 a 90
4 b 50
5 a 70
6 b 60
SELECT
each_top_k(
2, class, score,
class, student
) as (rank, score, class, student)
FROM (
SELECT * FROM table
DISTRIBUTE BY class SORT BY class
) t
EACH_TOP_K finishes in 2 hours J

Top-k query processing by RANK OVER()
partition by class
Node 1
Sort by class, score
rank over()
rank >= 2

Top-k query processing by EACH_TOP_K
distributed by class
Node 1
Sort by class
each_top_k
OUTPUT only K
items

Comparison between RANK and EACH_TOP_K
distributed by class
Sort by class
each_top_k
Sort by class, score
rank over()
rank >= 2
SORTING IS HEAVY
NEED TO
PROCESS ALL
OUTPUT only K
items
Each_top_k is very efficient where the number of class is large
Bounded Priority Queue
is utilized

List of Supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1, PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of Weight
Vectors (AROW)
✓ Soft Confidence Weighted (SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does not
work
Logistic regression is good for getting a
probability of a positive class
Factorization Machines is good where
features are sparse and categorical ones

RandomForest in Hivemall
Ensemble of Decision Trees

Training of RandomForest

Prediction of RandomForest

Supported Algorithms for Recommendation
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
✓ Factorization Machines
(regression)
each_top_k function of Hivemall is useful for
recommending top-k items

Other Supported Algorithms
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ Feature Binning
✓ TF-IDF vectorizer
✓ Polynomial Expansion
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
Evaluation metrics
✓AUC, nDCG, logloss, precision
recall@K, and etc

Feature Engineering – Feature Hashing

Feature Engineering – Feature Binning
Maps quantitative variables to fixed number of
bins based on quantiles/distribution
Map Ages into 3 bins

Feature Selection – Signal Noise Ratio

Evaluation Metrics

Other Supported Features
Anomaly Detection
✓Local Outlier Factor (LoF)
✓ChangeFinder
Clustering / Topic models
✓Online mini-batch LDA
✓Online mini-batch PLSA
Change Point Detection
✓ChangeFinder
✓Singular Spectrum
Transformation

Efficient algorithm for finding change point and outliers from
timeseries data
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
Anomaly/Change-point Detection by ChangeFinder

Take this…

…and do this!

Efficient algorithm for finding change point and outliers from
timeseries data
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.

• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point
Correlations", Proc. SDM, 2005T.
• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.
Change-point detection by Singular Spectrum Transformation

Online mini-batch LDA

ü XGBoost Integration
ü Sparse vector support in RandomForests
ü Docker support (for evaluation)
ü Field-aware Factorization Machines
ü Generalized Linear Model
• Optimizer framework including ADAM
• L1/L2 regularization
Other new features in development

Copyright©2016 NTT corp. All Rights Reserved.
Hivemall on

44Copyright©2016 NTT corp. All Rights Reserved.
Who am I
• Takeshi Yamamuro
• NTT corp. in Japan
• OSS activities
• Apache Hivemall PPMC
• Apache Spark contributer

Whatʼs Spark?
• Distributed data analytics engine,
generalizing Map Reduce

Whatʼs Spark?
• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs
• easy-to-use, rich optimization
• 3. Integrate broadly
• storages, libraries, ...

• Hivemall wrapper for Spark
• Wrapper implementations for DataFrame/SQL
• + some utilities for easy-to-use in Spark
• The wrapper makes you...
• run most of Hivemall functions in Spark
• try examples easily in your laptop
• improve some function performance in Spark
Whatʼs Hivemall on Spark?

• Hivemall already has many fascinating ML
algorithms and useful utilities
• + High barriers to add newer algorithms in MLlib
Whyʼs Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

• Most of Hivemall functions supported in Spark
v2.0 and v2.1
Current Status
- To compile for Spark v2.1
$ git clone https://github.com/apache/incubator-hivemall
$ cd incubator-hivemall
$ mvn package –Pspark-2.1–DskipTests
$ ls target/*spark*
target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar
...

• Most of Hivemall functions supported in Spark
v2.0 and v2.1
Current Status
- To compile for Spark v2.0
$ git clone https://github.com/apache/incubator-hivemall
$ cd incubator-hivemall
$ mvn package –Pspark-2.0–DskipTests
$ ls target/*spark*
target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
...

• 1. Fetch training and test data
• 2. Load these data in Spark
• 3. Build a model
• 4. Do predictions
4 Step Example

• E2006 tfidf regression dataset
• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/dat
asets/regression.html#E2006-tfidf
1. Fetch training and test data
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
regression/E2006.test.bz2

2. Load data in Spark
// Download Spark-v2.1 and launch a spark-shell with Hivemall
$ <HIVEMALL_HOME>/bin/spark-shell
// Create DataFrame from the bzipʼd libsvm-formatted file
scala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")
scala> trainDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)

2. Load data in Spark (Detailed)
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable

3. Build a model - DataFrame
scala> paste:
val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)

4. Do predictions - DataFrame
scala> paste:
val df = testDf.select(rowid(), $"features")
.explode_vector($"features")
.cache
# Do predictions
df.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER")
.groupBy("rowid")
.avg(sigmoid(sum($"weight" * $"value")))

4. Do predictions - SQL
scala> modelDf.createOrReplaceTempView(”ModelTable")
scala> df.createOrReplaceTempView(”TestTable”)
scala> paste:
sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted
| FROM TrainTable t
| LEFT OUTER JOIN ModelTable m
| ON t.feature = m.feature
| GROUP BY rowid
""".stripMargin)

• Top-K Join
• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:
val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”)
.select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”)
.withColumn(
"rank",
rank().over(partitionBy($"group”).orderBy($"score".desc))
)
.where($"rank" <= topK)
Use Spark vanilla APIs

• Top-K Join
scala> paste:
val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”)
.select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”)
.withColumn(
"rank", rank().over(partitionBy($"group”).orderBy($"score".desc))
)
.where($"rank" <= topK)
1. Join leftDf with rightDf
2. Sort data in each group
3. Filter top-K data
Use Spark vanilla APIs

• Top-K Join
scala> paste:
val topkDf = leftDf.top_k_join(
lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),
(leftDf(“x”) + rightDf(“y”)).as(“score”)
)
Use a fused API implemented in Hivemall

• How top_k_join works?
group x
group y
leftDf
rightDf
・・・

group x
group y
・・・
・・・
K-length
priority queue
Compute top-K rows
by using a priority queue
leftDf
rightDf

group x
group y
・・・
・・・
K-length
priority queue
Compute top-K rows
by using a priority queue
Only join top-K rowsleftDf
rightDf

• Codegenʼd top_k_join for fast processing
• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Spark Planner (Catalyst) Overview

Codegen
A Physical Plan

scala> topkDf.explain
== Physical Plan ==
*ShuffledHashJoinTopK 1, [group#10], [group#27]
:- Exchange hashpartitioning(group#10, 200)
: +- LocalTableScan [group#10, x#11]
+- Exchange hashpartitioning(group#27, 200)
+- LocalTableScan [group#27, y#28]
ʻ*ʼ in the head means a codegenʼd plan

• Benchmark Result
~13 times faster than vanilla APIs!!

• Support more useful functions
• Spark only implements naive and basic functions in
terms of usability and maintainability
• e.g.,)
• flatten - flatten a nested schema into flat one
• from_csv/to_csv – interconversion of CSV strings and structured data with schemas
• ...
• See more in the Hivemall user guide
• https://hivemall.incubator.apache.org /userguide /spark /misc /misc.html

• Spark has overheads to call Hive UD*Fs
• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))
scala> val sparkUdf = functions.udf(sigmoidFunc)
scala> df.select(sparkUdf($“value”))

scala> val hiveUdf = HivemallOps.sigmoid
scala> df.select(hiveUdf($“value”))

• ex.2) Compute top-k for each key group
scala> paste:
df.withColumn(
“rank”,
rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)

• Fixed the overhead issue for each_top_k
• See pr#353: “Implement EachTopK as a generator expression“ in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)

~4 times faster than rank!!

• supports under development
• fast implementation of the gradient tree boosting
• widely used in Kaggle competitions
• This integration will make you...
• load built models and predict in parallel
• build multiple models in parallel
for cross-validation
Support 3rd Party Library in Spark

Conclusion and Takeaway
Hivemall is a multi/cross-platform ML library
providing a collection of machine learning algorithms as Hive UDFs/UDTFs
We welcome your contributions to Apache Hivemall J
HiveQL SparkSQL/Dataframe API Pig Latin

Any feature request or questions?

Apache Hivemall @ Apache BigData '17, Miami

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Apache Hivemall @ Apache BigData '17, Miami

Similar a Apache Hivemall @ Apache BigData '17, Miami (20)

Más de Makoto Yui

Más de Makoto Yui (20)

Último

Último (20)

Apache Hivemall @ Apache BigData '17, Miami