This document discusses Hivemall, a machine learning library for Apache Hive and Spark. It was developed by Makoto Yui as a personal research project to make machine learning easier for SQL developers. Hivemall implements various machine learning algorithms like logistic regression, random forests, and factorization machines as user-defined functions (UDFs) for Hive, allowing machine learning tasks to be performed using SQL queries. It aims to simplify machine learning by abstracting it through the SQL interface and enabling parallel and interactive execution on Hadoop.
2. Ø 2015.04~ Research Engineer at Treasure Data,
Inc.
• My mission is developing ML-as-a-Service in a Hadoop-as-
a-service company
Ø 2010.04-2015.03 Senior Researcher at National
Institute of Advanced Industrial Science and
Technology, Japan. 産業技術総合研究所
• Developed Hivemall as a personal research project
Ø 2009.03 Ph.D. in Computer Science from NAIST
• Majored in Parallel Data Processing, not ML then
Ø Visiting scholar in CWI, Amsterdam and Univ. Edinburgh
Little about me ..
2016/09/09 HadoopCon 16, Taipei 2
3. 2016/09/09 HadoopCon 16, Taipei 3
Hiro Yoshikawa
CEO
Kaz Ota
CTO
Sada Furuhashi
Chief Architect
Open source business
veteran
Founder - world’s
largest Hadoop group
Invented Fluentd,
Messagepack
TODAY
100+ Employees, 30M+ funding
2015
New office in Seoul, Korea
2013
New office in Tokyo, Japan
2012
Founded in Mountain View, CA
Investors
Jerry Yang
Yahoo! Founder
Bill Tai
Angel Investor
Yukihiro Matsumoto
Ruby Inventor
Sierra Ventures - Tim Guleri
Entrerprise Software
Scale Ventures - Andy Vitus
B2B SaaS
Treasure Data
5. 2016/09/09 HadoopCon 16, Taipei 5
Microsoft Operation Management Suite and Google Cloud Platform
(Kubernates) are using Fluentd for log collection
Point
Our technology users
6. 2016/09/09 HadoopCon 16, Taipei 6
Microsoft Operation Management Suite and Google Cloud Platform
(Kubernates) are using Fluentd for log collection
Point
Our technology users
15. List of supported Algorithms
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
15
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones
2016/09/09 HadoopCon 16, Taipei
19. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
Industry use cases of Hivemall
19
Problem: Recommendation using hot-item is hard in hand-crafted
product market because each creator sells few single items (will
soon become out-of-stock)
2016/09/09 HadoopCon 16, Taipei
minne.com
20. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
Industry use cases of Hivemall
202016/09/09 HadoopCon 16, Taipei
21. • CTR prediction of Ad click logs
• Algorithm: Logistic regression
• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs
• Algorithm: Classification
• Scaleout Inc.
• Item/User recommendation
• Algorithm: Recommendation
• Wish.com, GMO pepabo
• Value prediction of Real estates
• Algorithm: Regression
• Livesense
• User score calculation
• Algrorithm: Regression
• Klout
Industry use cases of Hivemall
21
bit.ly/klout-hivemall
2016/09/09 HadoopCon 16, Taipei
Influencer marketing
klout.com
36. Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
362016/09/09 HadoopCon 16, Taipei
40. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
402016/09/09 HadoopCon 16, Taipei
41. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
412016/09/09 HadoopCon 16, Taipei
43. How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
432016/09/09 HadoopCon 16, Taipei
44. Real-time prediction
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature
Vector
Feature Vector
Label
Export
prediction model
44
bit.ly/hivemall-rtp
2016/09/09 HadoopCon 16, Taipei
55. ü ChangeFinder
• Efficient algorithm for finding change point and outliers
from timeseries data
55
Coming New Features - already merged in Master
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting
Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
2016/09/09 HadoopCon 16, Taipei
56. ü ChangeFinder
• Efficient algorithm for finding change point and outliers
from timeseries data
56
Coming New Features - already merged in Master
J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting
Outliers and Change Points from Time Series,” IEEE transactions on
Knowledge and Data Engineering, pp.482-492, 2006.
2016/09/09 HadoopCon 16, Taipei