1. Why Neural Net Field Aware Factorization
Machines
are able to break ground in
digital behaviours prediction
Presenter: Gunjan Sharma
Co-Author: Varun Kumar Modi
2. About the Authors
Presenter: Gunjan Sharma
System Architect @ InMobi (3 years)
SE @Facebook (2.5 Years)
DPE @Google (1 year)
Twitter Handle: @gunjan_1409
LinkedIn:
https://www.linkedin.com/in/gunjan-
sharma-a6794414/
Co-author: Varun Kumar Modi
Sr Research Scientist @ InMobi(5 years)
LinkedIn:
https://www.linkedin.com/in/varun-
modi-33800652/
3. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
4. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
5. InMobi is one of the largest advertising platform at scale globally
InMobi reaches >2 billion MAU across the world - specialised in mobile In-app advertising
JAPA
N
INDIA+
SEA
CHINA
Afri
ca
ANZ
NORTH
AMERICA
KOREA
EMEA
Latin
America
LATIN
AMERICA
Afri
ca
AfricaAFRICA
China
APAC
Consolidation has taken place to
clean up the ecosystem few
advertising platforms at scale exist
North America
(only
Video) Very limited number of players have
presence in Asia, InMobi is dominating
Few players control each component of the
chain; No presence of global players, except
InMobi
6. Problem stmt and why it matters
● What are the problems:
Use case 1 - Conversion ratio (CVR) prediction:
- CVR = Install rate of users = Probability of a install given a click
- Usage: CPM = CTR * CVR * CPI
Use case 2 - Video completion rate (VCR) prediction:
- Video completion rate of users watching advertising videos given click
● Why are they important:
○ Performance business - based on arbitrage, so the model directly determines the margin/profit of the
business and the ability of the campaign to achieve significant scale = > multi-million dollar
businesses!
7. Existing context and challenges
● Models traditionally used Linear/Logistic Regression and Tree-based models
● Both have their strengths and weaknesses when used in production
● What we need is an awesome model that sits somewhere in the middle and
can bring in the best of both worlds
LR Tree Based
Generalise for unseen combinations Our use cases could not
Potentially Underfit at times Potentially can overfit at times
Requires lesser RAM Can at times bloat RAM usage specially
with high cardinality features
8. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
9. Why think of NN for CVR/VCR
prediction
● Using cross features in LR wasn’t cutting it for us.
● Plus at some point it starts to become cumbersome both at training and
prediction time.
● All the major predictions noted here follow a complex curve
● LR left much to desire compared to Tree based models for example because
interaction-terms are limited
● We tried couple of awesome models that were also not able to beat Tree
based models
We all agreed that Neural Nets are a suitable technology to find higher order
interactions between our features
At the same time they have the power of generalising to unseen combinations.
10. Challenges Involved
● Traditionally NNs are more utilized for Classification problems
● We want to model our predictions as regression problem
● Most of the features are categorical which means we need to use one-hot
encoding
● This causes NN to spew very bad results as they need a lot of data to train
efficiently.
● Plus cardinality of some features is very high and it makes life more troublesome.
● Model should be easy to productionised both for training and serving
● Spark isn’t suited for custom NN networks.
● Model should be debuggable as much as possible to be able to explain the
Business changes
● The resistance to using NN for a long time came because of the lack of
understanding into their internals
11. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
12. Consider the following dummy dataset
Publisher Advertiser Gender CVR
ESPN Nike Male 0.01
CNBC Nike Male 0.0004
ESPN Adidas Female 0.008
Sony Coke Female 0.0005
Sony P&G Male 0.002
13. Factorization Machine (FM) - What are those
ESPN CNBC SONY Adi Nike Coke P&G Male Female
X0
X1
X2
Y0
Y1
Y2
Z0
Z1
Z2
Publisher Advertiser Gender CVR
ESPN Nike Male 0.01
1 0 0 0 1 0 0 1 0
= Publisher
Latent Vector
(PV)
= Advertiser
Latent Vector
(AV)
= Gender
Latent Vector
(GV)
PVT*AV + AVT*GV + GVT*PV = pCVR
NOTE: All vectors are K dimensional which is hyper parameter for the algorithm
14. Factorization Machine (FM) - What are those
● K dimensional representation for every feature value
● Captures second order interactions across all the features (ATB =
|A|*|B|*cos(Θ))
● Essentially a combination of hyperbolas summed up to form the final
prediction
● Works better than LR but tree based models are still more powerful.
● EG: Predict movie’s revenue:
Features
Movie
City
Gender
Latent Features
Horror
Comedy
Action
Romance
Second Order Intuition
● For every latent feature
● For every pair of original feature
● How much this latent feature affect
revenue when considering these pair
Final predicted revenue is linear sum over
all latent features
15. Field aware Factorization Machine (FFM)
ESPN CNBC SONY Adi Nike Coke P&G Male Female
XA
0
XA
1
XA
2
Publisher Advertiser Gender CVR
ESPN Nike Male 0.01
1 0 0 0 1 0 0 1 0
PVA
PVA
T*AVP + AVG
T*GVA + GVP
T*PVG = pCVR
NOTE: All vectors are K dimensional which is hyper parameter for the algorithm
XG
0
XG
1
XG
2
PVG
YP
0
YP
1
YP
2
AVP
YG
0
YG
1
YG
2
AVG
ZP
0
ZP
1
ZP
2
GVP
ZA
0
ZA
1
ZA
2
GVA
16. Field aware Factorization Machine (FFM)
● We have a K dimensional vector for every feature value for every other feature
type
● Still second order interactions but with more degrees of freedom than FM
● Intuition: Latent features interact with every other cross feature differently
Works significantly better than FM, but at certain cuts was still not able to beat
Tree based model
17. Deep neural-net with Factorisation Machine:
DeepFM
Sigmoid(FM + NeuralNet(PV :+ AV :+ GV)) = pCVR
18. DeepFM
● Now we are entering the neural net world
● This model is a combination of FM and NN and the final prediction is sum of
the output from the 2 models
● Here we optimize the entire graph together.
● It performs better than using the latent vectors from FM and then running
them through neural net as a secondary optimization (FNN)
● It performs better than FM but not better than FFM
● Intuition: FM finds the second order interactions while neural net uses the
latent vectors to find the higher order nonlinear interactions.
20. NFM
● In this architecture you only run the second order features through NN instead
of the raw latent vectors
● Intuition: The neural net takes the second order interactions and uses them to
find the higher order nonlinear interactions
● Performs better than DeepFM mostly attributed to the 2 facts
○ The size of the net is smaller hence converges faster.
○ The neural net can take the second order interactions and convert them easily to higher order
interactions.
● Results were better than DeepFM as well. But still not better than FFM
22. InMobi Spec: DeepFFM
● A simple upgrade to deepFM
● Performs better than both DeepFM and FFM
● Training is slower
● FFM part of things does the majority of the prediction heavy lifting. Evidently
due to faster gradient convergence.
● Intuition: Take the latent vectors run them through NN for higher order
interactions and use FFM for second order interactions.
24. InMobi Spec: NFFM
● A simple upgrade to NFM
● Does better than everyone significantly.
● Converges faster than DeepFFM
● Intuition: Take the second order interactions from FFM and run them through
neural net to find higher order nonlinear interactions.
25. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
26. Use case 1 - Results CVR
Accuracy function: (ΣWᵢ * abs(Yactᵢ - Ypredᵢ))
ΣWᵢ
Model FFM DeepFM DeepFFM NFFM
Accuracy %
Improvement over
Linear model (small
DS)
44% 35% 48% 64%
27. Use case 1 - Results CVR
Training Data
Dates
Test Date Accuracy %
Improvement over
Linear Model
T1-T7 T7 21%
T1-T7 T8 14%
T2-T8 T8 20%
T2-T8 T9 14%
% Improvement over Tree
model
Cut1 21.7%
Cut2 18.5%
28. Use case 2 - Results VCR
Error Ftn(AEPV -
Absolute Error Per
View):
(Σ(Viewsᵢ-Cmpltdᵢ) * abs(Ypredᵢ) +(Cmpltdᵢ) * abs(1 - Ypredᵢ))
ΣViewsᵢ
Model / % AEPV
Improvement By
Country OS Cut
over last 7 day
Avg Model
Logistic Reg Logistic Reg(2nd
order
Autoregressive
features)
LR (GBT based
Feature
Engineering)
NFFM
Cut1 -3.71% 2.30% 2.51% 3.00%
Cut2 -2.16% 3.05% 4.48% 28.83%
Cut3 -0.31% -0.56% 5.65% 12.47%
29. Use case 2 - Results VCR
● LR with L2 Regularisation
● 2nd Order features were selected based on Information Gain criteria
● GBT package in spark Mlib was used(numTrees = 400, maxDepth=8,
sampling=0.5 minInstancePerNode = 10).
○ Training process was too slow, even with large enough resources.
○ Xgboost with Spark(tried later) was faster , and resulted in further Improvements
● NFFM: Increasing the number of layers till 3 resulted in further 20%
improvement in the validation errors, no significant improvement after that
30. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
31. Building the full intuition
Factorisation machine:
● Handling categorical features and sparse data matrix
● Extracting latent variables, e.g., identifying non-explicit segment profiles in the population
Field-aware:
● Dimensionality reduction (high cardinality features to K dimension representation)
● Increases degrees of freedom (compared to FM in terms field-specific values) to enable exhaustive
set of second-order interactions
Neural network:
● Explores and weight higher order interactions - went up to 3 layers of interaction sucessfully
● Generates numerical prediction
● Training the factors based on performance of both FM machine and Neural Nets (instead of training
them separately causing latent vectors to only be limited by power of FM)
32. Content
1) The problem and context
2) The Motivation
3) Building the model theory: piece by piece
4) Results of the 2 use cases
5) Understanding exactly why it works
6) Implementation at InMobi scale
33. Implementation details
● Hyper params are k, lambda, num layers, num nodes in layers, activation
functions
● Implemented in Tensorflow
● Adam optimizer
● L2 regularization. No dropouts
● No batch-normalization
● 1 layer 100 nodes performs good enough and saves compute
● ReLU activations (converges faster)
● k=16 (try with powers of 2)
● Weighted RMSE as loss function for both use cases
34. Predicting for unseen feature values
ESPN CNBC SONY UNKNOWN?
XA
0
XA
1
XA
2
XG
0
XG
1
XG
2
● Avg latent feature interactions per feature for unknown values
YA
0
YA
1
YA
2
YG
0
YG
1
YG
2
ZA
0
ZA
1
ZA
2
ZG
0
ZG
1
ZG
2
(XA
0+YA
0+ZA
0)/3
(XA
1+YA
1+ZA
1)/3
(XA
2+YA
2+ZA
2)/3
(XG
0+YG
0+ZG
0)/3
(XG
1+YG
1+ZG
1)/3
(XG
2+YG
2+ZG
2)/3
35. Implementing @ low-latency, high-scale
● MLeap: MLeap framework provides support for models trained both in Spark
and Tensorflow. Helps us train models in Spark for Tree based models and
TF models for NN based models
● Offline training and challenges: We cannot train TF models on yarn cluster
hence we use a GPU machine as gateway to pull data and from HDFS and
train on GPU
● Online serving challenges: TF serving has pretty low throughput and wasn’t
scaling for our QPS. Hence we are using local LRU cache with decent TTL to
scale the TF serving
36. Future research that we are currently pursuing...
● Hybrid Binning NFFM
● Distributed training and serving
● Dropouts & Batch Normalization
● Methods to interpret the latent-vector (Using methods like t-Distributed
Stochastic Neighbour Embedding (t-SNE) etc)