Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?


Can Deep Learning and Egocentric Vision
for Visual Lifelogging help us eat better?
Petia Radeva
www.cvc.uab.es/~petia
Computer Vision at UB (CVUB), Universitat de Barcelona &
Medical Imaging Laboratory, Computer Vision Center

Index
 Healthy habits
 Deep learning
 Automatic food analysis
 Egocentric vision
22:45

What happens outside the body?
22:45

Project led by Dr. Maite Garolera of the Consorci Sanitari de Terrassa:
Goal: using episodic images to develop cognitive exercises and tools for memory
reinforcing of MCI and Alzheimer people.
22:45
But episodic images serve for something more than reinforcing memory….
They are showing the lifestyle of individuals!
Rememory: Life-logging for MCI treatment

Risk factors and chronic diseases
22:45

Chronic disease statistics
22:45

Obesity in Catalunya
51% of the Catalan population from 18 to 74 years overweight, 15% are obese.
62% without university studies vs. 36% with high education. 22:45

The obesity pandemic
 Risk factors for cancers, cardiovascular and
metabolic disorders and leading causes of
premature mortality worldwide.
 4.2 million die of chronic diseases in Europe
(diabetes or cancer) linked to lack of physical
activities and unhealthy diet.
 Physical activities can increase lifespan by
1.5-3.7 years.
22:45

Which wearables do consumers plan to buy?
• 21M Fitbit sold in 2015!
• It’s expected to double by 2018, to 81.7 million users.
22:45
The Consumer Technology Association (CTA), formerly the Consumer Electronics Association (CEA), surveyed
1,001 US internet users. Source: eMarketer.

 Today, automatically measuring physical activity is not a problem.
 But what about food and nutrition?
22:45
What are we missing in health applications?

 But what about food and nutrition?
 State of the art: Nutritional health apps are based on manual food diaries.
22:45
Sparkpeople
LoseIt!
MyFitnessPal
Cronometer Fatsecret
What are we missing in health applications?

https://techcrunch.com/2016/09/29/lose-it-launches-snap-it-to-let-users-count-calories-in-food-photos/
How many food
categories there are?
Today we are speaking
about 200.000 basic
food categories.
What about automatic food recognition?
Is it possible?
22:45

Image databases evolution
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
ARDatabase(1998)
YaleFaceDatabase(2001)
Caltech(2003)
101(2004)
VOC2005(2005)
TUGraz-02(2005)
VOC2006(2006)
Caltech256(2006)
MIT-CSAIL(2006)
VOC2007(2007)
Cifar-10(2009)
Cifar-100(2009)
Imagenet(2011)
SunDB(2012)
Places(2014)
Food101(2014)
Places2(2016)
3276 165 4620 9197 1578 1280 5304 30607 2873 9963 6000060000
1400000
15000
2500000
101000
10000000
Database
Number of objects/Database
Number of images/Database
126
15 6
101
10 4 10
256
125
21 10
100
1000
900
205
101
476
0
200
400
600
800
1000
1200
ARDatabase(1998)
YaleFace…
Caltech(2003)
101(2004)
VOC2005(2005)
TUGraz-02(2005)
VOC2006(2006)
Caltech256(2006)
MIT-CSAIL(2006)
VOC2007(2007)
Cifar-10(2009)
Cifar-100(2009)
Imagenet(2011)
SunDB(2012)
Places(2014)
Food101(2014)
Places2(2016)
ImageNet &
Deep learning
22:45

Food datasets
Food256: 25.600 images (100 images/class)
Classes: 256
Food101 – 101.000 images (1000 images/class)
Classes: 101
Food101+FoodCAT: 146.392 (101.000+45.392)
Classes: 231
EgocentricFood: 5038 images
Classes: 9
22:45
150.000 images
231 categories
1.400.000 images
1000 categories
????? images
200.000 categories
Food DB ImageNet Future Food DB

One is for sure,
if there is a solution,
it is highly probable
to need
Deep learning!
22:45

Index
 Healthy habits and food analysis
 Deep learning
 Automatic food analysis
 Egocentric vision
22:45

Deep leearning everywhere
22:45

White House wants the nation to get ready for AI
October, 2016
http://readwrite.com/2016/10/16/white-house-offers-artificial-intelligence-plan-cl1/
22:45

The learning pipeline
22:45
Input
f(x,W)y(f)
Score function
Predicted label
X
Feature
extraction
Good enough?

The traning process
22:45
Input
+
Ground
truth
f(x,W)argminf Σi Error(yi(f),yi)
Score function
X
Feature
extraction
Learn f

The learning process
22:45
argminf Σi Error(yi(f),yi)
Expectation over
data distribution
Prediction Ground Truth
Measure of prediction quality (error, loss)
Training data {(xi,yi), i = 1,2,…,n}
Loss function the negative conditional log-likelihood, with the interpretation that fi(X) estimates
P(Y=i|X):
L(f(x),y)) = -log fi(x), where fi(x)>=0, Σi fi(x) = 1.

The problem of image classification
22:45
32x32x3 D vector
Each image of M rows by N columns by C channels (3 for color images) can be
considered as a vector/point in RMxNxC and viceversa.
Dual representation of images as points/vectors
R32x32x3

Linear classification
22:45
Given two classes how to learn a hyperplane to separate them?
R32x32x3
To find the hyperplane that separates dogs from cats, we need to define:
• The score function
• The loss function
• And the optimization process.

22:45
How to project data in the feature space:
f(x)=W x + b
If x is an image of (32x32x3), -> x in R3072,
The matrix W is (3x3072).
The bias vector b is 3-dimensional.
3072x1
3x3072 3x1
3x1

22:45
How to project data in the feature space:
f(x)=W x + b
If we have 3 classes, f(x) will give 3 scores.
3072x1
3x3072 3x1
3x1

Image classification
Adapted from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
22:45

Loss function and optimisation
 Question: if you were to assign a single number to how unhappy you are
with these scores, what would you do?
22:45
Question : Given the score and the loss function, how to find the parameters W?
L(f(xi),yi)
W
Loss function
f(xi,W)
Score
function
Input
Xi
Yi

How is a CNN doing deep learning?
22:45
y=Wx
Image
….
First layer
y1=ΣiW1ixi
y10=ΣiW10ixi
….
Second layer
y=W(Wx) y=W(W(Wx))
….
Output layer
W11
W12
W13
W1n
Fully connected layers
y1=ΣiW1ixi
…

Why a CNN is a neural network?
From: Fei-Fei Li & Andrej Karpathy & Justin Johnson
22:45
Modern CNNs – 10M neurons
Human CNNs – 5B of neurons.

Activation functions of NN
From: Fei-Fei Li & Andrej Karpathy & Justin Johnson
22:45

Why is it convolutional?
Adapted from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
22:45

What is new in the Convolutional Neural Network?
22:45

Convolutional and Max-pooling layer
22:45
Convolutional layer
Max-pool layer
Spatial info No spatial info

Example architecture
22:45
The trick is to train the weights such that when the network sees a picture of a truck, the last layer will say
“truck”.
Credit slide: Li Fei-fei

Training a CNN
22:45
The process of training a CNN consists of training all hyperparameters: convolutional
matrices and weights of the fully connected layers.
- Several millions of parameters!!!

1001 benefits of CNN
 Transfer learning: Fine tunning for object recognition
 Replace and retrain the classier on top of the ConvNet
 Fine-tune the weights of the pre-trained network by continuing the backpropagation
 Feature extraction by CNN
 Object detection
 Object segmentation
 Image similarity and matching by CNN
22:45Convolutional Neural Networks (4096 Features)

Automatic food analysis
Can we automatically recognize food?
• To detect and classify every instance of a dish in all of its variants, shapes and
positions and in a large number of images.
The main problems that arise are:
• Complexity and variability of the data.
• Huge amounts of data to analyse.
22:45

Automatic Food Analysis
 Food detection
 Food recognition
 Food environment recognition
 Eating pattern extraction
22:45

Food localization
Food
Non Food
...
w1
w2
wn
G
oogleNet
Softm
ax
G
AP
inception4eoutput
Deep
Convolution
X
FAM
Bounding
Box
G
eneration
Examples of localization and recognition on UECFood256 (top) and EgocentricFood (bottom). Ground
truth is shown in green and our method in blue.
22:45
Marc Bolaños, Petia Radeva: Simultaneous Food Localization and Recognition, ICPR’16, Cancun, Mexico, arXiv.org> cs>
arXiv:1604.07953, 2016.

Image Input
Foodness Map
Extraction
Food Detection CNN
Food Recognition CNN
Food Type
Recognition
Apple
Strawberry
Food recognition
Results: TOP-1 74.7%
TOP-5 91.6%
SoA (Bossard,2014): TOP-1 56,4%22:45

Demo
22:45
Herruzo, P., Bolaños, M. and Radeva, P. (2016). “Can a CNN Recognize Catalan Diet?”. In Proceedings of the 8th Intl Conf. for
Promoting the Application of Mathematics in Technical and Natural Sciences (AMiTaNS).

Food environment classification
Bakery
Banquet hall
Bar
Butcher shop
Cafetería
Ice cream parlor
Kitchen
Kitchenette
Market
Pantry
Picnic Area
Restaurant
Restaurant Kitchen
Restaurant Patio
Supermarket
Candy store
Coffee shop
Dinette
Dining room
Food court
Galley
Classification results:
0.92 - Food-related vs. Non-food-related
0.68 - 22 classes of Food-related categories
22:45

Towards automatic image description
22:45
Bolaños, M., Peris, Á., Casacuberta, F., & Radeva, P. “VIBIKNet: Visual Bidirectional Kernelized Network for the VQA
Challenge” VQA Challenge, CVPR '16.

Two main questions?
 What we eat?
 Automatic food recognition vs. Food
diaries
 And how we eat?
 Automatic eating pattern extraction –
when, where, how, how long, with
whom, in which context?
22:45

Wearable cameras and the life-logging trend
Shipments of wearable computing devices worldwide by
category from 2013 to 2015 (in millions)
22:45

Life-logging data
 What we have:
22:45

Wealth of life-logging data
 We propose an energy-based approach for motion-based event
segmentation of life-logging sequences of low temporal
resolution
 - The segmentation is reached integrating different kind of
image features and classifiers into a graph-cut framework to
assure consistent sequence treatment.
22:45
Complete dataset of a day captured with SenseCam (more than 4,100 images
Choice of devise depends on:
1) where they are set: a hung up camera has
the advantage that is considered more
unobtrusive for the user, or
2) their temporal resolution: a camera with a
low fps will capture less motion information,
but we will need to process less data.
We chose a SenseCam or Narrative - cameras
hung on the neck or pinned on the dress that
capture 2-4 fps.
Or the hell of life-logging data

Visual Life-logging data
Events to be extracted from life-logging images
- Activities he/she has done
- Interactions he/she has participated
- Events he/she has taken part
- Duties he/she has performed
- Environments and places he/she visited, etc.
22:45
Dimiccoli, M., Bolaños, M., Talavera, E., Aghaei, M., Nikolov, S., and Radeva, P. (2015). “SRClustering: Semantic Regularized Clustering for
Egocentric Photo Streams Segmentation”. In Computer Vision and Image Understanding Journal (CVIU) (In press). Preprint:
http://arxiv.org/abs/1512.07143

Egocentric vision progress
22:45
Bolaños, M., Dimiccoli, M. & Radeva, P. (2015). “Towards Storytelling from Visual Lifelogging: An Overview”.
In Transactions on HumanMachine Systems Journal (THMS) (IN PRESS). Preprint: http://arxiv.org/abs/1507.06120

Towards healthy habits
Towards visualizing summarized lifestyle data to ease the management of the user’s
healthy habits (sedentary lifestyles, nutritional activity, etc.).
22:45
M. Aeghai, M. Dimiccoli, P. Radeva. Extended Bag-of-Tracklets for Multi-Face Tracking in Egocentric Photo Streams. Computer Vision and Image
Understanding, Volume 149, 146-156, 2016. Special Issue on Assistive Computer Vision and Robotics, Elsevier, 2016. doi: 10.1016/j.cviu.2016.02.013

Conclusions
 Healthy habits – one of the main health concern for people, society, and
governments
 Deep learning – a technology that came to stay
 A new technological trend that is affecting directly our environment
 Food analysis and recognition – a new challenge with huge potential for applications
 We need food databases of millions of images and thousands of categories
 A wide set of problems for food analysis – recognition, segmentation, habits
characterization, image and video description, etc.
 Egocentric vision and Lifelogging – a recent trend in Computer Vision and
unexplored technology that hides big potential to help people monitor and describe
their behaviour and thus improve their lifestyle.
22:45

Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?

Recommended

Recommended

More Related Content

Similar to Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?

Similar to Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better? (20)

Recently uploaded

Recently uploaded (20)

Can Deep Learning and Egocentric Vision for Visual Lifelogging help us eat better?

Editor's Notes