04 structured prediction and energy minimization part 1

Prediction Problem G: Generality G: Optimality

Part 6: Structured Prediction and Energy
Minimization (1/2)

Sebastian Nowozin and Christoph H. Lampert

Colorado Springs, 25th June 2011

Part 6: Structured Prediction and Energy Minimization (1/2)


Prediction Problem

Prediction Problem

y ∗ = f (x) = argmax g (x, y )
y ∈Y

g (x, y ) = p(y |x), factor graphs/MRF/CRF,
g (x, y ) = −E (y ; x, w ), factor graphs/MRF/CRF,
g (x, y ) = w , ψ(x, y ) , linear model (e.g. multiclass SVM),

→ diﬃculty: Y ﬁnite but very large



Prediction Problem

Prediction Prblem (cont)

Definition (Optimization Problem)
Given (g , Y, G, x), with feasible set Y ⊆ G over decision domain G, and
given an input instance x ∈ X and an objective function g : X × G → R,
find the optimal value
α = sup g (x, y ),
y ∈Y

and, if the supremum exists, find an optimal solution y ∗ ∈ Y such that
g (x, y ∗ ) = α.



Prediction Problem

The feasible set

Ingredients
Decision domain G,
typically simple (G = Rd , G = 2V , etc.)
Feasible set Y ⊆ G,
defining the problem-specific structure
Objective function g : X × G → R.

Terminology
Y = G: unconstrained optimization problem,
G finite: discrete optimization problem,
G = 2Σ for ground set Σ: combinatorial optimization problem,
Y = ∅: infeasible problem.



Prediction Problem

Example: Feasible Sets (cont)
Jij yi yj Jjk yj yk
Yi Yj Yk

(+1) (−1) (−1)
hi yi hj yj hk yk

Ising model with external field
Graph G = (V , E )
“External field”: h ∈ RV
Interaction matrix: J ∈ RV ×V
Objective, defined on yi ∈ {−1, 1}
1 1
g (y ) = hi yi + hj yj + hk yk + Jij yi yj + Jjk yj yk
2 2


Prediction Problem


Ising model with external ﬁeld

Y = G = {−1, +1}V
1
g (y ) = Ji,j yi yj + hi yi
2
(i,j)∈E i∈V

Unconstrained
Objective function contains quadratic terms



Prediction Problem


G = {0, 1}(V ×{−1,+1})∪(E ×{−1,+1}×{−1,+1}) ,
Y = {y ∈ G : ∀i ∈ V : yi,−1 + yi,+1 = 1,
∀(i, j) ∈ E : yi,j,+1,+1 + yi,j,+1,−1 = yi,+1 ,
∀(i, j) ∈ E : yi,j,−1,+1 + yi,j,−1,−1 = yi,−1 },
1
g (y ) = Ji,j (yi,j,+1,+1 + yi,j,−1,−1 )
2
(i,j)∈E
1
− Ji,j (yi,j,+1,−1 + yi,j,−1,+1 )
2
(i,j)∈E

+ hi (yi,+1 − yi,−1 )
i∈V

Constrained, more variables
Objective function contains linear terms only


Prediction Problem

Evaluating f : what do we want?
f (x) = argmax g (x, y )
y ∈Y

For evaluating f (x) we want an algorithm that
1. is general: applicable to all instances of the problem,
2. is optimal: provides an optimal y ∗ ,
3. has good worst-case complexity: for all instances the runtime and
space is acceptably bounded,
4. is integral: its solutions are restricted to Y,
5. is deterministic: its results and runtime are reproducible and depend
on the input data only.



Prediction Problem

Evaluating f : what do we want?
f (x) = argmax g (x, y )
y ∈Y

For evaluating f (x) we want an algorithm that
1. is general: applicable to all instances of the problem,
2. is optimal: provides an optimal y ∗ ,
3. has good worst-case complexity: for all instances the runtime and
space is acceptably bounded,
4. is integral: its solutions are restricted to Y,
5. is deterministic: its results and runtime are reproducible and depend
on the input data only.

wanting all of them → impossible



Prediction Problem

Giving up some properties

Hard problem

Generality Optimality Worst-case Integrality Determinism
complexity

→ giving up one or more properties
allows us to design algorithms satisfying the remaining properties
might be suﬃcient for the task at hand



G: Generality

Hard problem

complexity



G: Generality

Giving up Generality

Identify an interesting and tractable subset of instances

Set of all instances

Tractable subset



G: Generality

Example: MAP Inference in Markov Random Fields

Although NP-hard in general, it is tractable...
with low tree-width (Lauritzen, Spiegelhalter, 1988)
with binary states, pairwise submodular interactions (Boykov, Jolly,
2001)
with binary states, pairwise interactions (only), planar graph
structure (Globerson, Jaakkola, 2006)
with submodular pairwise interactions (Schlesinger, 2006)
with P n -Potts higher order factors (Kohli, Kumar, Torr, 2007)
with perfect graph structure (Jebara, 2009)



G: Generality

Binary Graph-Cuts

Energy function: unary and pairwise

E (y ; x, w ) = EF (yF ; x, wtF )+ EF (yF ; x, wtF )
F ∈F1 F ∈F2

Restriction 1 (wlog)

EF (yi ; x, wtF ) ≥ 0

Restriction 2
(regular/submodular/attractive)

EF (yi , yj ; x, wtF ) = 0, if yi = yj ,
EF (yi , yj ; x, wtF ) = EF (yj , yi ; x, wtF ) ≥ 0, otherwise.



G: Generality

Binary Graph-Cuts (cont)

Construct auxiliary undirected graph
One node {i}i∈V per variable s
{i, s}
Two extra nodes: source s, sink t
Edges {i, t} i j k
Edge Graph cut weight
{i, j} EF (yi = 0, yj = 1; x, wtF ) l m n
{i, s} EF (yi = 1; x, wtF )
{i, t} EF (yi = 0; x, wtF )
t
Find linear s-t-mincut
Solution deﬁnes optimal binary labeling
of the original energy minimization
problem



G: Generality

Example: Figure-Ground Segmentation

Input image (http://pdphoto.org)



G: Generality


Color model log-odds



G: Generality


Independent decisions



G: Generality


g (x, y , w ) = log p(yi |xi ) + w C (xi , xj )I (yi = yj )
i∈V (i,j)∈E

Gradient strength
2
C (xi , xj ) = exp(γ xi − xj )
γ estimated from mean edge strength (Blake et al, 2004)
w ≥ 0 controls smoothing


G: Generality


w =0



G: Generality


Small w > 0



G: Generality


Medium w > 0



G: Generality


Large w > 0



G: Generality

General Binary Case
Is there a larger class of energies for which binary graph cuts are
applicable?
(Kolmogorov and Zabih, 2004), (Freedman and Drineas, 2005)

Theorem (Regular Binary Energies)
E (y ; x, w ) = EF (yF ; x, wtF ) + EF (yF ; x, wtF )
F ∈F1 F ∈F2

is a energy function of binary variables containing only unary and pairwise
factors. The discrete energy minimization problem argminy E (y ; x, w ) is
representable as a graph cut problem if and only if all pairwise energy
functions EF for F ∈ F2 with F = {i, j} satisfy

Ei,j (0, 0) + Ei,j (1, 1) ≤ Ei,j (0, 1) + Ei,j (1, 0).



G: Generality

Example: Class-independent Object Hypotheses
(Carreira and Sminchisescu, 2010)
PASCAL VOC 2009/2010 segmentation winner
Generate class-independent object hypotheses
Energy (almost) as before

g (x, y , w ) = Ei (yi ) + w C (xi , xj )I (yi = yj )
i∈V (i,j)∈E

Fixed unaries

 ∞ if i ∈ Vfg and yi = 0
Ei (yi ) = ∞ if i ∈ Vbg and yi = 1
0 otherwise


Test all w ≥ 0 using parametric max-ﬂow
(Picard and Queyranne, 1980), (Kolmogorov et al., 2007)


G: Generality

Example: Class-independent Object Hypotheses (cont)

Input image (http://pdphoto.org)



G: Generality

Example: Class-independent Object Hypotheses (cont)

CPMC proposal segmentations (Carreira and Sminchisescu, 2010)



G: Optimality

Hard problem

complexity



G: Optimality

Giving up Optimality

Solving for y ∗ is hard, but is it necessary?
pragmatic motivation: in many applications a close-to-optimal
solution is good enough
computational motivation: set of “good” solutions might be large
and ﬁnding just one element can be easy

For machine learning models
modeling error: we always use the wrong model
estimation error: preference for y ∗ might be an artifact



G: Optimality

Local Search

Y

y0



G: Optimality

Local Search

Y

y0
N (y 0 )



G: Optimality

Local Search

Y

y0 y1
N (y 0 )



G: Optimality

Local Search

Y

y0 y1 y2 y3 N (y 3 )
N (y 0 ) N (y 1 ) N (y 2 )
y∗
∗
N (y )



G: Optimality

Local Search

Y

y0 y1 y2 y3 N (y 3 )
N (y 0 ) N (y 1 ) N (y 2 )
y∗
N (y ∗ )

Nt : Y → 2Y , neighborhood system
Optimization with respect to Nt (y ) must be tractable:

y t+1 = argmax g (x, y )
y ∈Nt (y t )



G: Optimality

Example: Iterated Conditional Modes (ICM)
Iterated Conditional Modes (ICM), (Besag, 1986)

g (x, y ) = log p(y |x)
y ∗ = argmax log p(y |x)
y ∈Y
Neighborhoods Ns (y ) = {(y1 , . . . , ys−1 , zs , ys+1 , . . . , yS )|zs ∈ Ys }



G: Optimality



y t+1 = argmax log p(y1 , y2 , . . . , yV |x)
t t
y1 ∈Y1



G: Optimality



y t+1 = argmax log p(y1 , y2 , y3 , . . . , yV |x)
t t t
y2 ∈Y2



G: Optimality

Neighborhood Size

ICM neighborhood Nt (y t ): all states reachable from y t by changing
a single variable (Besag, 1986)
Neighborhood size: in general, larger is better (VLSN, Ahuja, 2000)
Example: neighborhood along chains



G: Optimality

Example: Block ICM
Block Iterated Conditional Modes (ICM)
(Kelm et al., 2006), (Kittler and F¨glein, 1984)
o

t
y t+1 = argmax log p(yC1 , yV C1 |x)
yC1 ∈YC1



G: Optimality

Example: Block ICM
Block Iterated Conditional Modes (ICM)
(Kelm et al., 2006), (Kittler and F¨glein, 1984)
o

t
y t+1 = argmax log p(yC2 , yV C2 |x)
yC2 ∈YC2



G: Optimality

Example: Multilabel Graph-Cut

Binary graph-cuts are not applicable to multilabel energy
minimization problems
(Boykov et al., 2001): two local search algorithms for multilabel
problems
Sequence of binary directed s-t-mincut problems
Iteratively improve multilabel solution



G: Optimality

α-β Swap Neighborhood

Select two diﬀerent labels α and β
Fix all variables i for which yi ∈ {α, β}
/
Optimize over remaining i with yi ∈ {α, β}

Nα,β : Y × N × N → 2Y ,
Nα,β (y , α, β) := {z ∈ Y : zi = yi if yi ∈ {α, β},
/
otherwise zi ∈ {α, β}}.



G: Optimality

α-β-swap illustrated

5-label problem
α − β-swap



G: Optimality

α-β-swap derivation

y t+1 = argmin E (y ; x)
y ∈Nα,β (y t ,α,β)

Constant: drop out
Unary: combine
Pairwise: binary pairwise



G: Optimality


y t+1 = argmin Ei (yi ; x) + Ei,j (yi , yj ; x)
y ∈Nα,β (y t ,α,β) i∈V
(i,j)∈E

Constant: drop out
Unary: combine



G: Optimality


y t+1 = argmin Ei (yit ; x) + Ei (yi ; x)
y ∈Nα,β (y t ,α,β) i∈V , i∈V ,
yit ∈{α,β}
/ yit ∈{α,β}

+ Ei,j (yit , yjt ; x) + Ei,j (yi , yjt ; x)
(i,j)∈E , (i,j)∈E ,
yit ∈{α,β},yjt ∈{α,β}
/ / yit ∈{α,β},yjt ∈{α,β}
/

+ Ei,j (yit , yj ; x) + Ei,j (yi , yj ; x) .
(i,j)∈E , (i,j)∈E ,
yit ∈{α,β},yjt ∈{α,β}
/ yit ∈{α,β},yjt ∈{α,β}

Constant: drop out
Unary: combine


G: Optimality

α-β-swap graph construction
Directed graph G = (V , E )
tα α tα
V = {α, β} ∪ {i ∈ V : yi ∈ {α, β}}, i
tα
k
j
E = {(α, i, tiα ) : ∀i ∈ V : yi ∈ {α, β}} ∪ ni,j

{(i, β, tiβ ) : ∀i ∈ V : yi ∈ {α, β}} ∪
i j ... k
ni,j
{(i, j, ni,j ) : ∀(i, j), (j, i) ∈ E : yi , yj ∈ {α, β}}. tβ
j tβ
tβ
i β
k

Edge weights tiα , tiβ , and ni,j

ni,j = Ei,j (α, β; x)
tiα = Ei (α; x) + Ei,j (α, yj ; x)
(i,j)∈E,
yj ∈{α,β}
/

tiβ = Ei (β; x) + Ei,j (β, yj ; x)
(i,j)∈E,
yj ∈{α,β}
/



G: Optimality

α-β-swap move

tα
i
α tα
k
tα
j
ni,j
i j ... k
ni,j
tβ
j tβ
tβ
i β
k

Side of cut determines yi ∈ {α, β}
Iterate all possible (α, β) combinations
Semi-metric requirement on pairwise energies
Ei,j (yi , yj ; x) = 0 ⇔ yi = yj
Ei,j (yi , yj ; x) = Ei,j (yj , yi ; x) ≥ 0



G: Optimality

α-β-swap move

C
α

i j ... k

β
Side of cut determines yi ∈ {α, β}
Iterate all possible (α, β) combinations
Semi-metric requirement on pairwise energies
Ei,j (yi , yj ; x) = 0 ⇔ yi = yj
Ei,j (yi , yj ; x) = Ei,j (yj , yi ; x) ≥ 0


G: Optimality

Example: Stereo Disparity Estimation

Infer depth from two images
Discretized multi-label problem
α-expansion solution close to optimal



G: Optimality

Model Reduction

Energy minimization problem: many decision to make jointly
Model reduction
1. Fix a subset of decisions
2. Optimize the smaller remaining model

Example: forcing yi = yj for pairs (i, j)



G: Optimality

Example: Superpixels in Labeling Problems

Input image: 500-by-375 pixels (187,500 decisions)



G: Optimality

Example: Superpixels in Labeling Problems

Image with 149 superpixels (149 decisions)


04 structured prediction and energy minimization part 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 04 structured prediction and energy minimization part 1

Similar to 04 structured prediction and energy minimization part 1 (20)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

04 structured prediction and energy minimization part 1