Introduction  to  Chainer:
A  Flexible  Framework  for  Deep  Learning
2015-‐‑‒06-‐‑‒18  PFI/PFN  Weekly  Seminar
Seiya  Tokui  (Preferred  Networks)
l  Seiya  Tokui    @beam2d  (Twitter,  GitHub)
l  Researcher  at  Preferred  Networks
l  Main  focus:  machine  learning
–  Learning  to  Hash  (master  degree)
–  Deep  Learning,  Representation  Learning  (current  focus)
A Powerful, Flexible, and Intuitive Framework of Neural Networks
Today  I  will  introduce:
l  The  features  of  Chainer
l  How  to  use  Chainer
l  Some  planned  features
l  (Slide  in  English,  talk  in  Japanese)
: The Concept
Chainer  is  a  framework  of  neural  networks
l  Official  site:  
l  Repository:
l  Provided  as  a  Python  library  (PyPI:  chainer)
l  Main  features
–  Powerful:Supports  CUDA  and  multi-‐‑‒GPU  capability
–  Flexible: Support  almost  arbitrary  architectures
–  Intuitive: Forward  prop  can  be  written  as  a  regular  Python  code
Elements  of  a  neural  network  framework
l  Multi-‐‑‒dimensional  array  implementations
l  Layer  implementations
–  Called  in  various  names  (layers,  modules,  blocks,  primitives,  etc...)
–  The  smallest  units  of  automatic  differentiation
–  Contain  forward  and  backward  implementations
l  Optimizer  implementations
l  Other  stuffs  (data  loading  scheme,  training  loop,  etc...)
–  These  are  also  very  important,  though  Chainer  currently  does  not  
provide  their  abstraction  (future  work)
Forward  prop  /  Backprop
l  Forward  prop  is  how  we  want  to  process  the  input  data
l  Backprop  computes  its  gradient  for  the  learnable  parameters
l  Given  backward  procedures  of  all  layers,  backprop  can  be  written  as  
their  combination  (a.k.a.  reverse-‐‑‒mode  automatic  differentiation)
input hidden output groundtruth
loss  func
Backprop  Implementation  Paradigm  (1)
l  First,  a  computational  graph  is  constructed.  Then,  it  is  periodically  fed  
with  minibatches  to  do  forward/backward
l  The  computational  graph  can  be  seen  as  a  program  and  the  forward/
backward  computation  is  done  by  its  interpreter
u  Caffe:  the  program  is  written  by  Prototxt
u  Torch:  the  program  is  constructed  by  Lua  scripts
u  Theano-‐‑‒based  frameworks:  the  program  is  constructed  by  Python  
Backprop  Implementation  Paradigm  (2)
Define-‐‑‒and-‐‑‒Run  (cont.)
l  Pros
–  (Almost)  No  need  of  memory  management
–  The  computational  graph  can  be  implicitly  optimized  (cf.  Theano)
l  Cons
–  The  program  is  fixed  within  the  training  loop
–  The  interpreter  must  have  capability  of  defining  various  forward  
computations,  including  control-‐‑‒flow  statements  like  if  and  for
u  Theano  has  the  dedicated  functions  for  them  (ifelse  and  scan),  
which  are  unintuitive  and  not  Pythonic
–  Network  definition  is  hard  to  debug,  since  an  error  occurs  at  the  
forward  computation  that  is  far  apart  from  the  network  definition
Backprop  Implementation  Paradigm  (3)
l  The  forward  computation  is  written  as  a  regular  program  code  with  
special  variables  and  operators,  executing  which  simultaneously  involves  
the  forward  computation  and  the  graph  construction  (just  by  storing  the  
order  of  operations).
l  The  graph  is  used  for  the  backward  computation.
l  This  paradigm  enables  us  to  use  arbitrary  control  flow  statements  in  the  
forward  computation
–  No  need  of  a  mini  language  and  its  interpreter
l  It  also  makes  the  forward  computation  intuitive  and  easy  to  debug
Backprop  Implementation  Paradigm  (4)
Define-‐‑‒by-‐‑‒Run  (cont.)
l  The  computational  graph  can  be  modified  within  each  iteration
l  Example:  Truncated  BPTT  (BackProp  Through  Time)
–  BPTT:  Backprop  on  a  recurrent  net
–  Truncated  BPTT:  Truncate  the  backprop  at  some  time  point
–  Truncation  is  one  type  of  modification  of  the  computational  graph
Features  of  Chainer
l  Define-‐‑‒by-‐‑‒Run  scheme
–  Forward  computation  can  contain  any  Python  code
u  if-else,  for-else,  break,  continue,  try-except-finally,  
list,  dict,  class,  etc...
–  User  can  modify  the  graph  within  the  loop
u  E.g.  truncation  can  be  done  by  unchain_̲backward  (which  
unchains  the  graph  backward  from  some  variable)
u  See  the  tutorial  on  recurrent  nets
l  Predefined  functions
l  Support  GPU(s)  via  PyCUDA
Example:  Training  a  multi-‐‑‒layer  perceptron  in  one  page
Full  code  is  in  the  tutorial  and  the  example  directory.
# Model definition
model = FunctionSet(
l1=F.Linear(784, 100),
l2=F.Linear(100, 100),
l3=F.Linear(100, 10))
opt = optimizers.SGD()
# Forward computation
def forward(x, t):
h1 = F.relu(model.l1(x))
h2 = F.relu(model.l2(h1))
y = model.l3(h2)
return F.softmax_cross_entropy(y, t)
# Training loop
for epoch in xrange(n_epoch):
for i in xrange(0, N, batchsize):
x = Variable(...)
t = Variable(...)
loss = forward(x, t)
Example:  Recurrent  net  language  model  in  one  page
Full  code  is  in  the  tutorial  and  the  example  directory.
# Model definition
model = FunctionSet(
emb=F.EmbedID(1000, 100),
x2h=F.Linear( 100, 50),
h2h=F.Linear( 50, 50),
h2y=F.Linear( 50, 1000))
opt = optimizers.SGD()
# Forward computation of one step
def fwd1step(h, w, t):
x = F.tanh(model.emb(w))
h = F.tanh(model.x2h(x) + model.h2h(h))
y = model.h2y(h)
return h, F.softmax_cross_entropy(y, t)
# Full RNN forward computation
def forward(seq):
h = Variable(...) # init state
loss = 0
for curw, nextw in 
zip(seq, seq[1:]):
x = Variable(curw)
t = Variable(nextw)
h, new_loss = fwd1step(h, x, t)
loss += new_loss
return loss
: How to Use It
Install  Chainer
l  Prepare  a  Python  2.7  environment  with  pip
–  (Pyenv+)Anaconda  is  recommended
l  Install  Chainer  just  by
pip install chainer
l  If  you  want  to  use  GPU(s),  do:
–  Install  CUDA  and  the  corresponding  NVIDIA  driver
–  Install  dependent  packages  by
pip install chainer-cuda-deps
–  You  may  have  to  update  the  six package
pip install –U six
Run  the  MNIST  example  (quick  start)
l  Require  scikit-‐‑‒learn  installed:  pip install scikits.learn
l  Clone  the  repository  of  Chainer:  
git clone
l  Go  to  the  example  directory  at  examples/mnist
l  Then,  run  python
–  Run  on  GPU  by  passing  --gpu=0
l  Other  examples  can  be  similarly  executed  (some  needs  manual  
preparation  of  datasets)
Read  the  documents
l  Read  the  documents  at
l  It  includes:
–  Tutorial
–  Reference  manual
l  All  features  given  in  this  talk  are  introduced  by  the  tutorial,  so  please  try  
it  if  you  want  to  know  the  detail.
Basic  concepts  (1)
l  Essential  part  of  Chainer:  Variable  and  Function
l  Variable  is  a  wrapper  of  n-‐‑‒dimensional  arrays  (ndarray  and  GPUArray)
l  Function  is  an  operation  on  Variables
–  Function  application  is  memorized  by  the  returned  Variable(s)
–  All  operations  for  which  you  want  to  backprop  must  be  done  by  
Functions  on  Variables
l  Making  a  Variable  object  is  simple:  just  pass  an  array
x = chainer.Variable(numpy.ndarray(...))
–  The  array  is  stored  in  data  attribute  (
Basic  concepts  (2)
l  Example  of  the  computational  graph  construction
x = chainer.Variable(...)
y = chainer.Variable(...)
z = x**2 + 2*x*y + y
l  Gradient  of  z(x,  y)  can  be  computed  by  z.backward()
l  Results  are  stored  in  x.grad  and  y.grad
_ ** 2
2 * _ _ * _ _ + _ z
_ + _
Actually, Split nodes are automatically
inserted (they accumulate the gradients
on backprop)
Basic  concepts  (3)
l  Chainer  provides  many  functions  in  chainer.functions  subpackage
–  This  package  is  often  abbreviated  to  F
l  Parameterized  functions  are  provided  as  classes
–  Linear,  Convolution2D,  EmbedID,  PReLU,  BatchNormalization,  etc.
–  Their  instances  should  be  shared  across  all  iterations
l  Non-‐‑‒parameterized  functions  are  provided  as  Python  functions
–  Activation  functions,  pooling,  array  manipulation,  etc.
Basic  concepts  (4)
l  Use  FunctionSet  to  manage  parameterized  functions
–  It  is  an  object  with  Function  attributes
–  Easy  to  migrate  functions  onto  GPU  devices
–  Easy  to  collect  parameters  and  gradients  (collect_̲parameters)
l  Use  Optimizer  for  numerical  optimization
–  Major  algorithms  are  provided:
SGD,  MomentumSGD,  AdaGrad,  RMSprop,  ADADELTA,  Adam
–  Some  parameter/gradient  manipulations  are  done  via  this  class:
weight  decay,  gradient  clip,  
Easy  to  debug!
l  If  the  forward  computation  has  a  bug,  then  an  error  occurs  immediately  
at  the  appropriate  line  of  the  forward  definition
l  Example
–  This  code  has  inconsistency  of  the  array  size:
x = Variable(np.ndarray((3, 4), dtype=np.float32)
y = Variable(np.ndarray((3, 3), dtype=np.float32)
a = x ** 2 + x
b = a + y * 2
c = b + x * 2
–  Since  an  exception  is  raised  at  the  appropriate  line,  we  can  easily  find  
the  cause  of  bug  (this  is  one  big  difference  from  Define-‐‑‒and-‐‑‒Run  
← an exception is raised at this line
Graph  manipulation  (1)
l  Backward  unchaining:  y.unchain_backward()
–  It  purges  the  nodes  backward  from  y
–  It  is  useful  to  implement  truncated  BPTT  (see  PTB  example)
x f y g z
y g z
Graph  manipulation  (2)
l  Volatile  variables:  x = Variable(..., volatile=True)
–  Volatile  variable  does  not  build  a  graph
–  Volatility  can  be  accessed  directly  by  x.volatile
x = Variable(..., volatile=True)
y = f(x)
y.volatile = False
z = h(y)
x f y g z
Example:  Training  a  multi-‐‑‒layer  perceptron  in  one  page
Note:  F = chainer.functions
# Model definition
model = FunctionSet(
l1=F.Linear(784, 100),
l2=F.Linear(100, 100),
l3=F.Linear(100, 10))
opt = optimizers.SGD()
# Forward computation
def forward(x, t):
h1 = F.relu(model.l1(x))
h2 = F.relu(model.l2(h1))
y = model.l3(h2)
return F.softmax_cross_entropy(y, t)
# Training loop
for epoch in xrange(n_epoch):
for i in xrange(0, N, batchsize):
x = Variable(...)
t = Variable(...)
loss = forward(x, t)
Example:  Recurrent  net  language  model  in  one  page
# Model definition
model = FunctionSet(
emb=F.EmbedID(1000, 100),
x2h=F.Linear( 100, 50),
h2h=F.Linear( 50, 50),
h2y=F.Linear( 50, 1000))
opt = optimizers.SGD()
# Forward computation of one step
def fwd1step(h, w, t):
x = F.tanh(model.emb(w))
h = F.tanh(model.x2h(x) + model.h2h(h))
y = model.h2y(h)
return h, F.softmax_cross_entropy(y, t)
# Full RNN forward computation
def forward(seq):
h = Variable(...) # init state
loss = 0
for curw, nextw in 
zip(seq, seq[1:]):
x = Variable(curw)
t = Variable(nextw)
h, new_loss = fwd1step(h, x, t)
loss += new_loss
return loss
CUDA  support  (1)
l  Chainer  supports  CUDA  computation
l  Installation
–  Install  CUDA  6.5+
–  Install  CUDA-‐‑‒related  packages  by
pip install chainer-cuda-deps
u  Build  of  PyCUDA  may  fail  if  you  install  CUDA  into  non-‐‑‒standard  
path.  In  such  case,  you  have  to  install  PyCUDA  from  source  code  
with  appropriate  configuration.
CUDA  support  (2)
l  Call  cuda.init() before  any  CUDA-‐‑‒related  operations
l  Converts  numpy.ndarray  into  GPUArray  by  chainer.cuda.to_gpu
data_gpu = chainer.cuda.to_gpu(data_cpu)
l  A  GPUArray  object  can  be  passed  to  the  Variable  constructor
x = Variable(data_gpu)
l  Most  functions  support  GPU  Variables
–  Parameterized  functions  must  be  sent  to  GPU  beforehand  by  
Function.to_gpu  or  FunctionSet.to_gpu
l  Extracts  the  results  to  host  memory  by  chainer.cuda.to_cpu
l  All  examples  support  CUDA  (pass  --gpu=N,  where  N  is  the  GPU  ID)
MLP  example  for  CUDA
# Model definition
model = FunctionSet(
l1=F.Linear(784, 100),
l2=F.Linear(100, 100),
l3=F.Linear(100, 10)).to_gpu()
opt = optimizers.SGD()
# Forward computation
def forward(x, t):
h1 = F.relu(model.l1(x))
h2 = F.relu(model.l2(h1))
y = model.l3(h2)
return F.softmax_cross_entropy(y, t)
# Training loop
for epoch in xrange(n_epoch):
for i in xrange(0, N, batchsize):
x = Variable(to_gpu(...))
t = Variable(to_gpu(...))
loss = forward(x, t)
CUDA  support  (3)
l  Chainer  also  supports  computation  on  multiple  GPUs  (easily!)
l  Model  parallel
–  Send  FunctionSets  to  appropriate  devices  (to_̲gpu  accepts  GPU  ID)
model_0 = FunctionSet(...).to_gpu(0)
model_1 = FunctionSet(...).to_gpu(1)
–  Copy  Variable  objects  across  GPUs  by  copy  function
x_1 = F.copy(x_0, 1)
u  This  copy  is  tracked  by  the  computational  graph,  so  you  donʼ’t  
need  to  deal  with  it  on  backprop
CUDA  support  (4)
l  Chainer  also  supports  computation  on  multiple  GPUs
l  Data  parallel
–  FunctionSet  can  be  copied  by  copy.copy
model = FunctionSet(...)
model_0 = copy.copy(model_0).to_gpu(0)
model_1 = model_1.to_gpu(1)
–  Set  up  the  optimizer  only  for  the  master  model
–  After  data-‐‑‒parallel  gradient  computation,  gather  them
–  After  the  update,  share  them  across  model  copies
Model  Zoo  support  (in  the  near  future)
l  Model  Zoo  is  a  place  that  pretrained  models  are  registered
–  Provided  by  BVLC  Caffe  team
–  It  contains  the  Caffe  reference  models
l  We  are  planning  to  support  the  Caffe  reference  models  in  three  weeks  
(the  next  minor  release)
–  Current  design  (it  may  be  changed):
f = CaffeFunction(‘path/to/model.caffemodel’)
x, t = Variable(...), Variable(...)
y = f(inputs={‘data’: x, ‘label’: t}, outputs=[‘loss’])
–  It  emulates  Caffe  networks  by  Chainerʼ’s  functions
Note:  development  process
l  Schedule
–  We  are  planning  to  release  updates  biweekly
–  Updates  are  classified  into  three  groups
u  Revision:  bug  fixes,  updates  without  adding/modifying  interfaces
u  Minor:  Updates  that  add/modify  interfaces  without  lacking  
backward  compatibility
u  Major:  Updates  that  are  not  backward-‐‑‒compatible
l  We  are  using  the  GitHub-‐‑‒flow  process
l  We  welcome  your  PRs!
–  Please  send  them  to  the  master  branch
Wrap  up
l  Chainer  is  a  powerful,  flexible,  and  intuitive  framework  of  neural  
networks  in  Python
l  It  is  based  on  Define-‐‑‒by-‐‑‒Run  scheme,  which  makes  it  intuitive  and  
l  Chainer  is  a  very  young  project  and  immature
–  Its  development  started  at  mid.  April  (just  two  months  ago)
–  We  will  add  many  functionailities  (especially  more  functions)
–  We  may  add  some  abstraction  of  whole  learning  processes

