SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
 
	
   P a g e 	
  1	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  
	
  
	
  
	
  
	
   	
   	
   	
   	
   	
  
	
  
	
  
	
  
Data	
  Pipelining	
  –	
  A	
  new	
  Approach	
  To	
  
Datascience	
  
Big	
  Data	
  Week,	
  London	
  November	
  20125	
  
	
   	
   	
   	
  
	
  
	
  
	
  
Version	
  v0.2	
  
Copyright	
  notice	
  
This	
  document	
  is	
  Copyright	
  Datastax	
  Inc.	
  2015.	
  The	
  document	
  may	
  be	
  redistributed,	
  in	
  electronic	
  or	
  hardcopy	
  
format,	
  on	
  the	
  understanding	
  that	
  the	
  contents	
  remain	
  the	
  property	
  of	
  Datastax	
  Inc.	
  	
  The	
  document	
  may	
  not,	
  
under	
  any	
  circumstance,	
  be	
  sold	
  or	
  distributed	
  for	
  compensation	
  of	
  any	
  kind.	
  All	
  information	
  is	
  provided	
  "as	
  
is”,	
  there	
  is	
  no	
  warranty	
  that	
  the	
  information	
  is	
  correct	
  or	
  suitable	
  for	
  any	
  purpose,	
  either	
  implicit	
  or	
  explicit.	
  
	
  	
   	
  
	
  
 
	
   P a g e 	
  2	
  |	
  17	
  
	
  
	
   	
  
	
  
Table  of  Contents  
1	
   DOCUMENT	
  INFORMATION	
  AND	
  HISTORY	
  ..............................................................................................	
  3	
  
1.1	
   DOCUMENT	
  INFORMATION	
  .............................................................................................................................	
  3	
  
1.2	
   VERSION	
  HISTORY	
  .........................................................................................................................................	
  3	
  
2	
   INTRODUCTION	
  .......................................................................................................................................	
  3	
  
3	
   REFERENCES	
  ............................................................................................................................................	
  3	
  
3.1	
   ADAM	
  .......................................................................................................................................................	
  4	
  
3.2	
   SPARK	
  NOTEBOOK	
  .........................................................................................................................................	
  4	
  
4	
   BUILDING	
  &	
  RUNNING	
  THE	
  HUMAN	
  GENOME	
  BIG	
  DATA	
  PIPELINE	
  DEMO	
  ...............................................	
  4	
  
4.1	
   GENERAL	
  PRE-­‐REQUISITES	
  ..............................................................................................................................	
  4	
  
4.2	
   GENERAL	
  OS	
  PRE-­‐REQUISITES	
  (UBUNTU)	
  ..........................................................................................................	
  5	
  
4.3	
   OS-­‐SPECIFIC	
  PRE-­‐REQUISITES	
  .........................................................................................................................	
  7	
  
4.4	
   INSTALL	
  DOCKER	
  ...........................................................................................................................................	
  7	
  
4.5	
   CLONE	
  THE	
  PIPELINE	
  GITHUB	
  REPO	
  ..................................................................................................................	
  8	
  
4.6	
   CREATE	
  THE	
  DEVOXX	
  CONTAINER	
  IMAGE	
  ...........................................................................................................	
  9	
  
4.7	
   RUN	
  THE	
  DEVOXX	
  IMAGE	
  AS	
  A	
  CONTAINER	
  .....................................................................................................	
  10	
  
4.8	
   SETUP	
  THE	
  DEVOXX	
  DEMO	
  ...........................................................................................................................	
  13	
  
4.9	
   RUN	
  THE	
  DEVOXX	
  BIG	
  DATA	
  PIPELINING	
  DEMO	
  ................................................................................................	
  13	
  
4.10	
   ACCESS	
  CASSANDRA	
  FROM	
  OUTSIDE	
  THE	
  CONTAINER	
  ......................................................................................	
  16	
  
4.11	
   VISUALISE	
  THE	
  DATA	
  USING	
  AKKA	
  AND	
  A	
  REST	
  INTERFACE	
  ...............................................................................	
  17	
  
	
  
	
   	
  
 
	
   P a g e 	
  3	
  |	
  17	
  
	
  
	
   	
  
	
  
1 DOCUMENT	
  INFORMATION	
  AND	
  HISTORY	
  
	
  
1.1 DOCUMENT	
  INFORMATION	
  
	
  
Document	
  name:	
   Big	
  Data	
  Week	
  London	
  -­‐	
  Big	
  Data	
  Pipelining	
  0.2.Docx	
  
Document	
  Authors:	
   Simon	
  Ambridge	
  
Original	
  Date:	
   20/11/2015	
  
Purpose:	
   A	
  description	
  of	
  the	
  steps	
  required	
  to	
  build	
  and	
  run	
  the	
  Human	
  Genome	
  
Big	
  Data	
  Pipeline	
  container	
  created	
  and	
  distributed	
  by	
  Data	
  Fellas,	
  
Belgium.	
  
	
  
1.2 VERSION	
  HISTORY	
  
	
  
Version	
   Date	
   Changed	
  By	
   Changes	
  
0.1	
   20/11/2015	
   Simon	
  Ambridge	
   Initial	
  draft	
  	
  
0.2	
   26/11/2015	
   Simon	
  Ambridge	
   First	
  published	
  draft	
  (internal)	
  
	
   	
   	
   	
  
	
   	
   	
   	
  
	
   	
   	
   	
  
	
   	
   	
   	
  
	
   	
   	
   	
  
	
  
2 INTRODUCTION	
  
	
  
The	
  objective	
  for	
  the	
  one-­‐hour	
  presentation	
  at	
  Big	
  Data	
  Week	
  2015	
  in	
  London,	
  in	
  conjunction	
  with	
  
these	
  notes,	
  is	
  to	
  introduce	
  the	
  audience	
  to	
  a	
  demonstration	
  environment	
  that	
  was	
  used	
  as	
  the	
  basis	
  
of	
  a	
  half-­‐day	
  workshop	
  presented	
  by	
  a	
  team	
  led	
  by	
  Andy	
  Petrella	
  from	
  Data	
  Fellas	
  at	
  Devoxx	
  in	
  
November	
  2015.	
  
	
  
The	
  takeaway	
  at	
  the	
  end	
  of	
  this	
  session	
  will	
  be	
  a	
  better	
  understanding	
  of	
  how	
  to	
  build	
  a	
  data	
  
pipeline	
  using	
  modern	
  distributed	
  and	
  scalable	
  technologies,	
  in	
  conjunction	
  with	
  the	
  Data	
  Fellas	
  
Spark-­‐Notebook,	
  that	
  demonstrates	
  how	
  to	
  make	
  use	
  of	
  a	
  reproducible	
  data	
  pipelining	
  approach.	
  
	
  
3 REFERENCES	
  
	
  
 
	
   P a g e 	
  4	
  |	
  17	
  
	
  
	
   	
  
	
  
3.1 ADAM	
  
	
  
Adam	
  is	
  described	
  here:	
  
http://www.lexjansen.com/pharmasug/2014/DS/PharmaSUG-­‐2014-­‐DS11.pdf	
  
The	
  Clinical	
  Data	
  Interchange	
  Standards	
  Consortium	
  (CDISC)	
  Analysis	
  Data	
  Model	
  (ADaM)	
  
	
  
3.2 SPARK	
  NOTEBOOK	
  
	
  
From	
  an	
  interview	
  with	
  Andy	
  Petrella	
  by	
  Typesafe	
  Inc.	
  
“Spark-­‐Notebook	
  from	
  Data	
  Fellas	
  lets	
  you	
  use	
  Apache	
  Spark	
  in	
  your	
  browser	
  and	
  is	
  
purposed	
  with	
  creating	
  reproducible	
  analysis using	
  Scala,	
  Apache	
  Spark	
  and	
  other	
  
technologies. 	
  
Data	
  science	
  was	
  originally	
  focused	
  on	
  producing	
  static	
  products	
  (reports,	
  models,	
  …)	
  based	
  
on	
  samples	
  of	
  the	
  available	
  data	
  at	
  the	
  time	
  of	
  analysis.	
  Nowadays,	
  the	
  results	
  need	
  to	
  be	
  
reactive	
  with	
  the	
  data	
  flow,	
  requiring	
  new	
  data	
  types.	
  Also,	
  the	
  data	
  sizes	
  can	
  be	
  really	
  big.
This	
  explains	
  the	
  rise	
  of	
  distributed	
  computing	
  and	
  online	
  analysis,	
  the	
  union	
  of	
  which	
  could	
  
bethought	
  as	
  the	
  Reactive	
  Data	
  Science	
  Pipeline.	
  However,	
  such	
  a	
  pipeline	
  requires	
  many	
  
skill	
  sets,	
  including	
  data	
  science,	
  operations	
  software	
  engineering,	
  domain	
  knowledge,	
  and	
  
others.
At	
  Data	
  Fellas,	
  we	
  are	
  building	
  Shar3,	
  the	
  ultimate	
  toolkit	
  that	
  aims	
  to	
  build	
  Reactive	
  data	
  
science	
  pipelines	
  by	
  reducing	
  the	
  friction	
  between	
  the	
  different	
  building	
  phases.	
  
Shar3	
  is	
  composed	
  of	
  notable	
  OSS	
  technologies	
  like	
  Apache	
  Avro,	
  Apache	
  Mesos,	
  Apache	
  
Cassandra,	
  Apache	
  Spark,	
  Typesafe	
  Reactive	
  Platform	
  (Scala,	
  Akka,	
  Play),	
  Spark	
  Notebook	
  
and	
  more.	
  These	
  components	
  were	
  chosen	
  with	
  a	
  strong	
  focus	
  on	
  scalability	
  and	
  the	
  
capacity	
  to	
  reactively	
  adapt	
  to	
  their	
  ever-­‐changing	
  production	
  environments.”	
  
https://www.typesafe.com/blog/scala-­‐and-­‐spark-­‐notebook-­‐the-­‐next-­‐generation-­‐data-­‐
science-­‐toolkit	
  
https://github.com/andypetrella/spark-­‐notebook	
  
	
  
4 BUILDING	
  &	
  RUNNING	
  THE	
  HUMAN	
  GENOME	
  BIG	
  DATA	
  PIPELINE	
  DEMO	
  	
  
	
  
4.1 GENERAL	
  PRE-­‐REQUISITES	
  
	
  
 
	
   P a g e 	
  5	
  |	
  17	
  
	
  
	
   	
  
	
  
Required	
  machine	
  spec:	
  3	
  cores,	
  5GB	
  
Provision	
  a	
  host	
  	
  
• Linux	
  machine	
  
http://www.ubuntu.com/download/desktop	
  	
  
• Create	
  a	
  VM	
  (e.g.	
  Ubuntu)	
  
http://virtualboxes.org/images/ubuntu/	
  
http://www.osboxes.org/ubuntu/	
  
	
  
	
  
4.2 GENERAL	
  OS	
  PRE-­‐REQUISITES	
  (UBUNTU)	
  
	
  
Task	
  duration	
  :	
  5	
  minutes	
  
These	
  instructions	
  taken	
  from	
  https://docs.docker.com/installation/ubuntulinux/	
  	
  
Docker’s	
  apt	
  repository	
  contains	
  Docker	
  1.5.0	
  and	
  higher.	
  To	
  set	
  apt	
  to	
  use	
  packages	
  from	
  the	
  new	
  
repository:	
  
1. If	
  you	
  haven’t	
  already	
  done	
  so,	
  log	
  into	
  your	
  Ubuntu	
  instance.	
  
2. Open	
  a	
  terminal	
  window.	
  
3. Add	
  the	
  new	
  gpg	
  key	
  as	
  a	
  privileged	
  user	
  (e.g.	
  as	
  root	
  or	
  via	
  sudo).	
  	
  
$	
  sudo	
  apt-­‐key	
  adv	
  -­‐-­‐keyserver	
  hkp://p80.pool.sks-­‐keyservers.net:80	
  -­‐-­‐recv-­‐
keys	
  58118E89F3A912897C070ADBF76221572C52609D	
  
	
  
	
  
4. Identify	
  your	
  OS	
  –	
  for	
  example	
  on	
  Ubuntu	
  check	
  the	
  /etc/os_release	
  file:	
  
	
  
5. Open	
  the	
  docker.list	
  file	
  in	
  your	
  favourite	
  editor	
  and	
  remove	
  any	
  existing	
  entries.	
  If	
  the	
  file	
  
doesn’t	
  exist,	
  create	
  it.	
  	
  
 
	
   P a g e 	
  6	
  |	
  17	
  
	
  
	
   	
  
	
  
$	
  sudo	
  vi	
  /etc/apt/sources.list.d/docker.list	
  
	
  
6. Add	
  the	
  appropriate	
  entry	
  for	
  your	
  OS	
  e.g.	
  for	
  Precise	
  
#	
  Ubuntu	
  Precise	
  12.04	
  (LTS)	
  	
  
deb	
  https://apt.dockerproject.org/repo	
  ubuntu-­‐precise	
  main	
  
#	
  Ubuntu	
  Trusty	
  14.04	
  (LTS)	
  
deb	
  https://apt.dockerproject.org/repo	
  ubuntu-­‐trusty	
  main	
  
#	
  Ubuntu	
  Vivid	
  15.04	
  
deb	
  https://apt.dockerproject.org/repo	
  ubuntu-­‐vivid	
  main	
  
#	
  Ubuntu	
  Wily	
  15.10	
  
deb	
  https://apt.dockerproject.org/repo	
  ubuntu-­‐wily	
  main	
  
	
  
7. Save	
  and	
  close	
  the	
  /etc/apt/sources.list.d/docker.list	
  file.	
  
	
  
8. Update	
  the	
  apt	
  package	
  index	
  
$	
  apt-­‐get	
  update	
  
	
  
NB	
  if	
  you	
  get	
  a	
  GPG	
  error	
  e.g.	
  
The	
  following	
  signatures	
  couldn't	
  be	
  verified	
  because	
  the	
  public	
  key	
  is	
  not	
  
available:	
  NO_PUBKEY	
  <some	
  key>	
  
	
  
Fix	
  it	
  using:	
  
$	
  sudo	
  apt-­‐key	
  adv	
  -­‐-­‐keyserver	
  keyserver.ubuntu.com	
  -­‐-­‐recv-­‐keys	
  <some	
  key>	
  
	
  
And	
  re-­‐run	
  apt-­‐get	
  update	
  
	
  
9. Purge	
  the	
  old	
  repo	
  if	
  it	
  exists.	
  
$	
  sudo	
  apt-­‐get	
  purge	
  lxc-­‐docker*	
  
	
  
10. Verify	
  that	
  apt	
  is	
  pulling	
  from	
  the	
  right	
  repository.	
  
$	
  apt-­‐cache	
  policy	
  docker-­‐engine	
  
	
  
Henceforth	
  apt-­‐get	
  upgrade	
  will	
  pull	
  from	
  the	
  correct	
  repository	
  
	
  
 
	
   P a g e 	
  7	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  
4.3 OS-­‐SPECIFIC	
  PRE-­‐REQUISITES	
  
	
  
Task	
  duration	
  :	
  configuration	
  dependant	
  
For	
  example,	
  for	
  Ubuntu	
  Precise,	
  Docker	
  requires	
  the	
  3.13	
  kernel	
  version.	
  If	
  the	
  kernel	
  version	
  is	
  
older	
  than	
  3.13,	
  it	
  must	
  be	
  upgraded.	
  	
  
Use	
  the	
  uname	
  command:	
  
$	
  uname	
  -­‐r	
  
	
  
In	
  my	
  case	
  the	
  kernel	
  version	
  is	
  OK:	
  
	
  
	
  
	
  
4.4 INSTALL	
  DOCKER	
  
	
  
Task	
  duration	
  :	
  5	
  minutes	
  
	
  
1. Update	
  the	
  package	
  index	
  
$	
  sudo	
  apt-­‐get	
  update	
  
	
  
2. Install	
  docker	
  package	
  
$	
  sudo	
  apt-­‐get	
  install	
  docker-­‐engine	
  
	
  
3. Start	
  the	
  docker	
  daemon	
  (should	
  already	
  be	
  running)	
  
$	
  sudo	
  service	
  docker	
  start	
  
	
  
4. Check	
  it	
  is	
  working	
  correctly	
  
$	
  sudo	
  docker	
  run	
  hello-­‐world	
  
	
  
 
	
   P a g e 	
  8	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  
	
  
5. Add	
  the	
  login	
  user	
  to	
  the	
  docker	
  group	
  to	
  avoid	
  having	
  to	
  use	
  sudo	
  for	
  each	
  command:	
  
$	
  sudo	
  useradd	
  -­‐G	
  docker	
  <myuserid>	
  
	
  
or	
  if	
  the	
  user	
  already	
  exists:	
  
$	
  sudo	
  usermod	
  -­‐aG	
  docker	
  <myuserid>	
  
	
  
6. Log	
  out	
  and	
  back	
  in	
  –	
  check	
  docker	
  can	
  be	
  run	
  without	
  sudo	
  
$	
  docker	
  run	
  hello-­‐world	
  
	
  
4.5 CLONE	
  THE	
  PIPELINE	
  GITHUB	
  REPO	
  
	
  
Task	
  duration	
  :	
  2	
  mins	
  
	
  
 
	
   P a g e 	
  9	
  |	
  17	
  
	
  
	
   	
  
	
  
Clone	
  the	
  pipeline	
  repo	
  
$	
  mkdir	
  ~/pipeline	
  
$	
  cd	
  ~/pipeline	
  
$	
  git	
  clone	
  https://github.com/distributed-­‐freaks/pipeline.git	
  
	
  
	
  
	
  
4.6 CREATE	
  THE	
  DEVOXX	
  CONTAINER	
  IMAGE	
  
	
  
Task	
  duration	
  :	
  20	
  mins	
  
	
  
At	
  this	
  point	
  we	
  only	
  have	
  the	
  ‘hello-­‐world’	
  container	
  locally.	
  
1. Use	
  the	
  docker	
  images	
  command	
  to	
  list	
  available	
  images:	
  
$	
  docker	
  images	
  
	
  
	
  
Docker	
  images	
  are	
  stored	
  locally	
  under	
  /var/lib/docker	
  
	
  
2. Pull	
  the	
  Devoxx	
  image	
  from	
  the	
  Docker	
  hub:	
  
$	
  docker	
  pull	
  xtordoir/pipeline	
  
	
  
 
	
   P a g e 	
  10	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  
	
  
Now	
  check	
  available	
  images	
  using	
  the	
  ‘docker	
  images’	
  command:	
  
	
  
	
  
And	
  we’ve	
  used	
  approx.	
  3GB	
  disk	
  space	
  in	
  /var/lib/docker:	
  
	
  
	
  
4.7 RUN	
  THE	
  DEVOXX	
  IMAGE	
  AS	
  A	
  CONTAINER	
  
	
  
Task	
  duration	
  :	
  5	
  mins	
  
	
  
1. To	
  run	
  the	
  Devoxx	
  image	
  as	
  a	
  container	
  in	
  the	
  foreground,	
  we	
  will	
  use	
  the	
  following	
  
command:	
  
docker	
  run	
  -­‐it	
  -­‐m	
  8g	
  	
  -­‐p	
  30080:80	
  -­‐p	
  34040-­‐34045:4040-­‐4045	
  -­‐p	
  9160:9160	
  -­‐p	
  
9042:9042	
  -­‐p	
  39200:9200	
  -­‐p	
  37077:7077	
  -­‐p	
  36060:6060	
  -­‐p	
  36061:6061	
  -­‐p	
  
32181:2181	
  -­‐p	
  38090:8090	
  -­‐p	
  38099:8099	
  -­‐p	
  30000:10000	
  -­‐p	
  30070:50070	
  -­‐p	
  
 
	
   P a g e 	
  11	
  |	
  17	
  
	
  
	
   	
  
	
  
30090:50090	
  -­‐p	
  39092:9092	
  -­‐p	
  36066:6066	
  -­‐p	
  39000:9000	
  -­‐p	
  39999:19999	
  -­‐p	
  
36081:6081	
  -­‐p	
  35601:5601	
  -­‐p	
  37979:7979	
  -­‐p	
  38989:8989	
  xtordoir/pipeline	
  bash	
  
Where:	
  
-­‐it	
  :	
  For	
  interactive	
  processes	
  (like	
  a	
  shell),	
  you	
  must	
  use	
  –i	
  –t	
  together	
  in	
  order	
  to	
  
allocate	
  a	
  tty	
  for	
  the	
  container	
  process.	
  	
  
-­‐m	
  :	
  Allow	
  a	
  maximum	
  of	
  8	
  GB	
  memory	
  for	
  the	
  container	
  
-­‐p	
  :	
  A	
  list	
  of	
  port	
  mappings	
  
xtordoir/pipeline	
  :	
  The	
  image	
  to	
  run	
  
bash	
  :	
  The	
  command	
  to	
  run	
  in	
  the	
  container	
  
	
  
2. For	
  convenience	
  place	
  the	
  run	
  command	
  in	
  a	
  script	
  e.g.	
  	
  
$	
  vi	
  ./run_docker.sh	
  
	
  
Make	
  it	
  executable:	
  
$	
  chmod	
  755	
  ./run_docker.sh	
  
	
  
3. Run	
  the	
  command	
  or	
  the	
  script	
  
$	
  ./run_docker.sh	
  
	
  
4. From	
  now	
  on	
  we	
  are	
  inside	
  the	
  docker	
  instance	
  at	
  the	
  ‘#’	
  prompt.	
  
Ignore	
  the	
  warning	
  message	
  regarding	
  lack	
  of	
  swap	
  –	
  this	
  is	
  what	
  we	
  want	
  for	
  Cassandra.	
  
	
  
	
  
Creating	
  a	
  container	
  has	
  used	
  additional	
  space	
  in	
  /var/lib/docker	
  in	
  the	
  host:	
  
	
  
	
  
5. We	
  can	
  list	
  running	
  containers	
  using	
  ‘docker	
  ps’:	
  
 
	
   P a g e 	
  12	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  
	
  
6. If	
  we	
  exit	
  the	
  container	
  it	
  continues	
  to	
  exist,	
  but	
  stops	
  running,	
  and	
  ‘docker	
  ps’	
  returns	
  
nothing.	
  
	
  
We	
  can	
  see	
  all	
  containers	
  with	
  ‘docker	
  ps	
  –a’:	
  
	
  
	
  
7. We	
  don’t	
  need	
  the	
  old	
  copies	
  of	
  the	
  hello-­‐world	
  container	
  –	
  we	
  can	
  delete	
  them	
  using	
  their	
  
container	
  name	
  or	
  container	
  ID:	
  
	
  
	
  
8. We	
  can	
  re-­‐start	
  the	
  stopped	
  container	
  using	
  ‘docker	
  start’:	
  
$	
  docker	
  start	
  -­‐a	
  -­‐i	
  adoring_torvalds	
  
	
  
9. We	
  can	
  attach	
  to	
  the	
  running	
  container	
  using	
  ‘docker	
  attach’:	
  
	
  
	
  
10. To	
  detach	
  the	
  tty	
  without	
  exiting	
  the	
  shell,	
  use	
  the	
  escape	
  sequence	
  Ctrl-­‐p	
  +	
  Ctrl-­‐q.	
  	
  
 
	
   P a g e 	
  13	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  
	
  
4.8 SETUP	
  THE	
  DEVOXX	
  DEMO	
  
	
  
Task	
  duration	
  :	
  2	
  mins	
  
Ensure	
  that	
  you	
  are	
  inside	
  the	
  container.	
  
1. Setup	
  the	
  services	
  
$	
  cd	
  pipeline	
  
$	
  source	
  devoxx-­‐setup.sh	
  	
  	
  	
  	
  	
  	
  #	
  ignore	
  Cassandra	
  errors	
  
	
  
2. The	
  cassandra.yaml	
  file	
  (at	
  /root/apache-­‐cassandra-­‐2.2.0/conf/cassandra.yaml)	
  is	
  
updated	
  at	
  container	
  creation	
  time	
  with	
  the	
  necessary	
  parameters,	
  e.g.	
  listen_address	
  
(localhost)	
  and	
  rpc_interface	
  (eth0)	
  necessary	
  to	
  make	
  Cassandra	
  available	
  outside	
  the	
  
container.	
  The	
  rpc	
  port	
  (9160)	
  and	
  native	
  transport	
  port	
  (9042)	
  are	
  also	
  mapped	
  in	
  the	
  
container	
  run	
  command.	
  
	
  
3. After	
  a	
  little	
  time,	
  all	
  services	
  should	
  be	
  up,	
  e.g.	
  cassandra:	
  
$	
  cqlsh	
  `hostname`	
  
	
  
	
  
NB	
  At	
  this	
  point	
  the	
  pipeline	
  keyspace	
  does	
  not	
  yet	
  exist	
  in	
  Cassandra.	
  
	
  
4.9 RUN	
  THE	
  DEVOXX	
  BIG	
  DATA	
  PIPELINING	
  DEMO	
  
	
  
Task	
  duration	
  :	
  30	
  mins	
  
	
  
1. Check	
  that	
  the	
  Spark	
  Notebook	
  is	
  available	
  in	
  the	
  browser	
  
 
	
   P a g e 	
  14	
  |	
  17	
  
	
  
	
   	
  
	
  
http://localhost:39000/tree/pipeline	
  	
  	
  <-­‐-­‐	
  Spark	
  Notebook	
  
http://localhost:36060/	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <-­‐-­‐	
  Spark	
  Master	
  
http://localhost:34040/	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <-­‐-­‐	
  Spark	
  Worker	
  
	
  
	
  
	
  
2. Run	
  through	
  the	
  Notebooks	
  ‘AdamToDataframe’	
  and	
  ‘AggregateAndSaveToCassandra’.	
  At	
  
the	
  end	
  of	
  the	
  second	
  notebook	
  the	
  data	
  will	
  be	
  available	
  in	
  Cassandra.	
  
	
  
3. We	
  use	
  the	
  Dataframe	
  save	
  method	
  to	
  save	
  data	
  to	
  Cassandra:	
  
allPopulations	
  
	
  	
  	
  	
  .withColumn("population",	
  lit("ALL"))	
  
	
  	
  	
  	
  .withColumnRenamed("refCnt",	
  "ref_cnt")	
  
	
  	
  	
  	
  .withColumnRenamed("altCnt",	
  "alt_cnt")	
  
	
  	
  	
  	
  .write.format("org.apache.spark.sql.cassandra")	
  
	
  	
  	
  	
  .mode(org.apache.spark.sql.SaveMode.Append)	
  
	
  	
  	
  	
  .options(Map("keyspace"	
  -­‐>	
  "pipeline",	
  "table"	
  -­‐>	
  
"pop_allele_count"))	
  
	
  	
  	
  	
  .save()	
  
	
  
However	
  we	
  could	
  instead	
  have	
  used	
  the	
  RDD	
  ‘SaveToCassandra’	
  method:	
  
a. Convert	
  Dataframes	
  to	
  RDD:	
  
val	
  countsByPop	
  =	
  byPopulation.rdd.collect	
  {	
  
	
  	
  	
  case	
  Row(pop:	
  String,	
  chr:	
  String,	
  start:	
  Long,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ref:	
  String,	
  alt:	
  String,	
  refcnt:	
  Long,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  altcnt:	
  Long)	
  =>	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  PopAlleleCount(pop,	
  chr,	
  start,	
  ref,	
  alt,	
  refcnt,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  altcnt)	
  
 
	
   P a g e 	
  15	
  |	
  17	
  
	
  
	
   	
  
	
  
	
  	
  	
  }	
  
	
  
val	
  countAll	
  =	
  allPopulations.rdd.collect	
  {	
  
	
  	
  	
  	
  case	
  Row(chr:	
  String,	
  start:	
  Long,	
  ref:	
  String,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  alt:	
  String,	
  refcnt:	
  Long,	
  altcnt:	
  Long)	
  =>	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  PopAlleleCount("ALL",	
  chr,	
  start,	
  ref,	
  alt,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  refcnt,	
  altcnt)	
  
	
  	
  	
  }	
  
b. Save	
  RDD	
  to	
  Cassandra	
  
countsByPop.saveToCassandra("pipeline","pop_allele_count",	
  	
  
SomeColumns("population","chromosome","start","ref","alt","refcnt","
altcnt"))	
  
	
  
countAll.saveToCassandra("pipeline","pop_allele_count",	
  
SomeColumns("population","chromosome","start","ref","alt","refcnt","
altcnt"))	
  
	
  
4. We	
  can	
  then	
  see	
  the	
  processed	
  data	
  in	
  Cassandra	
  from	
  CQLSH	
  inside	
  the	
  container:	
  
cqlsh>	
  SELECT	
  COUNT(*)	
  FROM	
  pop_allele_count;	
  
	
  
We	
  can	
  run	
  a	
  typical	
  query	
  on	
  the	
  data:	
  
cqlsh>	
  SELECT	
  *	
  FROM	
  pop_allele_count	
  WHERE	
  population	
  =	
  'ALL'	
  AND	
  
chromosome	
  =	
  '22'	
  AND	
  start	
  >=	
  16500000	
  AND	
  start	
  <	
  16750000;	
  
	
  
	
  
	
  
 
	
   P a g e 	
  16	
  |	
  17	
  
	
  
	
   	
  
	
  
4.10 ACCESS	
  CASSANDRA	
  FROM	
  OUTSIDE	
  THE	
  CONTAINER	
  
	
  
Task	
  duration	
  :	
  20	
  mins	
  
Because	
  we	
  mapped	
  the	
  Cassandra	
  ports	
  when	
  we	
  started	
  the	
  container,	
  we	
  can	
  also	
  access	
  the	
  
data	
  in	
  Cassandra	
  from	
  outside	
  the	
  container.	
  
1. First	
  we	
  need	
  to	
  install	
  a	
  compatible	
  version	
  of	
  Cassandra	
  –	
  we	
  will	
  use	
  Datastax	
  Community	
  
Edition	
  2.2.0	
  –	
  we	
  can	
  get	
  this	
  from	
  http://downloads.datastax.com/community	
  
	
  
	
  
2. Unzip	
  and	
  extract	
  the	
  downloaded	
  archive	
  
$	
  gunzip	
  Downloads/dsc-­‐cassandra-­‐2.2.0-­‐bin.tar.gz	
  
$	
  cd	
  pipeline	
  
$	
  tar	
  xvf	
  ~/Downloads/dsc-­‐cassandra-­‐2.2.0-­‐bin.tar	
  
	
  
3. Run	
  cqlsh	
  and	
  use	
  the	
  IP	
  address	
  of	
  the	
  container	
  
	
  
	
  
	
  
 
	
   P a g e 	
  17	
  |	
  17	
  
	
  
	
   	
  
	
  
4.11 VISUALISE	
  THE	
  DATA	
  USING	
  AKKA	
  AND	
  A	
  REST	
  INTERFACE	
  
	
  
Task	
  duration	
  :	
  2	
  mins	
  
	
  
1. In the terminal,
cd	
  ~/pipeline/rest-­‐api	
  &&	
  sed	
  -­‐i	
  s/${IP_eth0}/${IP_eth0}/	
  
src/main/resources/application.conf	
  &&	
  sbt	
  run
	
  
This	
  is	
  starting	
  the	
  REST	
  service	
  in	
  Akka	
  HTTP	
  reading	
  data	
  from	
  Cassandra	
  and	
  serving	
  them	
  in	
  
JSON.	
  
2. Open	
  the	
  Rest	
  Call	
  notebook	
  and	
  execute	
  it.	
  This	
  one	
  uses	
  Akka	
  HTTP	
  (client	
  now)	
  to	
  access	
  
the	
  REST	
  service	
  and	
  plot	
  some	
  data.	
  
(there	
  is	
  currently	
  a	
  bug	
  in	
  this	
  step,	
  under	
  investigation)	
  
	
  
3. Open	
  the	
  Rest	
  Call	
  (using	
  HTML	
  form)	
  notebook.	
  It's	
  essentially	
  the	
  same	
  as	
  above	
  but	
  
present	
  the	
  REST	
  calls	
  in	
  an	
  HTML	
  form	
  instead	
  of	
  query	
  parameter	
  in	
  a	
  String.	
  
Click	
  "Range	
  query	
  over	
  a	
  population	
  for	
  a	
  chromosome"	
  and	
  then	
  click	
  "Change"
	
  
	
  

Más contenido relacionado

La actualidad más candente

DX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to workDX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to workPrincipled Technologies
 
Backup Navigator install and configuration guide
Backup Navigator install and configuration guideBackup Navigator install and configuration guide
Backup Navigator install and configuration guideAndrey Karpov
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark Cynthia Saracco
 
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Principled Technologies
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab Cynthia Saracco
 
Hpe Zero Downtime Administrator's Guide
Hpe Zero Downtime Administrator's GuideHpe Zero Downtime Administrator's Guide
Hpe Zero Downtime Administrator's GuideAndrey Karpov
 
Hpe data protector deduplication
Hpe data protector deduplicationHpe data protector deduplication
Hpe data protector deduplicationAndrey Karpov
 
Oracle 11gr2 on_rhel6_0 - Document from Red Hat Inc
Oracle 11gr2 on_rhel6_0 - Document from Red Hat IncOracle 11gr2 on_rhel6_0 - Document from Red Hat Inc
Oracle 11gr2 on_rhel6_0 - Document from Red Hat IncFilipe Miranda
 
Hpe Data Protector Disaster Recovery Guide
Hpe Data Protector Disaster Recovery GuideHpe Data Protector Disaster Recovery Guide
Hpe Data Protector Disaster Recovery GuideAndrey Karpov
 
Automate DG Best Practices
Automate DG  Best PracticesAutomate DG  Best Practices
Automate DG Best PracticesMohsen B
 
DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16
 DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16  DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16
DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16 Sunny U Okoro
 

La actualidad más candente (12)

DX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to workDX2000 from NEC lets you put big data to work
DX2000 from NEC lets you put big data to work
 
Backup Navigator install and configuration guide
Backup Navigator install and configuration guideBackup Navigator install and configuration guide
Backup Navigator install and configuration guide
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
 
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
 
Db2 bp hadr_1111
Db2 bp hadr_1111Db2 bp hadr_1111
Db2 bp hadr_1111
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab
 
Hpe Zero Downtime Administrator's Guide
Hpe Zero Downtime Administrator's GuideHpe Zero Downtime Administrator's Guide
Hpe Zero Downtime Administrator's Guide
 
Hpe data protector deduplication
Hpe data protector deduplicationHpe data protector deduplication
Hpe data protector deduplication
 
Oracle 11gr2 on_rhel6_0 - Document from Red Hat Inc
Oracle 11gr2 on_rhel6_0 - Document from Red Hat IncOracle 11gr2 on_rhel6_0 - Document from Red Hat Inc
Oracle 11gr2 on_rhel6_0 - Document from Red Hat Inc
 
Hpe Data Protector Disaster Recovery Guide
Hpe Data Protector Disaster Recovery GuideHpe Data Protector Disaster Recovery Guide
Hpe Data Protector Disaster Recovery Guide
 
Automate DG Best Practices
Automate DG  Best PracticesAutomate DG  Best Practices
Automate DG Best Practices
 
DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16
 DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16  DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16
DB Develop 2 Oracle 12c, DB2, MYSQL, SQL Anywhere 16
 

Destacado

Lesson 12 | Primary | Sabbath School
Lesson 12 | Primary | Sabbath SchoolLesson 12 | Primary | Sabbath School
Lesson 12 | Primary | Sabbath Schooljespadill
 
Steel plant equipment in india
Steel plant equipment in indiaSteel plant equipment in india
Steel plant equipment in indiadealnity
 
The Mousy King a pippet play written by Konstantin Iliev
The Mousy King a pippet play written by Konstantin IlievThe Mousy King a pippet play written by Konstantin Iliev
The Mousy King a pippet play written by Konstantin IlievДесислава Тенева
 
Nanotecnologia en la seguridad
Nanotecnologia en la seguridadNanotecnologia en la seguridad
Nanotecnologia en la seguridadKevin Cataño S
 
Sothea paper_JSIDRE
Sothea paper_JSIDRESothea paper_JSIDRE
Sothea paper_JSIDRERFDMC/MRC
 
FR_Brochure_Adobe_Marketing_Cloud_2015.PDF
FR_Brochure_Adobe_Marketing_Cloud_2015.PDFFR_Brochure_Adobe_Marketing_Cloud_2015.PDF
FR_Brochure_Adobe_Marketing_Cloud_2015.PDFOlivier BINISTI
 
Joint Stock Companies & Limited Liability Companies
Joint Stock Companies & Limited Liability CompaniesJoint Stock Companies & Limited Liability Companies
Joint Stock Companies & Limited Liability CompaniesMelis Buhan Öncel
 
Lesson 12 | Cornerstone Connections | Sabbath School
Lesson 12 | Cornerstone Connections | Sabbath SchoolLesson 12 | Cornerstone Connections | Sabbath School
Lesson 12 | Cornerstone Connections | Sabbath Schooljespadill
 
A'Levels & B.COM
A'Levels & B.COMA'Levels & B.COM
A'Levels & B.COMUmair Karim
 
“Часто допускаемые ошибки при заведении SEO-кампании в SeoPult”
“Часто допускаемые ошибки при заведении SEO-кампании в SeoPult” “Часто допускаемые ошибки при заведении SEO-кампании в SeoPult”
“Часто допускаемые ошибки при заведении SEO-кампании в SeoPult” Cybermarketing, Moscow
 
Lesson 12 | Power points | Sabbath School
Lesson 12 | Power points | Sabbath SchoolLesson 12 | Power points | Sabbath School
Lesson 12 | Power points | Sabbath Schooljespadill
 
портфолио1
портфолио1портфолио1
портфолио1Sveta16
 
Análisis ensayo sobre ed. de ernesto sábato
Análisis ensayo sobre ed. de ernesto sábatoAnálisis ensayo sobre ed. de ernesto sábato
Análisis ensayo sobre ed. de ernesto sábatoVictoria Misandria
 
Lesson 12 | Kindergarten | Sabbath School
Lesson 12 | Kindergarten | Sabbath SchoolLesson 12 | Kindergarten | Sabbath School
Lesson 12 | Kindergarten | Sabbath Schooljespadill
 
Utilisation d'une communauté en ligne pour le développement d'un projet
Utilisation d'une communauté en ligne pour le développement d'un projetUtilisation d'une communauté en ligne pour le développement d'un projet
Utilisation d'une communauté en ligne pour le développement d'un projetomsrp
 

Destacado (20)

Lesson 12 | Primary | Sabbath School
Lesson 12 | Primary | Sabbath SchoolLesson 12 | Primary | Sabbath School
Lesson 12 | Primary | Sabbath School
 
Steel plant equipment in india
Steel plant equipment in indiaSteel plant equipment in india
Steel plant equipment in india
 
The Mousy King a pippet play written by Konstantin Iliev
The Mousy King a pippet play written by Konstantin IlievThe Mousy King a pippet play written by Konstantin Iliev
The Mousy King a pippet play written by Konstantin Iliev
 
Nanotecnologia en la seguridad
Nanotecnologia en la seguridadNanotecnologia en la seguridad
Nanotecnologia en la seguridad
 
Sothea paper_JSIDRE
Sothea paper_JSIDRESothea paper_JSIDRE
Sothea paper_JSIDRE
 
FR_Brochure_Adobe_Marketing_Cloud_2015.PDF
FR_Brochure_Adobe_Marketing_Cloud_2015.PDFFR_Brochure_Adobe_Marketing_Cloud_2015.PDF
FR_Brochure_Adobe_Marketing_Cloud_2015.PDF
 
Joint Stock Companies & Limited Liability Companies
Joint Stock Companies & Limited Liability CompaniesJoint Stock Companies & Limited Liability Companies
Joint Stock Companies & Limited Liability Companies
 
Lesson 12 | Cornerstone Connections | Sabbath School
Lesson 12 | Cornerstone Connections | Sabbath SchoolLesson 12 | Cornerstone Connections | Sabbath School
Lesson 12 | Cornerstone Connections | Sabbath School
 
A'Levels & B.COM
A'Levels & B.COMA'Levels & B.COM
A'Levels & B.COM
 
“Часто допускаемые ошибки при заведении SEO-кампании в SeoPult”
“Часто допускаемые ошибки при заведении SEO-кампании в SeoPult” “Часто допускаемые ошибки при заведении SEO-кампании в SeoPult”
“Часто допускаемые ошибки при заведении SEO-кампании в SeoPult”
 
Lực ma sát
Lực ma sátLực ma sát
Lực ma sát
 
Lesson 12 | Power points | Sabbath School
Lesson 12 | Power points | Sabbath SchoolLesson 12 | Power points | Sabbath School
Lesson 12 | Power points | Sabbath School
 
портфолио1
портфолио1портфолио1
портфолио1
 
Análisis ensayo sobre ed. de ernesto sábato
Análisis ensayo sobre ed. de ernesto sábatoAnálisis ensayo sobre ed. de ernesto sábato
Análisis ensayo sobre ed. de ernesto sábato
 
ACCA
ACCAACCA
ACCA
 
How We Lost Power
How We Lost PowerHow We Lost Power
How We Lost Power
 
Lesson 12 | Kindergarten | Sabbath School
Lesson 12 | Kindergarten | Sabbath SchoolLesson 12 | Kindergarten | Sabbath School
Lesson 12 | Kindergarten | Sabbath School
 
Utilisation d'une communauté en ligne pour le développement d'un projet
Utilisation d'une communauté en ligne pour le développement d'un projetUtilisation d'une communauté en ligne pour le développement d'un projet
Utilisation d'une communauté en ligne pour le développement d'un projet
 
Palacio de sal bolivia
Palacio de sal bolivia Palacio de sal bolivia
Palacio de sal bolivia
 
Introducción a la Seguridad de los Sistemas Operativos
Introducción a la Seguridad de los Sistemas OperativosIntroducción a la Seguridad de los Sistemas Operativos
Introducción a la Seguridad de los Sistemas Operativos
 

Similar a Big data week London Big data pipelining 0.2

InformationDrivenShort
InformationDrivenShortInformationDrivenShort
InformationDrivenShortDirk Ortloff
 
Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Heath Turner
 
WHITE PAPER▶ Software Defined Storage at the Speed of Flash
WHITE PAPER▶ Software Defined Storage at the Speed of FlashWHITE PAPER▶ Software Defined Storage at the Speed of Flash
WHITE PAPER▶ Software Defined Storage at the Speed of FlashSymantec
 
Oracle database edition-12c
Oracle database edition-12cOracle database edition-12c
Oracle database edition-12cAsha BG
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DWSidi yazid
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningPatrick Nicolas
 
Data Domain Architecture
Data Domain ArchitectureData Domain Architecture
Data Domain Architecturekoesteruk22
 
salesforce_apex_developer_guide
salesforce_apex_developer_guidesalesforce_apex_developer_guide
salesforce_apex_developer_guideBrindaTPatil
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Witsml core api_version_1.3.1
Witsml core api_version_1.3.1Witsml core api_version_1.3.1
Witsml core api_version_1.3.1Suresh Ayyappan
 
INN694-2014-OpenStack installation process V5
INN694-2014-OpenStack installation process V5INN694-2014-OpenStack installation process V5
INN694-2014-OpenStack installation process V5Fabien CHASTEL
 
Building a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneBuilding a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneMirko Calvaresi
 
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...Jaleel Ahmed Gulammohiddin
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
blue-infinity White Paper on JavaFX by Jan Stenvall
blue-infinity White Paper on JavaFX by Jan Stenvallblue-infinity White Paper on JavaFX by Jan Stenvall
blue-infinity White Paper on JavaFX by Jan Stenvallblue-infinity
 

Similar a Big data week London Big data pipelining 0.2 (20)

InformationDrivenShort
InformationDrivenShortInformationDrivenShort
InformationDrivenShort
 
Tx2014 Feature and Highlights
Tx2014 Feature and Highlights Tx2014 Feature and Highlights
Tx2014 Feature and Highlights
 
Cube_it!_software_report_for_IMIS
Cube_it!_software_report_for_IMISCube_it!_software_report_for_IMIS
Cube_it!_software_report_for_IMIS
 
WHITE PAPER▶ Software Defined Storage at the Speed of Flash
WHITE PAPER▶ Software Defined Storage at the Speed of FlashWHITE PAPER▶ Software Defined Storage at the Speed of Flash
WHITE PAPER▶ Software Defined Storage at the Speed of Flash
 
Oracle database edition-12c
Oracle database edition-12cOracle database edition-12c
Oracle database edition-12c
 
Hadoop as an extension of DW
Hadoop as an extension of DWHadoop as an extension of DW
Hadoop as an extension of DW
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
tr-4537
tr-4537tr-4537
tr-4537
 
Data Domain Architecture
Data Domain ArchitectureData Domain Architecture
Data Domain Architecture
 
salesforce_apex_developer_guide
salesforce_apex_developer_guidesalesforce_apex_developer_guide
salesforce_apex_developer_guide
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Bb sql serverdell
Bb sql serverdellBb sql serverdell
Bb sql serverdell
 
Witsml core api_version_1.3.1
Witsml core api_version_1.3.1Witsml core api_version_1.3.1
Witsml core api_version_1.3.1
 
INN694-2014-OpenStack installation process V5
INN694-2014-OpenStack installation process V5INN694-2014-OpenStack installation process V5
INN694-2014-OpenStack installation process V5
 
Building a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneBuilding a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and Lucene
 
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
blue-infinity White Paper on JavaFX by Jan Stenvall
blue-infinity White Paper on JavaFX by Jan Stenvallblue-infinity White Paper on JavaFX by Jan Stenvall
blue-infinity White Paper on JavaFX by Jan Stenvall
 

Último

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Último (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Big data week London Big data pipelining 0.2

  • 1.     P a g e  1  |  17                                     Data  Pipelining  –  A  new  Approach  To   Datascience   Big  Data  Week,  London  November  20125                 Version  v0.2   Copyright  notice   This  document  is  Copyright  Datastax  Inc.  2015.  The  document  may  be  redistributed,  in  electronic  or  hardcopy   format,  on  the  understanding  that  the  contents  remain  the  property  of  Datastax  Inc.    The  document  may  not,   under  any  circumstance,  be  sold  or  distributed  for  compensation  of  any  kind.  All  information  is  provided  "as   is”,  there  is  no  warranty  that  the  information  is  correct  or  suitable  for  any  purpose,  either  implicit  or  explicit.          
  • 2.     P a g e  2  |  17           Table  of  Contents   1   DOCUMENT  INFORMATION  AND  HISTORY  ..............................................................................................  3   1.1   DOCUMENT  INFORMATION  .............................................................................................................................  3   1.2   VERSION  HISTORY  .........................................................................................................................................  3   2   INTRODUCTION  .......................................................................................................................................  3   3   REFERENCES  ............................................................................................................................................  3   3.1   ADAM  .......................................................................................................................................................  4   3.2   SPARK  NOTEBOOK  .........................................................................................................................................  4   4   BUILDING  &  RUNNING  THE  HUMAN  GENOME  BIG  DATA  PIPELINE  DEMO  ...............................................  4   4.1   GENERAL  PRE-­‐REQUISITES  ..............................................................................................................................  4   4.2   GENERAL  OS  PRE-­‐REQUISITES  (UBUNTU)  ..........................................................................................................  5   4.3   OS-­‐SPECIFIC  PRE-­‐REQUISITES  .........................................................................................................................  7   4.4   INSTALL  DOCKER  ...........................................................................................................................................  7   4.5   CLONE  THE  PIPELINE  GITHUB  REPO  ..................................................................................................................  8   4.6   CREATE  THE  DEVOXX  CONTAINER  IMAGE  ...........................................................................................................  9   4.7   RUN  THE  DEVOXX  IMAGE  AS  A  CONTAINER  .....................................................................................................  10   4.8   SETUP  THE  DEVOXX  DEMO  ...........................................................................................................................  13   4.9   RUN  THE  DEVOXX  BIG  DATA  PIPELINING  DEMO  ................................................................................................  13   4.10   ACCESS  CASSANDRA  FROM  OUTSIDE  THE  CONTAINER  ......................................................................................  16   4.11   VISUALISE  THE  DATA  USING  AKKA  AND  A  REST  INTERFACE  ...............................................................................  17        
  • 3.     P a g e  3  |  17           1 DOCUMENT  INFORMATION  AND  HISTORY     1.1 DOCUMENT  INFORMATION     Document  name:   Big  Data  Week  London  -­‐  Big  Data  Pipelining  0.2.Docx   Document  Authors:   Simon  Ambridge   Original  Date:   20/11/2015   Purpose:   A  description  of  the  steps  required  to  build  and  run  the  Human  Genome   Big  Data  Pipeline  container  created  and  distributed  by  Data  Fellas,   Belgium.     1.2 VERSION  HISTORY     Version   Date   Changed  By   Changes   0.1   20/11/2015   Simon  Ambridge   Initial  draft     0.2   26/11/2015   Simon  Ambridge   First  published  draft  (internal)                                             2 INTRODUCTION     The  objective  for  the  one-­‐hour  presentation  at  Big  Data  Week  2015  in  London,  in  conjunction  with   these  notes,  is  to  introduce  the  audience  to  a  demonstration  environment  that  was  used  as  the  basis   of  a  half-­‐day  workshop  presented  by  a  team  led  by  Andy  Petrella  from  Data  Fellas  at  Devoxx  in   November  2015.     The  takeaway  at  the  end  of  this  session  will  be  a  better  understanding  of  how  to  build  a  data   pipeline  using  modern  distributed  and  scalable  technologies,  in  conjunction  with  the  Data  Fellas   Spark-­‐Notebook,  that  demonstrates  how  to  make  use  of  a  reproducible  data  pipelining  approach.     3 REFERENCES    
  • 4.     P a g e  4  |  17           3.1 ADAM     Adam  is  described  here:   http://www.lexjansen.com/pharmasug/2014/DS/PharmaSUG-­‐2014-­‐DS11.pdf   The  Clinical  Data  Interchange  Standards  Consortium  (CDISC)  Analysis  Data  Model  (ADaM)     3.2 SPARK  NOTEBOOK     From  an  interview  with  Andy  Petrella  by  Typesafe  Inc.   “Spark-­‐Notebook  from  Data  Fellas  lets  you  use  Apache  Spark  in  your  browser  and  is   purposed  with  creating  reproducible  analysis using  Scala,  Apache  Spark  and  other   technologies.   Data  science  was  originally  focused  on  producing  static  products  (reports,  models,  …)  based   on  samples  of  the  available  data  at  the  time  of  analysis.  Nowadays,  the  results  need  to  be   reactive  with  the  data  flow,  requiring  new  data  types.  Also,  the  data  sizes  can  be  really  big. This  explains  the  rise  of  distributed  computing  and  online  analysis,  the  union  of  which  could   bethought  as  the  Reactive  Data  Science  Pipeline.  However,  such  a  pipeline  requires  many   skill  sets,  including  data  science,  operations  software  engineering,  domain  knowledge,  and   others. At  Data  Fellas,  we  are  building  Shar3,  the  ultimate  toolkit  that  aims  to  build  Reactive  data   science  pipelines  by  reducing  the  friction  between  the  different  building  phases.   Shar3  is  composed  of  notable  OSS  technologies  like  Apache  Avro,  Apache  Mesos,  Apache   Cassandra,  Apache  Spark,  Typesafe  Reactive  Platform  (Scala,  Akka,  Play),  Spark  Notebook   and  more.  These  components  were  chosen  with  a  strong  focus  on  scalability  and  the   capacity  to  reactively  adapt  to  their  ever-­‐changing  production  environments.”   https://www.typesafe.com/blog/scala-­‐and-­‐spark-­‐notebook-­‐the-­‐next-­‐generation-­‐data-­‐ science-­‐toolkit   https://github.com/andypetrella/spark-­‐notebook     4 BUILDING  &  RUNNING  THE  HUMAN  GENOME  BIG  DATA  PIPELINE  DEMO       4.1 GENERAL  PRE-­‐REQUISITES    
  • 5.     P a g e  5  |  17           Required  machine  spec:  3  cores,  5GB   Provision  a  host     • Linux  machine   http://www.ubuntu.com/download/desktop     • Create  a  VM  (e.g.  Ubuntu)   http://virtualboxes.org/images/ubuntu/   http://www.osboxes.org/ubuntu/       4.2 GENERAL  OS  PRE-­‐REQUISITES  (UBUNTU)     Task  duration  :  5  minutes   These  instructions  taken  from  https://docs.docker.com/installation/ubuntulinux/     Docker’s  apt  repository  contains  Docker  1.5.0  and  higher.  To  set  apt  to  use  packages  from  the  new   repository:   1. If  you  haven’t  already  done  so,  log  into  your  Ubuntu  instance.   2. Open  a  terminal  window.   3. Add  the  new  gpg  key  as  a  privileged  user  (e.g.  as  root  or  via  sudo).     $  sudo  apt-­‐key  adv  -­‐-­‐keyserver  hkp://p80.pool.sks-­‐keyservers.net:80  -­‐-­‐recv-­‐ keys  58118E89F3A912897C070ADBF76221572C52609D       4. Identify  your  OS  –  for  example  on  Ubuntu  check  the  /etc/os_release  file:     5. Open  the  docker.list  file  in  your  favourite  editor  and  remove  any  existing  entries.  If  the  file   doesn’t  exist,  create  it.    
  • 6.     P a g e  6  |  17           $  sudo  vi  /etc/apt/sources.list.d/docker.list     6. Add  the  appropriate  entry  for  your  OS  e.g.  for  Precise   #  Ubuntu  Precise  12.04  (LTS)     deb  https://apt.dockerproject.org/repo  ubuntu-­‐precise  main   #  Ubuntu  Trusty  14.04  (LTS)   deb  https://apt.dockerproject.org/repo  ubuntu-­‐trusty  main   #  Ubuntu  Vivid  15.04   deb  https://apt.dockerproject.org/repo  ubuntu-­‐vivid  main   #  Ubuntu  Wily  15.10   deb  https://apt.dockerproject.org/repo  ubuntu-­‐wily  main     7. Save  and  close  the  /etc/apt/sources.list.d/docker.list  file.     8. Update  the  apt  package  index   $  apt-­‐get  update     NB  if  you  get  a  GPG  error  e.g.   The  following  signatures  couldn't  be  verified  because  the  public  key  is  not   available:  NO_PUBKEY  <some  key>     Fix  it  using:   $  sudo  apt-­‐key  adv  -­‐-­‐keyserver  keyserver.ubuntu.com  -­‐-­‐recv-­‐keys  <some  key>     And  re-­‐run  apt-­‐get  update     9. Purge  the  old  repo  if  it  exists.   $  sudo  apt-­‐get  purge  lxc-­‐docker*     10. Verify  that  apt  is  pulling  from  the  right  repository.   $  apt-­‐cache  policy  docker-­‐engine     Henceforth  apt-­‐get  upgrade  will  pull  from  the  correct  repository    
  • 7.     P a g e  7  |  17             4.3 OS-­‐SPECIFIC  PRE-­‐REQUISITES     Task  duration  :  configuration  dependant   For  example,  for  Ubuntu  Precise,  Docker  requires  the  3.13  kernel  version.  If  the  kernel  version  is   older  than  3.13,  it  must  be  upgraded.     Use  the  uname  command:   $  uname  -­‐r     In  my  case  the  kernel  version  is  OK:         4.4 INSTALL  DOCKER     Task  duration  :  5  minutes     1. Update  the  package  index   $  sudo  apt-­‐get  update     2. Install  docker  package   $  sudo  apt-­‐get  install  docker-­‐engine     3. Start  the  docker  daemon  (should  already  be  running)   $  sudo  service  docker  start     4. Check  it  is  working  correctly   $  sudo  docker  run  hello-­‐world    
  • 8.     P a g e  8  |  17               5. Add  the  login  user  to  the  docker  group  to  avoid  having  to  use  sudo  for  each  command:   $  sudo  useradd  -­‐G  docker  <myuserid>     or  if  the  user  already  exists:   $  sudo  usermod  -­‐aG  docker  <myuserid>     6. Log  out  and  back  in  –  check  docker  can  be  run  without  sudo   $  docker  run  hello-­‐world     4.5 CLONE  THE  PIPELINE  GITHUB  REPO     Task  duration  :  2  mins    
  • 9.     P a g e  9  |  17           Clone  the  pipeline  repo   $  mkdir  ~/pipeline   $  cd  ~/pipeline   $  git  clone  https://github.com/distributed-­‐freaks/pipeline.git         4.6 CREATE  THE  DEVOXX  CONTAINER  IMAGE     Task  duration  :  20  mins     At  this  point  we  only  have  the  ‘hello-­‐world’  container  locally.   1. Use  the  docker  images  command  to  list  available  images:   $  docker  images       Docker  images  are  stored  locally  under  /var/lib/docker     2. Pull  the  Devoxx  image  from  the  Docker  hub:   $  docker  pull  xtordoir/pipeline    
  • 10.     P a g e  10  |  17               Now  check  available  images  using  the  ‘docker  images’  command:       And  we’ve  used  approx.  3GB  disk  space  in  /var/lib/docker:       4.7 RUN  THE  DEVOXX  IMAGE  AS  A  CONTAINER     Task  duration  :  5  mins     1. To  run  the  Devoxx  image  as  a  container  in  the  foreground,  we  will  use  the  following   command:   docker  run  -­‐it  -­‐m  8g    -­‐p  30080:80  -­‐p  34040-­‐34045:4040-­‐4045  -­‐p  9160:9160  -­‐p   9042:9042  -­‐p  39200:9200  -­‐p  37077:7077  -­‐p  36060:6060  -­‐p  36061:6061  -­‐p   32181:2181  -­‐p  38090:8090  -­‐p  38099:8099  -­‐p  30000:10000  -­‐p  30070:50070  -­‐p  
  • 11.     P a g e  11  |  17           30090:50090  -­‐p  39092:9092  -­‐p  36066:6066  -­‐p  39000:9000  -­‐p  39999:19999  -­‐p   36081:6081  -­‐p  35601:5601  -­‐p  37979:7979  -­‐p  38989:8989  xtordoir/pipeline  bash   Where:   -­‐it  :  For  interactive  processes  (like  a  shell),  you  must  use  –i  –t  together  in  order  to   allocate  a  tty  for  the  container  process.     -­‐m  :  Allow  a  maximum  of  8  GB  memory  for  the  container   -­‐p  :  A  list  of  port  mappings   xtordoir/pipeline  :  The  image  to  run   bash  :  The  command  to  run  in  the  container     2. For  convenience  place  the  run  command  in  a  script  e.g.     $  vi  ./run_docker.sh     Make  it  executable:   $  chmod  755  ./run_docker.sh     3. Run  the  command  or  the  script   $  ./run_docker.sh     4. From  now  on  we  are  inside  the  docker  instance  at  the  ‘#’  prompt.   Ignore  the  warning  message  regarding  lack  of  swap  –  this  is  what  we  want  for  Cassandra.       Creating  a  container  has  used  additional  space  in  /var/lib/docker  in  the  host:       5. We  can  list  running  containers  using  ‘docker  ps’:  
  • 12.     P a g e  12  |  17               6. If  we  exit  the  container  it  continues  to  exist,  but  stops  running,  and  ‘docker  ps’  returns   nothing.     We  can  see  all  containers  with  ‘docker  ps  –a’:       7. We  don’t  need  the  old  copies  of  the  hello-­‐world  container  –  we  can  delete  them  using  their   container  name  or  container  ID:       8. We  can  re-­‐start  the  stopped  container  using  ‘docker  start’:   $  docker  start  -­‐a  -­‐i  adoring_torvalds     9. We  can  attach  to  the  running  container  using  ‘docker  attach’:       10. To  detach  the  tty  without  exiting  the  shell,  use  the  escape  sequence  Ctrl-­‐p  +  Ctrl-­‐q.    
  • 13.     P a g e  13  |  17               4.8 SETUP  THE  DEVOXX  DEMO     Task  duration  :  2  mins   Ensure  that  you  are  inside  the  container.   1. Setup  the  services   $  cd  pipeline   $  source  devoxx-­‐setup.sh              #  ignore  Cassandra  errors     2. The  cassandra.yaml  file  (at  /root/apache-­‐cassandra-­‐2.2.0/conf/cassandra.yaml)  is   updated  at  container  creation  time  with  the  necessary  parameters,  e.g.  listen_address   (localhost)  and  rpc_interface  (eth0)  necessary  to  make  Cassandra  available  outside  the   container.  The  rpc  port  (9160)  and  native  transport  port  (9042)  are  also  mapped  in  the   container  run  command.     3. After  a  little  time,  all  services  should  be  up,  e.g.  cassandra:   $  cqlsh  `hostname`       NB  At  this  point  the  pipeline  keyspace  does  not  yet  exist  in  Cassandra.     4.9 RUN  THE  DEVOXX  BIG  DATA  PIPELINING  DEMO     Task  duration  :  30  mins     1. Check  that  the  Spark  Notebook  is  available  in  the  browser  
  • 14.     P a g e  14  |  17           http://localhost:39000/tree/pipeline      <-­‐-­‐  Spark  Notebook   http://localhost:36060/                                <-­‐-­‐  Spark  Master   http://localhost:34040/                                <-­‐-­‐  Spark  Worker         2. Run  through  the  Notebooks  ‘AdamToDataframe’  and  ‘AggregateAndSaveToCassandra’.  At   the  end  of  the  second  notebook  the  data  will  be  available  in  Cassandra.     3. We  use  the  Dataframe  save  method  to  save  data  to  Cassandra:   allPopulations          .withColumn("population",  lit("ALL"))          .withColumnRenamed("refCnt",  "ref_cnt")          .withColumnRenamed("altCnt",  "alt_cnt")          .write.format("org.apache.spark.sql.cassandra")          .mode(org.apache.spark.sql.SaveMode.Append)          .options(Map("keyspace"  -­‐>  "pipeline",  "table"  -­‐>   "pop_allele_count"))          .save()     However  we  could  instead  have  used  the  RDD  ‘SaveToCassandra’  method:   a. Convert  Dataframes  to  RDD:   val  countsByPop  =  byPopulation.rdd.collect  {        case  Row(pop:  String,  chr:  String,  start:  Long,                            ref:  String,  alt:  String,  refcnt:  Long,                            altcnt:  Long)  =>                            PopAlleleCount(pop,  chr,  start,  ref,  alt,  refcnt,                                                        altcnt)  
  • 15.     P a g e  15  |  17                }     val  countAll  =  allPopulations.rdd.collect  {          case  Row(chr:  String,  start:  Long,  ref:  String,                              alt:  String,  refcnt:  Long,  altcnt:  Long)  =>                              PopAlleleCount("ALL",  chr,  start,  ref,  alt,                                                          refcnt,  altcnt)        }   b. Save  RDD  to  Cassandra   countsByPop.saveToCassandra("pipeline","pop_allele_count",     SomeColumns("population","chromosome","start","ref","alt","refcnt"," altcnt"))     countAll.saveToCassandra("pipeline","pop_allele_count",   SomeColumns("population","chromosome","start","ref","alt","refcnt"," altcnt"))     4. We  can  then  see  the  processed  data  in  Cassandra  from  CQLSH  inside  the  container:   cqlsh>  SELECT  COUNT(*)  FROM  pop_allele_count;     We  can  run  a  typical  query  on  the  data:   cqlsh>  SELECT  *  FROM  pop_allele_count  WHERE  population  =  'ALL'  AND   chromosome  =  '22'  AND  start  >=  16500000  AND  start  <  16750000;        
  • 16.     P a g e  16  |  17           4.10 ACCESS  CASSANDRA  FROM  OUTSIDE  THE  CONTAINER     Task  duration  :  20  mins   Because  we  mapped  the  Cassandra  ports  when  we  started  the  container,  we  can  also  access  the   data  in  Cassandra  from  outside  the  container.   1. First  we  need  to  install  a  compatible  version  of  Cassandra  –  we  will  use  Datastax  Community   Edition  2.2.0  –  we  can  get  this  from  http://downloads.datastax.com/community       2. Unzip  and  extract  the  downloaded  archive   $  gunzip  Downloads/dsc-­‐cassandra-­‐2.2.0-­‐bin.tar.gz   $  cd  pipeline   $  tar  xvf  ~/Downloads/dsc-­‐cassandra-­‐2.2.0-­‐bin.tar     3. Run  cqlsh  and  use  the  IP  address  of  the  container        
  • 17.     P a g e  17  |  17           4.11 VISUALISE  THE  DATA  USING  AKKA  AND  A  REST  INTERFACE     Task  duration  :  2  mins     1. In the terminal, cd  ~/pipeline/rest-­‐api  &&  sed  -­‐i  s/${IP_eth0}/${IP_eth0}/   src/main/resources/application.conf  &&  sbt  run   This  is  starting  the  REST  service  in  Akka  HTTP  reading  data  from  Cassandra  and  serving  them  in   JSON.   2. Open  the  Rest  Call  notebook  and  execute  it.  This  one  uses  Akka  HTTP  (client  now)  to  access   the  REST  service  and  plot  some  data.   (there  is  currently  a  bug  in  this  step,  under  investigation)     3. Open  the  Rest  Call  (using  HTML  form)  notebook.  It's  essentially  the  same  as  above  but   present  the  REST  calls  in  an  HTML  form  instead  of  query  parameter  in  a  String.   Click  "Range  query  over  a  population  for  a  chromosome"  and  then  click  "Change"