Big Data Pipelining - build guide for the pipeline container from Data Fellas, to go with the slides from my Big Data Pipelining talk at Big Data Week and Cassandra London meetup 27th November 2015
Strategies for Landing an Oracle DBA Job as a Fresher
Big data week London Big data pipelining 0.2
1.
P a g e
1
|
17
Data
Pipelining
–
A
new
Approach
To
Datascience
Big
Data
Week,
London
November
20125
Version
v0.2
Copyright
notice
This
document
is
Copyright
Datastax
Inc.
2015.
The
document
may
be
redistributed,
in
electronic
or
hardcopy
format,
on
the
understanding
that
the
contents
remain
the
property
of
Datastax
Inc.
The
document
may
not,
under
any
circumstance,
be
sold
or
distributed
for
compensation
of
any
kind.
All
information
is
provided
"as
is”,
there
is
no
warranty
that
the
information
is
correct
or
suitable
for
any
purpose,
either
implicit
or
explicit.
2.
P a g e
2
|
17
Table of Contents
1
DOCUMENT
INFORMATION
AND
HISTORY
..............................................................................................
3
1.1
DOCUMENT
INFORMATION
.............................................................................................................................
3
1.2
VERSION
HISTORY
.........................................................................................................................................
3
2
INTRODUCTION
.......................................................................................................................................
3
3
REFERENCES
............................................................................................................................................
3
3.1
ADAM
.......................................................................................................................................................
4
3.2
SPARK
NOTEBOOK
.........................................................................................................................................
4
4
BUILDING
&
RUNNING
THE
HUMAN
GENOME
BIG
DATA
PIPELINE
DEMO
...............................................
4
4.1
GENERAL
PRE-‐REQUISITES
..............................................................................................................................
4
4.2
GENERAL
OS
PRE-‐REQUISITES
(UBUNTU)
..........................................................................................................
5
4.3
OS-‐SPECIFIC
PRE-‐REQUISITES
.........................................................................................................................
7
4.4
INSTALL
DOCKER
...........................................................................................................................................
7
4.5
CLONE
THE
PIPELINE
GITHUB
REPO
..................................................................................................................
8
4.6
CREATE
THE
DEVOXX
CONTAINER
IMAGE
...........................................................................................................
9
4.7
RUN
THE
DEVOXX
IMAGE
AS
A
CONTAINER
.....................................................................................................
10
4.8
SETUP
THE
DEVOXX
DEMO
...........................................................................................................................
13
4.9
RUN
THE
DEVOXX
BIG
DATA
PIPELINING
DEMO
................................................................................................
13
4.10
ACCESS
CASSANDRA
FROM
OUTSIDE
THE
CONTAINER
......................................................................................
16
4.11
VISUALISE
THE
DATA
USING
AKKA
AND
A
REST
INTERFACE
...............................................................................
17
3.
P a g e
3
|
17
1 DOCUMENT
INFORMATION
AND
HISTORY
1.1 DOCUMENT
INFORMATION
Document
name:
Big
Data
Week
London
-‐
Big
Data
Pipelining
0.2.Docx
Document
Authors:
Simon
Ambridge
Original
Date:
20/11/2015
Purpose:
A
description
of
the
steps
required
to
build
and
run
the
Human
Genome
Big
Data
Pipeline
container
created
and
distributed
by
Data
Fellas,
Belgium.
1.2 VERSION
HISTORY
Version
Date
Changed
By
Changes
0.1
20/11/2015
Simon
Ambridge
Initial
draft
0.2
26/11/2015
Simon
Ambridge
First
published
draft
(internal)
2 INTRODUCTION
The
objective
for
the
one-‐hour
presentation
at
Big
Data
Week
2015
in
London,
in
conjunction
with
these
notes,
is
to
introduce
the
audience
to
a
demonstration
environment
that
was
used
as
the
basis
of
a
half-‐day
workshop
presented
by
a
team
led
by
Andy
Petrella
from
Data
Fellas
at
Devoxx
in
November
2015.
The
takeaway
at
the
end
of
this
session
will
be
a
better
understanding
of
how
to
build
a
data
pipeline
using
modern
distributed
and
scalable
technologies,
in
conjunction
with
the
Data
Fellas
Spark-‐Notebook,
that
demonstrates
how
to
make
use
of
a
reproducible
data
pipelining
approach.
3 REFERENCES
4.
P a g e
4
|
17
3.1 ADAM
Adam
is
described
here:
http://www.lexjansen.com/pharmasug/2014/DS/PharmaSUG-‐2014-‐DS11.pdf
The
Clinical
Data
Interchange
Standards
Consortium
(CDISC)
Analysis
Data
Model
(ADaM)
3.2 SPARK
NOTEBOOK
From
an
interview
with
Andy
Petrella
by
Typesafe
Inc.
“Spark-‐Notebook
from
Data
Fellas
lets
you
use
Apache
Spark
in
your
browser
and
is
purposed
with
creating
reproducible
analysis using
Scala,
Apache
Spark
and
other
technologies.
Data
science
was
originally
focused
on
producing
static
products
(reports,
models,
…)
based
on
samples
of
the
available
data
at
the
time
of
analysis.
Nowadays,
the
results
need
to
be
reactive
with
the
data
flow,
requiring
new
data
types.
Also,
the
data
sizes
can
be
really
big.
This
explains
the
rise
of
distributed
computing
and
online
analysis,
the
union
of
which
could
bethought
as
the
Reactive
Data
Science
Pipeline.
However,
such
a
pipeline
requires
many
skill
sets,
including
data
science,
operations
software
engineering,
domain
knowledge,
and
others.
At
Data
Fellas,
we
are
building
Shar3,
the
ultimate
toolkit
that
aims
to
build
Reactive
data
science
pipelines
by
reducing
the
friction
between
the
different
building
phases.
Shar3
is
composed
of
notable
OSS
technologies
like
Apache
Avro,
Apache
Mesos,
Apache
Cassandra,
Apache
Spark,
Typesafe
Reactive
Platform
(Scala,
Akka,
Play),
Spark
Notebook
and
more.
These
components
were
chosen
with
a
strong
focus
on
scalability
and
the
capacity
to
reactively
adapt
to
their
ever-‐changing
production
environments.”
https://www.typesafe.com/blog/scala-‐and-‐spark-‐notebook-‐the-‐next-‐generation-‐data-‐
science-‐toolkit
https://github.com/andypetrella/spark-‐notebook
4 BUILDING
&
RUNNING
THE
HUMAN
GENOME
BIG
DATA
PIPELINE
DEMO
4.1 GENERAL
PRE-‐REQUISITES
5.
P a g e
5
|
17
Required
machine
spec:
3
cores,
5GB
Provision
a
host
• Linux
machine
http://www.ubuntu.com/download/desktop
• Create
a
VM
(e.g.
Ubuntu)
http://virtualboxes.org/images/ubuntu/
http://www.osboxes.org/ubuntu/
4.2 GENERAL
OS
PRE-‐REQUISITES
(UBUNTU)
Task
duration
:
5
minutes
These
instructions
taken
from
https://docs.docker.com/installation/ubuntulinux/
Docker’s
apt
repository
contains
Docker
1.5.0
and
higher.
To
set
apt
to
use
packages
from
the
new
repository:
1. If
you
haven’t
already
done
so,
log
into
your
Ubuntu
instance.
2. Open
a
terminal
window.
3. Add
the
new
gpg
key
as
a
privileged
user
(e.g.
as
root
or
via
sudo).
$
sudo
apt-‐key
adv
-‐-‐keyserver
hkp://p80.pool.sks-‐keyservers.net:80
-‐-‐recv-‐
keys
58118E89F3A912897C070ADBF76221572C52609D
4. Identify
your
OS
–
for
example
on
Ubuntu
check
the
/etc/os_release
file:
5. Open
the
docker.list
file
in
your
favourite
editor
and
remove
any
existing
entries.
If
the
file
doesn’t
exist,
create
it.
6.
P a g e
6
|
17
$
sudo
vi
/etc/apt/sources.list.d/docker.list
6. Add
the
appropriate
entry
for
your
OS
e.g.
for
Precise
#
Ubuntu
Precise
12.04
(LTS)
deb
https://apt.dockerproject.org/repo
ubuntu-‐precise
main
#
Ubuntu
Trusty
14.04
(LTS)
deb
https://apt.dockerproject.org/repo
ubuntu-‐trusty
main
#
Ubuntu
Vivid
15.04
deb
https://apt.dockerproject.org/repo
ubuntu-‐vivid
main
#
Ubuntu
Wily
15.10
deb
https://apt.dockerproject.org/repo
ubuntu-‐wily
main
7. Save
and
close
the
/etc/apt/sources.list.d/docker.list
file.
8. Update
the
apt
package
index
$
apt-‐get
update
NB
if
you
get
a
GPG
error
e.g.
The
following
signatures
couldn't
be
verified
because
the
public
key
is
not
available:
NO_PUBKEY
<some
key>
Fix
it
using:
$
sudo
apt-‐key
adv
-‐-‐keyserver
keyserver.ubuntu.com
-‐-‐recv-‐keys
<some
key>
And
re-‐run
apt-‐get
update
9. Purge
the
old
repo
if
it
exists.
$
sudo
apt-‐get
purge
lxc-‐docker*
10. Verify
that
apt
is
pulling
from
the
right
repository.
$
apt-‐cache
policy
docker-‐engine
Henceforth
apt-‐get
upgrade
will
pull
from
the
correct
repository
7.
P a g e
7
|
17
4.3 OS-‐SPECIFIC
PRE-‐REQUISITES
Task
duration
:
configuration
dependant
For
example,
for
Ubuntu
Precise,
Docker
requires
the
3.13
kernel
version.
If
the
kernel
version
is
older
than
3.13,
it
must
be
upgraded.
Use
the
uname
command:
$
uname
-‐r
In
my
case
the
kernel
version
is
OK:
4.4 INSTALL
DOCKER
Task
duration
:
5
minutes
1. Update
the
package
index
$
sudo
apt-‐get
update
2. Install
docker
package
$
sudo
apt-‐get
install
docker-‐engine
3. Start
the
docker
daemon
(should
already
be
running)
$
sudo
service
docker
start
4. Check
it
is
working
correctly
$
sudo
docker
run
hello-‐world
8.
P a g e
8
|
17
5. Add
the
login
user
to
the
docker
group
to
avoid
having
to
use
sudo
for
each
command:
$
sudo
useradd
-‐G
docker
<myuserid>
or
if
the
user
already
exists:
$
sudo
usermod
-‐aG
docker
<myuserid>
6. Log
out
and
back
in
–
check
docker
can
be
run
without
sudo
$
docker
run
hello-‐world
4.5 CLONE
THE
PIPELINE
GITHUB
REPO
Task
duration
:
2
mins
9.
P a g e
9
|
17
Clone
the
pipeline
repo
$
mkdir
~/pipeline
$
cd
~/pipeline
$
git
clone
https://github.com/distributed-‐freaks/pipeline.git
4.6 CREATE
THE
DEVOXX
CONTAINER
IMAGE
Task
duration
:
20
mins
At
this
point
we
only
have
the
‘hello-‐world’
container
locally.
1. Use
the
docker
images
command
to
list
available
images:
$
docker
images
Docker
images
are
stored
locally
under
/var/lib/docker
2. Pull
the
Devoxx
image
from
the
Docker
hub:
$
docker
pull
xtordoir/pipeline
10.
P a g e
10
|
17
Now
check
available
images
using
the
‘docker
images’
command:
And
we’ve
used
approx.
3GB
disk
space
in
/var/lib/docker:
4.7 RUN
THE
DEVOXX
IMAGE
AS
A
CONTAINER
Task
duration
:
5
mins
1. To
run
the
Devoxx
image
as
a
container
in
the
foreground,
we
will
use
the
following
command:
docker
run
-‐it
-‐m
8g
-‐p
30080:80
-‐p
34040-‐34045:4040-‐4045
-‐p
9160:9160
-‐p
9042:9042
-‐p
39200:9200
-‐p
37077:7077
-‐p
36060:6060
-‐p
36061:6061
-‐p
32181:2181
-‐p
38090:8090
-‐p
38099:8099
-‐p
30000:10000
-‐p
30070:50070
-‐p
11.
P a g e
11
|
17
30090:50090
-‐p
39092:9092
-‐p
36066:6066
-‐p
39000:9000
-‐p
39999:19999
-‐p
36081:6081
-‐p
35601:5601
-‐p
37979:7979
-‐p
38989:8989
xtordoir/pipeline
bash
Where:
-‐it
:
For
interactive
processes
(like
a
shell),
you
must
use
–i
–t
together
in
order
to
allocate
a
tty
for
the
container
process.
-‐m
:
Allow
a
maximum
of
8
GB
memory
for
the
container
-‐p
:
A
list
of
port
mappings
xtordoir/pipeline
:
The
image
to
run
bash
:
The
command
to
run
in
the
container
2. For
convenience
place
the
run
command
in
a
script
e.g.
$
vi
./run_docker.sh
Make
it
executable:
$
chmod
755
./run_docker.sh
3. Run
the
command
or
the
script
$
./run_docker.sh
4. From
now
on
we
are
inside
the
docker
instance
at
the
‘#’
prompt.
Ignore
the
warning
message
regarding
lack
of
swap
–
this
is
what
we
want
for
Cassandra.
Creating
a
container
has
used
additional
space
in
/var/lib/docker
in
the
host:
5. We
can
list
running
containers
using
‘docker
ps’:
12.
P a g e
12
|
17
6. If
we
exit
the
container
it
continues
to
exist,
but
stops
running,
and
‘docker
ps’
returns
nothing.
We
can
see
all
containers
with
‘docker
ps
–a’:
7. We
don’t
need
the
old
copies
of
the
hello-‐world
container
–
we
can
delete
them
using
their
container
name
or
container
ID:
8. We
can
re-‐start
the
stopped
container
using
‘docker
start’:
$
docker
start
-‐a
-‐i
adoring_torvalds
9. We
can
attach
to
the
running
container
using
‘docker
attach’:
10. To
detach
the
tty
without
exiting
the
shell,
use
the
escape
sequence
Ctrl-‐p
+
Ctrl-‐q.
13.
P a g e
13
|
17
4.8 SETUP
THE
DEVOXX
DEMO
Task
duration
:
2
mins
Ensure
that
you
are
inside
the
container.
1. Setup
the
services
$
cd
pipeline
$
source
devoxx-‐setup.sh
#
ignore
Cassandra
errors
2. The
cassandra.yaml
file
(at
/root/apache-‐cassandra-‐2.2.0/conf/cassandra.yaml)
is
updated
at
container
creation
time
with
the
necessary
parameters,
e.g.
listen_address
(localhost)
and
rpc_interface
(eth0)
necessary
to
make
Cassandra
available
outside
the
container.
The
rpc
port
(9160)
and
native
transport
port
(9042)
are
also
mapped
in
the
container
run
command.
3. After
a
little
time,
all
services
should
be
up,
e.g.
cassandra:
$
cqlsh
`hostname`
NB
At
this
point
the
pipeline
keyspace
does
not
yet
exist
in
Cassandra.
4.9 RUN
THE
DEVOXX
BIG
DATA
PIPELINING
DEMO
Task
duration
:
30
mins
1. Check
that
the
Spark
Notebook
is
available
in
the
browser
14.
P a g e
14
|
17
http://localhost:39000/tree/pipeline
<-‐-‐
Spark
Notebook
http://localhost:36060/
<-‐-‐
Spark
Master
http://localhost:34040/
<-‐-‐
Spark
Worker
2. Run
through
the
Notebooks
‘AdamToDataframe’
and
‘AggregateAndSaveToCassandra’.
At
the
end
of
the
second
notebook
the
data
will
be
available
in
Cassandra.
3. We
use
the
Dataframe
save
method
to
save
data
to
Cassandra:
allPopulations
.withColumn("population",
lit("ALL"))
.withColumnRenamed("refCnt",
"ref_cnt")
.withColumnRenamed("altCnt",
"alt_cnt")
.write.format("org.apache.spark.sql.cassandra")
.mode(org.apache.spark.sql.SaveMode.Append)
.options(Map("keyspace"
-‐>
"pipeline",
"table"
-‐>
"pop_allele_count"))
.save()
However
we
could
instead
have
used
the
RDD
‘SaveToCassandra’
method:
a. Convert
Dataframes
to
RDD:
val
countsByPop
=
byPopulation.rdd.collect
{
case
Row(pop:
String,
chr:
String,
start:
Long,
ref:
String,
alt:
String,
refcnt:
Long,
altcnt:
Long)
=>
PopAlleleCount(pop,
chr,
start,
ref,
alt,
refcnt,
altcnt)
15.
P a g e
15
|
17
}
val
countAll
=
allPopulations.rdd.collect
{
case
Row(chr:
String,
start:
Long,
ref:
String,
alt:
String,
refcnt:
Long,
altcnt:
Long)
=>
PopAlleleCount("ALL",
chr,
start,
ref,
alt,
refcnt,
altcnt)
}
b. Save
RDD
to
Cassandra
countsByPop.saveToCassandra("pipeline","pop_allele_count",
SomeColumns("population","chromosome","start","ref","alt","refcnt","
altcnt"))
countAll.saveToCassandra("pipeline","pop_allele_count",
SomeColumns("population","chromosome","start","ref","alt","refcnt","
altcnt"))
4. We
can
then
see
the
processed
data
in
Cassandra
from
CQLSH
inside
the
container:
cqlsh>
SELECT
COUNT(*)
FROM
pop_allele_count;
We
can
run
a
typical
query
on
the
data:
cqlsh>
SELECT
*
FROM
pop_allele_count
WHERE
population
=
'ALL'
AND
chromosome
=
'22'
AND
start
>=
16500000
AND
start
<
16750000;
16.
P a g e
16
|
17
4.10 ACCESS
CASSANDRA
FROM
OUTSIDE
THE
CONTAINER
Task
duration
:
20
mins
Because
we
mapped
the
Cassandra
ports
when
we
started
the
container,
we
can
also
access
the
data
in
Cassandra
from
outside
the
container.
1. First
we
need
to
install
a
compatible
version
of
Cassandra
–
we
will
use
Datastax
Community
Edition
2.2.0
–
we
can
get
this
from
http://downloads.datastax.com/community
2. Unzip
and
extract
the
downloaded
archive
$
gunzip
Downloads/dsc-‐cassandra-‐2.2.0-‐bin.tar.gz
$
cd
pipeline
$
tar
xvf
~/Downloads/dsc-‐cassandra-‐2.2.0-‐bin.tar
3. Run
cqlsh
and
use
the
IP
address
of
the
container
17.
P a g e
17
|
17
4.11 VISUALISE
THE
DATA
USING
AKKA
AND
A
REST
INTERFACE
Task
duration
:
2
mins
1. In the terminal,
cd
~/pipeline/rest-‐api
&&
sed
-‐i
s/${IP_eth0}/${IP_eth0}/
src/main/resources/application.conf
&&
sbt
run
This
is
starting
the
REST
service
in
Akka
HTTP
reading
data
from
Cassandra
and
serving
them
in
JSON.
2. Open
the
Rest
Call
notebook
and
execute
it.
This
one
uses
Akka
HTTP
(client
now)
to
access
the
REST
service
and
plot
some
data.
(there
is
currently
a
bug
in
this
step,
under
investigation)
3. Open
the
Rest
Call
(using
HTML
form)
notebook.
It's
essentially
the
same
as
above
but
present
the
REST
calls
in
an
HTML
form
instead
of
query
parameter
in
a
String.
Click
"Range
query
over
a
population
for
a
chromosome"
and
then
click
"Change"