Talk outlining the CLoud Infrastructure for Microbial Bioinformatics (CLIMB) system given at the CLIMB Launch in July 2016. CLIMB is a UK national e-infrastructure providing Microbial Bioinformatics as a Service.
Introducing the CLoud Infrastructure For Microbial Bioinformatics System (CLIMB
1. Introducing the CLoud
Infrastructure For Microbial
Bioinformatics System
CLIMB Launch, July 2016
Dr Thomas R Connor
Senior Lecturer
Cardiff University School of Biosciences
@tomrconnor ; connortr@cardiff.ac.uk
http://www.climb.ac.uk
2. The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) – Joint
PIs
• Professor Mark Achtman (Warwick), Professor Steve Busby FRS
(Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim Walsh
(Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is
• Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince (Warwick), Dr
Daniel Falush (Bath) ; MRC Research Fellows
• Simon Thompson (Birmingham, Project Technical/OpenStack lead),
• Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt Bull
(Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin), Andy Smith
(Birmingham software development)
• Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick (Cardiff
Biosciences Research Computing team), Wayne Lawrence, Dr Chrisine
Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt Ismail (Warwick
HPC lead),
* Principal bioinformaticians architecting and designing the system
3. Over the next few days you will hear mostly from
academic staff involved in the project. But what we
have achieved to date would not be possible without
the technical team
• Simon Thompson (Birmingham), Marius Bakke (Warwick)
• Radoslaw Poplawski (Birmingham sysadmin), Andy Smith
(Birmingham software development), Dr Matt Bull (Cardiff
Sysadmin)
• Matt Ismail (Warwick HPC lead), Kevin Munn and Dr Ian
Merrick (Cardiff Biosciences Research Computing team),
Wayne Lawrence, Dr Christine Kitchen, Professor Martyn
Guest (Cardiff HPC team), Simon Thompson (Swansea HPC
team)
4. Introducing the CLoud Infrastructure
for Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for
Microbial Bioinformatics
5. The Sequencing Iceberg
However, the major costs and difficulties do not lie with the
generation of data, they lie with how we share, store and analyse the
data we generate
Informatics expertise
User accessibility of software/hardware
Appropriate compute capacity
Software development
Storage availability
Network capacity
There are many biological analysis platforms available now that make
producing large, rich complex datasets relatively cheap and easy
6. The rise and rise of biological shadow
IT
• Everything we do is underpinned by
having access to compute and
storage capacity
• Joined Cardiff from the Sanger in
2012
• When I arrived at Cardiff, I had the
joy of working out how I was going
to do my research in a new place
• How to get scripts/software
installed
• Where to install scripts/software
• And I had to do this mostly on my
own
• So I built my own system
• Not an unusual story
7. Shadow IT is compounded by the
fact that biologists work in silos
• As a discipline we think in
terms of ‘labs’, ‘groups’ and
‘experiments’ being wet
work
• IT infrastructure is treated
the same way
• This means we develop
bespoke, local solutions to
informatics problems
• Because our software/data
is locally stored/setup – it is
often less portable than
wet lab methods /
approaches
8. Wave 1 Wave 2 Wave 3
2005-
09
1989-
97
2003-
07
1992-2002
1993-98
1975-86
1937-611966-71
1967-89
1969-73
1969-81
1981-85
1974
1986-87
1969-73
Mutreja, Kim, Thomson, Connor et al, Nature, 2011
Illustrating the size of the challenge
9. 320 samples
Approx 6-700GB
uncompressed data
Sequence Assembly
Each job 4-8GB RAM
1 CPU core
Each job generates
intermediate files of
~6GB
Runtime: 1+
hours/job
Sequence mapping
320 jobs
Each job 4GB RAM
1 CPU core
Each job generates
intermediate files of
~3GB
Runtime: 1 hour/job
Phylogenomics
1 job, 1+ cores, up to
128GB RAM
Intermediate file size
~2+GB
Output file ~2GB
Runtime 1-2 days
Virulence and
antimicrobial
resistance screening
320 jobs, single core
100MB ram
Runtime: 5 mins/job
Generates 10-20 small
files per job
Bayesian modelling
3 jobs, 1 core+, up to
1 GB RAM
CPU intensive
Runtime: 2 days per
job
Output file ~10GB
Can use GPUs
Written in Java
Larger RAM HTC HTC
HPC Possibly HPC
Illustrating the size of the challenge
10. At the other end of the scale
The key is, as microbiologists we are likely to need a wide variety of systems, for a
wide range of workloads
This does not normally fit well with “standard” local systems, as our workloads can
be disruptive or impossible to run
It is also hard to reproduce this across different systems
11. Basic Premise
• Wouldn’t it be great if there
were a single system that
microbiologists could use to
analyse and share data
• Data is more easily shared
when one uses a common
environment
• Software is more easily shared
on a common platform
• A common platform could also
make the hardware required
for complex analyses available
to all, easily
12. Thought process behind the
project
• Custom designed, properly
engineered, institution-wide
systems can work brilliantly
for enabling data and
software sharing
• Works brilliantly at the Sanger
• BUT how many other places
have a critical mass of
microbiologists to justify the
expense of having such a
system?
• Answer is, relatively few, so
we thought a shared system
open to all was the logical
solution
13. How to achieve this -
Virtualisation
• Virtualisation is a way of running multiple, distinct
computers (could be desktops, workstations, servers)
on one physical piece of hardware
• Not a new concept, is a mainstay of enterprise
computing
• Is a way for sharing resources
• Is a way for businesses to cut costs by consolidating servers
• Is a way for businesses to increase reliability; these physical
pieces of hardware can be networked and the VMs they run
can be moved around as required
• Also provides a way for businesses to easily deploy and
maintain software
• Virtualisation answers a lot of the questions that are
posed in bioinformatics around reproducibility
14. One of the servers
A user
A user wants a server; the system spins up a
VM (a self contained ‘module’ containing
operating system and software) and slots it
onto one of the servers
Other users from other institutes also want
systems; the system could then load those
up to, sharing a large server between users
A user
A user
So wouldn't it be great if….
Because these VMs are on a
common system, these are then
sharable between users
15. Virtualisation and the cloud
• Virtualisation is the central premise
behind systems like AWS
• Underpins services from Google to
Netflix
• Ultimately enables service providers to
share out comodity servers as required
• Means they can sell off access to small
slices of servers, and make lots of
money
• The idea behind CLIMB was to build on
this concept to provide a similar service
for the UK Medical Microbiology
community, without the profit, and
designed to meet the needs of
microbiologists
16. Why not use a commercial cloud?
• Bioinformatics workloads often
require lots of RAM, a good
number of CPUs and lots of
storage
• Commercial cloud providers are
not targeting this market, so
prices are very high
• A (amazon) storage optimised VM
with 244GB RAM and 32 vCPU
cores costs ~$3,000 per month
• Some configs are simply not
available
• Storage costs also high – 1TB on
Amazon S3 costs $30/month (our
current costs are £3/month)
• In future these solutions might be
suitable, but at the moment they
are not cost effective and don’t
really meet the needs of
researchers
17. So - what we said in 2014: CLIMB
Aims
• Create a public/private cloud
for use by UK academics
• Create a set of standardised
cloud images that implement
key pipelines
• Create a storage repository
for data that are made
available online and within
our system, anywhere
(‘eduroam for microbial
genomics’)
• Provide access to other
databases from within the
system
18. 2014: Expected specifications
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, large memory, huge
memory
• Able to support 1,000 VMs simultaneously
• 4PB of object storage across 4 sites (~2-3PB usable with erasure coding)
• 300TB of local high performance storage per site
19. Where we are now
• CLIMB aims to become a one-stop-shop
for microbial bioinformatics
• Public/private cloud for use by those
with a .ac.uk, .nhs.uk, .gov.uk email
account
• Standardised cloud images that
implement key pipelines
• Storage repository for data/images
that are made available online and
within our system, anywhere
(‘eduroam for microbial genomics’)
• We will provide access to other
databases from within the system
• As well as providing a place to
support orphan databases and
tools
• Today we will be introducing you to
the first set of VMs on the system,
and how to gain access
• Has been a lot of work, but
hopefully it will be worth it
20. Actual System Outline
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, large memory, huge memory
• Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM)
• ~7PB of object storage across 4 sites (~2-3PB usable)
• 4-500TB of local high performance storage per site
• A single system, with common log in, and between site data replication*
• System has been designed to enable the addition of extra nodes / Universities
21. CLIMB Overview
• 4 sites, running OpenStack
• Hardware procured in a
two stage process
• IBM/OCF provided
compute, Dell/redhat
provided storage
• Networks provided by
Brocade
• Now have a fairly clear
reference architecture that
would enable other nodes
to be added
23. 0
0.5
1
1.5
2
2.5
3
beast blastn gunzip muscle nhmmer phyml prokka snippy velvetg velveth geometric mean
Performance
• Generally extremely good
• Performance is quite consistent across workloads
• Compares well to both HPC systems and other cloud
systems
24. What this means for you
• CLIMB doesn’t (really) provide small
machines ; you have those in your
office, or you can buy them from
Amazon
• Our “personal” servers start at 4CPU
cores and ~15GB of RAM
• Our “group” servers are 8CPU cores
and ~60GB of RAM
• Other flavour sizes available on
request
• Means you get free, immediate access
to dedicated hardware to do your
analysis
• For comparison, a workstation with
similar spec to a “personal” server
retails at ~£1k, a server with similar
spec to a group server retails at ~£3k
25. Group quotas
• Each group gets a default allocation
• RAM: 64Gb * 10
• Instances 10
• Volumes 20
• Total Disk: 10TB
• Total cores: 128
• Up to you to manage the allocation
• Allocation is for whole team
• It can be increased; requests for increases will be
considered by the management group
26. System Access
• Following testing from a number of groups, we have developed
an initial service offering
• Adopting two models for access
• Access for registered users to core OpenStack system via the horizon
dashboard
• Access via our own launcher called bryn (welsh for “Hill”)
27. User access
• All users are members or a
project
• Project owners are PIs
• Means a PI has to register
ahead of the group
• PI then chooses people to
invite
• Might seem a pain, but gives
clear lines of responsibility,
and enables us to better track
impact of the system
• Also creates a mechanism for
overseas collaborators to
access the system, and to
ensure that they are
contributing to UK research
outputs
28. System Status
• Warwick is online and you
will be mostly using this
today
• Birmingham is online and
fully available, but due to
a server fire (not CLIMB)
will be taken down this
weekend for some
datacentre maintenance
• Cardiff is online and
available, but hasn’t been
fully stress tested yet
• Swansea is awaiting final
configuration and
integration into Bryn
29. Other parts of the project
• We have a forum
• For anyone with an account and for anon posting
• Forum also has links to tutorials
• CLIMB also has a twitter account – please tweet us
your successes using CLIMB
• The google group is dead, use the forum now
• Now that the system is mostly up, we will be
looking to deliver training events. Look out for
these on our website and twitter
30. About Today and the Future
• Today we are introducing the system
• We want you to use the system and we have put time into
setting up tools/images already
• But, while CLIMB has/had money for hardware, but no real
money for software development
• Means that either we need RCUK funds for developing tools
and resources, or we need you to share your software on
our system
• Is a complex system, so might be a few teething problems
(this is the first time we will have had so many users
hammering the system at the same time) but it will provide
a resource that we expect will be of huge value in future
• In the next couple of months there will be a CLIMB paper
coming out ; please cite this when you use the system
31. The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) –
Joint PIs
• Professor Mark Achtman (Warwick), Professor Steve Busby FRS
(Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim
Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is
• Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince
(Warwick), Dr Daniel Falush (Bath) ; MRC Research Fellows
• Simon Thompson (Birmingham, Project Technical/OpenStack lead),
• Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt
Bull (Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin)
• Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick
(Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr
Chrisine Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt
Ismail (Warwick HPC lead),
* Principal bioinformaticians who architected and designed the
system
32. The CLoud Infrastructure for
Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for
Microbial Bioinformatics
33. OpenStack terminology cheat
sheet
• Instances
• Each instance is a virtual server. The instance can have an external
IP address, and is accessed via either ssh or through the web
• Volumes
• A private disk, running on a large storage system. Volumes can be
treated like disk drives, and can be attached and detached from
running instances
• Snapshots
• Effectively a digital photo of everything in a volume at the point of
snapshotting. Can be used as a basis to create a new volume
containing the original data
• Project / tenant
• All VMs are part of a project/tenant ; this is the level at which
quotas apply
34. Parts of OpenStack / CLIMB
• Bryn
• Our interface for registering and spinning up VMs
• Horizon
• The OpenStack control panel (not reccomended for anyone but power users
• Keystone
• The OpenStack identity service
• Nova
• OpenStack compute service
• S3
• Amazon’s storage API
• Cinder
• OpenStack block storage service
• Glance
• OpenStack image service
• Neutron
• The OpenStack Network Service