9. HADOOP PROVISIONG ISSUES
Each cloud provider has a proprietary API
Create images for each provider
Network configuration
Service discovery
Resize, failover, member join support
10. OUR APPROACH – DETAILS
Build your Docker image
Install or pre-install Hadoop services with Ambari
Install Serf and dnsmasq
Build your cloud image
Use Ansible to create an image
Provision the cluster
11. BUILD DOCKER IMAGES
Create the Dockerfile
Have Docker.io to build the image
Optionally pre-install services
Use Ambari
Push image to Docker.io
Licensing questions
12. BUILD CLOUD IMAGES
Use a Docker ready base image
Use Ansible to provision the image template
Pull the Docker images
Apply custom infrastructure
Use cloud provider specific playbooks
AWS EC2
Azure
13. ANSIBLE
Configuration as data
Simplest way to automate IT
Secure and agentless
Goal oriented
One playbook – multiple modules
We use it to “burn” cloud images/templates
14. PROVISIONING – ISSUES
FQDN
/etc/hosts is read-only in Docker
Everybody needs to know everybody
DNS
Single point of failure
Dynamic cluster – nodes joining, leaving, failing
Routing
Cloud – ability to inter-host container routing
Collision free private IP range for Docker bridge
15. PROVISIONING – SOLUTION
FQDN
Use –h and –dns Docker params
DNS
dnsmasq is running on each Docker container
Serf member-xxx events trigger dnsmasq reconfiguration
Routing
Docker bridge configuration – follows a convention
16. SERF
Gossip based membership
Service discovery
Decentralized
Lightweight, fault tolerant
Highly available
DevOps friendly
Keep an eye on Consul, Open vSwitch, pipework
17. SERF – DECENTRALIZED SERVICE DISCOVERY
Gossip instead of heartbeat
LAN, WAN profiles
Provides membership information
Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
Query
21. AWS EC2 – HADOOP CLUSTER
Use EC2 REST API to provision instances (from Dockerized image)
Start Docker containers
One Ambari server
N-1 Ambari agents connecting to server
Connect ambari-shell to
Define blueprint
Provision the cluster
23. AWS EC2 - CLOUDFORMATION
Manually set up VPC is too complicated
Use CloudFormation
Manage the stack together
Template-based
Environments under version control
Customizable at runtime
No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},
24. CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
25. CLOUDBREAK
Benefits
Elastic
Scalable
Blueprints
Flexible
Main REST resources
/template – specify a cluster infrastructure
/stack – creates a cloud infrastructure built from a template
/blueprint – describes a Hadoop cluster
/cluster – creates a Hadoop cluster
26. RESULTS AND ACHIEVEMENTS
Hadoop as a Service API
Available for EC2 and Azure cloud
OpenStack, bare metal is coming soon
Open source under Apache 2 licence
Same goals as Apache Ambari Launchpad project
What's next?
27. HADOOP SERVICES - AS A SERVICE
Leverage YARN
Slider (Hoya) providers
HBase, Accumulo
SequenceIQ providers - Flume, Tomcat
YARN -1964
QoS for YARN – heuristic scheduler
Platform as a Service API
28. BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
29. THANK YOU
Get the code: https://github.com/sequenceiq
Read about: http://blog.sequenceiq.com
Facebook: http://facebook.com/sequenceiq
Twitter: http://twitter.com/sequenceiq
LinkedIn: http://linkedin.com/sequenceiq
Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE
Notas del editor
Thanks for coming – today will talk about Docker based Hadoop provisioning.
Quick introduction of who we are - Young startup, from Budapest, Hungary. Janos Matyas – CTO, open source contributor, Hadoop YARN evangelist.
Why we have started this at all – there are so many options.
We repeated the same steps over and over – and scripted. Still, we felt that there is something missing.
See bullet points
Been through many different approaches. Bare metal, cloud VM, so on – ended up using Docker.
Tested many provisioning frameworks – Ambari is the one.
Quick question - How many of you have used Docker before.
Docker is a container based virtualization framework. Unlike traditional virtualization Docker is fast, lightweight and easy to use. Docker allows you to create containers holding all the dependencies for an application. Each container is kept isolated from any other, and nothing gets shared.
I can run 5-6 containers – less overhead than 1 virtualbox. No SOCKS proxy, etc.
The ‘provisioning’ framework. No need to enter details, there were pretty good sessions about Ambari.
Blueprints 1.5.1 tech preview, 1.6 fully supported. Blueprint = stack definition + component layout.
REST API – we have created, open sourced Ambari client + shell (come and join the Ambari Meetup today at 3:30)
Now, the issues.
Do it again and again – for each cloud provider.
Create the image – but how do you know what’s the requirement, building an image each and every time?
Network – this is a big issue. EC2 has API, Azure his own. Open Stack has a network as a service component – Neutrom. SDN – Software define network!!!
Everything is dynamic – how do you do service discovery?
Extra features – fully dynamic Hadoop cluster.
Will expand on these shortly.
Sounds too easy – lets get into details.
A Docker image is described by a Dockerfile – like a Vagrant file for virtualbox for example.
You want trusted build – use Docker.io
Faster provisioning – a 100+ node Hadoop cluster in less than 5 minutes? Come and join the Ambari meetup.
Licensing –Ganglia or Nagios (BSD and GPL). Hortonworks Hadoop – Apache 2
Bigtop is coming…
Amazon Linux – Redhat based – recently is Docker ready. OpenStack stack Nova hypervisor supports Docker.
Apply the network and other infrastructure relates stuff.
Remember the licensing – use our Ansible script to build your cloud image. Or modify.
IT automation war - Ansible vc Chef, Puppet.
Ansible configurations are simple data descriptions of your infrastructure (both human-readable and machine-parsable).
Needs only SSH.
Dev – env : use default Docker bridge (easy)
All talks to each other
DNS – heavy management overweight
-h for hostname, --dns to specify the DNS service to use
Convention: AMI launch index
Serf is a decentralized solution for cluster membership, failure detection and orchestration.
Serf, Zookeeper, etcd, doozerd. All three have server nodes that require a quorum of nodes to operate – strong consistency.
Serf - eventual consistency
Most important thing is that gossip based – will expand shortly.
Decentralized – all nodes are equal.
Fire and forget
Waits for anwer – limited response collection.
Custom event handlers
Tags – e.g. Ambari server, hostgroups, etc
Load increases – how to cluster knows that there is a new member.
Running on each Docker container – updated by SERF events.
Amazon supports Docker natively.
Start N number of nodes. Pass our userdata script .at startup.
Start the containers – they will know about each other using Serf.
Shell or REST API or Ambari UI.
You need security – strongly recommended use your VPC instead of default VPC.
Use different availability zones for maximum uptime.
Who did VPS knows – can be scripted. It is harder to decommision / change / delete than add components.
Use CloudFormation.
This is a very easy but still error prone process – though it helps a let.
We build an API on top, and automated the whole process.
We are not a Service Provider – this is an API.
Elastic – arbitrary number of nodes.
Scalable – follow your workload change.
Blueprints – supports different cluster blueprints
Flexible – Use your favorite cloud, bring your own Hadoop – one common API
One API – any size, anywhere.
Why we needed Cloudbreak – this is not the end of the story.
We wanted to have a Platform as a Service API.
We are YARN evangelists – wanted to run everything on YARN.
Community driven.
Heuristic scheduler.
A fully dynamic big data pipeline.
Build your pipeline, run dynamically / on demand. All pre-coded, zero coding, only configuration.
Data pipeline – run services on demand, short or long term. Start when needed, stoped when is idle. Apply ETL on demand.
Job pipeline – all major ML are supported (Mahout, Mllib), and 44 other MR jobs (correlations, joins, summarizations, filtering, sort, sharding, shuffle)
Streaming pipeline – Spark based
Custom SDK – abstracts the complexity behind MR and Spark.
Subscribe to the Beta test.
Contribute.
We did contributions on several Apache and other open source projects.
Babilon at SequenceIQ; Java and Scala is the default. Groovy is used very often. Than Go – Docker + Serf – we had to learn Go to fix things. Ansible for IT.
Strongly suggest to use Docker – we use it everywhere. CI/CD, cloud.
For a demo come and join the Ambari meetup.
Thanks for coming. Q&A. Join me after or follow us through one of the social medias listed.