Enterprise data centers house numerous workloads. With Hadoop growing in these data centers, IT departments need tools to avoid creating silos, while maintaining SLAs, reporting and charge-back requirements. We present a completely open source reference architecture including Apache Hadoop, Linux cgroups and namespace isolation, Gluster and HTCondor. Topics to be covered – . Augmenting existing HDFS and MapReduce infrastructure with dynamically provisioned resources . On-demand creating, growing and shrinking MapReduce infrastructure for user workload . Isolating workloads to enable multi-tenant access to resources . Publishing of resource utilization and accounting information for ingest into charge-back systems
2. Abstract - what you hoped to hear?
Enterprise data centers house numerous workloads. With Hadoop growing in
these data centers, IT departments need tools to avoid creating silos, while
maintaining SLAs, reporting and chargeback requirements. We present a
completely open source reference architecture including Apache Hadoop,
Linux cgroups and namespace isolation, Gluster and HTCondor. Topics to
be covered -
• Augmenting existing HDFS and MapReduce infrastructure with
dynamically provisioned resources
• On-demand creating, growing and shrinking MapReduce infrastructure for
user workload
• Isolating workloads to enable multi-tenant access to resources
• Publishing of resource utilization and accounting information for ingest into
chargeback systems
3. Agenda
• Use cases
• High level architecture diagram
• Demonstrations
• cgroups and namespaces
• Lessons learned
Feel free to ask questions along the way
4. Use cases
1. Augmenting infrastructure by elastically
extending Hadoop clusters
2. User self-service clusters
3. Consolidating many small clusters onto
hardware with existing services
a. Managing, upgrading, or testing multiple Hadoop
versions on shared infrastructure
6. Scheduler
• SLA
o Quota, Requirements (won't run w/o), Rank (order),
global and local limits (won't exceed)
• Reporting
o Resource usage by time & group / user
o Audit log
• Performance
o Requirements - minimum physical resources
o Local limits - available spindle or co-processor or
railgun
7. System
• SLA
o cgroups (memory, cpu, cpuacct, blk)
• Isolation
o namespaces
o virtualization
• Reporting
o Resource usage per process and group
• Performance
o cpuset, numactl, numad
9. Use case one - augmenting infra.
DataNode
DataNode
TaskTracker
TaskTracker
Gluster FS
TaskTracker
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
DataNode
DataNode
TaskTracker
TaskTracker
NameNode
JobTracker
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
NameNode
Infra. Service
Dynamic
Service
Key:
Gluster FS
TaskTracker
Scheduler
Machine
Cluster
10. Use case two - self-service cluster
DataNode
DataNode
TaskTracker
TaskTracker
Gluster FS
TaskTracker
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
DataNode
DataNode
TaskTracker
TaskTracker
NameNode
JobTracker
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
NameNode
Infra. Service
Dynamic
Service
Key:
Gluster FS
TaskTracker
Scheduler
Machine
Cluster
11. Control Groups (cgroups)
• https://www.kernel.org/doc/Documentation/cgroups/
cgroups.txt
• Why not virtualization?
o Virt is not for SLA, it's for isolation
• All processes must be in a group, use
systemd or roll your own or use systemd
o Keep a close eye on systemd changes
o http://lwn.net/Articles/555920/
• Group depth and width
• Share / weight isn't %
12. Namespaces
• Mount
• PID
• Network
• Others,
o UTC - uname() - nodename and domainname
o IPC - SysV IPC and POSIX message queues
o User
13. Lessons learned
• Play nice
• Be flexible
• Cleanup is important and hard
• Resource tracking is hard
14. Play nice
• Don't assume you are the only scheduler on
the system, don't claim ownership of nodes,
cohabitate
• System integration helps (cgroups &
systemd)
15. Be flexible
• Use extensible data structures
o Obvious: CPU, Memory, Disk, Network
o Less obvious: GPU, co-processor, cache, spindle,
running services, licensing
• Might end up with an expression language to
evaluate policy
16. • You need to deallocate the resources you
allocate
o Kill all processes you spawned
o Clean up disk spaces you used
• Tracking processes used to be hard
o Processes can escape your watchful eye
o By uid / gid, by env cookies
o Now cgroups
• Tracking disk usage used to be nearly
impossible (inefficient)
o Now mount namespaces
Cleanup is important and hard
17. • Similar to resource cleanup
• Keeping track of resources meant walking /
proc and merging with getrusage()
• Far easier with cgroups
Resource tracking is hard