More Related Content Similar to Big Data Benchmarking with RDMA solutions (20) More from Mellanox Technologies (13) Big Data Benchmarking with RDMA solutions 1. © 2013 Mellanox Technologies 1
Big Data Benchmarking with RDMA solutions
Oracle Open World 2013
2. © 2013 Mellanox Technologies 2
Leading Supplier of End-to-End Interconnect Solutions
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Virtual Protocol Interconnect
Storage
Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Fibre Channel
Virtual Protocol Interconnect
3. © 2013 Mellanox Technologies 3
A scalable fault-tolerant distributed system for data storage and processing
Hadoop has two main systems
• Hadoop Distributed File System: self-healing high-bandwidth clustered storage.
• MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data
programming abstraction.
Key values
• Flexibility – Store any data, Run any analysis.
• Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes.
• Economics – Cost per TB at a fraction of traditional options.
Hadoop Framework
HDFS™
(Hadoop Distributed File System)
Map Reduce HBase
DISK DISK DISK DISK DISK DISK
Hive Pig
Map Reduce
HDFS™
(Hadoop Distributed File System)
4. © 2013 Mellanox Technologies 4
Three Areas for Accelerations
Data Analytics
• Explore inefficiencies in existing analytics frameworks and systems
• Accelerate data processing to deliver faster results
Storage
• Explore ways to refine dominant file system
• Take advantage for direct attached disk to accelerate data access
Distributed Storage
• Leverage popular distributed storage systems with Big Data applications
• Use existing systems for usage with Big Data frameworks
5. © 2013 Mellanox Technologies 5
~88% CPU
Utilization
I/O Offload Frees Up CPU for Application Processing
UserSpaceSystemSpace
~53% CPU
Utilization
~47% CPU
Overhead/Idle
~12% CPU
Overhead/Idle
Without RDMA With RDMA and Offload
UserSpaceSystemSpace
6. © 2013 Mellanox Technologies 6
Plug-in architecture
• Open-source, latest GA version 3.1 (6/10/2013)
• Google code repository at: https://code.google.com/p/uda-plugin/
Accelerates Map Reduce Jobs
• Accelerated merge sort
Efficient Shuffle Provider
• Data transfer over RDMA
• Supports InfiniBand and Ethernet
Supported Hadoop Distributions
• Apache 3.0 – In the main trunk!
• Apache 2.0.3 – In the main trunk!
• Apache Hadoop 1.0.x ; 1.1.x
• Cloudera Distribution Hadoop 3 &4
• Hortonworks HDP 1.x
• GPHD 1.2
Supported Hardware
• ConnectX®-3 VPI
• SwitchX-2 based systems
Unstructured Data Accelerator - UDA
HDFS™
(Hadoop Distributed File System)
Map Reduce HBase
DISK DISK DISK DISK DISK DISK
Hive Pig
Map Reduce
7. © 2013 Mellanox Technologies 7
Double Map Reduce Performance with UDA
*TeraSort is a popular benchmark used to measure the performance of Hadoop cluster
~50%Disk Access CPU Efficiency 2.5X
**1TB Data Set, 20x dual X5670 (Westmere) Machines, 10x HDD Base; Vanilla GPHD1.2; UDA GPHD1.2+UDA
2X Faster Job Completion! Increase the Value of Data!
54%
8. © 2013 Mellanox Technologies 8
HDFS is the Hadoop File System
• The underlying File system for HBase and other NoSQL Data Bases
More Drives, Higher Throughput is Needed
SSDs Solutions Must use Higher Throughput
• Bounded by 1GbE and 10GbE
HDFS Acceleration; Joint Project With Ohio State University
HDFS™
(Hadoop Distributed File System)
Map Reduce HBase
DISK DISK DISK DISK DISK DISK
Hive Pig
9. © 2013 Mellanox Technologies 9
SSDs Become De-Facto standard in HDFS deployment
• Read capability is a critical factor for application performance
E-DFSIO, Part of Intel’s HiBench test suite, profiles aggregated throughput on the cluster
• 1GbE network impede any performance benefit from SSD deployment
Unlocking the Power SSDs In Hadoop Environment
E-DFSIO, Showing the Power of SSD @ HDFS
11. © 2013 Mellanox Technologies 11
Mellanox VPI Card
• MCX354A-FCBT
Mellanox Edge Switches
• MSX10xx; MSX60xx
Cloudera Certified – CDH3 and CDH4
12. © 2013 Mellanox Technologies 12
E5-26x0 (Sandy Bridge) Machines
• Dual Socket
• 4+ cores each socket
• 32GB+ of DRAM
Disk Drives
• At least 5 x 1TB, SAS, 10K RPM
Hadoop Configuration
• At least one Name Node + Job Tracker
• At least 4 Data Nodes
Installation:
• Your selection of Hadoop Distribution or other Big Data solution (Such as Cassandra)
Networking
• ConnectX-3 VPI card, FDR, 40GbE and 10GbE
• SwitchX based systems: MSX6036F, MSX1036B and MSX1016
• Mellanox’s FDR, 40GbE and 10GbE Cable Solutions
http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Hadoop.pdf
Simple Building Block for Big Data Solution
13. © 2013 Mellanox Technologies 13
EMC 1000-Node Analytic Platform
Accelerates Industry's Hadoop Development
24 PetaByte of physical storage
• Half of every written word since inception of mankind
Mellanox VPI Solutions
Test Drive Your Big Data
2X Faster Hadoop Job Run-Time
Hadoop
Acceleration
High Throughput, Low Latency, RDMA Critical for ROI
Editor's Notes HDFS is the underlying file system for Hadoop.WE have a project ongoing with OSU – stay tuned for the availability schedule. Test configuration:5 nodes, Apache Hadoop 1.1.2 E5-2670, 64GB DRAM.1 Name Node, 4 Data Nodes.HDDs: 5x 1TB, 10K per NodeSSDs: 2x 960MB, PCIe Gen II x4.HiBench 2.2 test suite Hadoop Filesystem Agnostic API Recipe on how to build a big data solution is available on Mellanox web site.Everything is there, components, scripts, tradeoffs – USE IT, it works. Ask your customers to login to the AWB.It is for their use and try, it is deployed over Mellanox e2e FDR network utilizing UDA and UFM