SlideShare una empresa de Scribd logo
1 de 25
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may
be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2010, Intel Corporation.
Optimizing Hadoop* Workloads
Nurcan Coskun, Ph.D.
Intel Software & Solutions Group
October 12, 2010
Acknowledgements to Jason Dai, Intel SSG, for
of the test results and optimization techniques
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
2
Legal Disclaimers
Disclaimers & Legal Notices
THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD
NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR
LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE
PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY
EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR
NONINFRINGEMENT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any difference in system hardware or software design or
configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance
of systems or components they are considering purchasing. For more information on performance tests and on the
performance of Intel products, visit Intel Performance Benchmark Limitations
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY,
OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED
IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE
FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the
absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The
information here is subject to change without notice. Do not finalize a design with this information. The products described in
this document may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to
obtain the latest specifications and before placing your product order. Copies of documents which have an order number and
are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's
Web Site http://www.intel.com/.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
3
Why Optimize Hadoop Deployments?
Handle
More
Data
At
Lower
Cost
In
Less
Time
With
Less
Power
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
4
Workload traits drive optimization approach
4
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
5
Where to Optimize ?
Hardware Hadoop / HDFS Software
Equipment, Settings Version, Settings OS, JVM, Settings
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
6
Server Considerations
2 Socket Systems
with Intel®
Xeon® Processor
5600 series
Sweet spot for
performance,
efficiency, cost
12-24 GB DDR3 CPU intense or
HBASE may require
more.
4-6 1TB SATA
HDD 7200
Pure I/O workloads
may require more
1-2GB Ethernet Channel bonding for
increased throughput
Energy efficient
components
Gold certified power
supplies, efficient
fans, low power
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
7
Processor Choice Matters
Faster
Handles More Data
More Energy Efficient
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
8
Processor Choice Impacts Speed
Data Source: Intel internal measurements. Hadoop 0.19.1 results as of September 20, 2009 and Hadoop 0.20.2 results as of August 8, 2010.
Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
Last Year This Year
36%
Up to
faster
29%
Up to
faster
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
9
Processor Choice Impacts Throughput
• Throughput = # of tasks
completed / minute when
cluster is at 100% utilization
• Intel Xeon processor 5600
provides up to 30% more
throughput than 5500 series1
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010.
Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Turn on Intel® Hyper-threading Technology
Intel® Hyper-threading
Technology
Increases performance for threaded
applications delivering greater throughput
and responsiveness
Up to 28% better performance1
1
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010.
Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect
the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration
may affect actual performance.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Terasort Wordcount
SMT OFF
SMT ON
Job Running Time
(Lower values are better)
10
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Memory
• Equipping 1~3 GB of RAM per CPU core
• ECC memory is highly recommended1, to detect and correct errors
introduced during storage transmission of data.
Hard drives
• Run in AHCI mode with NCQ enabled to improve multiple
simultaneous Read/Write performance
• Enable hard drive’s write cache
1. See in the discussion mail list http://mail-archives.apache.org/mod_mbox/hadoop-core-
dev/200705.mbox/%3C465C3065.09050501@dragonflymc.com%3E
Memory & Storage
11
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
• 1-2 x 1Gigabit Ethernet per node
• Ensure multiple RX/TX queue support for multi-core processors
• Enable channel bonding to resolve network-bound workloads if needed
– E.g., Improves sort workload by 30% in job running time
1
NetworkingMap/ReduceTasks
bootstrap
map
shuffle
sort
reduce
idle
0%
20%
40%
60%
80%
100%
CPUUtilization
idle
wait I/O
system
user
0%
20%
40%
60%
80%
100%
DiskUtilization
disk
0%
20%
40%
60%
80%
100%
Network
Utilization
network
Map/ReduceTasks
bootstrap
map
shuffle
sort
reduce
idle
0%
20%
40%
60%
80%
100%
CPUUtilization
idle
wait I/O
system
user
0%
20%
40%
60%
80%
100%
DiskUtilization
disk
0%
20%
40%
60%
80%
100%
Network
Utilization
network
I/O improves
substantially
Sort – no channel bonding Sort – channel bonding
100% network
utilization without
channel bonding
12
1
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 24. Performance tests
and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured
by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
time time
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Hadoop Disk Drives
13
Doubling disk drives  >2x Speedup
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
14
OS
•Use a Linux* distribution based on kernel version 2.6.30 or
later because of the optimizations included for energy and
threading efficiency
– For Example: energy consumption can be up to 60 percent (42 watts)
higher at idle for each server using older versions of Linux
•Optimize Linux* configurations
– Linux open file descriptor limit using /etc/security/limits.conf
• Default 1024 is too low for Hadoop daemon, and try to increase to approximately 64,000
– In kernel 2.6.28, epoll file descriptor limit using /etc/sysctl.conf
• Default 128 is too low for Hadoop daemon, and try to increase to approximately 4096
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
15
JVM
JVM (set in hadoop-env.sh)
• Prefer Sun Hotspot Java Runtime Environment
• Prefer 1.6 update 14 or later 64-bit version JVM
• “-server” option
– Recommend for Hadoop framework processes (E.g., JobTracker, Namenode), targeting at the production
deployments
• Specific GC related options for framework process
– E.g., Using parallel GC algorithm -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC
• Set the parameter java.net.preferIPv4Stack to true as well.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
16
Choosing a proper codec for your IO intensive workloads
Data Compression
• Compress data wherever possible
• Reduces storage footprint
• Speed I/O bound workloads
• Set mapred.output.compress
and/or
mapred.compress.map.output to
be true
• Consider LZO format
• Terasort with LZO compression:
• 60% faster than uncompressed
• 56% faster than zlib
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 25. Performance tests
and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured
by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
17
Hadoop Configuration Tuning
1. Increase DFS block size
• dfs.block.size
– HDFS file block size, to use larger block size (such as 128M or 256M) for large file system.
• E.g., Increasing block size from 128M to 256M saves Terasort running time by 7%
2. Supply enough handlers (HDFS)
– dfs.datanode.max.xcievers
• The maximum number of threads that can be connected to a data node simultaneously, set
larger number (e.g., 2048) rather than the default value 256.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
18
For More Detail – See Intel’s Recent Paper
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
19
Summary
• Tune and optimize Hadoop case by case
• Most of the Hadoop applications are data-intensive
• Tune your IO related application subsystem first
• Processor choice matters:
• X5670 (Westmere) shows 20-40% improvement for CPU-intensive workloads over
X5570 (Nehalem)1
• For I/O Intensive workloads – consider scaling HDD with core count
• Performance tuning tips:
• Channel bonding can reduce the network bottleneck for I/O intensive workloads
• Using larger DFS block size decreases task overhead
• Enabling HT shows gains up to 28% for CPU intensive workloads2
• Using LZO can significantly improve TeraSort results
1,2
Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010.
Hardware configurations (speed test) are on slide 21. Hardware configurations (HyperThreading) are on slide 23. Performance tests and ratings are
measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those
tests. Any difference in system hardware or software design or configuration may affect actual performance.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
20
Backup
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Processor Choice Impacts Speed”)
Source: Intel internal measurement as of September 19 2009 running Hadoop*, WordCount, and TeraSort
Intel® Xeon® X5460-based server
Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz
Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM
Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results
Network: 1 Gigabit Ethernet NIC
BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and
adjacent cache-line, prefetch disable
Intel® Xeon® X5570-based server
Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz
Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM
Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results
Network: 1 Gigabit Ethernet NIC
BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache-
line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve
Hadoop performance on Xeon X5460 server according to our benchmarking.)
Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort.
Results: WordCount single job running time was 407 seconds on the Xeon® 5500® processor series and 289 seconds on the Intel® Xeon® 5600
processor series. TeraSort single job running time was 2,541 seconds on the Xeon processor 5500 series and 2,182 seconds on the Intel Xeon processor
5600 series.
Hardware, cluster configuration, and settings were as follows:
(1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel
Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3
RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST
(Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-
Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel® Xeon® processor X5570 2.93
GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with
isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line
prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system,
mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual
machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)].
21
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Processor Choice Impacts Throughput”)
Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort.
Results: Total completed tasks per minute of WordCount over Intel® Xeon® processor 5500 series
was approximately 71.58, and over Intel® Xeon® process 5600 series was approximately 93.22.
Hardware, cluster configuration, and settings were as follows:
(1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a
single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP
ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB
DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system
and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo
mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-
Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6
Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA
disks per node (All six for HDFS and intermediate results, sharing one for system and log files with
isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled.
Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading
Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file
system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE
Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop
[hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)]
22
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Intel® Hyper-threading Technology”)
Source: Intel internal measurement as of August 8, 2010 based on the following cluster and server
configuration: 6 nodes (1 NameNode/JobTracker, 5 DataNode/TaskTracker) in each, configured with
2GbE connectivity to each server. Intel® Xeon® processor 5600 series servers: HP ProLiant* z6000
G6 Server 2 x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6
SATA disks per node (All six for HDFS and intermediate results, sharing one for file system and log
files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode
disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-
Threading Technology (Intel® HT Technology) requires a computer system with an Intel® processor
supporting Intel HT Technology and an Intel HT Technology-enabled chipset, BIOS, and operating
system. Performance will vary depending on the specific hardware and software you use.
See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on
which processors support Intel HT Technology.
23
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Networking”)
24
Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort.
Hardware, cluster configuration, and settings were as follows:
(1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single
GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant*
z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3
RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and
log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode
disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-Threading
Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x
Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node
(All six for HDFS and intermediate results, sharing one for system and log files with isolated
partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both
hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology
enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system,
mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime
Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-
0.20.2-CDH3 beta 2 (hadoop patch level 320)]
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation.
Cluster Configurations Information
(Slide: “Data Compression”)
Source: Intel internal measurement as of August 8,2010 running Hadoop* TeraSort.
Results: TeraSort single job running time was 1477 seconds without compression, 1256 seconds with
default(zlib) compression, and 586 seconds with LZO compression.
Hardware, cluster configuration, and settings were as follows: (1 NameNode/JobTracker + 32
DataNode/TaskTracker; each has 1 port 1 GbE connectivity to a single GbE switch) Intel Xeon
processor 5500 series servers: 2x Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3
RAM, 4 SATA disks per node (All 4 for HDFS and intermediate results, sharing 1 for system and log
files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode
disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading
Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30.10 x86_64). Ext3
filesystem, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java
SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Hadoop 0.20.1 version
25

Más contenido relacionado

Similar a Intel - Nurcan Coskun - Hadoop World 2010

Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop DeploymentsCloudera, Inc.
 
Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop DeploymentsCloudera, Inc.
 
Intel Mobile Launch Information
Intel Mobile Launch InformationIntel Mobile Launch Information
Intel Mobile Launch InformationAnna Yovka
 
AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12Jomar Silva
 
Intel HPC Update
Intel HPC UpdateIntel HPC Update
Intel HPC UpdateIBM Danmark
 
Intel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterIntel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterDr. Wilfred Lin (Ph.D.)
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura IntelTDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Inteltdc-globalcode
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...tdc-globalcode
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Igor José F. Freitas
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsGael Hofemeier
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldIntel IT Center
 
Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*Intel® Software
 
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)IntelAPAC
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyIntel IT Center
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoIntel Software Brasil
 
8 intel network builders overview
8 intel network builders overview8 intel network builders overview
8 intel network builders overviewvideos
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Ceph Community
 
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year HorizonGary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year HorizonAugmentedWorldExpo
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIntel® Software
 
Driving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to ExascaleDriving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to ExascaleIntel IT Center
 

Similar a Intel - Nurcan Coskun - Hadoop World 2010 (20)

Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop Deployments
 
Hw09 Optimizing Hadoop Deployments
Hw09   Optimizing Hadoop DeploymentsHw09   Optimizing Hadoop Deployments
Hw09 Optimizing Hadoop Deployments
 
Intel Mobile Launch Information
Intel Mobile Launch InformationIntel Mobile Launch Information
Intel Mobile Launch Information
 
AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12AI & Computer Vision (OpenVINO) - CPBR12
AI & Computer Vision (OpenVINO) - CPBR12
 
Intel HPC Update
Intel HPC UpdateIntel HPC Update
Intel HPC Update
 
Intel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterIntel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data Center
 
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura IntelTDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
TDC2018SP | Trilha IA - Inteligencia Artificial na Arquitetura Intel
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
 
Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*Intel® Open Image Denoise in Unity*
Intel® Open Image Denoise in Unity*
 
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge Economy
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
8 intel network builders overview
8 intel network builders overview8 intel network builders overview
8 intel network builders overview
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
 
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year HorizonGary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
Gary Brown (Movidius, Intel): Deep Learning in AR: the 3 Year Horizon
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for Intel
 
Driving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to ExascaleDriving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to Exascale
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Intel - Nurcan Coskun - Hadoop World 2010

  • 1. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice. Copyright © 2010, Intel Corporation. Optimizing Hadoop* Workloads Nurcan Coskun, Ph.D. Intel Software & Solutions Group October 12, 2010 Acknowledgements to Jason Dai, Intel SSG, for of the test results and optimization techniques
  • 2. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 2 Legal Disclaimers Disclaimers & Legal Notices THE INFORMATION IS FURNISHED FOR INFORMATIONAL USE ONLY, IS SUBJECT TO CHANGE WITHOUT NOTICE, AND SHOULD NOT BE CONSTRUED AS A COMMITMENT BY INTEL CORPORATION. INTEL CORPORATION ASSUMES NO RESPONSIBILITY OR LIABILITY FOR ANY ERRORS OR INACCURACIES THAT MAY APPEAR IN THIS DOCUMENT OR ANY SOFTWARE THAT MAY BE PROVIDED IN ASSOCIATION WITH THIS DOCUMENT. THIS INFORMATION IS PROVIDED "AS IS" AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THE USE OF THIS INFORMATION INCLUDING WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, COMPLIANCE WITH A SPECIFICATION OR STANDARD, MERCHANTABILITY OR NONINFRINGEMENT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site http://www.intel.com/.
  • 3. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 3 Why Optimize Hadoop Deployments? Handle More Data At Lower Cost In Less Time With Less Power
  • 4. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 4 Workload traits drive optimization approach 4
  • 5. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 5 Where to Optimize ? Hardware Hadoop / HDFS Software Equipment, Settings Version, Settings OS, JVM, Settings
  • 6. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 6 Server Considerations 2 Socket Systems with Intel® Xeon® Processor 5600 series Sweet spot for performance, efficiency, cost 12-24 GB DDR3 CPU intense or HBASE may require more. 4-6 1TB SATA HDD 7200 Pure I/O workloads may require more 1-2GB Ethernet Channel bonding for increased throughput Energy efficient components Gold certified power supplies, efficient fans, low power
  • 7. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 7 Processor Choice Matters Faster Handles More Data More Energy Efficient
  • 8. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 8 Processor Choice Impacts Speed Data Source: Intel internal measurements. Hadoop 0.19.1 results as of September 20, 2009 and Hadoop 0.20.2 results as of August 8, 2010. Hardware configurations are on slide 21. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Last Year This Year 36% Up to faster 29% Up to faster
  • 9. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 9 Processor Choice Impacts Throughput • Throughput = # of tasks completed / minute when cluster is at 100% utilization • Intel Xeon processor 5600 provides up to 30% more throughput than 5500 series1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 22. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
  • 10. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Turn on Intel® Hyper-threading Technology Intel® Hyper-threading Technology Increases performance for threaded applications delivering greater throughput and responsiveness Up to 28% better performance1 1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Terasort Wordcount SMT OFF SMT ON Job Running Time (Lower values are better) 10
  • 11. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Memory • Equipping 1~3 GB of RAM per CPU core • ECC memory is highly recommended1, to detect and correct errors introduced during storage transmission of data. Hard drives • Run in AHCI mode with NCQ enabled to improve multiple simultaneous Read/Write performance • Enable hard drive’s write cache 1. See in the discussion mail list http://mail-archives.apache.org/mod_mbox/hadoop-core- dev/200705.mbox/%3C465C3065.09050501@dragonflymc.com%3E Memory & Storage 11
  • 12. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. • 1-2 x 1Gigabit Ethernet per node • Ensure multiple RX/TX queue support for multi-core processors • Enable channel bonding to resolve network-bound workloads if needed – E.g., Improves sort workload by 30% in job running time 1 NetworkingMap/ReduceTasks bootstrap map shuffle sort reduce idle 0% 20% 40% 60% 80% 100% CPUUtilization idle wait I/O system user 0% 20% 40% 60% 80% 100% DiskUtilization disk 0% 20% 40% 60% 80% 100% Network Utilization network Map/ReduceTasks bootstrap map shuffle sort reduce idle 0% 20% 40% 60% 80% 100% CPUUtilization idle wait I/O system user 0% 20% 40% 60% 80% 100% DiskUtilization disk 0% 20% 40% 60% 80% 100% Network Utilization network I/O improves substantially Sort – no channel bonding Sort – channel bonding 100% network utilization without channel bonding 12 1 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 24. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. time time
  • 13. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Hadoop Disk Drives 13 Doubling disk drives  >2x Speedup
  • 14. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 14 OS •Use a Linux* distribution based on kernel version 2.6.30 or later because of the optimizations included for energy and threading efficiency – For Example: energy consumption can be up to 60 percent (42 watts) higher at idle for each server using older versions of Linux •Optimize Linux* configurations – Linux open file descriptor limit using /etc/security/limits.conf • Default 1024 is too low for Hadoop daemon, and try to increase to approximately 64,000 – In kernel 2.6.28, epoll file descriptor limit using /etc/sysctl.conf • Default 128 is too low for Hadoop daemon, and try to increase to approximately 4096
  • 15. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 15 JVM JVM (set in hadoop-env.sh) • Prefer Sun Hotspot Java Runtime Environment • Prefer 1.6 update 14 or later 64-bit version JVM • “-server” option – Recommend for Hadoop framework processes (E.g., JobTracker, Namenode), targeting at the production deployments • Specific GC related options for framework process – E.g., Using parallel GC algorithm -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC • Set the parameter java.net.preferIPv4Stack to true as well.
  • 16. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 16 Choosing a proper codec for your IO intensive workloads Data Compression • Compress data wherever possible • Reduces storage footprint • Speed I/O bound workloads • Set mapred.output.compress and/or mapred.compress.map.output to be true • Consider LZO format • Terasort with LZO compression: • 60% faster than uncompressed • 56% faster than zlib Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations are on slide 25. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
  • 17. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 17 Hadoop Configuration Tuning 1. Increase DFS block size • dfs.block.size – HDFS file block size, to use larger block size (such as 128M or 256M) for large file system. • E.g., Increasing block size from 128M to 256M saves Terasort running time by 7% 2. Supply enough handlers (HDFS) – dfs.datanode.max.xcievers • The maximum number of threads that can be connected to a data node simultaneously, set larger number (e.g., 2048) rather than the default value 256.
  • 18. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 18 For More Detail – See Intel’s Recent Paper
  • 19. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 19 Summary • Tune and optimize Hadoop case by case • Most of the Hadoop applications are data-intensive • Tune your IO related application subsystem first • Processor choice matters: • X5670 (Westmere) shows 20-40% improvement for CPU-intensive workloads over X5570 (Nehalem)1 • For I/O Intensive workloads – consider scaling HDD with core count • Performance tuning tips: • Channel bonding can reduce the network bottleneck for I/O intensive workloads • Using larger DFS block size decreases task overhead • Enabling HT shows gains up to 28% for CPU intensive workloads2 • Using LZO can significantly improve TeraSort results 1,2 Data Source: Intel internal measurements by using Hadoop 0.20.2 as of August 8, 2010. Hardware configurations (speed test) are on slide 21. Hardware configurations (HyperThreading) are on slide 23. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
  • 20. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. 20 Backup
  • 21. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Processor Choice Impacts Speed”) Source: Intel internal measurement as of September 19 2009 running Hadoop*, WordCount, and TeraSort Intel® Xeon® X5460-based server Processor: Dual-socket quad-core Intel® Xeon® X5460 3.16GHz Processor Memory: 16GB (DDR2 FBDIM ECC 667MHz) RAM Storage: 1 X 300GB 15K RPM SAS disk for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version S5000.86B.10.60.0091.100920081631EIST (Enhanced Intel SpeedStep Technology) disabled both hardware prefetcher and adjacent cache-line, prefetch disable Intel® Xeon® X5570-based server Processor: Dual-socket quad-core Intel® Xeon® X5570 2.93GHz Processor Memory: 16GB (DDR3 ECC 1333MHz) RAM Storage: 1 X 1TB 7200RPM SATA for system and log files, 4 X 1TB 7200RPM SATA for HDFS and intermediate results Network: 1 Gigabit Ethernet NIC BIOS: BIOS version 4.6.3 Both EIST (Enhanced Intel SpeedStep Technology) and Turbo mode disabled both hardware prefetcher and adjacent cache- line prefetch enabled, SMT (Simultaneous MultiThreading), enabled (Disabling hardware prefetcher and adjacent cache-line prefetch helps improve Hadoop performance on Xeon X5460 server according to our benchmarking.) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Results: WordCount single job running time was 407 seconds on the Xeon® 5500® processor series and 289 seconds on the Intel® Xeon® 5600 processor series. TeraSort single job running time was 2,541 seconds on the Xeon processor 5500 series and 2,182 seconds on the Intel Xeon processor 5600 series. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel® Xeon® processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)]. 21
  • 22. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Processor Choice Impacts Throughput”) Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Results: Total completed tasks per minute of WordCount over Intel® Xeon® processor 5500 series was approximately 71.58, and over Intel® Xeon® process 5600 series was approximately 93.22. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop-0.20.2-CDH3 beta 2 (hadoop patch level 320)] 22
  • 23. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Intel® Hyper-threading Technology”) Source: Intel internal measurement as of August 8, 2010 based on the following cluster and server configuration: 6 nodes (1 NameNode/JobTracker, 5 DataNode/TaskTracker) in each, configured with 2GbE connectivity to each server. Intel® Xeon® processor 5600 series servers: HP ProLiant* z6000 G6 Server 2 x Intel® Xeon® processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for file system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper- Threading Technology (Intel® HT Technology) requires a computer system with an Intel® processor supporting Intel HT Technology and an Intel HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support Intel HT Technology. 23
  • 24. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Networking”) 24 Source: Intel internal measurement as of August 8, 2010 running Hadoop* WordCount and TeraSort. Hardware, cluster configuration, and settings were as follows: (1 Namenode/JobTracker + 5 DataNode/TaskTracker, each has two port 1 GbE connectivity to a single GbE switch with channel bonding enabled.) Intel Xeon processor 5600 series servers: HP ProLiant* z6000 G6 Server with 2x Intel Xeon processor X5670 2.93 GHz (12 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel® Hyper-Threading Technology enabled. Intel Xeon processor 5500 series servers: HP ProLiant z6000 G6 Server with 2x Intel Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 6 SATA disks per node (All six for HDFS and intermediate results, sharing one for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30x86_64). Ext4 file system, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Cloudera distribution of Hadoop [hadoop- 0.20.2-CDH3 beta 2 (hadoop patch level 320)]
  • 25. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without notice. Copyright © 2010, Intel Corporation. Cluster Configurations Information (Slide: “Data Compression”) Source: Intel internal measurement as of August 8,2010 running Hadoop* TeraSort. Results: TeraSort single job running time was 1477 seconds without compression, 1256 seconds with default(zlib) compression, and 586 seconds with LZO compression. Hardware, cluster configuration, and settings were as follows: (1 NameNode/JobTracker + 32 DataNode/TaskTracker; each has 1 port 1 GbE connectivity to a single GbE switch) Intel Xeon processor 5500 series servers: 2x Xeon processor X5570 2.93 GHz (8 cores per node), 24 GB DDR3 RAM, 4 SATA disks per node (All 4 for HDFS and intermediate results, sharing 1 for system and log files with isolated partition). Both EIST (Enhanced Intel® SpeedStep Technology) and Turbo mode disabled. Both hardware prefetcher and adjacent cache-line prefetch enabled. Intel Hyper-Threading Technology enabled. Software: Red Hat Enterprise Linux* 5.4 (with kernel 2.6.30.10 x86_64). Ext3 filesystem, mounted with “noatime,nodiratime” options.). Sun JVM 1.6 (Java* version 1.6.0_14 Java SE Runtime Environment Java HotSpot* 64-bit server virtual machine). Hadoop 0.20.1 version 25