SlideShare a Scribd company logo
1 of 16
Download to read offline
Profiling the network performance of data transfers in Hadoop jobs
Team : Pramod Biligiri & Sayed Asad Ali
Abstract
We have attempted to reproduce existing research which shows that the Shuffle phase of 
Hadoop is network intensive and can constitute a bottleneck for many Hadoop jobs. We ran the 
Terasort and Ranked Inverted Index jobs on an Amazon Elastic MapReduce cluster. Our 
experiments show that the Shuffle phase can form a significant fraction (upto nearly 30%) of the 
time consumed in these jobs.
We do not have decisive results showing that the network is saturated during this phase. This is 
due to a) lack of precise documentation on the network capacity of EMR, and b) inconsistent 
results between our network benchmark tests and the results from the Hadoop jobs. See 
Section 6 for a detailed discussion of both these factors.
1. Introduction
Data intensive computing on large scale, commodity clusters is becoming commonplace. 
Hadoop[9] is a popular framework used in such computing environments. While performance 
analysis is traditionally focused on the algorithm, the processing unit, memory and disk, the rise 
of cluster computing adds the communication patterns of the algorithm and the underlying 
network capacity as factors to consider while evaluating performance.
In this project we profile the data transfers that happen between the different stages of a Hadoop 
job, with an aim to understand the utilization of network resources during the process. We hope 
to reproduce some well known results which show that network utilization is a bottleneck in 
MapReduce. We intend to focus on the shuffle phase of the MapReduce pipeline, and the 
many­to­many pattern of data movement therein.
1.1. Hadoop
Hadoop is a framework for distributed processing of large data sets across clusters of 
computers using simple programming models based on Google’s MapReduce [7]. Hadoop is 
open source and implemented in Java.
Hadoop can be characterized by the following distinct features:
● Designed for commodity hardware
● Fault tolerant
● Horizontally scalable
● Push computation to data
1.2. MapReduce
MapReduce is a programming model for processing large data sets with a parallel, distributed 
algorithm on a cluster. In this model, a program consists of two phases: Map and Reduce. In the 
Map phase, each input record is processed to generate a (key, value) pair. In the Reduce phase, 
values associated with the same key are grouped together and an operation is applied on them 
to obtain the final results.
The following figure illustrates the flow of a MapReduce job:
1.3. Shuffle
The Shuffle is a phase where each reducer fetches its part of the sorted map outputs from all 
the mapper nodes. This phase results in a n­to­n communication among a set of n nodes.
2. Related Work:
We have studied a few papers which cite that the shuffle phase is an expensive operation. As 
stated by the Orchestra paper[1]: 
“On average, the shuffle phase accounts for 33% of the running time in these jobs. In addition,
in 26% of the jobs with reduce tasks, shuffles account for more than 50% of the running time, 
and in 16% of jobs, they account for more than 70% of the running time. This confirms widely 
reported results that the network is a bottleneck in MapReduce”
More information from the Hedera paper[2] corrobrates this:
“A data shuffle is an expensive but necessary operation for many MapReduce/Hadoop 
operations in which every host transfers a large amount of data to every other host participating 
in the shuffle. In this experiment, each host sequentially transfers 500MB to every other host 
using TCP (a 120GB shuffle).”
Furthermore, the VL2 paper[3] also establishes this observation:
“we consider an all­to­all data shuffle stress test: all servers simultaneously initiate TCP 
transfers to all other servers. This data shuffle pattern arises in large scale sorts, merges and 
join operations in the data center. We chose this test because, in our interactions with 
application developers, we learned that many use such operations with caution, because the 
operations are highly expensive in today’s data center network. However, data shuffles are 
required, and, if data shuffles can be efficiently supported, it could have large impact on the 
overall algorithmic and data storage strategy.”
 
3. Choice of Benchmarks:
3.1. Terasort
3.1.1. Why Terasort?
Terasort [4]  is a popular benchmark for Hadoop and is also shipped with most Hadoop 
distributions. This benchmark program sorts 1 terabyte of data. Each data item is 100 bytes in 
size. The first 10 bytes of a data item constitute its sort key.
Each key is represented as:
<key 10 bytes><rowid 10 bytes><filler 78 bytes>rn
key  : random characters from ASCII 32­126
rowid  : an integer
filler  : random characters from the set A­Z
The Terasort workload utilizes all aspects of the cluster  ­ cpu, network, disk and memory ­ and 
also has a large amount of data to shuffle (240 GB). Moreover, this is representative of real world 
workloads, as mentioned in the VL2 paper[3]:
“we consider an all­to­all data shuffle stress test: all servers simultaneously initiate TCP 
transfers to all other servers. This data shuffle pattern arises in large scale sorts, merges and 
join operations in the data center. We chose this test because, in our interactions with application 
developers, we learned that many use such operations with caution, because the operations are 
highly expensive in today’s data center network. However, data shuffles are required, and, if data 
shuffles can be efficiently supported, it could have large impact on the overall algorithmic and 
data storage strategy.”
3.1.2. How it works?
The Map phase of Terasort partitions input keys into different buckets and then leverages 
Hadoop’s default sorting of Map output. Finally, the reducer only collects outputs from different 
maps and does not perform a computation­intensive task. Due to its simple application logic and 
usage of Hadoop’s default sorting mechanism, Terasort is considered a good benchmarking 
application.
3.2. Ranked Inverted Index
3.2.1. Why Ranked Inverted Index?
This benchmark was chosen as it is mentioned in the Tarazu[4] paper as a Shuffle heavy 
workload. Also, a ranked inverted index is used often in text processing and information retrieval 
tasks and is therefore a commonly executed job. For a given text corpus, for each word it 
generates a list of documents containing the word in decreasing order of frequency
word ­> (count1 | file1), (count2 | file2), ...
count1 > count2 > …
 
4. Experimental Setup:
4.1. Configuration
We utilised three configurations as a testbed for our experiments. Two of these were configured 
on Amazon’s Elastic MapReduce (EMR) clusters and we used a cluster at SDSC as a learning 
testbed. However, Config 1 on EMR is the one we chose for a majority of our tests and our 
results are based on that.
Both the EMR configurations have 1 NameNode, 10 DataNode/Tasktrackers.
Instance 
type
Memory CPU ECU Disk Network 
performance
Config 1 m1.large 7.5 GB 64­bit 4 2 x 420 GB Moderate
Config 2 m1.xlarge 15 GB 64­bit 8 4 x 420 GB High
SDSC custom 8 GB 64­bit/ Intel Xeon 
CPU 5140 @2.33 
GHz, 4 cores
2 x 1.5 TB 1 Gb/s
4.2. Network Test
Source 1 : with AppNeta pathtest[8]
average : 753 Mb/s
Source 2 : “The available bandwidth is still 1 Gb/s, confirming anecdotal evidence that EC2 has 
full bisection bandwidth."[5]
Source 3 : “The median TCP/UDP throughput of medium instances are both close to 760 Mb/s." 
[6]
5. Results:
5.1. Terasort
5.1.1. Comparison of running Terasort on different Configurations
Total job 
Time (min)
Map Time 
(min)
Reduce 
Time (min)
Shuffle Average 
Time
Shuffle Time %
Config 1 205 84 205 60 29.3
SDSC 166 60 90 36 21.7
Config 2 86 40 75 22 25.5
5.1.2. CDF of Transferred Data
The CDF shows that network traffic happens in two distinct phases. First is the Map phase 
during which there is steady traffic, although not at high rates. Approximately half the amount of 
the total volume of 240GB is transferred during this time.
Following the Map, the traffic reduces as the map outputs are sorted locally. Then the job enters 
the Shuffle phase, where the data is transferred to all the reducers. During this phase,  the 
network traffic saturates the links, as we show in subsequent sections. This phase transfers the 
remaining half of the 240GB of data.
 
5.1.3. Network activity
The following figure shows the network transfer rates over the lifetime of the job. It shows that 
during the Shuffle phase the network traffic reaches 700 Mbps, which was the peak transfer rate 
as measured by one of our tests.
5.1.4. Disk activity
The figure shows that the map phase has a good mix of Read and Write. However, once the 
map is done which is around the 5100sec mark, a marked reduction in the data read is observed 
and this phase shows a increase in writing activity of the cluster. This pattern continues till the 
end of the shuffle phase around 6900s, where shuffle starts and the pattern shifts to read/write 
but a marked reduction in read activity. 
The observed peak value is around 60 to 80 MB/s which is well below the threshold value of 100 
MB/s according to the dd performance metric which was executed on the Amazon EC2 
machines.
5.1.5. Memory Activity
The captured logs indicate that throughout the timeline of the job, the memory shows a 
somewhat consistent utilisation of nearly 4.5 GB on all boxes and never overshoots this mark. 
As nearly 7.5 GB of memory is available on each box, this proves that the memory was never 
stressed to capacity, therefore hinting that the bottleneck lies elsewhere.
5.1.6. CPU Utilisation
The graphs below show the CPU utilisation on the EC2 boxes over the lifetime of the project. As 
indicated by the figure, the CPU may have been stressed to capacity with 100% utilisation but 
the pattern seen in the initial portion of the graph is very erratic and shows a high oscillation 
between full and partial CPU utilisation. However, during the shuffle phase, it is quite evident that 
the CPU utilisation drops below the 50% threshold, thus the CPU utilisation cannot be the limiting 
factor in the case of Shuffle phase of a job.
 
5.2. Ranked Inverted Index (RII)
5.1.1. Data for a RII run on Config 1
Total job 
Time (min)
Map Time 
(min)
Reduce 
Time (min)
Shuffle Average 
Time
Shuffle Time %
Config 1 12 5.5 11.5 3.5 27.14
5.1.2. CDF of data
The CDF shows that network traffic happens in three distinct phases. First is the Map phase 
during which there is steady traffic, although not at high rates. Once, shuffle is activated then the 
network traffic picks up, this is where 13 GB of data gets transferred across the network in a 
short duration and we see the saturation point of the network. The burst of traffic which happens 
after this is the replication of the results to 3 nodes.
5.1.3. Network Activity
The following figure shows the network transfer rates over the lifetime of the job. It shows that 
during the Shuffle phase the network traffic reaches 1.5 Gbps, which is something we have not 
been able to explain as the maximum expected rate should have been in the range of 700 ­ 800 
Mbps.
5.1.4. Disk Activity
The figure shows that the map phase has a good mix of Read and Write. However, once the 
map is done which is around the 350sec mark, a marked reduction in the data read is observed 
and this phase shows an increase in writing activity of the cluster. This pattern continues till the 
end of the shuffle phase around 550s, where shuffle starts and the pattern shifts to read/write but 
a marked reduction in read activity. 
The sporadic maximum read value is around 140 to 150 MB/s which does not stress the cluster 
as the consistent read rate is well below that.
5.1.5. Memory Activity
The captured logs indicate that throughout the timeline of the job, the memory shows a 
somewhat consistent utilisation of nearly 4.5 GB on all boxes and never overshoots this mark. 
As nearly 7.5 GB of memory is available on each box, this proves that the memory was never 
stressed to capacity, therefore hinting that the bottleneck lies elsewhere.
5.1.6. CPU Utilisation
he graphs below show the CPU utilisation on the EC2 boxes over the lifetime of the project. As 
indicated by the figure, the CPU may have been stressed to capacity with 100% utilisation during 
the map phase but the pattern seen in the initial portion of the graph is very erratic and shows a 
high oscillation between full and partial CPU utilisation. However, during the shuffle phase, it is 
quite evident that the CPU utilisation drops below the 50% threshold, thus the CPU utilisation 
cannot be the limiting factor in the case of shuffle phase of a job.
 
6. Unresolved Issues
6.1 Maximum network bandwidth of EMR
Amazon does not provide maximum network bandwidth rates for EMR. The network 
performance is only described in qualitative terms (Low, Moderate and High). We measured the 
network bandwidth between 2 nodes using the pathtest application from AppNeta [8]. We found a 
peak transfer rate of 753Mb/s.
We surveyed existing literature for the same, and found two conflicting versions:
1. “The available bandwidth is still 1 Gb/s, confirming anecdotal evidence that EC2 has full 
bisection bandwidth" ­ Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al
2. “The median TCP/UDP throughput of medium instances are both close to 760 Mb/s" ­ The 
Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui Wang 
et al
Further, the Terasort job maxed out at 800 Mb/s during our runs, whereas the Ranked Inverted 
Index crossed 1Gb/s. We are not able to reconcile these results, and believe it needs further 
investigation.
6.2 Sorting phase on Ranked Inverted Index
The sorting phase of the Ranked Inverted Index lasts for a very short duration. In fact it is not 
noticeable on the network CDF graph. But from the Disk activity graph it can be seen that there 
is write heavy activity between the 300­400 seconds mark, which correspond to a period of low 
network activity. It should be investigated why the network transfer does not flatline during this 
period for this job, whereas it does in the case of Terasort.
7. Summary
We have shown that both for the Terasort and Ranked Inverted Index jobs, the shuffle phase of 
MapReduce can constitute a large fraction of the overall job runtime (nearly 30%). We infer that 
for this phase, the network is potentially a bottleneck, as there is low activity on CPU, disk and 
memory. A better understanding of EMR’s network performance can lead to more a conclusive 
result in this regard.
8. Future Work
Apart from the issues raised in Section 6, we see other avenues of investigation for this project.
The experiments should be run on different kinds of hardware, and different values for certain 
Hadoop parameters. The important Hadoop parameters to consider are: io.sort.mb, 
io.sort.factor, and fs.inmemory.size.mb.
Jobs should be investigated with and without the presence of Combiners, which help to reduce 
the amount of data shuffled. The number of map and reduce tasks can be varied to see if that 
has an impact on the results.
Also, the locality of the tasks is an important factor to be considered while evaluating Hadoop 
jobs eg a rack­local or machine local setup may perform better.
There is scope for extensive work to determine the topology and bandwidth expectations of 
Amazon’s EMR clusters. 
 
9. References
1. Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, Ion Stoica. Managing Data 
Transfers in Computer Clusters with Orchestra ­ in SIGCOMM ’11
2. M. Al­Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow 
scheduling for data center networks. In NSDI, 2010
3. A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,
D. A. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In 
SIGCOMM, 2009
4. Tarazu: Optimizing MapReduce on Heterogeneous Clusters
Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, T.N. Vijaykumar
5. Opening Up Black Box Networks with CloudTalk, by Costin Raiciu et al 
6. The Impact of Virtualization on Network Performance of Amazon EC2 Data Center, by Guohui 
Wang et al
7. MapReduce: Simplified Data Processing on Large Clusters, by Jeffrey Dean and Sanjay 
Ghemawat
8. AppNeta pathtest ­ http://www.appneta.com/resources/pathtest­download.html
9. Apache Hadoop ­ http://hadoop.apache.org/

More Related Content

What's hot

20200526 AWS Black Belt Online Seminar AWS X-Ray
20200526 AWS Black Belt Online Seminar AWS X-Ray20200526 AWS Black Belt Online Seminar AWS X-Ray
20200526 AWS Black Belt Online Seminar AWS X-RayAmazon Web Services Japan
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPNAmazon Web Services Japan
 
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介Amazon Web Services Japan
 
Introduction to Apache ActiveMQ Artemis
Introduction to Apache ActiveMQ ArtemisIntroduction to Apache ActiveMQ Artemis
Introduction to Apache ActiveMQ ArtemisYoshimasa Tanabe
 
Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017
Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017
Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017AWSKRUG - AWS한국사용자모임
 
20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本Amazon Web Services Japan
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
AWS Black Belt Online Seminar 2018 AWS上の位置情報
AWS Black Belt Online Seminar 2018 AWS上の位置情報AWS Black Belt Online Seminar 2018 AWS上の位置情報
AWS Black Belt Online Seminar 2018 AWS上の位置情報Amazon Web Services Japan
 
AWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAkihiro Kuwano
 
20200804 AWS Black Belt Online Seminar Amazon CodeGuru
20200804 AWS Black Belt Online Seminar Amazon CodeGuru20200804 AWS Black Belt Online Seminar Amazon CodeGuru
20200804 AWS Black Belt Online Seminar Amazon CodeGuruAmazon Web Services Japan
 
Black Belt Online Seminar AWS上の暗号化ソリューション
Black Belt Online Seminar AWS上の暗号化ソリューションBlack Belt Online Seminar AWS上の暗号化ソリューション
Black Belt Online Seminar AWS上の暗号化ソリューションAmazon Web Services Japan
 
AWS CURのデータを安く楽に可視化して共有したい
AWS CURのデータを安く楽に可視化して共有したいAWS CURのデータを安く楽に可視化して共有したい
AWS CURのデータを安く楽に可視化して共有したいTakayuki Ishikawa
 
How to choose the right database for your workload
How to choose the right database for your workloadHow to choose the right database for your workload
How to choose the right database for your workloadAmazon Web Services
 
데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series
데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series
데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online SeriesAmazon Web Services Korea
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Patrick Van Renterghem
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)Seongyun Byeon
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020Amazon Web Services Korea
 

What's hot (20)

20200526 AWS Black Belt Online Seminar AWS X-Ray
20200526 AWS Black Belt Online Seminar AWS X-Ray20200526 AWS Black Belt Online Seminar AWS X-Ray
20200526 AWS Black Belt Online Seminar AWS X-Ray
 
Databases on AWS Workshop.pdf
Databases on AWS Workshop.pdfDatabases on AWS Workshop.pdf
Databases on AWS Workshop.pdf
 
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
202110 AWS Black Belt Online Seminar AWS Site-to-Site VPN
 
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
[20220126] JAWS-UG 2022初頭までに葬ったAWSアンチパターン大紹介
 
AWS Black Belt - AWS Glue
AWS Black Belt - AWS GlueAWS Black Belt - AWS Glue
AWS Black Belt - AWS Glue
 
Introduction to Apache ActiveMQ Artemis
Introduction to Apache ActiveMQ ArtemisIntroduction to Apache ActiveMQ Artemis
Introduction to Apache ActiveMQ Artemis
 
Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017
Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017
Kinesis와 Lambda를 이용한 비용 효율적인 센서 데이터 처리 - 주민규 (부산 모임) :: AWS Community Day 2017
 
20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本20210526 AWS Expert Online マルチアカウント管理の基本
20210526 AWS Expert Online マルチアカウント管理の基本
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
AWS Black Belt Online Seminar 2018 AWS上の位置情報
AWS Black Belt Online Seminar 2018 AWS上の位置情報AWS Black Belt Online Seminar 2018 AWS上の位置情報
AWS Black Belt Online Seminar 2018 AWS上の位置情報
 
AWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティスAWSのログ管理ベストプラクティス
AWSのログ管理ベストプラクティス
 
20200804 AWS Black Belt Online Seminar Amazon CodeGuru
20200804 AWS Black Belt Online Seminar Amazon CodeGuru20200804 AWS Black Belt Online Seminar Amazon CodeGuru
20200804 AWS Black Belt Online Seminar Amazon CodeGuru
 
Black Belt Online Seminar AWS上の暗号化ソリューション
Black Belt Online Seminar AWS上の暗号化ソリューションBlack Belt Online Seminar AWS上の暗号化ソリューション
Black Belt Online Seminar AWS上の暗号化ソリューション
 
AWS CURのデータを安く楽に可視化して共有したい
AWS CURのデータを安く楽に可視化して共有したいAWS CURのデータを安く楽に可視化して共有したい
AWS CURのデータを安く楽に可視化して共有したい
 
How to choose the right database for your workload
How to choose the right database for your workloadHow to choose the right database for your workload
How to choose the right database for your workload
 
데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series
데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series
데이터베이스 운영, 서버리스로 걱정 끝! - 윤석찬, AWS 테크에반젤리스트 - AWS Builders Online Series
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
 
AWS Black Belt Online Seminar AWS Amplify
AWS Black Belt Online Seminar AWS AmplifyAWS Black Belt Online Seminar AWS Amplify
AWS Black Belt Online Seminar AWS Amplify
 

Similar to Shuffle phase as the bottleneck in Hadoop Terasort

Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationUT, San Antonio
 
Deploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalitiesDeploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalitiesFredrick Ishengoma
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...IJCSES Journal
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 

Similar to Shuffle phase as the bottleneck in Hadoop Terasort (20)

Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
Deploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalitiesDeploying the producer consumer problem using homogeneous modalities
Deploying the producer consumer problem using homogeneous modalities
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
C044051215
C044051215C044051215
C044051215
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 

Recently uploaded

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Shuffle phase as the bottleneck in Hadoop Terasort