SlideShare una empresa de Scribd logo
1 de 5
Descargar para leer sin conexión
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. I (May-Jun. 2016), PP 30-34
www.iosrjournals.org
DOI: 10.9790/0661-1803013034 www.iosrjournals.org 30 | Page
Performance Analysis of Hadoop Application For Heterogeneous
Systems
Naresh E1
, Poornesha B D2
, Vijaya Kumar B P3
1
Research Scholar, CSE, Jain University, Bangalore, Karnataka, India
1, 2, 3
Department of ISE, MSRIT, MSR Nagar, Bangalore, Karnataka, 560054, India
Abstract: Apache Hadoop is open source software that allows for the processing of large data set sin nodes.
This work involves non-functional testing of Hadoop application in order to get the observable performance of
an application. This work also includes the study and observations of performance analysis of Hadoop
application for various machine architectures like Intel Core 2 Duo, and Intel i3.
Keywords: BigData, Split files, Fuunctional testing, Hadoop throughput and performance, Mappers, Reducers.
I. Introduction
As the technology evolves huge amount of data is getting generated in day to day life by Sensors,
Smartphone, Mobile devices and by web applications, there are three dimensions involved in generating the data
they are Volume, Variety and Velocity(3V’s) [1] [4]. Volume defines the amount of data generated from
multiple systems, which needs to be processed and analyzed. Variety defines type of data getting generated like
structured data, semi-structured data and unstructured data, analyzing of these types of data is a tedious jobs so
some scripting techniques can be adopted to analyze these data and involves lot of effort. Velocity defines the
rate at which the data is generated due to digitalization, by real-time systems and mission critical application,
this leads to the emergence of new domain-Big Data in organizations. Testing of this Big Data is crucial task
and organizations are worried in handling of this data because of lack of knowledge on what to test and how to
test [5]. Organizations are facing many challenges in forming test strategies for structured and unstructured data,
setting up on environment and functions of components in Big Data model results in poor quality of data in
production hit the response times very badly [7].
In order to manage accuracy of this data there are many testing types like functional and non-functional
testing are essential along with error free test data and environment. Functional testing involves testing of
Hadoop map reduce process and input data validation to ensure both input and output are in good quality and
validating a extracted data before storing in downstream system [1]. Non functional testing involves testing of
performance of the system plays a key role in accepting the model.
II. Functional Testing
Hadoop distributed file system (HDFS) makes input data available to multiple nodes. These input data
can be extracted from multiple sources in structured or unstructured manner and loaded into HDFS by splitting
files in multiple numbers and process, these multiple files using map and reduce operations finally extract the
data generated from HDFS and store into downstream systems.
As we are working with huge data and processing at multiple nodes, there are chances of missing
quality data and inject of unrelated data leads to quality issues at each stage of the process. Data functional
testing involves the testing of the data at three phases, they are validation of Hadoop pre-processing, validation
of Hadoop map-reduces process and validation of Extracted data and store in downstream systems shown in the
below Hadoop Map-Reduce Architecture [2] [5].
Figure 1: Hadoop Map-Reduce Architecture
Performance Analysis Of Hadoop Application For Heterogeneous Systems
DOI: 10.9790/0661-1803013034 www.iosrjournals.org 31 | Page
Figure 1 shows the Hadoop framework for distributed processing of large datasets using map and reduce
in any node of a system, initially pre-loaded input data is loaded into the HDFS for processing, map process split
the data and store the intermediate key-value pair’s data from mappers. These key-value pairs are shuffled based
on the key before sending into reduce phase, reduce phase performs the reduce operation and stores the
generated output in local file system [9].
2.1 Validation of Hadoop Pre-processing
Data can be extracted from various sources and load into HDFS for processing, there are many issues
involved during this process like data not in format, incorrect data captured from source systems, incorrect split
of data files and incorrect storage of data. Validation of Hadoop pre-processing includes:
 Comparing the extracted input file format with the source data format and ensure data extracted in correct
format.
 Data captured in input files should match the source data.
 Proper splits can be made and its compatible with the HDFS distributed file system.
 Input files should be bigger in size, Hadoop reads the blocks each of size may be in multiple of 64MB, it’s
good to have the file size in multiple of thousands of 64MB [3] [12] [13].
2.2 Validation of Hadoop map reduces process
Once the files are loaded into HDFS, now its Hadoop map-reduce job to process the input files from
different sources, there are many issues involved in these process includes jobs run correctly in single node but
not in multiple nodes, incorrect aggregation of data in reduce phase, incorrect output data format, coding error
in map-reduce phase, mapper spill size. Validation of Hadoop map-reduce process includes:
 Tune number of map and reduce tasks appropriately [11].
 Validation of data processing and output files generated, output file format should be as per the requirement
specified.
 Compare the output file with the input source file formats.
 Verify and validate the business logic configuration in single node against the multiple nodes [6].
 Consolidate and validate the data after the reduce phase.
 Validate the key value pair generated in map and reduce phase and verify the shuffling activity thoroughly
[8] [12].
 Size of the mapper output is sensitive to disk IO, network IO and memory sensitive on shuffle phase, use of
minimal map output key-map output value and compressing of mapper output and minimizing the mapper
output will give the observable performance, this can be done by filter out records on mapper side instead
of reducer side [3].
Compress the intermediate output of mappers before transmitting to reducers, which will reduce the
less disk I/O and network traffic from reading and moving files around [13].
2.3 Validation of extracted data and store in downstream systems
After completion of data processing, output files are generated and moved to downstream systems for
loading. There are few issues involved in this phase like incorrect rule transformation is applied, incorrect
loading of files to downstream databases, output files format not compatible with the storage format.
Validation of post processing includes:
 Validation of transaction rules before extracting the data.
 Validation of data corruption in output files and validation of data integrity in a target system.
 Comparing storage database format and output file format, setting of constraints rules in the target system
[6].
III. Non Functional Testing
This type of testing involves Performance testing, Failover testing and Load testing to identify any
bottleneck involved in the process.
3.1 Performance Testing
Any big data project deals with processing of huge volumes of data in a less time using multiple nodes,
this data may be structured, unstructured. Performance will be hit badly because of poor design of architecture,
low level coding standards and poor configuration of parameters so such systems are not meeting the user
expectations for huge and complex data. Apart from the above, performance will hit badly due to improper
input files, most of the map processing jobs are running at reduce phase such as aggregation of data, redundant
Performance Analysis Of Hadoop Application For Heterogeneous Systems
DOI: 10.9790/0661-1803013034 www.iosrjournals.org 32 | Page
sorts and shuffle tasks[5], Communication delay between map to reduce phase. These performance issues can be
resolved by:
 Maintaining a huge file size of input files.
 Carefully designing the system architecture by considering the design constructs and performing
performance test to identify bottleneck.
 Performance test can be carried out by setting up data in huge volume and infrastructure similar to
production
 Using Hadoop performance monitoring tool capture performance metrics such as job completion time,
caching percentage, throughput, CPU utilization, memory utilization, top number of process consuming
CPU etc.
3.2 Failover Testing
Hadoop architecture consists of HDFS components, job and task tracker, name node and many data
nodes. There are many chances like components of Hadoop become non functional due to name or data node
failure, job or task tracker failures etc. Architecture can be designed such a way that it should automatically
rectify or recover from the problem and start processing data.
Failover testing is a important in Hadoop implementation and few validations need to be performed are
frequent checking of edit logs, back up of name node which holds the metadata information, no data loss during
the data node failures, operation should not halt if any of the data node fails or during replacement of data
nodes, data replication during the new data nodes, replication is initiated whenever data node fails or data
become corrupted. There should be constant schedule for Recover Point Objective (RPO) and Recovery Time
Objective (RTO) in a system and metrics can be captured for RPO and RTO [6].
3.3 Load Testing
Its necessary to measure the Hadoop system behavior under both normal and peak load conditions to
identify maximum operational capacity of an application, any bottlenecks in the application and which element
causing the performance degradation. For the big data analysis Hadoop system should have capable enough to
process vast amount of data in the form of files from different sources. Load testing can be carried out in three
stages Physical testing, Dynamic testing and Cyclical testing [15].
Physical testing involves providing the constant input file load for a specified amount of time. Dynamic
testing involves providing a variable input file load to map-reduce operations. Cyclical testing consists of
repeated unloading and loading of input files for specified durations and conditions [10]. For each of this load
testing measure the performance of the system using the below captured metrics:
 Calculating the average and peak response time of the map-reduce phase, this can be calculated by time
taken to process the input files and store the computed data in downstream systems.
 Capturing the CPU and memory utilization.
 Behavior of map and reduce phases and measuring the response time for each.
 Throughput that is number of input files (size) processed per second, Error rate and how many concurrent
users can work on the system.
IV. Analysis And Results
A We have conducted a experiments with the 4GHZ Core 2 Duo processor, RAM of 4GB and Hard
Disk of 100GB in Ubuntu 12.04 LTS and Apple Intel Core i3 64-bit single node system in order to calculate the
performance of the single node system in terms of number of input file size processed per unit of time i.e
throughput of the Hadoop application [14], it can be calculated by size of the input file processed per total
amount of CPU time taken in Milli Seconds.
Input to the Hadoop application consists of the variable file sizes, at the beginning of the experiments,
we have split the 1GB file size into 40 files, 20 files, 10 files of equal size and next we proceed with the single
file of bigger sizes to process the data in a file, The experiments is carried with the Hadoop configured with the
two mapers and one reducer in Core 2 duo and Intel Core i3 Systems as shown in the below Table 1.
Sl.No Files Total File Size Core 2 Duo Throughput Intel Core i3 Throughput
1 40 1GB 3452.5 4098.3
2 20 1GB 3664.6 5168.7
3 10 1GB 3579.1 5113.1
4 1 1GB 3890.3 5237.7
5 1 2GB 4391.6 5355.3
6 1 5GB 5384.9 5835.6
7 1 10GB 5540.5 5754.2
8 1 20GB 5371.4 5904.5
9 1 30GB 5716.5 6333.5
Table 1: Throughput for variable file sizes
Performance Analysis Of Hadoop Application For Heterogeneous Systems
DOI: 10.9790/0661-1803013034 www.iosrjournals.org 33 | Page
From the above table 1, we can observe that the throughput of the application is very low means 1GB
file split into multiple files of equal size (first three experiments) and throughput is varies as the files of small
size and throughput is good if we concatenate the same multiple files (40 files or 20 files or 10 files) and input
to the application as a single file (1GB file). From the above analysis we observed that Throughput increase as
the size of the file increases as shown in the below graph: Figure 2.
Figure 2: Variable File Size versus Throughput
We have further analyzed the rate processing of records defined as percentage of processing of records
by total number of records processed by machines in a unit amount of time, means rate of processing of Intel
Core i3 is calculated by percentage of total bytes of processed by Intel Core i3 in unit time divided by total
records processed by Core 2 duo and Intel Core i3 as shown in table 2.
Sl.No Files Total File Size Core 2 Duo Intel Core i3
1 40 1GB 45.7 54.3
2 20 1GB 40.5 55.4
3 10 1GB 41.2 58.5
4 1 1GB 42.6 57.4
5 1 2GB 45.6 54.9
6 1 5GB 47.9 52.1
7 1 10GB 49.5 53.5
8 1 20GB 47.8 54.4
9 1 30GB 47.4 52.6
Table 2: Rate of Processing
We have conducted a experiment with Hadoop in core 2 duo and IntelCore i3 systems, to processing of
records in same files of variable size and found that processing of records in Intel Core i3 is faster than Core 2
duo as shown in figure 3.
Figure 3: Rate of Processing of different machines
Performance Analysis Of Hadoop Application For Heterogeneous Systems
DOI: 10.9790/0661-1803013034 www.iosrjournals.org 34 | Page
V. Conclusions
Data quality challenges can be resolved by deploying a well planned testing approach for both
functional and non-functional requirements. Applying a right test strategies will improve the performance
quality during processing, pre-processing and post processing. As we discussed in above section we can achieve
good observable performance from huge single files than multiple files of variable size and necessary parameter
configuration of mapper and reducer also helps in achieving a good performance of Hadoop application. Big
Data testing is a specialized stream, tester should built with good data analysis skill set for identifying quality
issues in data, it will help in identifying defects early and reduce the cost of implementation.
Acknowledgements
We indebted to management of MSRIT, Bangalore for excellent support in completing this work at
right time, and also for providing the academic and research atmosphere at the institute. A special thanks to the
authors mentioned in the references.
References
[1] Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, George Lapis, Thomas Deutsch,Understanding Big Data McGraw-Hill Companies
2012.
[2] Lam,Chuck,Hadoop in Action, Manning Publications Co. 2010, Pages 21-25.
[3] Dryman,Carpediem, Hadoop Performance Tuning Best Practices, 241-242, ACM,2014.
[4] Zikopoulos, Paul and Eaton, Chris and others, Understanding big data: Analytics for enterprise class hadoop and streaming data,
McGraw-Hill Osborne Media 2011.
[5] Vinaykumar Chandrashekar, VinyasShrinivasShetty, Testing in a Big DataWorld, Manning EuroSTAR 2013.
[6] Mahesh Gudipati, ShanthiRao, Naju D. Mohan and Naveen Kumar Gajja, Big Data: Testing Approach to Overcome Quality
Challenges, Infosys, 2013.
[7] Almeida, Fernando, Calistru, Catalin, The main challenges and issues of big data management, April 2013.
[8] Xie, Jiong and Yin, Shu and Ruan, Xiaojun and Ding, Zhiyang and Tian,Yun and Majors, James and Manzanares, Adam and Qin,
Xiao, Improving map reduce performance through data placement in heterogeneous hadoopclusters,IEEE, 2010.
[9] Michael Kopp, About the Performance of Map Reduce Jobs. http://apmblog.compuware.com/2012/01/25/about-the-performance-
of-map-reduce-jobs/ 2012.
[10] Menasc, Daniel, Load testing of web sites Internet Computing, IEEE, vol=6, num=4, pages=70-74, IEEE, 2002.
[11] Shrinivas B Joshi, Apache hadoop performance tuning methodologies and best practices ICPE 12 Proceedings of the 3rd
ACM/SPEC International Conference on Performance Engineering, Pages 241-242.
[12] White,Tom, Hadoop: The definitive guide O’Reilly Media, Inc., 2012.
[13] Todd Lipcon 7 Tips for Improving MapReduce Performance, Apache Hadoop for the Enterprise, Cloudera,
http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce- performance, 2009.
[14] Performance measurement of a Hadoop Cluster,
https://www.amax.com/enterprise/pdfs/AMAX%20Emulex%20Hadoop%20Whitepaper.pdf, February 232012.
[15] What Are Performance Metrics Used in Load Testing [online], Available: http://loadstorm.com/load-testing-metrics/ 2015.

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM MethodEstimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
 
An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETL
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
 
Size estimation of olap systems
Size estimation of olap systemsSize estimation of olap systems
Size estimation of olap systems
 
SIZE ESTIMATION OF OLAP SYSTEMS
SIZE ESTIMATION OF OLAP SYSTEMSSIZE ESTIMATION OF OLAP SYSTEMS
SIZE ESTIMATION OF OLAP SYSTEMS
 
Ijebea14 267
Ijebea14 267Ijebea14 267
Ijebea14 267
 
Decision Making Framework in e-Business Cloud Environment Using Software Metr...
Decision Making Framework in e-Business Cloud Environment Using Software Metr...Decision Making Framework in e-Business Cloud Environment Using Software Metr...
Decision Making Framework in e-Business Cloud Environment Using Software Metr...
 
An Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed DatabaseAn Analysis on Query Optimization in Distributed Database
An Analysis on Query Optimization in Distributed Database
 
Mis
MisMis
Mis
 
A Methodology To Manage Victim Components Using Cbo Measure
A Methodology To Manage Victim Components Using Cbo MeasureA Methodology To Manage Victim Components Using Cbo Measure
A Methodology To Manage Victim Components Using Cbo Measure
 
Process management seminar
Process management seminarProcess management seminar
Process management seminar
 
Agile Data: Automating database refactorings
Agile Data: Automating database refactoringsAgile Data: Automating database refactorings
Agile Data: Automating database refactorings
 
IRJET- A Review of Data Mining and its Methods Used in Manufacturing and How ...
IRJET- A Review of Data Mining and its Methods Used in Manufacturing and How ...IRJET- A Review of Data Mining and its Methods Used in Manufacturing and How ...
IRJET- A Review of Data Mining and its Methods Used in Manufacturing and How ...
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTQUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
 
Recovery techniques
Recovery techniquesRecovery techniques
Recovery techniques
 
50320140501006
5032014050100650320140501006
50320140501006
 
Survey on software remodularization techniques
Survey on software remodularization techniquesSurvey on software remodularization techniques
Survey on software remodularization techniques
 
Fast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data AnalysisFast Range Aggregate Queries for Big Data Analysis
Fast Range Aggregate Queries for Big Data Analysis
 
Efficient Cost Minimization for Big Data Processing
Efficient Cost Minimization for Big Data ProcessingEfficient Cost Minimization for Big Data Processing
Efficient Cost Minimization for Big Data Processing
 

Destacado

Destacado (20)

A1103020113
A1103020113A1103020113
A1103020113
 
A017420108
A017420108A017420108
A017420108
 
E010623639
E010623639E010623639
E010623639
 
D010631621
D010631621D010631621
D010631621
 
G012655365
G012655365G012655365
G012655365
 
J1304036065
J1304036065J1304036065
J1304036065
 
F1303063449
F1303063449F1303063449
F1303063449
 
F0433338
F0433338F0433338
F0433338
 
E013162126
E013162126E013162126
E013162126
 
M1303038998
M1303038998M1303038998
M1303038998
 
H012664448
H012664448H012664448
H012664448
 
B013131219
B013131219B013131219
B013131219
 
N018138696
N018138696N018138696
N018138696
 
F012463754
F012463754F012463754
F012463754
 
G1802033543
G1802033543G1802033543
G1802033543
 
B017310612
B017310612B017310612
B017310612
 
Simulation of Route Optimization with load balancing Using AntNet System
Simulation of Route Optimization with load balancing Using AntNet SystemSimulation of Route Optimization with load balancing Using AntNet System
Simulation of Route Optimization with load balancing Using AntNet System
 
J1803067477
J1803067477J1803067477
J1803067477
 
I017377074
I017377074I017377074
I017377074
 
E017552629
E017552629E017552629
E017552629
 

Similar a F1803013034

TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
ruchabhandiwad
 

Similar a F1803013034 (20)

Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testing
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Efficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and HiveEfficient Log Management using Oozie, Parquet and Hive
Efficient Log Management using Oozie, Parquet and Hive
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Strengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data ImplementationsStrengthening the Quality of Big Data Implementations
Strengthening the Quality of Big Data Implementations
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATADATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
DATA MINING FRAMEWORK TO ANALYZE ROAD ACCIDENT DATA
 
IRJET- Performing Load Balancing between Namenodes in HDFS
IRJET- Performing Load Balancing between Namenodes in HDFSIRJET- Performing Load Balancing between Namenodes in HDFS
IRJET- Performing Load Balancing between Namenodes in HDFS
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 

Más de IOSR Journals

Más de IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

F1803013034

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. I (May-Jun. 2016), PP 30-34 www.iosrjournals.org DOI: 10.9790/0661-1803013034 www.iosrjournals.org 30 | Page Performance Analysis of Hadoop Application For Heterogeneous Systems Naresh E1 , Poornesha B D2 , Vijaya Kumar B P3 1 Research Scholar, CSE, Jain University, Bangalore, Karnataka, India 1, 2, 3 Department of ISE, MSRIT, MSR Nagar, Bangalore, Karnataka, 560054, India Abstract: Apache Hadoop is open source software that allows for the processing of large data set sin nodes. This work involves non-functional testing of Hadoop application in order to get the observable performance of an application. This work also includes the study and observations of performance analysis of Hadoop application for various machine architectures like Intel Core 2 Duo, and Intel i3. Keywords: BigData, Split files, Fuunctional testing, Hadoop throughput and performance, Mappers, Reducers. I. Introduction As the technology evolves huge amount of data is getting generated in day to day life by Sensors, Smartphone, Mobile devices and by web applications, there are three dimensions involved in generating the data they are Volume, Variety and Velocity(3V’s) [1] [4]. Volume defines the amount of data generated from multiple systems, which needs to be processed and analyzed. Variety defines type of data getting generated like structured data, semi-structured data and unstructured data, analyzing of these types of data is a tedious jobs so some scripting techniques can be adopted to analyze these data and involves lot of effort. Velocity defines the rate at which the data is generated due to digitalization, by real-time systems and mission critical application, this leads to the emergence of new domain-Big Data in organizations. Testing of this Big Data is crucial task and organizations are worried in handling of this data because of lack of knowledge on what to test and how to test [5]. Organizations are facing many challenges in forming test strategies for structured and unstructured data, setting up on environment and functions of components in Big Data model results in poor quality of data in production hit the response times very badly [7]. In order to manage accuracy of this data there are many testing types like functional and non-functional testing are essential along with error free test data and environment. Functional testing involves testing of Hadoop map reduce process and input data validation to ensure both input and output are in good quality and validating a extracted data before storing in downstream system [1]. Non functional testing involves testing of performance of the system plays a key role in accepting the model. II. Functional Testing Hadoop distributed file system (HDFS) makes input data available to multiple nodes. These input data can be extracted from multiple sources in structured or unstructured manner and loaded into HDFS by splitting files in multiple numbers and process, these multiple files using map and reduce operations finally extract the data generated from HDFS and store into downstream systems. As we are working with huge data and processing at multiple nodes, there are chances of missing quality data and inject of unrelated data leads to quality issues at each stage of the process. Data functional testing involves the testing of the data at three phases, they are validation of Hadoop pre-processing, validation of Hadoop map-reduces process and validation of Extracted data and store in downstream systems shown in the below Hadoop Map-Reduce Architecture [2] [5]. Figure 1: Hadoop Map-Reduce Architecture
  • 2. Performance Analysis Of Hadoop Application For Heterogeneous Systems DOI: 10.9790/0661-1803013034 www.iosrjournals.org 31 | Page Figure 1 shows the Hadoop framework for distributed processing of large datasets using map and reduce in any node of a system, initially pre-loaded input data is loaded into the HDFS for processing, map process split the data and store the intermediate key-value pair’s data from mappers. These key-value pairs are shuffled based on the key before sending into reduce phase, reduce phase performs the reduce operation and stores the generated output in local file system [9]. 2.1 Validation of Hadoop Pre-processing Data can be extracted from various sources and load into HDFS for processing, there are many issues involved during this process like data not in format, incorrect data captured from source systems, incorrect split of data files and incorrect storage of data. Validation of Hadoop pre-processing includes:  Comparing the extracted input file format with the source data format and ensure data extracted in correct format.  Data captured in input files should match the source data.  Proper splits can be made and its compatible with the HDFS distributed file system.  Input files should be bigger in size, Hadoop reads the blocks each of size may be in multiple of 64MB, it’s good to have the file size in multiple of thousands of 64MB [3] [12] [13]. 2.2 Validation of Hadoop map reduces process Once the files are loaded into HDFS, now its Hadoop map-reduce job to process the input files from different sources, there are many issues involved in these process includes jobs run correctly in single node but not in multiple nodes, incorrect aggregation of data in reduce phase, incorrect output data format, coding error in map-reduce phase, mapper spill size. Validation of Hadoop map-reduce process includes:  Tune number of map and reduce tasks appropriately [11].  Validation of data processing and output files generated, output file format should be as per the requirement specified.  Compare the output file with the input source file formats.  Verify and validate the business logic configuration in single node against the multiple nodes [6].  Consolidate and validate the data after the reduce phase.  Validate the key value pair generated in map and reduce phase and verify the shuffling activity thoroughly [8] [12].  Size of the mapper output is sensitive to disk IO, network IO and memory sensitive on shuffle phase, use of minimal map output key-map output value and compressing of mapper output and minimizing the mapper output will give the observable performance, this can be done by filter out records on mapper side instead of reducer side [3]. Compress the intermediate output of mappers before transmitting to reducers, which will reduce the less disk I/O and network traffic from reading and moving files around [13]. 2.3 Validation of extracted data and store in downstream systems After completion of data processing, output files are generated and moved to downstream systems for loading. There are few issues involved in this phase like incorrect rule transformation is applied, incorrect loading of files to downstream databases, output files format not compatible with the storage format. Validation of post processing includes:  Validation of transaction rules before extracting the data.  Validation of data corruption in output files and validation of data integrity in a target system.  Comparing storage database format and output file format, setting of constraints rules in the target system [6]. III. Non Functional Testing This type of testing involves Performance testing, Failover testing and Load testing to identify any bottleneck involved in the process. 3.1 Performance Testing Any big data project deals with processing of huge volumes of data in a less time using multiple nodes, this data may be structured, unstructured. Performance will be hit badly because of poor design of architecture, low level coding standards and poor configuration of parameters so such systems are not meeting the user expectations for huge and complex data. Apart from the above, performance will hit badly due to improper input files, most of the map processing jobs are running at reduce phase such as aggregation of data, redundant
  • 3. Performance Analysis Of Hadoop Application For Heterogeneous Systems DOI: 10.9790/0661-1803013034 www.iosrjournals.org 32 | Page sorts and shuffle tasks[5], Communication delay between map to reduce phase. These performance issues can be resolved by:  Maintaining a huge file size of input files.  Carefully designing the system architecture by considering the design constructs and performing performance test to identify bottleneck.  Performance test can be carried out by setting up data in huge volume and infrastructure similar to production  Using Hadoop performance monitoring tool capture performance metrics such as job completion time, caching percentage, throughput, CPU utilization, memory utilization, top number of process consuming CPU etc. 3.2 Failover Testing Hadoop architecture consists of HDFS components, job and task tracker, name node and many data nodes. There are many chances like components of Hadoop become non functional due to name or data node failure, job or task tracker failures etc. Architecture can be designed such a way that it should automatically rectify or recover from the problem and start processing data. Failover testing is a important in Hadoop implementation and few validations need to be performed are frequent checking of edit logs, back up of name node which holds the metadata information, no data loss during the data node failures, operation should not halt if any of the data node fails or during replacement of data nodes, data replication during the new data nodes, replication is initiated whenever data node fails or data become corrupted. There should be constant schedule for Recover Point Objective (RPO) and Recovery Time Objective (RTO) in a system and metrics can be captured for RPO and RTO [6]. 3.3 Load Testing Its necessary to measure the Hadoop system behavior under both normal and peak load conditions to identify maximum operational capacity of an application, any bottlenecks in the application and which element causing the performance degradation. For the big data analysis Hadoop system should have capable enough to process vast amount of data in the form of files from different sources. Load testing can be carried out in three stages Physical testing, Dynamic testing and Cyclical testing [15]. Physical testing involves providing the constant input file load for a specified amount of time. Dynamic testing involves providing a variable input file load to map-reduce operations. Cyclical testing consists of repeated unloading and loading of input files for specified durations and conditions [10]. For each of this load testing measure the performance of the system using the below captured metrics:  Calculating the average and peak response time of the map-reduce phase, this can be calculated by time taken to process the input files and store the computed data in downstream systems.  Capturing the CPU and memory utilization.  Behavior of map and reduce phases and measuring the response time for each.  Throughput that is number of input files (size) processed per second, Error rate and how many concurrent users can work on the system. IV. Analysis And Results A We have conducted a experiments with the 4GHZ Core 2 Duo processor, RAM of 4GB and Hard Disk of 100GB in Ubuntu 12.04 LTS and Apple Intel Core i3 64-bit single node system in order to calculate the performance of the single node system in terms of number of input file size processed per unit of time i.e throughput of the Hadoop application [14], it can be calculated by size of the input file processed per total amount of CPU time taken in Milli Seconds. Input to the Hadoop application consists of the variable file sizes, at the beginning of the experiments, we have split the 1GB file size into 40 files, 20 files, 10 files of equal size and next we proceed with the single file of bigger sizes to process the data in a file, The experiments is carried with the Hadoop configured with the two mapers and one reducer in Core 2 duo and Intel Core i3 Systems as shown in the below Table 1. Sl.No Files Total File Size Core 2 Duo Throughput Intel Core i3 Throughput 1 40 1GB 3452.5 4098.3 2 20 1GB 3664.6 5168.7 3 10 1GB 3579.1 5113.1 4 1 1GB 3890.3 5237.7 5 1 2GB 4391.6 5355.3 6 1 5GB 5384.9 5835.6 7 1 10GB 5540.5 5754.2 8 1 20GB 5371.4 5904.5 9 1 30GB 5716.5 6333.5 Table 1: Throughput for variable file sizes
  • 4. Performance Analysis Of Hadoop Application For Heterogeneous Systems DOI: 10.9790/0661-1803013034 www.iosrjournals.org 33 | Page From the above table 1, we can observe that the throughput of the application is very low means 1GB file split into multiple files of equal size (first three experiments) and throughput is varies as the files of small size and throughput is good if we concatenate the same multiple files (40 files or 20 files or 10 files) and input to the application as a single file (1GB file). From the above analysis we observed that Throughput increase as the size of the file increases as shown in the below graph: Figure 2. Figure 2: Variable File Size versus Throughput We have further analyzed the rate processing of records defined as percentage of processing of records by total number of records processed by machines in a unit amount of time, means rate of processing of Intel Core i3 is calculated by percentage of total bytes of processed by Intel Core i3 in unit time divided by total records processed by Core 2 duo and Intel Core i3 as shown in table 2. Sl.No Files Total File Size Core 2 Duo Intel Core i3 1 40 1GB 45.7 54.3 2 20 1GB 40.5 55.4 3 10 1GB 41.2 58.5 4 1 1GB 42.6 57.4 5 1 2GB 45.6 54.9 6 1 5GB 47.9 52.1 7 1 10GB 49.5 53.5 8 1 20GB 47.8 54.4 9 1 30GB 47.4 52.6 Table 2: Rate of Processing We have conducted a experiment with Hadoop in core 2 duo and IntelCore i3 systems, to processing of records in same files of variable size and found that processing of records in Intel Core i3 is faster than Core 2 duo as shown in figure 3. Figure 3: Rate of Processing of different machines
  • 5. Performance Analysis Of Hadoop Application For Heterogeneous Systems DOI: 10.9790/0661-1803013034 www.iosrjournals.org 34 | Page V. Conclusions Data quality challenges can be resolved by deploying a well planned testing approach for both functional and non-functional requirements. Applying a right test strategies will improve the performance quality during processing, pre-processing and post processing. As we discussed in above section we can achieve good observable performance from huge single files than multiple files of variable size and necessary parameter configuration of mapper and reducer also helps in achieving a good performance of Hadoop application. Big Data testing is a specialized stream, tester should built with good data analysis skill set for identifying quality issues in data, it will help in identifying defects early and reduce the cost of implementation. Acknowledgements We indebted to management of MSRIT, Bangalore for excellent support in completing this work at right time, and also for providing the academic and research atmosphere at the institute. A special thanks to the authors mentioned in the references. References [1] Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, George Lapis, Thomas Deutsch,Understanding Big Data McGraw-Hill Companies 2012. [2] Lam,Chuck,Hadoop in Action, Manning Publications Co. 2010, Pages 21-25. [3] Dryman,Carpediem, Hadoop Performance Tuning Best Practices, 241-242, ACM,2014. [4] Zikopoulos, Paul and Eaton, Chris and others, Understanding big data: Analytics for enterprise class hadoop and streaming data, McGraw-Hill Osborne Media 2011. [5] Vinaykumar Chandrashekar, VinyasShrinivasShetty, Testing in a Big DataWorld, Manning EuroSTAR 2013. [6] Mahesh Gudipati, ShanthiRao, Naju D. Mohan and Naveen Kumar Gajja, Big Data: Testing Approach to Overcome Quality Challenges, Infosys, 2013. [7] Almeida, Fernando, Calistru, Catalin, The main challenges and issues of big data management, April 2013. [8] Xie, Jiong and Yin, Shu and Ruan, Xiaojun and Ding, Zhiyang and Tian,Yun and Majors, James and Manzanares, Adam and Qin, Xiao, Improving map reduce performance through data placement in heterogeneous hadoopclusters,IEEE, 2010. [9] Michael Kopp, About the Performance of Map Reduce Jobs. http://apmblog.compuware.com/2012/01/25/about-the-performance- of-map-reduce-jobs/ 2012. [10] Menasc, Daniel, Load testing of web sites Internet Computing, IEEE, vol=6, num=4, pages=70-74, IEEE, 2002. [11] Shrinivas B Joshi, Apache hadoop performance tuning methodologies and best practices ICPE 12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, Pages 241-242. [12] White,Tom, Hadoop: The definitive guide O’Reilly Media, Inc., 2012. [13] Todd Lipcon 7 Tips for Improving MapReduce Performance, Apache Hadoop for the Enterprise, Cloudera, http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce- performance, 2009. [14] Performance measurement of a Hadoop Cluster, https://www.amax.com/enterprise/pdfs/AMAX%20Emulex%20Hadoop%20Whitepaper.pdf, February 232012. [15] What Are Performance Metrics Used in Load Testing [online], Available: http://loadstorm.com/load-testing-metrics/ 2015.