2. 8CS4-21: Big Data Analytics
Lab
Credit:2
Max. Marks: 50 (IA:30, ETE:20)
0L+0T+2P
End Term Exam: 2 Hours
3. List of Experiments:
1. Implement the following Data structures in
Java
i) nked Lists
ii) ii) Stacks
iii) iii) Queues
iv) iv) Set
v) v) Map
4. 2.Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudodistributed, Fully distributed.
Hadoop Mainly works on 3 different Modes:
Standalone Mode
Pseudo-distributed Mode
Fully-Distributed Mode
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode,
Secondary Name node, Job Tracker, and Task Tracker. We use job-tracker and
task-tracker for processing purposes in Hadoop1. For Hadoop2 we use Resource
Manager and Node Manager. Standalone Mode also means that we are installing
Hadoop only in a single system. By default, Hadoop is made to run in this
Standalone Mode or we can also call it as the Local mode. We mainly use
Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As
we all know HDFS (Hadoop distributed file system) is one of the major
components for Hadoop which utilized for storage Permission is not utilized in
this mode. You can think of HDFS as similar to the file system’s available for
windows i.e. NTFS (New Technology File System) and FAT32(File Allocation
Table which stores the data in the blocks of 32 bits ). when your Hadoop works in
this mode there is no need to configure the files – hdfs-site.xml, mapred-
site.xml, core-site.xml for Hadoop environment. In this Mode, all of your
Processes will run on a single JVM(Java Virtual Machine) and this mode can only
be used for small development purposes.
5. 2. Pseudo Distributed Mode (Single Node Cluster)
.
In Pseudo-distributed Mode we also use only a single node, but the
main thing is that the cluster is simulated, which means that all the
processes inside the cluster will run independently to each other. All the
daemons that are Namenode, Datanode, Secondary Name node,
Resource Manager, Node Manager, etc. will be running as a separate
process on separate JVM(Java Virtual Machine) or we can say run on
different java processes that is why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single
node set up so all the Master and Slave processes are handled by the
single system. Namenode and Resource Manager are used as Master
and Datanode and Node Manager is used as a slave. A secondary
name node is also used as a Master. The purpose of the Secondary
Name node is to just keep the hourly based backup of the Name node.
In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the
Input and Output processes.
We need to change the configuration files mapred-site.xml, core-
site.xml, hdfs-site.xml for setting up the environment.
6.
7. 3. Fully Distributed Mode (Multi-Node Cluster)
This is the most important one in which multiple nodes are
used few of them run the Master Daemon’s that are
Namenode and Resource Manager and the rest of them
run the Slave Daemon’s that are DataNode and Node
Manager. Here Hadoop will run on the clusters of Machine
or nodes. Here the data that is used is distributed across
different nodes. This is actually the Production Mode of
Hadoop let’s clarify or understand this Mode in a better
way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip
file format then you install it in your system and you run all
the processes in a single system but here in the fully
distributed mode we are extracting this tar or zip file to
each of the nodes in the Hadoop cluster and then we are
using a particular node for a particular process. Once you
distribute the process among the nodes then you’ll define
which nodes are working as a master or which one of them
9. 3.Implement the following file
management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting files Hint: A typical Hadoop workflow creates data files
(such
as log files) elsewhere and copies them into HDFS using one of
the
10. 4.Run a basic Word Count Map Reduce program to understand
Map Reduce Paradigm.
MapReduce is a programming model and an
associated implementation for processing large
data sets Users specify a Map function that
processes a key/value pair to generate a set of
intermediate key/value pairs, and a Reduce function
that merges all intermediate values associated with
11. 5. Write a Map Reduce program that mines weather data. Weather sensors collecting data every
hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with MapReduce,since it is semi structured and record-oriented.
weather sensors are collecting weather information across the
globe in a large volume of log data. This weather data is semi-
structured and record-oriented.
This data is stored in a line-oriented ASCII format, where each
row represents a single record. Each row has lots of fields like
longitude, latitude, daily max-min temperature, daily average
temperature, etc. for easiness, we will focus on the main
element, i.e. temperature. We will use the data from the National
Centres for Environmental Information(NCEI). It has a massive
amount of historical weather data that we can use for data
analysis.
Hadoop MapReduce is a software framework for easily writing
applications which process big amounts of data in-parallel on
large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner. The term MapReduce actually
refers to the following two different tasks that Hadoop programs
perform:
12. The Map Task: This is the first task, which takes input
data and converts it into a set of data, where individual
elements are broken down into tuples (key/value
pairs).
The Reduce Task: This task takes the output from a
map task as input and combines those data tuples into
a smaller set of tuples. The reduce task is always
performed after the map task.
13. 6.Implement Matrix Multiplication with Hadoop Map
Reduce
MapReduce is a technique in which a huge program is
subdivided into small tasks and run parallelly to make
computation faster, save time, and mostly used in
distributed systems. It has 2 important parts:
Mapper: It takes raw data input and organizes into key,
value pairs. For example, In a dictionary, you search for
the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”.
Here the Key is Data and the
Value associated with is facts and statistics collected
together for reference or analysis.
Reducer: It is responsible for processing data in parallel
and produce final output.
14. 7.Install and Run Pig then write Pig Latin scripts to sort, group, join,
project,and filter your data
Pig is a high-level programming language useful for analyzing
large data sets. Pig was a result of development effort at Yahoo!
In a MapReduce framework, programs need to be translated into
a series of Map and Reduce stages. However, this is not a
programming model which data analysts are familiar with. So, in
order to bridge this gap, an abstraction called Pig was built on
top of Hadoop.
A Pig Latin program consists of a series of operations or
transformations which are applied to the input data to produce
output. These operations describe a data flow which is translated
into an executable representation, by Hadoop Pig execution
environment. Underneath, results of these transformations are
series of MapReduce jobs which a programmer is unaware of.
So, in a way, Pig in Hadoop allows the programmer to focus on
data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar
keywords from data processing e.g., Join, Group and Filter.
15. 8.Install and Run Hive then use Hive to create, alter, and drop
databases,tables, views, functions, and indexes.
Apache Hive is a data warehouse software project built
on top of Apache Hadoop for providing data query and
analysis. Hive gives an SQL-like interface to query data
stored in various databases and file systems that
integrate with Hadoop
17. Program -1 :Linked list in java
Linked List is a part of the Collection
framework present in java.util package. This class
is an implementation of the LinkedList data
structure which is a linear data structure where
the elements are not stored in contiguous
locations and every element is a separate object
with a data part and address part. The elements
are linked using pointers and addresses. Each
element is known as a node. Due to the
dynamicity and ease of insertions and deletions,
they are preferred over the arrays. It also has few
disadvantages like the nodes cannot be accessed
directly instead we need to start from the head
and follow through the link to reach to a node we
wish to access.
18. create and use a linked list.
import java.util.*;
public class Test {
public static void main(String args[])
{ LinkedList<String> ll
= new LinkedList<String>();
// Adding elements to the linked list
ll.add("A");
ll.add("B");
ll.addLast("C");
ll.addFirst("D");
ll.add(2, "E");
System.out.println(ll);
ll.remove("B");
ll.remove(3);
ll.removeFirst();
ll.removeLast();
System.out.println(ll);
}
}
19. Performing Various Operations on LinkedList
1. Adding Elements: In order to add an element to
an ArrayList, we can use the add() method. This
method is overloaded to perform multiple
operations based on different parameters. They
are:
add(Object): This method is used to add an
element at the end of the LinkedList.
add(int index, Object): This method is used to
add an element at a specific index in the
LinkedList.
20. 2. Changing Elements: After adding the elements, if we
wish to change the element, it can be done using the set()
method. Since a LinkedList is indexed, the element which
we wish to change is referenced by the index of the
element. Therefore, this method takes an index and the
updated element which needs to be inserted at that index.
21. // Java program to change elements in a LinkedList
import java.util.*;
public class GFG {
public static void main(String args[])
{
LinkedList<String> ll = new LinkedList<>();
ll.add("Geeks");
ll.add("Geeks");
ll.add(1, "Geeks");
System.out.println("Initial LinkedList " + ll);
ll.set(1, "For");
System.out.println("Updated LinkedList " + ll);
}
}
22. 3. Removing Elements: In order to remove an element
from a LinkedList, we can use the remove() method.
This method is overloaded to perform multiple
operations based on different parameters. They are:
remove(Object): This method is used to simply remove
an object from the LinkedList. If there are multiple
such objects, then the first occurrence of the object is
removed.
remove(int index): Since a LinkedList is indexed, this
method takes an integer value which simply removes
the element present at that specific index in the
LinkedList. After removing the element, all the
elements are moved to the left to fill the space and the
indices of the objects are updated.
23. // Java program to remove elements // in a LinkedList
import java.util.*;
public class GFG {
public static void main(String args[])
{
LinkedList<String> ll = new LinkedList<>();
ll.add("Geeks");
ll.add("Geeks");
ll.add(1, "For");
System.out.println(
"Initial LinkedList " + ll);
ll.remove(1);
System.out.println(
"After the Index Removal " + ll);
ll.remove("Geeks");
System.out.println(
"After the Object Removal " + ll);
}
}
24. 4. Iterating the LinkedList: There are multiple ways
to iterate through the LinkedList. The most famous
ways are by using the basic for loop in combination
with a get() method to get the element at a specific
index and the advanced for loop.
25. // Java program to iterate the elements in an LinkedList
import java.util.*;
public class GFG {
public static void main(String args[])
{
LinkedList<String> ll
= new LinkedList<>();
ll.add("Geeks");
ll.add("Geeks");
ll.add(1, "For");
// Using the Get method and the
// for loop
for (int i = 0; i < ll.size(); i++) {
System.out.print(ll.get(i) + " ");
}
System.out.println();
// Using the for each loop
for (String str : ll)
System.out.print(str + " ");
}
}