Hadoop and Big Data Analytics | Sysfore

Sysfore Technologies
#117-120, First Floor, 4th Block, 80 Feet Road, Koramangala, Bangalore 560034
HADOOP AND BIG DATA
ANALYTICS

Hadoop and Big Data Analytics
Gartner defines Big Data as “high volume, velocity and variety information
assets that demand cost-effective, innovative forms of information processing
for enhanced insight and decision making”.
Big data is data that, by virtue of its velocity, volume, or variety (the three Vs),
cannot be easily stored or analysed with traditional methods.
The term covers each and every piece of data your organization has stored till
now. It includes all the data stored both on-premises or in the cloud. It could
be papers, digital, structured and non-structured data within your company.
There is a deluge of structured and unstructured data that is generated every
second. This is known as Big Data, which can be analysed to help customers
turn that data into insights. AWS provides a broad platform of managed
services, infrastructure and tools to tackle your next Big Data project. It
enables you to collect, store, process, analyse and visualize Big Data on the
cloud. It provides all the hardware, software, infrastructure to maintain and
scale, so that you focus on building your application.
Some of the common Big Data Customer scenarios include Web & E-Tailing,
Telecommunications, Government, Healthcare & Life Science, Bank & Financial
Services and Retail, where Big Data is continuously generated.
How is Big Data consumed by Businesses: Businesses can gain a lot of insight
into how their product is being consumed, by analysing the Big Data

generated. Big Data analytics is an area of rapidly growing diversity. Analysing
large data sets requires significant compute capacity that can vary in size based
on the amount of input data and the analysis required. This characteristic of
big data workloads is ideally suited to the pay-as-you-go cloud computing
model, where applications can easily scale up and down based on demand.
Using Big Data analytics will give you a clear picture about how your data is
being generated and consumed by the customers. It can be used for predictive
marketing and plans to increase your business. It provides:
 Early key indicators, gives insights into business trends resulting in
business fortunes.
 Analytics results in business advantage.
 Get more precise analysis and results with more data.
Limitations of using the traditional analytics methods: The advancements in
technologies has resulted in huge volume of data being generated every
second. Storing, processing, analysing and getting quality results is time
consuming, costly and ineffective in the current scenario.
 Only a limited amount of high fidelity raw data is available for analysis.
 Storage is limited by the high volume of raw data that is continuously
generated.
 Moving data for computation doesn’t scale accordingly.
 Data is archived regularly to conserve space. This limits the amount of
data that is available for the analytical tools.

 The perception that traditional data warehousing processes are too slow
and limited in scalability.
 The ability to converge data from multiple data sources, both structured
and unstructured.
 The realization that time to information is critical to extract value from
data sources that include mobile devices, RFID, the web and a growing
list of automated sensory technologies.
As requirements change you can easily resize your environment (horizontally
or vertically) on AWS to meet your Amazon Web Services.
In addition, there are at least four major developmental segments that
underline the diversity to be found within Big Data analytics. These segments
are MapReduce, scalable database, real-time stream processing and Big Data
appliance.
Using Hadoop for Big Data Analytics: There is a big difference between Big
Data and Hadoop. The former is an asset, often a complex and ambiguous one,
while the latter is a program that accomplishes a set of goals and objectives for
dealing with that asset.
Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage
for any kind of data, enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs.

Hadoop is a framework, which allows processing of large data sets. It
completes the tasks in minutes, while the same done using the RDMS would
take hours.
Hadoop has 2 main components:
 HDFS – Hadoop Distributed File System (for Storage)
 MapReduce (for Processing)
Hadoop Distributed File System works: The Hadoop Distributed File System
(HDFS) is the primary storage system used by Hadoop applications. It consists
of HDFS clusters, which each contain one or more data nodes. Incoming data is
split into segments and distributed across data nodes to support parallel
processing. Each segment is then replicated on multiple data nodes to enable
processing to continue in the event of a node failure.
While HDFS protects against some types of failure, it is not entirely fault
tolerant. A single NameNode located on a single server is required. If this
server fails, the entire file system shuts down. A secondary NameNode
periodically backs up the primary. The backup data is used to restart the
primary but cannot be used to maintain operation.
HDFS is typically used in a Hadoop installation, yet other distributed file
systems are also supported. The Amazon S3 file system can be used but does
not maintain information on the location of data segments, reducing the ability
of Hadoop to survive server or rack failures. Other file systems such as open
source CloudStore and the MapR file system can be used to do maintain
location information.

Distributed processing is handled by MapReduce: The idea behind
MapReduce is that Hadoop can first map a large data set, and then perform a
reduction on that content for specific results. A reduce function can be thought
of as a kind of filter for raw data. The HDFS system then acts to distribute data
across a network or migrate it as necessary.
The MapReduce feature consists of one JobTracker and multiple TaskTrackers.
Client applications submit jobs to the JobTracker, which assigns each job to a
TaskTracker node. When HDFS or another location-aware file system is in use,
JobTracker takes advantage of knowing the location of each data segment. It
attempts to assign processing to the same node on which the required data
has been placed.
Apache Hadoop users typically build their own parallelized computing clusters
from commodity servers, each with dedicated storage in the form of a small
disk array or solid-state drive (SSD) for performance. These are commonly
referred to as “shared-nothing” architectures.
Big Data is getting Big and more important: As more and more data are
collected, the analysis of these data requires scalable, flexible, and high-
performing tools to provide analysis and insight in a timely fashion. Big Data
analytics is a growing field, with the need to parse large data sets from
multiple sources, and to produce information in real-time or near-real-time
gaining importance. IT organizations are exploring various analytics
technologies to parse web-based data sources and extract value from the
social networking boom.

Hadoop and Big Data Analytics | Sysfore

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop and Big Data Analytics | Sysfore

Similar to Hadoop and Big Data Analytics | Sysfore (20)

Recently uploaded

Recently uploaded (20)

Hadoop and Big Data Analytics | Sysfore