SlideShare a Scribd company logo
1 of 19
University of Notre Dame
Hadoop-Based Data Discovery
Team 5: Jaydeep Chakrabarty, Tom Torralbas, Brian Dondanville, Ben Ashkar, Dan Lash
MSBA 70750 Emerging Issues in Analytics
Professor: Don Kleinmuntz
11/01/2015
2
Table of Contents
Section 1. Executive Summary....................................................................................................... 3
Section 2. Introduction and Scope................................................................................................. 3
Section 3. Current State................................................................................................................. 4
Hadoop Core Components..............................................................................................................4
Hadoop 1.0 Limitations:.................................................................................................................7
Section 4. Future State................................................................................................................... 8
YARN a Hadoop 2.0 Evolution.........................................................................................................8
Advantages of YARN:..................................................................................................................9
Few Important Notes about YARN:..............................................................................................9
Introduction to the User Friendly Face of Hadoop – Apache Spark...................................................9
Key Aspects of the Spark................................................................................................................9
Difference between Hadoop & Spark............................................................................................11
Emerging Technologies: Hadoop by 202........................................................................................11
Section 5. Description of Proposed Applications........................................................................ 12
5.1. Overview of Operational Scenarios........................................................................................13
Data Agnostic...........................................................................................................................14
Open Source ............................................................................................................................14
Price........................................................................................................................................14
5.2. Organizational Impacts..........................................................................................................14
5.3. Strategic Sourcing/Vendor Management (Hadoop Ecosystem Overview)................................15
6. Summary of Impacts ................................................................................................................ 16
Retail Banking Fraud.................................................................................................................16
Insurance Data Processing.........................................................................................................17
Capital Markets and Investments...............................................................................................17
7. Analysis of the Proposed System............................................................................................. 18
Works Cited .................................................................................................................................. 19
3
Section 1. Executive Summary
Hadoop has emerged as a powerful game changer in how we manage and process data.
As Hadoop continues to evolve, it will be a significant contributor in the ability of data scientists
to unlock the trapped insights in big data. Big data discovery is already starting to reveal
significant insights in fields such as pharmaceuticals, finance, crime prevention, security,
insurance, and banking. As Hadoop continues to evolve and become universally integrated and
distributed, this will undoubtedly lead to continued breakthroughs in many data intensive
industries.
Hadoop is market disruptive. It drives down the cost of managing big data because it is
data and hardware agnostic. Hadoop can work with all types of data formats (converting them
to HDFS), and can work across most types of hardware. Additionally, Hadoop significantly
increases the speed, access, and management of large data sets which allows users to leverage
data sets more efficiently, allowing for near real time computations.
As the ability to capture, store, and process data continues to grow exponentially, our
ability to draw insight from this data will be critical. Hadoop is at the forefront and will continue
to unlock insights that we can leverage for better decision making across all industries.
Section 2. Introduction and Scope
In our analysis, we will discuss key technologies and components of Hadoop to lay a
foundation of core functions. This will pave the path for discussions about business uses, but
does not dive deeply into the technology. For further reading, we recommend the following
books, which may be found on Amazon:
1. Hadoop: The Definitive Guide by Tom White
2. Hadoop Cluster Deployment by Danil Zburivsky
The bulk of our research came from the books above and Forester Research.
Unfortunately for our research, many companies are reluctant to discuss in detail how they use
Hadoop or its impacts for fear of loosing a competitive advantage. As well, vendors continue to
sensationalize its impacts and functions, and we wanted to remain objective. Much of our
research relies on existing research that was anonymously conducted by Forrester Research and
discussed at a high enough level to protect IP and company identity.
A key component of harnessing the power of big data is being creative in how technology
can be used for a business’unique set of challenges.Theexamples givenwillhelp illustratewhat’s
4
possible, but given the flexibility of Hadoop and its open source nature, we know that there are
several use cases yet to be covered. We hope that the reader walks away from this research with
a fair understanding of how Hadoop works and is inspired to use the technology within their own
organization.
Section 3. Current State
Apache Hadoop was inspired by Google’s MapReduce and Google File System papers that
were further developed at Yahoo!. It started as a large-scale distributed batch processing
infrastructure, and was later refined to meet the needs of an affordable, scalable and flexible
data structure, which could be used for working with very largedata sets. At its very core, Hadoop
is a data file system with an ecosystem of processing tools. (Hopkins, 2014)
Hadoop Core Components
Figure below shows the core components of Hadoop.
Starting from the bottom of the diagram, lets explore the ecosystem. The ecosystem is made of
technologies that provide improved capabilities in data access, governance, and analysis. This
allows Hadoop’s core capabilities, to integrate with a broad range of analytic solutions:
HDFS
A foundational component of the
Hadoop ecosystem is the Hadoop
Distributed File System (HDFS).
HDFS is the mechanism by which a
large amount of data can be
distributed over a cluster of
computers, and data is written once,
but read many times for processing.
HDFS provides the foundation for
other tools, such as HBase.
Map Reduce
Hadoop's main execution framework
is MapReduce, a programming model
for distributed, parallel data
processing, breaking jobs into
mapping phases and reduce phases
(thus the name). Developers write
MapReduce jobs for Hadoop, using
data stored in HDFS for fast data
access. Because of the nature of how
MapReduce works, Hadoop
processes the data in a parallel
fashion, resulting in faster results.
5
In the early days,big data processing was computing power intensive, requiring extensive
processing resources, storage, and parallelism. This meant that organizations had to spend a
considerable amount of money to build the infrastructure needed to support big data analytics.
Given the large price tag, only the largest Fortune 500 organizations could afford such an
infrastructure. And even with the large price tag, these traditional systems were slow and
HBase
A column-oriented NoSQL database
built on top of HDFS, HBase is used
for fast read/write access to large
amounts of data. HBase uses
Zookeeper for its management to
ensure that all of its components are
up and running.
Zookeeper
Zookeeper is Hadoop's distributed
coordination service. Designed to run
over a cluster of machines, it is a
highly available service used for the
management of Hadoop operations,
and many components of Hadoop
depend on it.
Pig
An abstraction over the complexity of
MapReduce programming, the Pig
platform includes an execution
environment and a scripting language
(Pig Latin) used to analyze Hadoop
data sets. Its compiler translates Pig
Latin into sequences of MapReduce.
programs
Hive
An SQL-like, high-level language used
to run queries on data stored in
Hadoop. Hive enables developers not
familiar with MapReduce to write data
queries that are translated into
MapReduce jobs in Hadoop.
Mahout
This is a machine-learning and data-
mining library that provides
MapReduce implementations for
popular algorithms used for
clustering, regression testing, and
statistical modeling.
Ambari
This is a project aimed at simplifying
Hadoop management by providing
support for provisioning, managing,
and monitoring Hadoop clusters.
6
difficult to navigate. Processing time was a significant hindrance in the ability to translate Big
Data into meaningful insights. (MapR, 2015)
Now, let’s look at some of Hadoop’s core functions in more detail. In Hadoop 1.0, there
was a tight coupling between Cluster Resource Management and the MapReduce programming
model Job Tracker, which manages resource management, and is part of the MapReduce
Framework.
HDFS’s technical functions are based on the Google File System (GFS). Its implementation
addresses a number of problems that are present in other distributed file systems such as
Network File System (NFS). Specifically, the implementation of HDFS is able to store a very large
amount of data (terabytes or petabytes). HDFS is designed to spread data across a large number
of machines, and support much larger file sizes compared to distributed file systems such as NFS.
To store data reliably, and cope with malfunctioning or the loss of individual machines in
a cluster, HDFS uses data replication. HDFS supports only a limited set of operations on files —
writes, deletes, appends, and reads, but not updates. It assumes that the data will be written to
the HDFS once, and then read multiple times. HDFS is implemented as a block-structured file
system. As shown in the figure below, individual files are broken into blocks of a fixed size, which
are stored across a Hadoop cluster. A file can be made up of several blocks, and stored on
different DataNodes (individual machines in the cluster) which are chosen randomly on a block-
by-block basis. As a result, access to a file usually requires access to multiple DataNodes, which
means that HDFS supports file sizes far larger than a single-disk in a server.
One of the requirements for such a block-structured file system is the capability to store,
manage, and access file metadata (information about files and blocks) reliably, and to provide
fast access to the metadata store. Unlike HDFS files themselves (which are accessed in a write-
once and read-many model), the metadata structures can be modified by a large number of
clients concurrently. It is important that this information never gets out of sync. HDFS solves this
problem by introducing a dedicated server, called the NameNode, which stores all the metadata
for the file system across the cluster. As mentioned, the implementation of HDFS is based on
master/slave architecture. On one hand, this approach greatly simplifies the overall HDFS
7
architecture. On the other, it also creates a single point of failure — losing the NameNode
effectively means losing all HDFS data. To prevent this problem, Hadoop implemented
a Secondary NameNode.
In the MapReduce framework, MapReduce job (MapReduce application) is divided
between number of tasks called mappers and reducers. Each task runs on one of the servers
(DataNode) of the cluster, and each server has a limited number of predefined slots (map slot,
reduce slot) for running tasks concurrently. (Mark Grover (Author), 2015)
The JobTracker is responsible for both managing the cluster's resources and driving the
execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs
and monitors each task, and if a task fails, allocates a new slot and reattempts the task. After a
task finishes, the job tracker cleans up temporary resources and releases the task's slot to make
it available for other jobs.
Hadoop 1.0 Limitations:
In this section we will present some of the limitations of the initial infrastructure. While
we can see that 1.0 laid the foundation for a powerful data platform, subsequent infrastructure
optimizations paved they way for its success.
1. Scalability: JobTracker limits scalability by using a single server to handle the following tasks:
o Resource management
o Job and task scheduling
o Monitoring
Although there are many servers (DataNode) available, the single server limits scalability once it
is fully utilized.
8
2. Availability: In Hadoop 1.0, JobTracker is a single point of failure. This means if JobTracker fails,
all jobs must restart, bringing down the entire system.
3. Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and
reduced slots for each TaskTrackers. Resources become constrained as map slots are ‘full’ while
reduce slots are empty (and vice-versa). Here the server resources (DataNode) could sit idle
which are reserved for reduce slots even when there is immediate need for those resources to
be used as mapper slots.
4. Non-MapReduceApplication:InHadoop 1.0, Job tracker was tightly integrated with MapReduce
and only supported applications that obey the MapReduce programming framework. Limiting its
ability to integrate with other applications. (solution, 2015)
Section 4. Future State
YARN a Hadoop 2.0 Evolution
The first generation of Hadoop provided affordable scalability and a flexible data
structure, but it was really only the first step in the journey. Its batch-oriented job processing and
consolidated resource management were limitations that drove the development of Yet Another
Resource Negotiator (YARN). YARN essentially became the architectural center of Hadoop, since
it allowed multiple data processing engines to handle data stored in one platform.
9
Advantages of YARN:
1. Yarn efficiently manages utilization of the resource. There are no more fixed map-reduce slots.
YARN provides central resource manager. With YARN, you can now run multiple applications in
Hadoop, all sharing a common resource.
2. Yarn can even run application that do not follow MapReduce model. YARN decouples
MapReduce's resource management and scheduling capabilities from the data processing
component, enabling Hadoop to support more varied processing approaches and a broader array
of applications. For example, Hadoop clusters can now run interactive querying and streaming
data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce
to do what is does best, process data.
Few Important Notes about YARN:
1. YARN is backward compatible. This means that existing MapReduce job can run on Hadoop 2.0
without any change.
2. No more JobTracker and TaskTracker needed in Hadoop 2.0. JobTracker and TaskTracker have
totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource
management and job scheduling/monitoring into 2 separate daemons (components):
o Resource Manager
o Node Manager (node specific)
(Readwrite, 2015)
Introduction to the User Friendly Face of Hadoop – Apache Spark
Spark is a fast cluster computing system developed by the contributions of nearly 250
developers from 50 companies in the UC Berkeley’s AMP Lab. Spark was created for making data
analytics faster and easier to write and as well to run.
Apache Spark is open source and available for free download, thus making it a user
friendly face of the distributed programming framework i.e. Big Data. Spark follows a general
execution model that helps it with in-memory computing and optimization of arbitrary operator
graphs, so that querying data becomes much faster when compared to the disk based engines
like MapReduce.
Key Aspects of the Spark
10
1. Speed
o Runs 100x faster than Hadoop MapReduce in memory.
o It provides in memory computations for increased speed and data processing over
MapReduce.
o All information is maintained in main memory for fast lookup.
2. Runs Everywhere
o Spark runs on Hadoop, Mesos, standalone, or in the cloud.
o It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
3. Ease of Use
o Write application quickly in Java, Scala, Python and R.
o Prebuilt Machine learning with Mllib for classification, regression, clustering, Chi-Squre
and correlation.
o Offers over 80 high-level operators that make it easy to build parallel apps.
4. SQL Query Support
o Spark supports SQL queries, streaming data, and complex analytics out-of-the-box.
o Combine these complex capabilities into a simple single workflow.
Java Scala Python
Spark Spark
SQL
Spark
Stream
HDFS HBase HBase HDFS Flume Kafka Twitter Custom
11
Difference between Hadoop & Spark
Hadoop is parallel data processing framework that has traditionally been used to run
map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark was
designed to run on top of Hadoop and is an alternative to the traditional batch map/reduce
model that can be used for real-time stream data processing and fast interactive queries that
finish within seconds. Hadoop supports both traditional map/reduce and Spark. (Readwrite,
2015)
Emerging Technologies: Hadoop by 202
Hadoop Will Be Used for Over 10 Percent of Data Processing and Storage
According to a recent 2014 State of Database Technology Survey, 13 percent of respondents are
already using Hadoop in production or pilots, indicating that it's on the upswing. By 2020, Hadoop
will be used across the enterprise.
Hadoop Will Lead in Infrastructure Spending
Hadoop has the potential to completely reshape the IT infrastructure of many companies. The
technology may be playing a larger role on the IT road map than in the enterprise right now, but
this will flip in the future. By 2020, most enterprises will have IT strategies that leverage Hadoop,
and it will be their greatest infrastructure investment.
Hadoop Will Be Used for Critical Day-to-Day Operations
As Hadoop is used more and the capabilities of YARN become fully realized, more useful
opportunities leveraging technology like Apache Spark and Storm will emerge and quickly
increase its potential. Even now, real-time/operational analytics are the fastest moving part of
the Hadoop ecosystem, and by 2020, Hadoop will be relied on for day-to-day enterprise
operations.
Hadoop Will Advance the Internet of Things
Spark
o Spark uses more RAM instead of network and disk
I/O its relatively fast as compared to Hadoop.
o Spark uses resilient distributed datasets (RDD), uses
a clever way of guaranteeing fault tolerance that
minimizes network I/O.
o Machine learning and Data mining algorithms are a
component of Spark.
Hadoop
o Hadoop stores data in disk and uses
replication to achieve fault tolerance.
o Hadoop is relatively slow as it uses disk
I/O and network to read and write data.
o Hadoop needs an Apache tool Mahout to
use the data mining.
12
The Internet of things is only possible with instant data processing and prescriptive analytics. As
more things enter the data ecosystem, the burden of processing will become greater and legacy
technology will not be able to keep up. Hadoop will be able to, and by 2020, it will be a mission-
critical foundation for many businesses tied to the Internet of things.
Hadoop Will Be Used for Processing and Storing Highly Sensitive Data
The lack of built-in security in Hadoop is an obstacle that enterprises face today, but new tools
are emerging that address this issue, connecting Kerberos and MapReduce components and
ensuring compatibility between data. By 2020, expect these issues to be resolved and highly
regulated organizations to be managing their secure data with Hadoop. (kdnuggets, 2015)
Section 5. Description of Proposed Applications
Now that we have explained some of the core functions of Hadoop, let’s take a look at
the some the business use cases. Given the flexibility of Hadoop and the ability to scale cheaply
and process large amounts of data quickly, this makes it well suited for analysis projects that
were previously not possible due to hardware limitations. According to a Forrester Study, 36% of
companies between the size of 5,000 – 19,000 employees had already or were planning on
implementing a Hadoop based solution. (Hopkins, 2014) Consider the following scenarios:
Scenario 1:
An e-commerce website receives millions of visits a day with each customer generating 4 - 10x
that amount of data as they traverse the site. Individual customers may be identified if they have
logged in previously, which allows the retailer to connect the paths to historical purchase
behavior. Customers may also have been touched by digital ads before visiting the site, perhaps
by 3 or 4 different messages. Capturing all of this data over months or a year would allow the
business to better understand which pages and ads lead to better conversion at a customer level.
Processing this data in a traditional data warehouse would be difficult to collect, connect and
analyze. However, Hadoop is able to capture these touch points in real time and process them,
giving the business a real time view of their customer’s behavior.
(Fichera, 2014)
Scenario 2:
A financial services company is processing millions of transactions a minute, the velocity and
volume of data makes Hadoop a perfect fit. Traditional extract, load and transform (ETL)
environments are expensive to purchase from commercial vendors and grow as data sets expand.
Traditional ETL server environments require significant investments and time to scale. Using
Hadoop makes hardware a commodity that can easily be interchanged and expanded. The value
of Hadoop in this scenario is growing the data capacity quickly and cheaply, allowing all data to
live within a single environment. (Fichera, 2014)
Scenario 3:
13
Consider a manufacturing plant that has placedsensors on its assemblyline. The sensors monitor
everything from the products to the machines performing the work, collecting thousands of data
points each minute. Leveraging Hadoop, the manufacturer is able to collect and analyze the data
in real time with a cheap and vast server farm. Having storage as a commodity allows them to
perform detailed analysis and predict when problems will occur with the machines or with
products. Making asmallinvestment in hardware and then building expertise in house to manage
their Hadoop stack, the company is able to keep costs down by being proactive in fixing problems
with their assembly line. (Fichera, 2014)
The above examples illustrate the power of Hadoop and many of these use cases were
only dreamed about previously. Performing at an enterprise level, this ecosystem of tools gives
companies willing to make investments adistinct advantage above their competitors. Data which
could previously not be joined due to being unstructured, or perhaps collected for any great deal
of time due to size is being solved by Hadoop. Allowing businesses to unlock the power of big
data allows them to perform data discovery like never before.
The table below, created by Forester Research, further illustrates business needs and the
translated Hadoop ‘evolution’.
(Hopkins, 2014)
5.1. Overview of Operational Scenarios
14
Our research indicates that Hadoop has three distinct qualities that make it highly
desirable for data discovery in operational scenarios.
Data Agnostic
Due to the open source environment Hadoop is highly customizable and having a variety
of prewritten software modules maintained by the community means that someone has
probably thought of a way to incorporate your data type into the environment. Hadoop
is simply a means for processing data and it can take all types of data structures that
previously could not be joined. This quality makes it incredibly attractive as companies
capture unstructured data from a variety of sources. (Fichera, 2014)
Open Source
A source of weakness, originally due to slow development and limited support, the fact
that Hadoop is open to all means that it has evolved many capabilities. There is a growing
ecosystem, expanding Hadoops’ functionality to meet new demands. Now that it has
become more mainstream, a growing number of support services are being sold by
vendors. An ever growing listof functionality combined with professionalsupport options
makes Hadoop enterprise ready. (Fichera, 2014)
Price
Hadoop is freely available for anyone to install and can be implemented by a growing
number vendors on commodity hardware. This gives companies the option to either grow
expertise in-house or farm the work out to a vendor and not focus on licensing fees for
the software. (Fichera, 2014)
5.2. Organizational Impacts
The processing time of data is now being scaled from months or weeks to hours and
minutes according to a WSJ survey, which found a time savings of 100:1. This time savings results
in a competitive advantage for financial services companies in particular. The survey also found
among financial services companies the following estimates of savings, it should be noted that
each company is different and the numbers below are directional:
1. Analyzing risk data decreased from 3 months to 3 hours.
2. Pricing calculations that once took 48 hours is taking 20 minutes.
3. Behavioral analytics that took 72 hours is now taking 20 minutes.
4. Modeling automation grew from 150 models per year to 15,000.
5. Operational data store built for $300,000 with Hadoop instead of $4M using a
traditional relational database.
15
(Bean, 2014)
While the processing numbers are impressive, the larger impact is on the business. The
low maintenance costs of Hadoop remove traditional barriers to data, for instance the time to
execute from IT. We’re not pointing fingers at IT, but we all know of expensive projects that take
substantial amounts of time to execute, and take longer still to start providing value. Giving
business stakeholders more access and less barriers allows them to self-service with regard to
quickly finding what is important and executing on crucial business decisions. (Bean, 2014)
To help further illustrate the revenue savings of Hadoop, the WSJ survey found a 50:1
ratio for traditional server costs versus a Hadoop server. This is due to inexpensive hardware; the
table below is a break down of the cost of competing products.
(Bean, 2014)
Overall we firmly believe that Hadoop opens the field for what’s possible in using data to
guide business decisions. It does this quickly and cheaply which results in major impacts to the
business.
5.3. Strategic Sourcing/Vendor Management (Hadoop Ecosystem Overview)
The Hadoop ecosystem continues to grow, at a rapid pace below is an overview of the
current state of Hadoop, which matches to the table in section 5.1.
16
(Hopkins, 2014)
We can seethere are often multiple potential solution paths to a singlebusiness problem.
6. Summary of Impacts
Some organizations are using Hadoop to create a competitive advantage amongst its
peers, so many are tight lipped about what they’re doing and the impacts it has had to the
organization. That being said, from our research online and customer interviews, we will walk
through three scenarios in which Hadoop has created value for organizations.
Retail Banking Fraud
Problem:
Banking fraud with regard to applicant misrepresentation for new accounts is a common
problem, which can even be caused by internal record manipulation. Third party services can
help screen applicants out up front, but, unfortunately, some do receive accounts. Once given an
account, applicants will then overdraw them in hopes of them being charged off, resulting in lost
revenue for the banks. It’s possible however to detect common precursors to fraudulent activity
and shut the accounts down before major charges occur. However, this requires arapid response
to massive amounts of data, which is how Hadoop can help.
Solution:
17
A Hadoop server was used for processing vastamount of financialtransactions and, with the help
of smart algorithms, allowed the bank to detect potential fraudulent activity. Detecting this
activity resulted in less write-offs or customers being placed into special high risk programs that
allowed them to manage their finances. (Hortonworks, 2014)
Insurance Data Processing
Many insurers now reward customers for good driving behavior. Customers will place tracking
devices within their vehicles and allow the company to collect geo-location data which allows
them to assess how good their driving habits are. This results in a large volume of data being
collected for several customers, which then turns into a processing nightmare with traditional
data environments. In this example 75% of data had to be discarded after being collected leading
to less visibility in long term trends for the customer. As well the data which was processed had
to be sampled which would lead to errors in final results.
Solutions:
Apache Hadoop was deployed which allowed the insurer to store data far more economically.
The larger data set allowed the insurer to make far better decisions in who to charge how much
because it could keep data for longer and process it with ease. These two advancements allowed
data scientists to train better models and deploy more advanced processing with algorithms.
(Hortonworks, 2014)
Capital Markets and Investments
A well known provider of financial market data collected a massive data feed of 50GB each day
of server log data, which would then be queried at a rate of 35,000 times per second. This feed
was typically used for processing recent data within the past year, though thirty percent of
queries were for older data. Unfortunately, their servers could only hold 10 years work of data
which lead to less informed decisions; as well, since the servers were aging and based on older
infrastructure they were barely able to keep up with their 12 millisecond SLA.
Solution:
Older infrastructure was replaced with cheaper, modern Hadoop servers. The new servers
allowed for affordable long term storage and using Apache HBase lowered latency response
times. The infrastructure refresh lead to better performance at a cheaper rate.
The above three scenarios help illustrate the power of Hadoop in real life scenarios. To
summarize the findings from our research we know that an implementation can stand to save
money for the organization and allow them to act more quickly with their data. It does this by
being flexiblewith integrations and hardware requirements. The ability to mimic traditional EDW
environments, for instance, an SQL database, makes it a no-brainer for organizations that want
to savemoney in the long run, and have the flexibilityto embrace emerging technologies allwhile
not paying the standard licensing fees of traditional systems. (Hortonworks, 2014)
18
7. Analysis of the Proposed System
Throughout our analysis we have praised Hadoop and its ecosystemof products. We will
now take a more critical look at the proposed systems and provide an evaluative summary.
Organizational, cultural and data governance problems are not solved simply by switching
to Hadoop. A business that fails to embrace advanced analytics and typically relies on their data
simply for reporting willnot overcome that hurdle by gaining anew platform. Before transitioning
to Hadoop, the organization should think critically about the business problems it is attempting
to solve and then evaluate how a new platform would help support those goals. For instance,
moving to a predictive sales model and leveraging Hadoop is a smart move, but the ground work
of understanding what makes a good model and the data requirements is the first step, not
moving to a new platform. Without the necessary groundwork, a transition is likely to fail to
deliver on its entire value proposition. (Evelson, 2016)
While it may seem obvious, some organizations may think that by moving to a new
platform, they’ll change stakeholders’ opinions about how to leverage data and end long
standing political divides. Disconnects between IT and business stakeholders are infamous and
are a real detriment to the business. However an organization that fails to see the value of data
prior to the switch will more than likely still have these problems post implementation. A clear
example of this is Tesla’s useofbig data to help inform their vehicleproduction. Being data driven
from day one allowed the company to embrace emerging technologies, having a culture built
around optimization and data has been a key to their success. This competitive advantage has
left long standing companies like Ford in the dust for product development. The best practices
around leading and maintaining a data driven culture still remain and cannot be solved simply by
a new platform. (Evelson, 2016)
Organizations that have struggled with data governance may exacerbate their problems
by switching to a new platform. If the correct operations are not put in place in a traditional data
environment for ensuring data quality, this leaves little hope that they’ll be solved once there’s
more data. A finely tuned analytics department takes discipline and cannot be accomplished
simply by switching to a new platform. (Evelson, 2016)
Well built existing reporting and data environments are something to be respected and
admired. CIO’s may be in a rush to transition to Hadoop, but if an existing system is providing
substantial value, a slower transition to Hadoop may be more appropriate. New access to data
can also be a sensitive transition for many companies. With more access to data or new data in
general, there will come more requests and new questions. CIO’s should consider how they will
handle these requests prior to a planned transition. Internal resource constraints tend to be tight,
increasing resources is typically recommended. The flexibilityof Hadoop to meet emerging needs
means BI departments need to be able to accommodate these requests and be nimble in their
execution. For instance, having the ability to integrate a number of new data sources without
19
someone to do it won’t add much value for the organization. Given the cost savings in software
and infrastructure, a CIO should consider using that money to increase head count. (Evelson,
2016)
An implementation of Hadoop for most companies will not be an option, but a
requirement to compete. Organizations that are able to think of new and creative business
solutions leveraging big data will win and we see Hadoop as a flexible platform that can support
those initiatives. As Hadoop continues to evolve, it is able to tap into the best and brightest
technologies. Significant leaps forward are visible today, with computer assisted steering (and
driving), fraud prevention calls and mobile alerts, severe weather predictions and responses,
stabilizing complex financial markets and facial recognition. Hadoop data processing is central
to all these evolving technologies, with incredible implications.
Works Cited
Bean, R. (2014, 01 27). Financial Servies Companies Firms See Results from Big Data Push.
Retrieved 09 01, 2015, from Wall Street Journal:
http://blogs.wsj.com/cio/2014/01/27/financial-services-companies-firms-see-results-
from-big-data-push
Evelson, B. (2016). Brief Reasons To (Or Not To) Modernize BI Platforms With Hadoop. Forrester
Research.
Fichera, R. (2014). Building The Foundation For Customer Insight: Hadoop Infrastructure
Architecture. Forrester Research.
Hopkins, B. (2014). Hadoop Ecosystem Overview, Q4 2014. Forrester Research.
Hortonworks. (2014). Hadoop Accelerates Earnings Growth in Banking and Insurance.
Hortonworks.
solution, A. T. (2015).
Is apache spark going to replace Hadoop. Website.
(solution, 2015) (kdnuggets, 2015; Mark Grover (Author), 2015)

More Related Content

What's hot

Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaEdureka!
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction EMC
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753pradip patel
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)JonySaini2
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri BilimiBüyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri BilimiAnkara Big Data Meetup
 
XA Secure | Whitepaper on data security within Hadoop
XA Secure | Whitepaper on data security within HadoopXA Secure | Whitepaper on data security within Hadoop
XA Secure | Whitepaper on data security within Hadoopbalajiganesan03
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1Donghan Kim
 
Whitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest MindsWhitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest MindsHappiest Minds Technologies
 
Improving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopImproving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopeSAT Journals
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Anita de Waard
 

What's hot (20)

Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction   White Paper: Hadoop in Life Sciences — An Introduction
White Paper: Hadoop in Life Sciences — An Introduction
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Hadoop framework thesis (3)
Hadoop framework thesis (3)Hadoop framework thesis (3)
Hadoop framework thesis (3)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri BilimiBüyük Veri, Hadoop Ekosistemi ve Veri Bilimi
Büyük Veri, Hadoop Ekosistemi ve Veri Bilimi
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 
XA Secure | Whitepaper on data security within Hadoop
XA Secure | Whitepaper on data security within HadoopXA Secure | Whitepaper on data security within Hadoop
XA Secure | Whitepaper on data security within Hadoop
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Whitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest MindsWhitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest Minds
 
Improving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoopImproving performance of apriori algorithm using hadoop
Improving performance of apriori algorithm using hadoop
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
 

Similar to Hadoop Based Data Discovery

Hadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterHadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterShiva Achari
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Hadoop Training in Delhi
Hadoop Training in DelhiHadoop Training in Delhi
Hadoop Training in DelhiAPTRON
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docxlorainedeserre
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docxBHANU281672
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947CMR WORLD TECH
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesJyrki Määttä
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 

Similar to Hadoop Based Data Discovery (20)

HDFS
HDFSHDFS
HDFS
 
Hadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapterHadoop essentials by shiva achari - sample chapter
Hadoop essentials by shiva achari - sample chapter
 
paper
paperpaper
paper
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop Training in Delhi
Hadoop Training in DelhiHadoop Training in Delhi
Hadoop Training in Delhi
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Actian DataFlow Whitepaper
Actian DataFlow WhitepaperActian DataFlow Whitepaper
Actian DataFlow Whitepaper
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Big data
Big dataBig data
Big data
 

Hadoop Based Data Discovery

  • 1. University of Notre Dame Hadoop-Based Data Discovery Team 5: Jaydeep Chakrabarty, Tom Torralbas, Brian Dondanville, Ben Ashkar, Dan Lash MSBA 70750 Emerging Issues in Analytics Professor: Don Kleinmuntz 11/01/2015
  • 2. 2 Table of Contents Section 1. Executive Summary....................................................................................................... 3 Section 2. Introduction and Scope................................................................................................. 3 Section 3. Current State................................................................................................................. 4 Hadoop Core Components..............................................................................................................4 Hadoop 1.0 Limitations:.................................................................................................................7 Section 4. Future State................................................................................................................... 8 YARN a Hadoop 2.0 Evolution.........................................................................................................8 Advantages of YARN:..................................................................................................................9 Few Important Notes about YARN:..............................................................................................9 Introduction to the User Friendly Face of Hadoop – Apache Spark...................................................9 Key Aspects of the Spark................................................................................................................9 Difference between Hadoop & Spark............................................................................................11 Emerging Technologies: Hadoop by 202........................................................................................11 Section 5. Description of Proposed Applications........................................................................ 12 5.1. Overview of Operational Scenarios........................................................................................13 Data Agnostic...........................................................................................................................14 Open Source ............................................................................................................................14 Price........................................................................................................................................14 5.2. Organizational Impacts..........................................................................................................14 5.3. Strategic Sourcing/Vendor Management (Hadoop Ecosystem Overview)................................15 6. Summary of Impacts ................................................................................................................ 16 Retail Banking Fraud.................................................................................................................16 Insurance Data Processing.........................................................................................................17 Capital Markets and Investments...............................................................................................17 7. Analysis of the Proposed System............................................................................................. 18 Works Cited .................................................................................................................................. 19
  • 3. 3 Section 1. Executive Summary Hadoop has emerged as a powerful game changer in how we manage and process data. As Hadoop continues to evolve, it will be a significant contributor in the ability of data scientists to unlock the trapped insights in big data. Big data discovery is already starting to reveal significant insights in fields such as pharmaceuticals, finance, crime prevention, security, insurance, and banking. As Hadoop continues to evolve and become universally integrated and distributed, this will undoubtedly lead to continued breakthroughs in many data intensive industries. Hadoop is market disruptive. It drives down the cost of managing big data because it is data and hardware agnostic. Hadoop can work with all types of data formats (converting them to HDFS), and can work across most types of hardware. Additionally, Hadoop significantly increases the speed, access, and management of large data sets which allows users to leverage data sets more efficiently, allowing for near real time computations. As the ability to capture, store, and process data continues to grow exponentially, our ability to draw insight from this data will be critical. Hadoop is at the forefront and will continue to unlock insights that we can leverage for better decision making across all industries. Section 2. Introduction and Scope In our analysis, we will discuss key technologies and components of Hadoop to lay a foundation of core functions. This will pave the path for discussions about business uses, but does not dive deeply into the technology. For further reading, we recommend the following books, which may be found on Amazon: 1. Hadoop: The Definitive Guide by Tom White 2. Hadoop Cluster Deployment by Danil Zburivsky The bulk of our research came from the books above and Forester Research. Unfortunately for our research, many companies are reluctant to discuss in detail how they use Hadoop or its impacts for fear of loosing a competitive advantage. As well, vendors continue to sensationalize its impacts and functions, and we wanted to remain objective. Much of our research relies on existing research that was anonymously conducted by Forrester Research and discussed at a high enough level to protect IP and company identity. A key component of harnessing the power of big data is being creative in how technology can be used for a business’unique set of challenges.Theexamples givenwillhelp illustratewhat’s
  • 4. 4 possible, but given the flexibility of Hadoop and its open source nature, we know that there are several use cases yet to be covered. We hope that the reader walks away from this research with a fair understanding of how Hadoop works and is inspired to use the technology within their own organization. Section 3. Current State Apache Hadoop was inspired by Google’s MapReduce and Google File System papers that were further developed at Yahoo!. It started as a large-scale distributed batch processing infrastructure, and was later refined to meet the needs of an affordable, scalable and flexible data structure, which could be used for working with very largedata sets. At its very core, Hadoop is a data file system with an ecosystem of processing tools. (Hopkins, 2014) Hadoop Core Components Figure below shows the core components of Hadoop. Starting from the bottom of the diagram, lets explore the ecosystem. The ecosystem is made of technologies that provide improved capabilities in data access, governance, and analysis. This allows Hadoop’s core capabilities, to integrate with a broad range of analytic solutions: HDFS A foundational component of the Hadoop ecosystem is the Hadoop Distributed File System (HDFS). HDFS is the mechanism by which a large amount of data can be distributed over a cluster of computers, and data is written once, but read many times for processing. HDFS provides the foundation for other tools, such as HBase. Map Reduce Hadoop's main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus the name). Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data access. Because of the nature of how MapReduce works, Hadoop processes the data in a parallel fashion, resulting in faster results.
  • 5. 5 In the early days,big data processing was computing power intensive, requiring extensive processing resources, storage, and parallelism. This meant that organizations had to spend a considerable amount of money to build the infrastructure needed to support big data analytics. Given the large price tag, only the largest Fortune 500 organizations could afford such an infrastructure. And even with the large price tag, these traditional systems were slow and HBase A column-oriented NoSQL database built on top of HDFS, HBase is used for fast read/write access to large amounts of data. HBase uses Zookeeper for its management to ensure that all of its components are up and running. Zookeeper Zookeeper is Hadoop's distributed coordination service. Designed to run over a cluster of machines, it is a highly available service used for the management of Hadoop operations, and many components of Hadoop depend on it. Pig An abstraction over the complexity of MapReduce programming, the Pig platform includes an execution environment and a scripting language (Pig Latin) used to analyze Hadoop data sets. Its compiler translates Pig Latin into sequences of MapReduce. programs Hive An SQL-like, high-level language used to run queries on data stored in Hadoop. Hive enables developers not familiar with MapReduce to write data queries that are translated into MapReduce jobs in Hadoop. Mahout This is a machine-learning and data- mining library that provides MapReduce implementations for popular algorithms used for clustering, regression testing, and statistical modeling. Ambari This is a project aimed at simplifying Hadoop management by providing support for provisioning, managing, and monitoring Hadoop clusters.
  • 6. 6 difficult to navigate. Processing time was a significant hindrance in the ability to translate Big Data into meaningful insights. (MapR, 2015) Now, let’s look at some of Hadoop’s core functions in more detail. In Hadoop 1.0, there was a tight coupling between Cluster Resource Management and the MapReduce programming model Job Tracker, which manages resource management, and is part of the MapReduce Framework. HDFS’s technical functions are based on the Google File System (GFS). Its implementation addresses a number of problems that are present in other distributed file systems such as Network File System (NFS). Specifically, the implementation of HDFS is able to store a very large amount of data (terabytes or petabytes). HDFS is designed to spread data across a large number of machines, and support much larger file sizes compared to distributed file systems such as NFS. To store data reliably, and cope with malfunctioning or the loss of individual machines in a cluster, HDFS uses data replication. HDFS supports only a limited set of operations on files — writes, deletes, appends, and reads, but not updates. It assumes that the data will be written to the HDFS once, and then read multiple times. HDFS is implemented as a block-structured file system. As shown in the figure below, individual files are broken into blocks of a fixed size, which are stored across a Hadoop cluster. A file can be made up of several blocks, and stored on different DataNodes (individual machines in the cluster) which are chosen randomly on a block- by-block basis. As a result, access to a file usually requires access to multiple DataNodes, which means that HDFS supports file sizes far larger than a single-disk in a server. One of the requirements for such a block-structured file system is the capability to store, manage, and access file metadata (information about files and blocks) reliably, and to provide fast access to the metadata store. Unlike HDFS files themselves (which are accessed in a write- once and read-many model), the metadata structures can be modified by a large number of clients concurrently. It is important that this information never gets out of sync. HDFS solves this problem by introducing a dedicated server, called the NameNode, which stores all the metadata for the file system across the cluster. As mentioned, the implementation of HDFS is based on master/slave architecture. On one hand, this approach greatly simplifies the overall HDFS
  • 7. 7 architecture. On the other, it also creates a single point of failure — losing the NameNode effectively means losing all HDFS data. To prevent this problem, Hadoop implemented a Secondary NameNode. In the MapReduce framework, MapReduce job (MapReduce application) is divided between number of tasks called mappers and reducers. Each task runs on one of the servers (DataNode) of the cluster, and each server has a limited number of predefined slots (map slot, reduce slot) for running tasks concurrently. (Mark Grover (Author), 2015) The JobTracker is responsible for both managing the cluster's resources and driving the execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs and monitors each task, and if a task fails, allocates a new slot and reattempts the task. After a task finishes, the job tracker cleans up temporary resources and releases the task's slot to make it available for other jobs. Hadoop 1.0 Limitations: In this section we will present some of the limitations of the initial infrastructure. While we can see that 1.0 laid the foundation for a powerful data platform, subsequent infrastructure optimizations paved they way for its success. 1. Scalability: JobTracker limits scalability by using a single server to handle the following tasks: o Resource management o Job and task scheduling o Monitoring Although there are many servers (DataNode) available, the single server limits scalability once it is fully utilized.
  • 8. 8 2. Availability: In Hadoop 1.0, JobTracker is a single point of failure. This means if JobTracker fails, all jobs must restart, bringing down the entire system. 3. Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and reduced slots for each TaskTrackers. Resources become constrained as map slots are ‘full’ while reduce slots are empty (and vice-versa). Here the server resources (DataNode) could sit idle which are reserved for reduce slots even when there is immediate need for those resources to be used as mapper slots. 4. Non-MapReduceApplication:InHadoop 1.0, Job tracker was tightly integrated with MapReduce and only supported applications that obey the MapReduce programming framework. Limiting its ability to integrate with other applications. (solution, 2015) Section 4. Future State YARN a Hadoop 2.0 Evolution The first generation of Hadoop provided affordable scalability and a flexible data structure, but it was really only the first step in the journey. Its batch-oriented job processing and consolidated resource management were limitations that drove the development of Yet Another Resource Negotiator (YARN). YARN essentially became the architectural center of Hadoop, since it allowed multiple data processing engines to handle data stored in one platform.
  • 9. 9 Advantages of YARN: 1. Yarn efficiently manages utilization of the resource. There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. 2. Yarn can even run application that do not follow MapReduce model. YARN decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce to do what is does best, process data. Few Important Notes about YARN: 1. YARN is backward compatible. This means that existing MapReduce job can run on Hadoop 2.0 without any change. 2. No more JobTracker and TaskTracker needed in Hadoop 2.0. JobTracker and TaskTracker have totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components): o Resource Manager o Node Manager (node specific) (Readwrite, 2015) Introduction to the User Friendly Face of Hadoop – Apache Spark Spark is a fast cluster computing system developed by the contributions of nearly 250 developers from 50 companies in the UC Berkeley’s AMP Lab. Spark was created for making data analytics faster and easier to write and as well to run. Apache Spark is open source and available for free download, thus making it a user friendly face of the distributed programming framework i.e. Big Data. Spark follows a general execution model that helps it with in-memory computing and optimization of arbitrary operator graphs, so that querying data becomes much faster when compared to the disk based engines like MapReduce. Key Aspects of the Spark
  • 10. 10 1. Speed o Runs 100x faster than Hadoop MapReduce in memory. o It provides in memory computations for increased speed and data processing over MapReduce. o All information is maintained in main memory for fast lookup. 2. Runs Everywhere o Spark runs on Hadoop, Mesos, standalone, or in the cloud. o It can access diverse data sources including HDFS, Cassandra, HBase, and S3. 3. Ease of Use o Write application quickly in Java, Scala, Python and R. o Prebuilt Machine learning with Mllib for classification, regression, clustering, Chi-Squre and correlation. o Offers over 80 high-level operators that make it easy to build parallel apps. 4. SQL Query Support o Spark supports SQL queries, streaming data, and complex analytics out-of-the-box. o Combine these complex capabilities into a simple single workflow. Java Scala Python Spark Spark SQL Spark Stream HDFS HBase HBase HDFS Flume Kafka Twitter Custom
  • 11. 11 Difference between Hadoop & Spark Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark was designed to run on top of Hadoop and is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. Hadoop supports both traditional map/reduce and Spark. (Readwrite, 2015) Emerging Technologies: Hadoop by 202 Hadoop Will Be Used for Over 10 Percent of Data Processing and Storage According to a recent 2014 State of Database Technology Survey, 13 percent of respondents are already using Hadoop in production or pilots, indicating that it's on the upswing. By 2020, Hadoop will be used across the enterprise. Hadoop Will Lead in Infrastructure Spending Hadoop has the potential to completely reshape the IT infrastructure of many companies. The technology may be playing a larger role on the IT road map than in the enterprise right now, but this will flip in the future. By 2020, most enterprises will have IT strategies that leverage Hadoop, and it will be their greatest infrastructure investment. Hadoop Will Be Used for Critical Day-to-Day Operations As Hadoop is used more and the capabilities of YARN become fully realized, more useful opportunities leveraging technology like Apache Spark and Storm will emerge and quickly increase its potential. Even now, real-time/operational analytics are the fastest moving part of the Hadoop ecosystem, and by 2020, Hadoop will be relied on for day-to-day enterprise operations. Hadoop Will Advance the Internet of Things Spark o Spark uses more RAM instead of network and disk I/O its relatively fast as compared to Hadoop. o Spark uses resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O. o Machine learning and Data mining algorithms are a component of Spark. Hadoop o Hadoop stores data in disk and uses replication to achieve fault tolerance. o Hadoop is relatively slow as it uses disk I/O and network to read and write data. o Hadoop needs an Apache tool Mahout to use the data mining.
  • 12. 12 The Internet of things is only possible with instant data processing and prescriptive analytics. As more things enter the data ecosystem, the burden of processing will become greater and legacy technology will not be able to keep up. Hadoop will be able to, and by 2020, it will be a mission- critical foundation for many businesses tied to the Internet of things. Hadoop Will Be Used for Processing and Storing Highly Sensitive Data The lack of built-in security in Hadoop is an obstacle that enterprises face today, but new tools are emerging that address this issue, connecting Kerberos and MapReduce components and ensuring compatibility between data. By 2020, expect these issues to be resolved and highly regulated organizations to be managing their secure data with Hadoop. (kdnuggets, 2015) Section 5. Description of Proposed Applications Now that we have explained some of the core functions of Hadoop, let’s take a look at the some the business use cases. Given the flexibility of Hadoop and the ability to scale cheaply and process large amounts of data quickly, this makes it well suited for analysis projects that were previously not possible due to hardware limitations. According to a Forrester Study, 36% of companies between the size of 5,000 – 19,000 employees had already or were planning on implementing a Hadoop based solution. (Hopkins, 2014) Consider the following scenarios: Scenario 1: An e-commerce website receives millions of visits a day with each customer generating 4 - 10x that amount of data as they traverse the site. Individual customers may be identified if they have logged in previously, which allows the retailer to connect the paths to historical purchase behavior. Customers may also have been touched by digital ads before visiting the site, perhaps by 3 or 4 different messages. Capturing all of this data over months or a year would allow the business to better understand which pages and ads lead to better conversion at a customer level. Processing this data in a traditional data warehouse would be difficult to collect, connect and analyze. However, Hadoop is able to capture these touch points in real time and process them, giving the business a real time view of their customer’s behavior. (Fichera, 2014) Scenario 2: A financial services company is processing millions of transactions a minute, the velocity and volume of data makes Hadoop a perfect fit. Traditional extract, load and transform (ETL) environments are expensive to purchase from commercial vendors and grow as data sets expand. Traditional ETL server environments require significant investments and time to scale. Using Hadoop makes hardware a commodity that can easily be interchanged and expanded. The value of Hadoop in this scenario is growing the data capacity quickly and cheaply, allowing all data to live within a single environment. (Fichera, 2014) Scenario 3:
  • 13. 13 Consider a manufacturing plant that has placedsensors on its assemblyline. The sensors monitor everything from the products to the machines performing the work, collecting thousands of data points each minute. Leveraging Hadoop, the manufacturer is able to collect and analyze the data in real time with a cheap and vast server farm. Having storage as a commodity allows them to perform detailed analysis and predict when problems will occur with the machines or with products. Making asmallinvestment in hardware and then building expertise in house to manage their Hadoop stack, the company is able to keep costs down by being proactive in fixing problems with their assembly line. (Fichera, 2014) The above examples illustrate the power of Hadoop and many of these use cases were only dreamed about previously. Performing at an enterprise level, this ecosystem of tools gives companies willing to make investments adistinct advantage above their competitors. Data which could previously not be joined due to being unstructured, or perhaps collected for any great deal of time due to size is being solved by Hadoop. Allowing businesses to unlock the power of big data allows them to perform data discovery like never before. The table below, created by Forester Research, further illustrates business needs and the translated Hadoop ‘evolution’. (Hopkins, 2014) 5.1. Overview of Operational Scenarios
  • 14. 14 Our research indicates that Hadoop has three distinct qualities that make it highly desirable for data discovery in operational scenarios. Data Agnostic Due to the open source environment Hadoop is highly customizable and having a variety of prewritten software modules maintained by the community means that someone has probably thought of a way to incorporate your data type into the environment. Hadoop is simply a means for processing data and it can take all types of data structures that previously could not be joined. This quality makes it incredibly attractive as companies capture unstructured data from a variety of sources. (Fichera, 2014) Open Source A source of weakness, originally due to slow development and limited support, the fact that Hadoop is open to all means that it has evolved many capabilities. There is a growing ecosystem, expanding Hadoops’ functionality to meet new demands. Now that it has become more mainstream, a growing number of support services are being sold by vendors. An ever growing listof functionality combined with professionalsupport options makes Hadoop enterprise ready. (Fichera, 2014) Price Hadoop is freely available for anyone to install and can be implemented by a growing number vendors on commodity hardware. This gives companies the option to either grow expertise in-house or farm the work out to a vendor and not focus on licensing fees for the software. (Fichera, 2014) 5.2. Organizational Impacts The processing time of data is now being scaled from months or weeks to hours and minutes according to a WSJ survey, which found a time savings of 100:1. This time savings results in a competitive advantage for financial services companies in particular. The survey also found among financial services companies the following estimates of savings, it should be noted that each company is different and the numbers below are directional: 1. Analyzing risk data decreased from 3 months to 3 hours. 2. Pricing calculations that once took 48 hours is taking 20 minutes. 3. Behavioral analytics that took 72 hours is now taking 20 minutes. 4. Modeling automation grew from 150 models per year to 15,000. 5. Operational data store built for $300,000 with Hadoop instead of $4M using a traditional relational database.
  • 15. 15 (Bean, 2014) While the processing numbers are impressive, the larger impact is on the business. The low maintenance costs of Hadoop remove traditional barriers to data, for instance the time to execute from IT. We’re not pointing fingers at IT, but we all know of expensive projects that take substantial amounts of time to execute, and take longer still to start providing value. Giving business stakeholders more access and less barriers allows them to self-service with regard to quickly finding what is important and executing on crucial business decisions. (Bean, 2014) To help further illustrate the revenue savings of Hadoop, the WSJ survey found a 50:1 ratio for traditional server costs versus a Hadoop server. This is due to inexpensive hardware; the table below is a break down of the cost of competing products. (Bean, 2014) Overall we firmly believe that Hadoop opens the field for what’s possible in using data to guide business decisions. It does this quickly and cheaply which results in major impacts to the business. 5.3. Strategic Sourcing/Vendor Management (Hadoop Ecosystem Overview) The Hadoop ecosystem continues to grow, at a rapid pace below is an overview of the current state of Hadoop, which matches to the table in section 5.1.
  • 16. 16 (Hopkins, 2014) We can seethere are often multiple potential solution paths to a singlebusiness problem. 6. Summary of Impacts Some organizations are using Hadoop to create a competitive advantage amongst its peers, so many are tight lipped about what they’re doing and the impacts it has had to the organization. That being said, from our research online and customer interviews, we will walk through three scenarios in which Hadoop has created value for organizations. Retail Banking Fraud Problem: Banking fraud with regard to applicant misrepresentation for new accounts is a common problem, which can even be caused by internal record manipulation. Third party services can help screen applicants out up front, but, unfortunately, some do receive accounts. Once given an account, applicants will then overdraw them in hopes of them being charged off, resulting in lost revenue for the banks. It’s possible however to detect common precursors to fraudulent activity and shut the accounts down before major charges occur. However, this requires arapid response to massive amounts of data, which is how Hadoop can help. Solution:
  • 17. 17 A Hadoop server was used for processing vastamount of financialtransactions and, with the help of smart algorithms, allowed the bank to detect potential fraudulent activity. Detecting this activity resulted in less write-offs or customers being placed into special high risk programs that allowed them to manage their finances. (Hortonworks, 2014) Insurance Data Processing Many insurers now reward customers for good driving behavior. Customers will place tracking devices within their vehicles and allow the company to collect geo-location data which allows them to assess how good their driving habits are. This results in a large volume of data being collected for several customers, which then turns into a processing nightmare with traditional data environments. In this example 75% of data had to be discarded after being collected leading to less visibility in long term trends for the customer. As well the data which was processed had to be sampled which would lead to errors in final results. Solutions: Apache Hadoop was deployed which allowed the insurer to store data far more economically. The larger data set allowed the insurer to make far better decisions in who to charge how much because it could keep data for longer and process it with ease. These two advancements allowed data scientists to train better models and deploy more advanced processing with algorithms. (Hortonworks, 2014) Capital Markets and Investments A well known provider of financial market data collected a massive data feed of 50GB each day of server log data, which would then be queried at a rate of 35,000 times per second. This feed was typically used for processing recent data within the past year, though thirty percent of queries were for older data. Unfortunately, their servers could only hold 10 years work of data which lead to less informed decisions; as well, since the servers were aging and based on older infrastructure they were barely able to keep up with their 12 millisecond SLA. Solution: Older infrastructure was replaced with cheaper, modern Hadoop servers. The new servers allowed for affordable long term storage and using Apache HBase lowered latency response times. The infrastructure refresh lead to better performance at a cheaper rate. The above three scenarios help illustrate the power of Hadoop in real life scenarios. To summarize the findings from our research we know that an implementation can stand to save money for the organization and allow them to act more quickly with their data. It does this by being flexiblewith integrations and hardware requirements. The ability to mimic traditional EDW environments, for instance, an SQL database, makes it a no-brainer for organizations that want to savemoney in the long run, and have the flexibilityto embrace emerging technologies allwhile not paying the standard licensing fees of traditional systems. (Hortonworks, 2014)
  • 18. 18 7. Analysis of the Proposed System Throughout our analysis we have praised Hadoop and its ecosystemof products. We will now take a more critical look at the proposed systems and provide an evaluative summary. Organizational, cultural and data governance problems are not solved simply by switching to Hadoop. A business that fails to embrace advanced analytics and typically relies on their data simply for reporting willnot overcome that hurdle by gaining anew platform. Before transitioning to Hadoop, the organization should think critically about the business problems it is attempting to solve and then evaluate how a new platform would help support those goals. For instance, moving to a predictive sales model and leveraging Hadoop is a smart move, but the ground work of understanding what makes a good model and the data requirements is the first step, not moving to a new platform. Without the necessary groundwork, a transition is likely to fail to deliver on its entire value proposition. (Evelson, 2016) While it may seem obvious, some organizations may think that by moving to a new platform, they’ll change stakeholders’ opinions about how to leverage data and end long standing political divides. Disconnects between IT and business stakeholders are infamous and are a real detriment to the business. However an organization that fails to see the value of data prior to the switch will more than likely still have these problems post implementation. A clear example of this is Tesla’s useofbig data to help inform their vehicleproduction. Being data driven from day one allowed the company to embrace emerging technologies, having a culture built around optimization and data has been a key to their success. This competitive advantage has left long standing companies like Ford in the dust for product development. The best practices around leading and maintaining a data driven culture still remain and cannot be solved simply by a new platform. (Evelson, 2016) Organizations that have struggled with data governance may exacerbate their problems by switching to a new platform. If the correct operations are not put in place in a traditional data environment for ensuring data quality, this leaves little hope that they’ll be solved once there’s more data. A finely tuned analytics department takes discipline and cannot be accomplished simply by switching to a new platform. (Evelson, 2016) Well built existing reporting and data environments are something to be respected and admired. CIO’s may be in a rush to transition to Hadoop, but if an existing system is providing substantial value, a slower transition to Hadoop may be more appropriate. New access to data can also be a sensitive transition for many companies. With more access to data or new data in general, there will come more requests and new questions. CIO’s should consider how they will handle these requests prior to a planned transition. Internal resource constraints tend to be tight, increasing resources is typically recommended. The flexibilityof Hadoop to meet emerging needs means BI departments need to be able to accommodate these requests and be nimble in their execution. For instance, having the ability to integrate a number of new data sources without
  • 19. 19 someone to do it won’t add much value for the organization. Given the cost savings in software and infrastructure, a CIO should consider using that money to increase head count. (Evelson, 2016) An implementation of Hadoop for most companies will not be an option, but a requirement to compete. Organizations that are able to think of new and creative business solutions leveraging big data will win and we see Hadoop as a flexible platform that can support those initiatives. As Hadoop continues to evolve, it is able to tap into the best and brightest technologies. Significant leaps forward are visible today, with computer assisted steering (and driving), fraud prevention calls and mobile alerts, severe weather predictions and responses, stabilizing complex financial markets and facial recognition. Hadoop data processing is central to all these evolving technologies, with incredible implications. Works Cited Bean, R. (2014, 01 27). Financial Servies Companies Firms See Results from Big Data Push. Retrieved 09 01, 2015, from Wall Street Journal: http://blogs.wsj.com/cio/2014/01/27/financial-services-companies-firms-see-results- from-big-data-push Evelson, B. (2016). Brief Reasons To (Or Not To) Modernize BI Platforms With Hadoop. Forrester Research. Fichera, R. (2014). Building The Foundation For Customer Insight: Hadoop Infrastructure Architecture. Forrester Research. Hopkins, B. (2014). Hadoop Ecosystem Overview, Q4 2014. Forrester Research. Hortonworks. (2014). Hadoop Accelerates Earnings Growth in Banking and Insurance. Hortonworks. solution, A. T. (2015). Is apache spark going to replace Hadoop. Website. (solution, 2015) (kdnuggets, 2015; Mark Grover (Author), 2015)