Hadoop Based Data Discovery

University of Notre Dame
Hadoop-Based Data Discovery
Team 5: Jaydeep Chakrabarty, Tom Torralbas, Brian Dondanville, Ben Ashkar, Dan Lash
MSBA 70750 Emerging Issues in Analytics
Professor: Don Kleinmuntz
11/01/2015

2
Table of Contents
Section 1. Executive Summary....................................................................................................... 3
Section 2. Introduction and Scope................................................................................................. 3
Section 3. Current State................................................................................................................. 4
Hadoop Core Components..............................................................................................................4
Hadoop 1.0 Limitations:.................................................................................................................7
Section 4. Future State................................................................................................................... 8
YARN a Hadoop 2.0 Evolution.........................................................................................................8
Advantages of YARN:..................................................................................................................9
Few Important Notes about YARN:..............................................................................................9
Introduction to the User Friendly Face of Hadoop – Apache Spark...................................................9
Key Aspects of the Spark................................................................................................................9
Difference between Hadoop & Spark............................................................................................11
Emerging Technologies: Hadoop by 202........................................................................................11
Section 5. Description of Proposed Applications........................................................................ 12
5.1. Overview of Operational Scenarios........................................................................................13
Data Agnostic...........................................................................................................................14
Open Source ............................................................................................................................14
Price........................................................................................................................................14
5.2. Organizational Impacts..........................................................................................................14
5.3. Strategic Sourcing/Vendor Management (Hadoop Ecosystem Overview)................................15
6. Summary of Impacts ................................................................................................................ 16
Retail Banking Fraud.................................................................................................................16
Insurance Data Processing.........................................................................................................17
Capital Markets and Investments...............................................................................................17
7. Analysis of the Proposed System............................................................................................. 18
Works Cited .................................................................................................................................. 19

3
Section 1. Executive Summary
Hadoop has emerged as a powerful game changer in how we manage and process data.
As Hadoop continues to evolve, it will be a significant contributor in the ability of data scientists
to unlock the trapped insights in big data. Big data discovery is already starting to reveal
significant insights in fields such as pharmaceuticals, finance, crime prevention, security,
insurance, and banking. As Hadoop continues to evolve and become universally integrated and
distributed, this will undoubtedly lead to continued breakthroughs in many data intensive
industries.
Hadoop is market disruptive. It drives down the cost of managing big data because it is
data and hardware agnostic. Hadoop can work with all types of data formats (converting them
to HDFS), and can work across most types of hardware. Additionally, Hadoop significantly
increases the speed, access, and management of large data sets which allows users to leverage
data sets more efficiently, allowing for near real time computations.
As the ability to capture, store, and process data continues to grow exponentially, our
ability to draw insight from this data will be critical. Hadoop is at the forefront and will continue
to unlock insights that we can leverage for better decision making across all industries.
Section 2. Introduction and Scope
In our analysis, we will discuss key technologies and components of Hadoop to lay a
foundation of core functions. This will pave the path for discussions about business uses, but
does not dive deeply into the technology. For further reading, we recommend the following
books, which may be found on Amazon:
1. Hadoop: The Definitive Guide by Tom White
2. Hadoop Cluster Deployment by Danil Zburivsky
The bulk of our research came from the books above and Forester Research.
Unfortunately for our research, many companies are reluctant to discuss in detail how they use
Hadoop or its impacts for fear of loosing a competitive advantage. As well, vendors continue to
sensationalize its impacts and functions, and we wanted to remain objective. Much of our
research relies on existing research that was anonymously conducted by Forrester Research and
discussed at a high enough level to protect IP and company identity.
A key component of harnessing the power of big data is being creative in how technology
can be used for a business’unique set of challenges.Theexamples givenwillhelp illustratewhat’s

4
possible, but given the flexibility of Hadoop and its open source nature, we know that there are
several use cases yet to be covered. We hope that the reader walks away from this research with
a fair understanding of how Hadoop works and is inspired to use the technology within their own
organization.
Section 3. Current State
Apache Hadoop was inspired by Google’s MapReduce and Google File System papers that
were further developed at Yahoo!. It started as a large-scale distributed batch processing
infrastructure, and was later refined to meet the needs of an affordable, scalable and flexible
data structure, which could be used for working with very largedata sets. At its very core, Hadoop
is a data file system with an ecosystem of processing tools. (Hopkins, 2014)
Hadoop Core Components
Figure below shows the core components of Hadoop.
Starting from the bottom of the diagram, lets explore the ecosystem. The ecosystem is made of
technologies that provide improved capabilities in data access, governance, and analysis. This
allows Hadoop’s core capabilities, to integrate with a broad range of analytic solutions:
HDFS
A foundational component of the
Hadoop ecosystem is the Hadoop
Distributed File System (HDFS).
HDFS is the mechanism by which a
large amount of data can be
distributed over a cluster of
computers, and data is written once,
but read many times for processing.
HDFS provides the foundation for
other tools, such as HBase.
Map Reduce
Hadoop's main execution framework
is MapReduce, a programming model
for distributed, parallel data
processing, breaking jobs into
mapping phases and reduce phases
(thus the name). Developers write
MapReduce jobs for Hadoop, using
data stored in HDFS for fast data
access. Because of the nature of how
MapReduce works, Hadoop
processes the data in a parallel
fashion, resulting in faster results.

5
In the early days,big data processing was computing power intensive, requiring extensive
processing resources, storage, and parallelism. This meant that organizations had to spend a
considerable amount of money to build the infrastructure needed to support big data analytics.
Given the large price tag, only the largest Fortune 500 organizations could afford such an
infrastructure. And even with the large price tag, these traditional systems were slow and
HBase
A column-oriented NoSQL database
built on top of HDFS, HBase is used
for fast read/write access to large
amounts of data. HBase uses
Zookeeper for its management to
ensure that all of its components are
up and running.
Zookeeper
Zookeeper is Hadoop's distributed
coordination service. Designed to run
over a cluster of machines, it is a
highly available service used for the
management of Hadoop operations,
and many components of Hadoop
depend on it.
Pig
An abstraction over the complexity of
MapReduce programming, the Pig
platform includes an execution
environment and a scripting language
(Pig Latin) used to analyze Hadoop
data sets. Its compiler translates Pig
Latin into sequences of MapReduce.
programs
Hive
An SQL-like, high-level language used
to run queries on data stored in
Hadoop. Hive enables developers not
familiar with MapReduce to write data
queries that are translated into
MapReduce jobs in Hadoop.
Mahout
This is a machine-learning and data-
mining library that provides
MapReduce implementations for
popular algorithms used for
clustering, regression testing, and
statistical modeling.
Ambari
This is a project aimed at simplifying
Hadoop management by providing
support for provisioning, managing,
and monitoring Hadoop clusters.

6
difficult to navigate. Processing time was a significant hindrance in the ability to translate Big
Data into meaningful insights. (MapR, 2015)
Now, let’s look at some of Hadoop’s core functions in more detail. In Hadoop 1.0, there
was a tight coupling between Cluster Resource Management and the MapReduce programming
model Job Tracker, which manages resource management, and is part of the MapReduce
Framework.
HDFS’s technical functions are based on the Google File System (GFS). Its implementation
addresses a number of problems that are present in other distributed file systems such as
Network File System (NFS). Specifically, the implementation of HDFS is able to store a very large
amount of data (terabytes or petabytes). HDFS is designed to spread data across a large number
of machines, and support much larger file sizes compared to distributed file systems such as NFS.
To store data reliably, and cope with malfunctioning or the loss of individual machines in
a cluster, HDFS uses data replication. HDFS supports only a limited set of operations on files —
writes, deletes, appends, and reads, but not updates. It assumes that the data will be written to
the HDFS once, and then read multiple times. HDFS is implemented as a block-structured file
system. As shown in the figure below, individual files are broken into blocks of a fixed size, which
are stored across a Hadoop cluster. A file can be made up of several blocks, and stored on
different DataNodes (individual machines in the cluster) which are chosen randomly on a block-
by-block basis. As a result, access to a file usually requires access to multiple DataNodes, which
means that HDFS supports file sizes far larger than a single-disk in a server.
One of the requirements for such a block-structured file system is the capability to store,
manage, and access file metadata (information about files and blocks) reliably, and to provide
fast access to the metadata store. Unlike HDFS files themselves (which are accessed in a write-
once and read-many model), the metadata structures can be modified by a large number of
clients concurrently. It is important that this information never gets out of sync. HDFS solves this
problem by introducing a dedicated server, called the NameNode, which stores all the metadata
for the file system across the cluster. As mentioned, the implementation of HDFS is based on
master/slave architecture. On one hand, this approach greatly simplifies the overall HDFS

7
architecture. On the other, it also creates a single point of failure — losing the NameNode
effectively means losing all HDFS data. To prevent this problem, Hadoop implemented
a Secondary NameNode.
In the MapReduce framework, MapReduce job (MapReduce application) is divided
between number of tasks called mappers and reducers. Each task runs on one of the servers
(DataNode) of the cluster, and each server has a limited number of predefined slots (map slot,
reduce slot) for running tasks concurrently. (Mark Grover (Author), 2015)
The JobTracker is responsible for both managing the cluster's resources and driving the
execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs
and monitors each task, and if a task fails, allocates a new slot and reattempts the task. After a
task finishes, the job tracker cleans up temporary resources and releases the task's slot to make
it available for other jobs.
Hadoop 1.0 Limitations:
In this section we will present some of the limitations of the initial infrastructure. While
we can see that 1.0 laid the foundation for a powerful data platform, subsequent infrastructure
optimizations paved they way for its success.
1. Scalability: JobTracker limits scalability by using a single server to handle the following tasks:
o Resource management
o Job and task scheduling
o Monitoring
Although there are many servers (DataNode) available, the single server limits scalability once it
is fully utilized.

8
2. Availability: In Hadoop 1.0, JobTracker is a single point of failure. This means if JobTracker fails,
all jobs must restart, bringing down the entire system.
3. Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and
reduced slots for each TaskTrackers. Resources become constrained as map slots are ‘full’ while
reduce slots are empty (and vice-versa). Here the server resources (DataNode) could sit idle
which are reserved for reduce slots even when there is immediate need for those resources to
be used as mapper slots.
4. Non-MapReduceApplication:InHadoop 1.0, Job tracker was tightly integrated with MapReduce
and only supported applications that obey the MapReduce programming framework. Limiting its
ability to integrate with other applications. (solution, 2015)
Section 4. Future State
YARN a Hadoop 2.0 Evolution
The first generation of Hadoop provided affordable scalability and a flexible data
structure, but it was really only the first step in the journey. Its batch-oriented job processing and
consolidated resource management were limitations that drove the development of Yet Another
Resource Negotiator (YARN). YARN essentially became the architectural center of Hadoop, since
it allowed multiple data processing engines to handle data stored in one platform.

9
Advantages of YARN:
1. Yarn efficiently manages utilization of the resource. There are no more fixed map-reduce slots.
YARN provides central resource manager. With YARN, you can now run multiple applications in
Hadoop, all sharing a common resource.
2. Yarn can even run application that do not follow MapReduce model. YARN decouples
MapReduce's resource management and scheduling capabilities from the data processing
component, enabling Hadoop to support more varied processing approaches and a broader array
of applications. For example, Hadoop clusters can now run interactive querying and streaming
data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce
to do what is does best, process data.
Few Important Notes about YARN:
1. YARN is backward compatible. This means that existing MapReduce job can run on Hadoop 2.0
without any change.
2. No more JobTracker and TaskTracker needed in Hadoop 2.0. JobTracker and TaskTracker have
totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource
management and job scheduling/monitoring into 2 separate daemons (components):
o Resource Manager
o Node Manager (node specific)
(Readwrite, 2015)
Introduction to the User Friendly Face of Hadoop – Apache Spark
Spark is a fast cluster computing system developed by the contributions of nearly 250
developers from 50 companies in the UC Berkeley’s AMP Lab. Spark was created for making data
analytics faster and easier to write and as well to run.
Apache Spark is open source and available for free download, thus making it a user
friendly face of the distributed programming framework i.e. Big Data. Spark follows a general
execution model that helps it with in-memory computing and optimization of arbitrary operator
graphs, so that querying data becomes much faster when compared to the disk based engines
like MapReduce.
Key Aspects of the Spark

10
1. Speed
o Runs 100x faster than Hadoop MapReduce in memory.
o It provides in memory computations for increased speed and data processing over
MapReduce.
o All information is maintained in main memory for fast lookup.
2. Runs Everywhere
o Spark runs on Hadoop, Mesos, standalone, or in the cloud.
o It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
3. Ease of Use
o Write application quickly in Java, Scala, Python and R.
o Prebuilt Machine learning with Mllib for classification, regression, clustering, Chi-Squre
and correlation.
o Offers over 80 high-level operators that make it easy to build parallel apps.
4. SQL Query Support
o Spark supports SQL queries, streaming data, and complex analytics out-of-the-box.
o Combine these complex capabilities into a simple single workflow.
Java Scala Python
Spark Spark
SQL
Spark
Stream
HDFS HBase HBase HDFS Flume Kafka Twitter Custom

11
Difference between Hadoop & Spark
Hadoop is parallel data processing framework that has traditionally been used to run
map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark was
designed to run on top of Hadoop and is an alternative to the traditional batch map/reduce
model that can be used for real-time stream data processing and fast interactive queries that
finish within seconds. Hadoop supports both traditional map/reduce and Spark. (Readwrite,
2015)
Emerging Technologies: Hadoop by 202
Hadoop Will Be Used for Over 10 Percent of Data Processing and Storage
According to a recent 2014 State of Database Technology Survey, 13 percent of respondents are
already using Hadoop in production or pilots, indicating that it's on the upswing. By 2020, Hadoop
will be used across the enterprise.
Hadoop Will Lead in Infrastructure Spending
Hadoop has the potential to completely reshape the IT infrastructure of many companies. The
technology may be playing a larger role on the IT road map than in the enterprise right now, but
this will flip in the future. By 2020, most enterprises will have IT strategies that leverage Hadoop,
and it will be their greatest infrastructure investment.
Hadoop Will Be Used for Critical Day-to-Day Operations
As Hadoop is used more and the capabilities of YARN become fully realized, more useful
opportunities leveraging technology like Apache Spark and Storm will emerge and quickly
increase its potential. Even now, real-time/operational analytics are the fastest moving part of
the Hadoop ecosystem, and by 2020, Hadoop will be relied on for day-to-day enterprise
operations.
Hadoop Will Advance the Internet of Things
Spark
o Spark uses more RAM instead of network and disk
I/O its relatively fast as compared to Hadoop.
o Spark uses resilient distributed datasets (RDD), uses
a clever way of guaranteeing fault tolerance that
minimizes network I/O.
o Machine learning and Data mining algorithms are a
component of Spark.
Hadoop
o Hadoop stores data in disk and uses
replication to achieve fault tolerance.
o Hadoop is relatively slow as it uses disk
I/O and network to read and write data.
o Hadoop needs an Apache tool Mahout to
use the data mining.

12
The Internet of things is only possible with instant data processing and prescriptive analytics. As
more things enter the data ecosystem, the burden of processing will become greater and legacy
technology will not be able to keep up. Hadoop will be able to, and by 2020, it will be a mission-
critical foundation for many businesses tied to the Internet of things.
Hadoop Will Be Used for Processing and Storing Highly Sensitive Data
The lack of built-in security in Hadoop is an obstacle that enterprises face today, but new tools
are emerging that address this issue, connecting Kerberos and MapReduce components and
ensuring compatibility between data. By 2020, expect these issues to be resolved and highly
regulated organizations to be managing their secure data with Hadoop. (kdnuggets, 2015)
Section 5. Description of Proposed Applications
Now that we have explained some of the core functions of Hadoop, let’s take a look at
the some the business use cases. Given the flexibility of Hadoop and the ability to scale cheaply
and process large amounts of data quickly, this makes it well suited for analysis projects that
were previously not possible due to hardware limitations. According to a Forrester Study, 36% of
companies between the size of 5,000 – 19,000 employees had already or were planning on
implementing a Hadoop based solution. (Hopkins, 2014) Consider the following scenarios:
Scenario 1:
An e-commerce website receives millions of visits a day with each customer generating 4 - 10x
that amount of data as they traverse the site. Individual customers may be identified if they have
logged in previously, which allows the retailer to connect the paths to historical purchase
behavior. Customers may also have been touched by digital ads before visiting the site, perhaps
by 3 or 4 different messages. Capturing all of this data over months or a year would allow the
business to better understand which pages and ads lead to better conversion at a customer level.
Processing this data in a traditional data warehouse would be difficult to collect, connect and
analyze. However, Hadoop is able to capture these touch points in real time and process them,
giving the business a real time view of their customer’s behavior.
(Fichera, 2014)
Scenario 2:
A financial services company is processing millions of transactions a minute, the velocity and
volume of data makes Hadoop a perfect fit. Traditional extract, load and transform (ETL)
environments are expensive to purchase from commercial vendors and grow as data sets expand.
Traditional ETL server environments require significant investments and time to scale. Using
Hadoop makes hardware a commodity that can easily be interchanged and expanded. The value
of Hadoop in this scenario is growing the data capacity quickly and cheaply, allowing all data to
live within a single environment. (Fichera, 2014)
Scenario 3:

13
Consider a manufacturing plant that has placedsensors on its assemblyline. The sensors monitor
everything from the products to the machines performing the work, collecting thousands of data
points each minute. Leveraging Hadoop, the manufacturer is able to collect and analyze the data
in real time with a cheap and vast server farm. Having storage as a commodity allows them to
perform detailed analysis and predict when problems will occur with the machines or with
products. Making asmallinvestment in hardware and then building expertise in house to manage
their Hadoop stack, the company is able to keep costs down by being proactive in fixing problems
with their assembly line. (Fichera, 2014)
The above examples illustrate the power of Hadoop and many of these use cases were
only dreamed about previously. Performing at an enterprise level, this ecosystem of tools gives
companies willing to make investments adistinct advantage above their competitors. Data which
could previously not be joined due to being unstructured, or perhaps collected for any great deal
of time due to size is being solved by Hadoop. Allowing businesses to unlock the power of big
data allows them to perform data discovery like never before.
The table below, created by Forester Research, further illustrates business needs and the
translated Hadoop ‘evolution’.
(Hopkins, 2014)
5.1. Overview of Operational Scenarios

14
Our research indicates that Hadoop has three distinct qualities that make it highly
desirable for data discovery in operational scenarios.
Data Agnostic
Due to the open source environment Hadoop is highly customizable and having a variety
of prewritten software modules maintained by the community means that someone has
probably thought of a way to incorporate your data type into the environment. Hadoop
is simply a means for processing data and it can take all types of data structures that
previously could not be joined. This quality makes it incredibly attractive as companies
capture unstructured data from a variety of sources. (Fichera, 2014)
Open Source
A source of weakness, originally due to slow development and limited support, the fact
that Hadoop is open to all means that it has evolved many capabilities. There is a growing
ecosystem, expanding Hadoops’ functionality to meet new demands. Now that it has
become more mainstream, a growing number of support services are being sold by
vendors. An ever growing listof functionality combined with professionalsupport options
makes Hadoop enterprise ready. (Fichera, 2014)
Price
Hadoop is freely available for anyone to install and can be implemented by a growing
number vendors on commodity hardware. This gives companies the option to either grow
expertise in-house or farm the work out to a vendor and not focus on licensing fees for
the software. (Fichera, 2014)
5.2. Organizational Impacts
The processing time of data is now being scaled from months or weeks to hours and
minutes according to a WSJ survey, which found a time savings of 100:1. This time savings results
in a competitive advantage for financial services companies in particular. The survey also found
among financial services companies the following estimates of savings, it should be noted that
each company is different and the numbers below are directional:
1. Analyzing risk data decreased from 3 months to 3 hours.
2. Pricing calculations that once took 48 hours is taking 20 minutes.
3. Behavioral analytics that took 72 hours is now taking 20 minutes.
4. Modeling automation grew from 150 models per year to 15,000.
5. Operational data store built for $300,000 with Hadoop instead of $4M using a
traditional relational database.

15
(Bean, 2014)
While the processing numbers are impressive, the larger impact is on the business. The
low maintenance costs of Hadoop remove traditional barriers to data, for instance the time to
execute from IT. We’re not pointing fingers at IT, but we all know of expensive projects that take
substantial amounts of time to execute, and take longer still to start providing value. Giving
business stakeholders more access and less barriers allows them to self-service with regard to
quickly finding what is important and executing on crucial business decisions. (Bean, 2014)
To help further illustrate the revenue savings of Hadoop, the WSJ survey found a 50:1
ratio for traditional server costs versus a Hadoop server. This is due to inexpensive hardware; the
table below is a break down of the cost of competing products.
(Bean, 2014)
Overall we firmly believe that Hadoop opens the field for what’s possible in using data to
guide business decisions. It does this quickly and cheaply which results in major impacts to the
business.
5.3. Strategic Sourcing/Vendor Management (Hadoop Ecosystem Overview)
The Hadoop ecosystem continues to grow, at a rapid pace below is an overview of the
current state of Hadoop, which matches to the table in section 5.1.

16
(Hopkins, 2014)
We can seethere are often multiple potential solution paths to a singlebusiness problem.
6. Summary of Impacts
Some organizations are using Hadoop to create a competitive advantage amongst its
peers, so many are tight lipped about what they’re doing and the impacts it has had to the
organization. That being said, from our research online and customer interviews, we will walk
through three scenarios in which Hadoop has created value for organizations.
Retail Banking Fraud
Problem:
Banking fraud with regard to applicant misrepresentation for new accounts is a common
problem, which can even be caused by internal record manipulation. Third party services can
help screen applicants out up front, but, unfortunately, some do receive accounts. Once given an
account, applicants will then overdraw them in hopes of them being charged off, resulting in lost
revenue for the banks. It’s possible however to detect common precursors to fraudulent activity
and shut the accounts down before major charges occur. However, this requires arapid response
to massive amounts of data, which is how Hadoop can help.
Solution:

17
A Hadoop server was used for processing vastamount of financialtransactions and, with the help
of smart algorithms, allowed the bank to detect potential fraudulent activity. Detecting this
activity resulted in less write-offs or customers being placed into special high risk programs that
allowed them to manage their finances. (Hortonworks, 2014)
Insurance Data Processing
Many insurers now reward customers for good driving behavior. Customers will place tracking
devices within their vehicles and allow the company to collect geo-location data which allows
them to assess how good their driving habits are. This results in a large volume of data being
collected for several customers, which then turns into a processing nightmare with traditional
data environments. In this example 75% of data had to be discarded after being collected leading
to less visibility in long term trends for the customer. As well the data which was processed had
to be sampled which would lead to errors in final results.
Solutions:
Apache Hadoop was deployed which allowed the insurer to store data far more economically.
The larger data set allowed the insurer to make far better decisions in who to charge how much
because it could keep data for longer and process it with ease. These two advancements allowed
data scientists to train better models and deploy more advanced processing with algorithms.
(Hortonworks, 2014)
Capital Markets and Investments
A well known provider of financial market data collected a massive data feed of 50GB each day
of server log data, which would then be queried at a rate of 35,000 times per second. This feed
was typically used for processing recent data within the past year, though thirty percent of
queries were for older data. Unfortunately, their servers could only hold 10 years work of data
which lead to less informed decisions; as well, since the servers were aging and based on older
infrastructure they were barely able to keep up with their 12 millisecond SLA.
Solution:
Older infrastructure was replaced with cheaper, modern Hadoop servers. The new servers
allowed for affordable long term storage and using Apache HBase lowered latency response
times. The infrastructure refresh lead to better performance at a cheaper rate.
The above three scenarios help illustrate the power of Hadoop in real life scenarios. To
summarize the findings from our research we know that an implementation can stand to save
money for the organization and allow them to act more quickly with their data. It does this by
being flexiblewith integrations and hardware requirements. The ability to mimic traditional EDW
environments, for instance, an SQL database, makes it a no-brainer for organizations that want
to savemoney in the long run, and have the flexibilityto embrace emerging technologies allwhile
not paying the standard licensing fees of traditional systems. (Hortonworks, 2014)

18
7. Analysis of the Proposed System
Throughout our analysis we have praised Hadoop and its ecosystemof products. We will
now take a more critical look at the proposed systems and provide an evaluative summary.
Organizational, cultural and data governance problems are not solved simply by switching
to Hadoop. A business that fails to embrace advanced analytics and typically relies on their data
simply for reporting willnot overcome that hurdle by gaining anew platform. Before transitioning
to Hadoop, the organization should think critically about the business problems it is attempting
to solve and then evaluate how a new platform would help support those goals. For instance,
moving to a predictive sales model and leveraging Hadoop is a smart move, but the ground work
of understanding what makes a good model and the data requirements is the first step, not
moving to a new platform. Without the necessary groundwork, a transition is likely to fail to
deliver on its entire value proposition. (Evelson, 2016)
While it may seem obvious, some organizations may think that by moving to a new
platform, they’ll change stakeholders’ opinions about how to leverage data and end long
standing political divides. Disconnects between IT and business stakeholders are infamous and
are a real detriment to the business. However an organization that fails to see the value of data
prior to the switch will more than likely still have these problems post implementation. A clear
example of this is Tesla’s useofbig data to help inform their vehicleproduction. Being data driven
from day one allowed the company to embrace emerging technologies, having a culture built
around optimization and data has been a key to their success. This competitive advantage has
left long standing companies like Ford in the dust for product development. The best practices
around leading and maintaining a data driven culture still remain and cannot be solved simply by
a new platform. (Evelson, 2016)
Organizations that have struggled with data governance may exacerbate their problems
by switching to a new platform. If the correct operations are not put in place in a traditional data
environment for ensuring data quality, this leaves little hope that they’ll be solved once there’s
more data. A finely tuned analytics department takes discipline and cannot be accomplished
simply by switching to a new platform. (Evelson, 2016)
Well built existing reporting and data environments are something to be respected and
admired. CIO’s may be in a rush to transition to Hadoop, but if an existing system is providing
substantial value, a slower transition to Hadoop may be more appropriate. New access to data
can also be a sensitive transition for many companies. With more access to data or new data in
general, there will come more requests and new questions. CIO’s should consider how they will
handle these requests prior to a planned transition. Internal resource constraints tend to be tight,
increasing resources is typically recommended. The flexibilityof Hadoop to meet emerging needs
means BI departments need to be able to accommodate these requests and be nimble in their
execution. For instance, having the ability to integrate a number of new data sources without

19
someone to do it won’t add much value for the organization. Given the cost savings in software
and infrastructure, a CIO should consider using that money to increase head count. (Evelson,
2016)
An implementation of Hadoop for most companies will not be an option, but a
requirement to compete. Organizations that are able to think of new and creative business
solutions leveraging big data will win and we see Hadoop as a flexible platform that can support
those initiatives. As Hadoop continues to evolve, it is able to tap into the best and brightest
technologies. Significant leaps forward are visible today, with computer assisted steering (and
driving), fraud prevention calls and mobile alerts, severe weather predictions and responses,
stabilizing complex financial markets and facial recognition. Hadoop data processing is central
to all these evolving technologies, with incredible implications.
Works Cited
Bean, R. (2014, 01 27). Financial Servies Companies Firms See Results from Big Data Push.
Retrieved 09 01, 2015, from Wall Street Journal:
http://blogs.wsj.com/cio/2014/01/27/financial-services-companies-firms-see-results-
from-big-data-push
Evelson, B. (2016). Brief Reasons To (Or Not To) Modernize BI Platforms With Hadoop. Forrester
Research.
Fichera, R. (2014). Building The Foundation For Customer Insight: Hadoop Infrastructure
Architecture. Forrester Research.
Hopkins, B. (2014). Hadoop Ecosystem Overview, Q4 2014. Forrester Research.
Hortonworks. (2014). Hadoop Accelerates Earnings Growth in Banking and Insurance.
Hortonworks.
solution, A. T. (2015).
Is apache spark going to replace Hadoop. Website.
(solution, 2015) (kdnuggets, 2015; Mark Grover (Author), 2015)

Hadoop Based Data Discovery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Based Data Discovery

Similar to Hadoop Based Data Discovery (20)

Hadoop Based Data Discovery