For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Stream Meets Batch for Smarter Analytics- Impetus White Paper
1. Stream Meets Batch for
Smarter Analytics
W H I T E P A P E R
Abstract
This white paper focuses on dealing with Big Data problems in
real time. It discusses how the traditional batch paradigm and
real time paradigm can work together to deliver smarter, quicker
and better insights on large volumes of data. It also talks about
additions to existing solutions to deal with low latency use cases.
The paper also guides you on picking the right strategy and right
technology stack to address real-time, Big Data analytics
problems.
Impetus Technologies, Inc.
www.impetus.com
2. Stream meets Batch for smarter analytics
2
Table of Contents
Introduction..............................................................................................2
The archive data analytics platform .........................................................3
Emerging use cases of Batch processing .....................................3
Downsides of Batch processing ...................................................3
Live data analytics platform......................................................................3
The stream processing system .................................................................5
Benefits of stream processing .....................................................6
Interesting use cases of live data analytics..................................6
Integration of archive data and live data analysis....................................7
Smarter Data Ingestion.............................................................................7
Adaptive analysis ......................................................................................8
Case Study: Auto categorize news articles ..............................................9
Summary.................................................................................................10
Introduction
Evolution in digital media and technologies has led to an exponential growth in
the volume of data produced by mankind. The data has grown to Exabytes, and
is expanding daily. As digital technologies is touching every aspect of our lives,
we have data being generated from posts of social media sites, e-mails, digital
pictures, online videos, sensor data for climate information, GPS signals, cell
phone data, browsing data, transactional data of online shoppers, etc. We
categorize such data of data as Big Data.
Enterprises are identifying smarter ways to extract valuable information out of
this data. This valuable information can be used to predict market trends,
optimize business processes, create effective campaigns and improve user
services. Analysis of the data can fall into two broad classes– real-time and
historic. Together, both kinds of analysis provide a 360 degree view of valuable
information.
3. Stream meets Batch for smarter analytics
3
Significant amount of work has been done in the area of historic or batch
analytics, with the evolution of solutions built over the Hadoop platform or
similar platforms like R Analytics and HPCC. However, enterprises are lagging
behind in the area of real-time analysis of Big Data and even more on the
combination of the two. The real challenge is dealing with historic information,
to find insights, and smarter ways to effectively use those insights with real-time
data. |
This paper primarily focuses on Big Data real-time processing strategies to
enable existing platforms to handle low latency use cases. It empowers
businesses to gain quick insight and in turn maximize Return on Investment
(ROI).
The Archive Data Analytics Platform
Batch or archive data processing is the most widely used approach for analyzing
big volumes of data. In batch processing, the data is aggregated into a single
entity called the batch or a job. The biggest covet in batching is that it won’t give
you a partial result of the analysis. For results, you have to wait until the batch
processing is done. Batch analysis is best suited for deeper analysis of data
which requires full view of the data. Consider an e-commerce web site, where
the requirement is to recommend to users the products of their taste, to
maximize sales.
Emerging use cases of Batch processing
Deeper analytics
Classification of data
Clustering of data
Recommendations on user tastes
Downsides of Batch processing
High latency results
Classification of data
Bigger hardware requirements
Limited ad-hoc capabilities
Live Data Analytics Platform
Time is the key. Analytics solutions for domains like defense, credit card fraud
detection, intelligence, law enforcement, online trading and security need to
4. Stream meets Batch for smarter analytics
4
quickly analyze, identify and react to the patterns of threats by continuously
processing the enormous amounts of data generated from network logs, e-
mails, social media feeds, sensor data, web feeds and many other sources. For
such applications, timely response is the only key to their business. Otherwise
high latency information is of no use.
Enterprises need a revolutionary upgrade in their capabilities to extract,
transform, analyze and quickly respond to the huge volume of data coming in
real time. Today, many enterprises are struggling to manage and analyze
massive and growing volumes of data in real time.
Lately, few technologies and tools have emerged to meet the challenges of
analyzing high volumes of data in real time or near real time. This section talks
about a few of the existing approaches with their downsides:
1. Relational Database management systems: RDBMSs have been
available for years for OLTP as well as data warehouse class of
applications. But they do not scale and perform for high volume
streaming data because of indexing limitations.
2. Main memory databases: Modified versions of DBMSs target the same
set of functionalities as traditional DBMSs but with higher throughput,
by storing data in the main memory instead of physical storage. Like
traditional systems, they also fail when it comes to Big Data
requirements.
3. Rule engines: Sales and marketing has ‘repurposed’ them to deal with
Big Data real time applications. Their downside is a lack of suitable
storage systems and hence the need for different infrastructure for
persistence of data.
5. Stream meets Batch for smarter analytics
5
The Stream Processing System
The stream processing system is a completely new paradigm well suited for
handling continuous data. It offers high scalability, performance and flexibility
over other traditional approaches. Conventional systems run continuous queries
over stored static data whereas a stream processing system runs static queries
over continuous unbounded data. A stream processing system for continuous
unbounded data is analogous to a DBMS for structural stored data.
The stream processing platform consists of three major components: data
import connectors, output connectors and ETL components. Various types of
incoming data from multiple sources are pulled into the platform using input
connectors. In the next stage, the data is cleansed, filtered, transformed,
clustered, classified or correlated and the resulting information used for
notifications, reporting and analyses.
6. Stream meets Batch for smarter analytics
6
Benefits of stream processing
• Online accumulation
• Real-time analytics
• Live BI competences
• Smart ingestion into data warehouse (details in next section)
Interesting use cases of live data analytics
Fraud detection– Analysis on millions of real time credit card transactions to
detect and prevent any fraud cases using predictive algorithms. Also, text in
insurance claim documents can be analyzed to identify probable fraud cases.
Patient health monitoring–An analytical solution can capture streams of data
coming from medical equipment that monitors a patient’s heart rate, blood
pressure, sugar levels and temperature and predict if an infection or
compilation can occur.
Omni channel retail – Data from various independent sources can be analyzed
to enhance the shoppers’ experience by recommending products, customized
campaigns, and location based offerings.
7. Stream meets Batch for smarter analytics
7
Integration of Archive Data and Live Data
Analysis
Both archive data analysis and live data analysis can handle their own class of
use cases, and they complement each other. At times, enterprises require close
integration of both platforms to get a full 360 degree view of the information.
This section focuses on the benefits of integrating these two classes of
platforms.
Smarter Data Ingestion
Recently, an interesting trend has been found in Big Data repositories. Lots of
data stored in data a warehouse is of very little or no business use and will
never appear in business reports. It is also stated to be a ‘Big Data fetish’
problem. To overcome this problem, it is essential to identify what is to be
stored and store what’s relevant to the business.
Streaming systems can be used to address the Big Data fetish problem. Data
coming from various data sources can be cleansed, extracted, transformed,
filtered and normalized in the streaming system. Processed data then can be
persisted in data warehouses for deeper analytics. This approach will reduce the
overall cost of data storage by a significant amount.
An example can be viewed in e-mail or SMS processing use cases such as
Lawyers.com. In this use case, lots of storage optimization can be achieved by
identifying spam and corrupted messages before dumping them into the data
store. This can be achieved using streams.
8. Stream meets Batch for smarter analytics
8
Adaptive Analysis
Both live data analytics platforms and archive data analytics platforms can
exchange data between. They can also be used smartly to exchange or share
intelligence. This will help improve the effectiveness, accuracy and quality of
analysis by absorbing these intelligences. Exchange of intelligence can be
achieved in two ways:
1. Archive to live exchange: Deeper analytics algorithms such as
Recommendation, Classification, Clustering, Statistical and pattern
finding algorithms are applied over huge volume of data accumulated
over long periods of time. For instance, classification model generation
over historic e-mails or finding item similarity models for
recommendations. The generated model can later be utilized by a
corresponding component in streams to identify quick, real time insight
over continuous data. For instance, incoming e-mail or a document
stream can be classified or categorized in real time. In this scenario,
deeper analysis is helping live streams in decision making.
2. Live to archive exchange: The stream validates unbounded incoming
data using models generated by the batch processing platform. In case,
conflicts goes beyond a threshold, a level stream processing platform
can signal to the batch platform that it is time to update or rebuild a
new model. For instance, if we are categorizing incoming documents on
the Wikipedia categorization model and if the percentage of the default
category or unidentified category goes beyond a threshold level, then
streams can signal to the batch processing platform to re-build the
categorization model using a new set of documents. In this scenario, the
achieved platform is assisted by the live platform for better quality of
analytics.
9. Stream meets Batch for smarter analytics
9
Case Study: Auto Categorize News Articles
This section describes how integration concepts explained in the above can be
applied in real world use cases. Consider an example of auto categorization of
new article streams or feeds coming from different data sources. A flow
diagram of this use cases is shown below:
Incoming new article streams and feeds are first cleansed and parsed to extract
meaningful data, with the garbage data getting thrown off. In the second stage,
the extracted data is pushed into the batch processing platform for deeper
analytics. At the same time, the stream processing platform categorizes the new
articles using the model generated by the batch processing platform. If the
percentage of a default or unknown category crosses a threshold limit, the
stream platform can trigger or ask the batch platform to re-generate a new
model. Once the batch platform is done with model generation, the updated
model is pushed to the stream platform to start categorizing documents in real
time again.