Streaming Data and Stream Processing with Apache Kafka
7_considerations_final
1. ESSENTIAL ELEMENTS
ANALYTICS PLATFORM
IN A REAL-TIME STREAMING
OPEN
SOURCE
LOW
LATENCY
ELASTIC
SCALING
PRE-BUILT
OPERATORS
DATA
INTEGRATION
DATA
VISUALIZATION
FUTURE-
PROOF
A GUIDE TO WHAT TO LOOK FOR IN A HIGH PERFORMANCE
REAL-TIME STREAMING ANALYTICS (RTSA) PLATFORM
2. Introduction
The rise of the Internet of Things (IoT) and the exponential growth
of potentially valuable, fast moving, real-time data is now a
well-documented fact. The continuous stream of data generated
by sensors, machines, vehicles, mobile phones, social media
networks, and other real-time sources are compelling
organizations to imagine what they could do with this data if they
could gain insight into it.
Here’s a quick look at where all the data is coming from and why
it’s growing so astronomically:
IoT Sensors are everywhere
As the speed of business increases, and more information about
what people are doing, thinking, feeling, saying, and buying
becomes available in real-time, applications that can produce
actionable insights are becoming the new imperative for
organizations to keep pace. The question is no longer if an
organization needs to use this data but rather how quickly can
one begin to capitalize on the insights that are potentially
available.
This paper explores the top seven must-have features in a
Real-Time Streaming Application (RTSA) platform in order to help
you choose a platform that meets the needs of your organization.
1. Mobile devices now outnumber humans: Report. Aaron Mamiit, Tech Times | October 8,
http://www.techtimes.com/articles/17431/20141008/mobile-devices-now-outnumber-humans-report.htm
2. Gartner Says 4.9 Billion Connected "Things" Will Be in Use in 2015. http://www.gartner.com/newsroom/id/2905717
http://www.internetlivestats.com/twitter-statistics/#trend
4. The Zettabyte Era—Trends and Analysis
(http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/VNI_
Hyperconnectivity_WP.html
We currently live in a world with over 7.2 billion active SIM
cards—more mobile devices than there are people.1
Gartner predicts that by 2020, 25 billion connected things will
be in use.2
About 500 million tweets are generated per day or 200 billion
per year.3
The average life of a tweet is 18 minutes, thus time is of the
essence to capitalize on sentiment or consumer behavior.
Social engagement is exploding
By 2019, global IP traffic will reach 2.0 zettabytes per year.4
At least 80 percent of that data will be unstructured.
The world is increasingly going digital
2
3.
3. What Must It Do?
Maximizing the potential of data means analyzing data in motion
as well as data at rest—from the moment it is created and after it
is stored. This means that an RTSA platform needs to be able to
do both—analyze data in motion and at rest. It also explains the
growing interest in such approaches as Lambda architecture
which provides the ability to integrate both batch and streaming
data processing within a common architecture under a common
development, runtime, and administration paradigm.
A real-time streaming platform must also meet the needs of data
scientists, developers and data center operations teams without
requiring extensive custom code or brittle integration of many
third-party components.
Overall, a stream processing solution must address many
challenges:
Let’s take a closer look at some of the capabilities you must look
for in an RTSA platform.
Process massive amounts of streaming events
Perform and scale as data volumes increase in size and
complexity
Rapidly integrate with existing infrastructure and data sources.
Allow fast time-to-market for application development and
deployment due to ever-changing requirements
Provide developer support and agile development
Allow live data discovery and monitoring, continuous query
processing, automated alerts and reactions.
Provide data visualization
Be built on an architecture that allows flexibility and
customization as requirements and new technology becomes
available
?
??
3
4. Open Source
Innovators in the streaming analytics space recognize the need
for both new architectures and capabilities to support streaming
analytics, while incorporating open standards and community
innovation. The current trend in platform development is to look
for ways to leverage open source.
Why leverage Open Source?
Which is probably better: software created by a limited
number of developers or a software package created by
thousands of developers around the globe? Open source is, of
course, the latter and allows for countless developers and users
to create innovative new features and enhancements to the
code.
In general, open source software tends to get closer to what
users want because those users are also often the developers.
It's not a matter of the vendor deciding what users want but the
users and developers making it what they want.
Similarly, companies can customize open source software to suit
their needs, if they choose to.
Additionally, open source allows business users to free
themselves from the severe vendor lock-in that limits users of
proprietary packages. Vendor lock-in leaves users at the mercy
of the vendor's vision, requirements, dictates, prices, priorities
and timetable.
With open source, users are in control to make their own
decisions and to do what they want with the software. They also
have a worldwide community of developers and users at their
disposal for help with that.
However…
However, and most importantly, your RTSA platform needs to
meet the needs of your business. Therefore, we are not
recommending that you go hire an army of platform level
developers to create your own platform using open source code,
but rather that you look for one that is built on an open source
code.
4
1
5. Future-proof
Technology is, and always will be, a moving target which is why
future proofing is important. Future proofing refers to the ability
for something to continue to be of value, well into the future and
guarantees that it will not quickly become obsolete.
If you are ready to start exploring the insights available in
streaming data, but are worried about investing in new
technology that could rapidly be surpassed or become outdated,
future proofing is the promise you need in a platform.
The technological advances related to real-time streaming
analytics are moving and changing as rapidly as data itself.
One of the major obstacles that stands in the way of enterprises
deciding to move forward—to select a vendor and to jump into
the rapidly moving river of real-time streaming analytics—is the
absence of a future-proof guarantee. This is especially true right
now as the impact of innovation in this space is expected to
continue, with several new open source projects already
underway.
The ideal real-time streaming analytics platform would have an
architecture with a common abstraction layer below the user
interface that allows the selection of one or more streaming
engines and allows you to change engines as your goals,
requirements and strategies change. This abstraction layer would
also streamline upgrades and embed new technologies into the
existing system, based on relevance and credibility.
5
2
6. Low Latency
Latency refers to the time required for a system to respond to
input. In the case of streaming data analytics - this is typically
measured in how many milli-seconds or seconds it takes to
ingest, process, analyze and respond to an incoming event or
data-point.
This depends on the architecture of the underlying stream
processing engine. Some popular micro-batch based
architectures like Spark Streaming may not be able to process a
response in anything less than 500 milli-seconds where as
discrete event processing architectures like Apache Storm could
provide responses in less than 10 milli-seconds. This will vary
and depend on data size of each event and the complexity of
the calculations performed.
There are many use-cases which do not need low latency
responses - and could work just fine with many seconds
between input and response. For example - if a real-time
dashboard is being shown to a business executive with sales
numbers that need to be updated every hour or few hours - a
few seconds of delay in calculation would not be a problem.
However, in a different case, if millions of internet
advertisement decisions need to be taken every second where
the decision on what Ad or offer to display to each user needs
to be sent back to a web-server faster than 20 milli-seconds -
then the stream-processing engine chosen must support
extremely high-volume high-speed processing with latency
guarantees for each event processed that meet the SLA of the
use-case.
While making a decision on an Enterprise-wide platform that
must support many groups or departments with a variety of
use-cases the technology chosen needs to support the most
strict requirements among all the possible use-cases that the
platform will be used for. If a multi-engine platform is chosen
which can support both discrete-event processing and micro-
batch processing architectures - that may be the best choice
to support a wide-variety of use-cases.
6
3
7. Data Integration with
Lambda Architecture
The huge increase in types and sources of data has made
it necessary to quickly blend and correlate disparate data to
create actionable insights. A combination of real-time and batch
processing is needed to meet the new demands. The ideal
platform will enable an integration of static data and real-time
streaming data as prescribed by the Lambda Architecture.
Going beyond mere enrichment of streaming data with static
data, ideally there would also be an ability to orchestrate
workflows across Batch systems and streaming systems to
enable things like building a predictive model based on training
data in batch mode and moving the model after training to score
streaming data.
There’s a growing bag of technologies out there that excel in
specific aspects:
Hadoop MapReduce, Storm and Spark for massively parallel
processing; Ni-fi, Kafka, Flume, Active MQ along with traditional
messaging and queuing software for real time data movement;
Mesos and YARN for resource management.
We encourage you to find a platform that provides an easily
usable UI driven abstraction over the stack of complex
technologies used in Big Data platforms.The platform should
make the experience easy for developers, analysts and business
users and have them be much more productive and able to
focus on business needs rather than infrastructure issues.
7
4
8. Rapid Application
Development
An enterprise-ready streaming analytics system should include
intuitive visual tools to quickly create streaming applications.
These applications should not involve cumbersome coding by
developers but instead allow them to easily create organization
specific business logic that can operate on any data source by
easily manipulating a graphical user interface and selecting pre-
defined functions and operators to integrate, pre-process and
analyze the data.
The ideal RTSA platform will include a wide range of data-
processing operators including but not limited to the following:
In addition to such a rich array of pre-built operators; it would be
good to watch for developer life-cycle enablement features like
automatic porting from development, to test to production; ability
to track and maintain multiple application versions and easily
upload and manage custom code and extensions for tailored
business logic.
8
5
Source connectors for high-speed ingestion engines
Connectors for data storage systems like Hadoop and other
indexing stores
Simple filters
Complex multi-stream correlations
Aggregation functions
Statistical models
Time Window operators - Fixed and Sliding
Predictive analytics functions
9. Ideally, you want a platform that the operations team can easily
scale out or scale down by visually changing the resource
allocation to various tasks and the newly configured resources
seamlessly join the cluster and are able to handle changes in load
or traffic without any overall interruption to the real-time
streaming data processing.
A clustered, linear scale-out architecture for the stream
processing system based on commodity hardware has to be a
pre-requisite to enable these types of scenarios in todays Big
Data era where millions of events per second may need to be
processed with low latency response.
9
6
Data Visualization
Attractive and easy-to-understand pictorial illustration of data
with graphs charts and dashboards are a big factor in increasing
adoption of any platform among the user community.
A strong RTSA platform will provide data visualization tools that
allow you to:
7 Visualize live streaming data with add-ons like thresholds,
alerts, range markers, etc.
Dynamically create custom charts or dashboards
See trending and comparisons
Easily build custom UI extensions