SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Fast Big Data and Streaming
Analytics: Discerning Differences and
Selecting the Best Approach
Lynn Langit
April 2015
	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   2
TABLE OF CONTENTS
Executive	
  summary	
  ...................................................................................................................................................................	
  3	
  
Introduction	
  ..................................................................................................................................................................................	
  4	
  
High-­‐Volume	
  Real-­‐time	
  Data	
  Analytics	
  ........................................................................................................................	
  5	
  
Streaming	
  Analytics	
  Project	
  Design	
  ...................................................................................................................................	
  7	
  
Architectural	
  Considerations	
  ................................................................................................................................................	
  7	
  
Component	
  Selection	
  ................................................................................................................................................................	
  9	
  
Overall	
  Architecture	
  .............................................................................................................................................................	
  9	
  
Enterprise-­‐grade	
  Streaming	
  Engine	
  ............................................................................................................................	
  10	
  
Ease	
  of	
  Use	
  and	
  Development	
  ........................................................................................................................................	
  12	
  
Creating	
  the	
  Proof	
  of	
  Concept	
  .............................................................................................................................................	
  13	
  
Management/DevOps	
  .............................................................................................................................................................	
  14	
  
Key	
  Findings	
  ...............................................................................................................................................................................	
  15	
  
Summary	
  Comparisons	
  ..........................................................................................................................................................	
  16	
  
About	
  Lynn	
  Langit	
  ....................................................................................................................................................................	
  18	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   3
Executive summary
An increasing number of big data projects include one or more streaming components. The architectures
of these streaming solutions are complex, often containing multiple data repositories (data at rest),
streaming pipelines (data in motion), and other processing components (such as real- or near-real time
analytics). In addition to performing analytics in real time (or near-real-time), streaming platforms can
interact with external systems (databases, files, message buses, etc.) for data enrichment or for storing
data after processing. Architects must consider many types of big data streaming solutions that can
include open source projects as well as commercial products. Now they also have a next-generation
reference architecture for big data, also know as fast big data.
This report will help IT decision makers at all levels understand the common technologies,
implementation patterns, and architectural approaches for creating streaming big data solutions. It will
also describe each solution’s tradeoffs and determine which approach best fits specific needs.
Key takeaways from this report include:
• Use commercial streaming solutions to implement complex Event Stream Processing (ESP)
projects. Organizations should match the solution’s complexity to the component’s maturity.
• Teams that have already implemented pure open source Hadoop solutions are most capable of
adding pure open source streaming solutions. An organization should match its team’s skill level
to solution complexity and component maturity.
• Organizations should test solutions at production levels of load during the proof-of-concept phase
and determine whether they will host the solution on premises, in the cloud, or as a hybrid project.
• Organizations should select tools or plan for coding appropriate types of visualization solution.
	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   4
Introduction
Data continues to grow in variety, volume, and velocity. Enterprises are handling the first two
components, with Hadoop emerging as the de facto big data technology of choice. However, they now
realize that processing and gaining insights from the increased velocity of data creation yields greater
value as it enables better operational efficiency (cost reduction) and personalized customer offerings
(revenue growth). Streaming analytic solutions are the technology that supports processing fast big data.
Note that the term “fast big data” still includes “big data,” so streaming solutions must also adhere to the
tenets of big data: (near) linear scalability, fault tolerance, distributed computing, no single point of
failure, security, operability, ease of use etc. These abilities become more critical for fast big data. Fast big
data also requires seamless integration with external systems, in-memory processing, advanced
partitioning schemes, etc. The fast big data architecture stack must integrate seamlessly in the
enterprise’s existing data processing methodology.
Contrasting streaming to traditional analytic data architecture determines whether a business can benefit
from processing fast big data.
• The traditional workflow first entails ingest (usually via batch processing), then store-and-process,
and finally query. This workflow provides results in hours or days.
• The action-based architecture of streaming, which is based on a pipeline of steps for continuously
ingesting, transforming, processing, alerting, and then storing data, provides insights and take
actions in seconds, minutes and hours.
The following definitions are helpful in understanding solution architectures designed to support one or
more streaming data components.
• Event-stream processing (ESP). Streaming data is continuously ingested into one or more
data systems via a flow or stream. The opposite, non-streaming data is ingested via individual
record processing (insert, update or delete) or batched record processing. Considerations for
streaming include:
o The size and duration of each streaming window or chunk of data.
o The stream’s volume and velocity; is it predictable and regular or variable and prone to
spikes?
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   5
o The design must account for the sources, sizes, and types of data in the streams.
o In general, ESP solutions contain many input streams that can include different types and
volumes of data. Common types of stream data include sensor data, transactional data,
web clickstream, or log data. Internet of Things (IoT) data is increasing the demand for
streaming architectures as well. All streaming applications also require access to data at
rest for enrichment or to provide context—such as customer data, purchase history, and
support history.
o Along with multiple data input streams, ESP solutions are often implemented to answer
many mission-critical business questions and are placed into the operational data stream
requiring fault tolerance.
o This type solution can include many data-pipeline-processing steps that vary from simple
aggregation to complex machine learning processes. An example is using predictive
analysis of live and stored stream data to process all airline engine sensor data for all
flights for an airline, with the business goals of improved flight safety and reduced engine
maintenance.
High-Volume Real-time Data Analytics
Since Hadoop is the focal point of big data ecosystem, emerging fast big data platforms must be evaluated
based on their interaction with Hadoop. Including Apache Hadoop components in streaming solutions is
common so the following definitions of major components are helpful.
• Apache Hadoop core. The cores services of Hadoop are the Hadoop Distributed File System
(HDFS) and YARN (Yet Another Resource Negotiator). This separation of HDFS and YARN has
enabled the emergence of streaming platforms native to Hadoop. MapReduce (which previously
provided cluster management services in place of YARN) is now a user-side library and completely
separated from YARN. It is not relevant in streaming platform.
• Apache Flume. A distributed service for efficiently collecting, aggregating, and moving
streaming event data.
• Apache Kafka. A distributed, highly scalable publish/subscribe messaging system. It maintains
feeds of messages in topics. Kafka is one of the most popular message buses in the big data
ecosystem. Though not part of core Hadoop, it is very widely used by the open source community.
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   6
• Apache Zookeeper. This centralized configuration coordinator maintaining configuration
information and naming, and providing distributed synchronization and group services is
commonly used in a Hadoop ecosystem.
• Apache Storm and Spark Streaming. These streaming data processing libraries are defined
and differentiated in the body of this report. Storm and Spark-Streaming predate YARN. Storm
uses its own scheduler rather than YARN’s, and while Spark-Streaming has YARN integration, it
is not designed to work exclusively with HDFS.
• Commercial native Hadoop streaming platforms. Some commercial platforms run
natively in YARN and leverage all the Hadoop semantics, operability, and other features.
The following figure shows a subset Apache Hadoop’s components. Selecting the appropriate Hadoop
components for a solution is a key consideration in architecting the streaming solution.
Architectural view of typical Hadoop components
	
  
	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   7
Streaming Analytics Project Design
The common phases of streaming data projects are: architecture, component selection, proof of concept,
and management/DevOps (or moving the solution to production). Although design phases follow familiar,
standard architectural patterns, a closer examination of the tasks performed in each phase is useful in
understanding solution design at a deeper level.
Because the landscape of streaming data solutions is changing rapidly as more open source libraries and
commercial products become available and because many technical teams lack experience creating these
types of solutions, following best practices is essential.
Architectural Considerations
The architecture phase includes sub-phases of design. The figure below shows a simple diagram of the
fast big data pipeline.
A Fast Big Data Pipeline
	
  
Phase 1: Scalable Ingestion. Identify all of the streaming data sources for ingestion. The critical
requirements of ingestion are fault tolerance and scalability. These include handling fault tolerance with
no-data loss; no loss of application state; and in-order processing, with no manual intervention or
dependence on components external to the platform. The platform should have connectors to various
sources, as well as the capability to add new future sources.
Phase 2: Real-time ETL. “What is the quality of the incoming data?” For example, will possible
duplicate records appear in a stream? If so, which will need to be de-duplicated? How will that
transformation be accomplished? Will the team write code to de-duplicate the data or a commercial
product with data manipulation capabilities do that?
Another set of considerations involves compliance. Are timestamps required on data for compliance
requirements? Does the streaming platform have connectors to various external sources for data
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   8
enrichment? If error checking depends on the order of data, or needs context of data, fault tolerance, and
in-order processing with no data loss becomes critical.
Phase 3: Real-time Analytics. “What are the most complex analytics that need to be performed and
can the business SLA be met to have the job finish in an hour, one minute, or one second, as needed?”
Examples of analytics computations involve dimensional cube computations, aggregates, reconciliation,
etc.
Fast big data analytics are computation-intensive and affect latency. Performance, scalability, and fault-
tolerance are of utmost importance in these use cases. Spikes in data have a great impact on analytics, so
the scaling automatically (rather than hand-coding such scalability) is important.
Analytics require that intermediate results be retained so fault tolerance is critical. A fault tolerant
platform must ensure that there is no loss of events, application state, or application data. As with many
other considerations around streaming architectures, a choice exists between custom coding fault
tolerance on a per-application basis or letting the streaming platform provide that capability.
Another consideration is whether to run some or all of the solution on a public cloud where offerings for
streaming data and persisting it differ considerably. Commercial solutions can run only on-premises, only
on a particular cloud, or both in the cloud and on-premises.
Phase 4: Alerts and Actions. Address the business need around notification for when events or
activities occur. Alerts can be provided as part of a visual dashboard, via STMP (email), SMS (text), or any
other message bus. Taking action is usually an automated business process that occurs without human
intervention, based on rules or policy that must be formulated. For example, for “smart building” data
from sensors, at what point should local building maintenance team be alerted and for which types of
event thresholds.
Phase 5: Visualization and Distribution. Design for both storage and integration of processes;
result data must be in formats consumable by end-user groups. For example, will manufacturing line
metrics be incorporated into a dashboard, a phone application, or a wearable device? Will the developers
create these visualizations or will the organization integrate commercial visualization solutions? A
streaming platform that has connectors to various external systems and provides ability to integrate with
a visualization technology, or provides its own, helps in this phase.
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   9
Component Selection
The next stage in creating a streaming solution is selecting the particular components for the streaming
data architectures. Gigaom suggests that the component selection be evaluated with these considerations:
• The overall architecture of the platform, both in terms of simplicity and reliability
• The enterprise-grade capabilities of the core streaming engine
• Ease of application development and the ability to process data in motion and data at rest
• Management and operation of the platform once it is in production
Following are key questions to consider that will assist with component selection along with a summary
table of the most critical features compared across representative technologies.
Overall Architecture
• Is production operability natively built into the platform? Streaming analytic solutions
run in operational capacity and have high operational requirements, including fault tolerance,
scalability, security, native integration with external sources, web services, CLI, visual dashboard,
etc. A platform that supports these features natively offers time-to-solution advantages over a
platform that does not.
• Does the platform depend on external components? The number of external components
a streaming platform depends on impacts operability. Each added component introduces another
possible point of failure, and may require additional expertise. Stitching together disparate
components makes architecture more complex and possibly more brittle.
• What is the general attitude toward using open source software? When selecting
streaming components, evaluate the level of developer talent that is available. For some teams,
particularly those that already have deep expertise in developing and deploying Hadoop-based
solution into production, adding streaming functionality via open source libraries may be a good
fit. This is because those teams are already familiar with patterns for working with Apache
Hadoop libraries.
However, for other, more traditional enterprise teams, using pure open source technologies can
result in hidden project costs such as resource hours to set up, test, and implement the solution.
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   10
In some cases, underestimating the knowledge required to set up, configure, integrate, tune, and
test can derail an entire project, as the complexity becomes overwhelming.
• What is the true cost of a creating, deploying, and managing a big data streaming
solution? Are there hidden costs?. When looking at commercial solutions versus open source,
they must compare the licensing costs, support costs, and development resources required for
both. Typically, commercial products have an upfront licensing fee that includes support and
requires less engineering expertise, while open source has no licensing fees, but requires support
fees, services fees, and internal development resources. Organizations running in a commercial
cloud must analyze their monthly bills carefully for potential cost savings.
Enterprise-grade Streaming Engine
The core of any selection process is ensuring that the platform will meet the business needs for scalability,
high-availability, performance, and security. The many dimensions to consider vary based on use case.
Here are the most common areas of consideration.
• What is the forecasted data volume and variety of sources for ingestion? The available
solution components vary widely in their ability to ingest at scale from multiple data sources. The
ingestion-volume requirements alone could drive a decision to a particular commercial product or
set of libraries, or some combination of both, because there are known upper limits to the
solutions. The number and scope of data sources requires enterprise-quality adapters to handle
the variety of data types. The platform must be based on a very common programming language,
such as Java, to enable re-use of current code within the streaming platform. A Java-based
commercial product designed and tested for that kind of scale, with fault tolerance and a large
number of connectors would be a far better fit.
• What are the use case requirements for event processing regarding data processing
guarantees and event order? Fast big data is comprised of a series of events that occur over
time. Architectural decisions on how a streaming platform processes those events will have a
direct impact on performance, latency, and scalability. Many fast big data use cases require that
event order be maintained. For example, some predictive analytics use cases must compare event
order to determine what might happen next. Streaming platform architectural decisions can
impact the ability to guarantee the event order, and whether an event will be processed exactly
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   11
one time, at most one time or at least one time. Below, we look at three architectural methods of
processing event streams and the implications of each method.
1. Event-at-a-time. Apache Storm uses this method, which processes each event and uses an
acknowledgement signal for each. It can impact performance and the cost of hardware.
•
Source: DataTorrent
2. Micro-batching. Apache Spark Streaming uses this method, which processes tiny groups of
events. This method cannot provide a guarantee of in-order processing.
	
  
Source: DataTorrent
3. Streaming event window processing. This method processes windows in the stream and
differs from micro-batching by relying on a more lightweight process (markers in the stream)
rather than batching. This approach is a non-blocking high-performance implementation that
can guarantee event order and provide only-once, at-least-once, and at-most-once event
processing with no data loss.
	
  
Source: DataTorrent
What are the fault tolerance requirements of the use case? Fast big data use cases are typically
implemented in an operational environment. A streaming application, unlike a batch application, does
not have an end. It runs continuously 24/7. An organization must know if the platform it has selected to
process incoming data streams meets its requirements for fault tolerance. Does the ESP system manage
the application fault tolerance or does the engineer need to hand-code it? In the event of a failure, does
the ESP guarantee the events will be processed in the order in which they are generated?
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   12
Will a use case and business logic change over time? As an organization learns more about the
customer or operational aspects of the streaming application, it will typically want to change or
supplement the current business logic of its application. For example, in a financial services fraud
detection application, it may want to add another algorithm for detecting fraud, or change an alert
process. Since transactions occur continuously, the streaming application needs to handle being updated
without impacting the analysis of the data flow. This can be thought of as an extension of fault tolerance.
Ease of Use and Development
Development process. In addition to the complexity of setting up the development environment,
when an organization begins creating a POC, it should also understand the tradeoff involved in the details
of solution implementation. These details include items such as selection of programming language. For
example, will coding be done in a proprietary vendor-created language or in a general-purpose language,
such as Java? What features of the solution (for example event guarantees, parallel partitioning, and fault
tolerant capabilities) will be manually coded?
Commercial vendors, such as DataTorrent, Informatica, and others, include visual data pipeline creation
tools for rapid prototyping. This can be a significant factor in successful pipeline POC creation. For
example, in the case of the need to build a time sensitive POC (due to competitive pressure, regulatory
changes or some other business concern, the ability to use visual tools to do so can significantly reduce
the time to create a prototype.
Data visualization. Another decision-point is how output data, insights, and action will be presented to
the project stakeholders for validation.
• What type of dash-boarding solution will be used? Will alerts be generated to demonstrate the
viability of the project?
• Will the developer team be expected to create visualization for results or will one or more
visualization tools be provided used instead?
• If using a commercial visualization solution, such as Tableau, what is the complexity of connecting
the output data to it and of creating visualizations that make sense to the subject matter experts?
• How is integration with processes, such as indexing or dimensional attribute processing, achieved?
This is another factor that separates commercial products from pure open source libraries.
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   13
• What steps are involved in optimizing real-time analytics? Can dimensional pre-calculations or some
type of indexing be added or adjusted? And, again, is optimization manual or automated via tools?
Creating the Proof of Concept
The first consideration in the proof of concept phase of a streaming is revalidating the first few business
questions the solution should address and what action the organization would like to take as a result of
the insights it creates. For example, in an IoT smart-building sensor data-streaming project, typical
questions are: Which building sections or rooms seem to be outside of normal for HVAC and is there any
related data (changes in weather, number of people in the area, etc.) that normally correlate to such a
change? Or does this seem to be an anomaly that requires investigation by a technician? If a technician is
required, automatically schedule the appointment.
Next, begin the proof of concept (POC) that will be driven by the data sources and methods of making
sense of these streams. Developers will be eager to build a POC as this point. Understanding developer
knowledge of the systems being proposed is key. Another approach is to ask candidate vendors “What is
the developer story when creating applications on different types streaming data systems?”
	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   14
Management/DevOps
A new set of priorities and decisions arises after the POC has been validated and planning begins for
production deployment. Top among these concerns is considering what enterprise-level services are
necessary to make the solution operate and meet SLA requirements. These often include processes and
tools for monitoring and maintaining security, availability, scalability, and recoverability. Next is deciding
whether to build or buy. Which open source projects and which products best meet these enterprise
service requirements? Is the best solution an all-in-one or an amalgamation of various open source
libraries and vendors tools?
How are production solutions monitored, scaled, and managed? Does the streaming platform
have rich and deep web services and other integration methods to enable ease of integration into
monitoring systems? Commercial solutions include strong support for integration with monitoring
systems. Open source projects may not have strong monitoring integration points. Is the set of streams
spiky? Is dynamic or even automated scaling needed? Commercial products may include such auto-
scaling capabilities, including the ability to add operations into, or remove operations from, a running
pipeline.
What steps are involved in optimizing the ingest pipeline? Is it a manual process? Does code
have to be re-written, tested, and deployed or does the solution include tooling that can automate this?
	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   15
Key Findings
The addition of streaming platform to a big data stack adds significant complexity to big data solutions.
Given this, a solid understanding of technology choices around streaming data solutions is essential for
designing and delivering solutions that provide business value to the organization.
• Consider use of commercial streaming solutions for complex ESP projects. Match the
solution complexity to component maturity. Consider the volume, velocity, and variety of
both data and data streams. Consider the true costs of building on top of pure open source Hadoop
projects versus buying vendor tools and solutions that include Hadoop plus enterprise services.
• Teams that have already implemented pure open source Hadoop solutions are most
capable of adding pure open source streaming solutions. Match a team’s skill level
to solution complexity and component maturity. Know whether the team has production
experience working with Hadoop solutions, and has deployed Hadoop solutions with low time to
market. What programming languages do they use, what are their expectation about integrating
this new project into their current development environment, and their requirements around
monitoring and scaling. Do they expect to work with tools or are they comfortable working with
libraries from the command line?
• Select tools or plan for coding libraries to perform the types of analytics required.
Match the types of analysis expected on the event stream data. Organizations must
understand whether they will be performing simple aggregations or anticipate using machine
learning algorithms or some other type of predictive component. Know whether a team should
write this logic or whether the organization will purchase a solution that integrates or contains
these components. Figure out how much of the analytics must be returned in real-time.
• Test a solutions at production levels of load during the proof-of-concept phase.
Determine whether to host the solution on premises, in the cloud, or as a hybrid
project. Examine different vendor’s cloud solutions. Is the solution tested at scale on any/all
clouds? Know the potential service costs.
• Select tools or plan for coding appropriate types of visualization solution. If the
organization plans to build its own visualization solution, the technical team must have the talent
to create what is envisioned. If not, what is the cost of hiring, re-training, or getting additional
help to implement this part of the solution.
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   16
Summary Comparisons
Capability Storm
Spark
Streaming
DataTorrent
RTS
IBM
InfoStreams
TIBCO
StreamBase
Core Streaming Engine Capabilities
In-memory
compute engine Yes Yes Yes Yes Yes
Native Hadoop
Architecture No Yes Yes No No
Sub-second
event processing Yes No Yes Yes Yes
Extremely Linear
Scalability (billion(s) of
events /second) No No Yes No No
Stream partitioning for
parallel processing
No No Yes No No
Auto-scaling input
and event processing No No Yes Yes Yes
Event Processing
Guarantees
Only Once
At least once
At most once
Only Once
At least once
At most once
Only Once
At Least Once
At Most Once At Most Once
Only Once
At Most Once
Event order
Guaranteed No No Yes No No
End-to-End Stateful
Fault Tolerance No No Yes No No
Incremental recovery No No Yes No No
Dynamic application
updates No No Yes No No
Data Loss Potential Yes Yes No Yes Yes
Complete separation of
business logic, event
acknowledgement and
fault tolerance
No No Yes No No
	
   	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   17
Streaming Application Development Tools
Native Application
Programming
Language Java
Scala
Java API Java
Proprietary
Streams Query
Language
Proprietary
StreamSQL
EventFlow
Open Source
Pre-built Connectors <10 <10 > 75 No No
Graphical application
builder
No No Yes (beta) Yes Yes
Visual Real-time
Dashboard
No No Yes (beta) No Yes
Operations and Management Tools
Fully functional GUI-
based Management
No No Yes Yes Yes
Ease of integration with
monitoring systems
No No Yes Yes Yes
Ease of integration with
external systems via
REST APIs
No No Yes Yes Yes
Simple install and
upgrade
No No Yes No No
	
  
 
Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach	
   18
About Lynn Langit
Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more
than 15 years. Over the past 4 years, she’s been working as an independent architect using these
technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done
POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has
done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB,
Neo4j and many other database systems. In addition to building solutions, Lynn also partners with all
major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings.
She is a Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is
also a Cloudera certified instructor (for MapReduce Programming).
Prior to re-entering the consulting world 4 years ago, Lynn’s background is over 10 years as a Microsoft
Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books
on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft.
She continues to write and screencast and hosts a BigData channel on YouTube
(http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData
topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit).

Más contenido relacionado

La actualidad más candente

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsightKhalid Salama
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureLynn Langit
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 

La actualidad más candente (20)

Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 

Destacado

Learning GitHub Part 4
Learning GitHub Part 4Learning GitHub Part 4
Learning GitHub Part 4Lynn Langit
 
Learning GitHub Part 2
Learning GitHub Part 2Learning GitHub Part 2
Learning GitHub Part 2Lynn Langit
 
Learning GitHub Part 1
Learning GitHub Part 1Learning GitHub Part 1
Learning GitHub Part 1Lynn Langit
 
Cloud-centric Internet of Things
Cloud-centric Internet of ThingsCloud-centric Internet of Things
Cloud-centric Internet of ThingsLynn Langit
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningLynn Langit
 
Using Premium Data - for Business Analysts
Using Premium Data - for Business AnalystsUsing Premium Data - for Business Analysts
Using Premium Data - for Business AnalystsLynn Langit
 
Learning GitHub Part 3
Learning GitHub Part 3Learning GitHub Part 3
Learning GitHub Part 3Lynn Langit
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'Lynn Langit
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data ArchitecturesLynn Langit
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseBenchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseLynn Langit
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformLynn Langit
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL ServerLynn Langit
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformLynn Langit
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinLynn Langit
 
AWS for the Data Professional
AWS for the Data ProfessionalAWS for the Data Professional
AWS for the Data ProfessionalLynn Langit
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 

Destacado (20)

Learning GitHub Part 4
Learning GitHub Part 4Learning GitHub Part 4
Learning GitHub Part 4
 
Learning GitHub Part 2
Learning GitHub Part 2Learning GitHub Part 2
Learning GitHub Part 2
 
Learning GitHub Part 1
Learning GitHub Part 1Learning GitHub Part 1
Learning GitHub Part 1
 
Cloud-centric Internet of Things
Cloud-centric Internet of ThingsCloud-centric Internet of Things
Cloud-centric Internet of Things
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Using Premium Data - for Business Analysts
Using Premium Data - for Business AnalystsUsing Premium Data - for Business Analysts
Using Premium Data - for Business Analysts
 
Learning GitHub Part 3
Learning GitHub Part 3Learning GitHub Part 3
Learning GitHub Part 3
 
What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'What is 'Teaching Kids Programming'
What is 'Teaching Kids Programming'
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with EaseBenchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
SQL Server on Google Cloud Platform
SQL Server on Google Cloud PlatformSQL Server on Google Cloud Platform
SQL Server on Google Cloud Platform
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and YellowfinBuilding a data warehouse with AWS Redshift, Matillion and Yellowfin
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
 
AWS for the Data Professional
AWS for the Data ProfessionalAWS for the Data Professional
AWS for the Data Professional
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 

Similar a About Streaming Data Solutions for Hadoop

Hadoop-based architecture approaches
Hadoop-based architecture approachesHadoop-based architecture approaches
Hadoop-based architecture approachesMiraj Godha
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 

Similar a About Streaming Data Solutions for Hadoop (20)

Hadoop-based architecture approaches
Hadoop-based architecture approachesHadoop-based architecture approaches
Hadoop-based architecture approaches
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Evaluation guide to Streaming Analytics
Evaluation guide to Streaming AnalyticsEvaluation guide to Streaming Analytics
Evaluation guide to Streaming Analytics
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 

Más de Lynn Langit

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWSLynn Langit
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless ArchitecturesLynn Langit
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids ProgrammingLynn Langit
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on DockerLynn Langit
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina LanguageLynn Langit
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsLynn Langit
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for DevelopersLynn Langit
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauLynn Langit
 
TKPJava Eclipse and Codenvy IDE Keyboard Shortcuts
TKPJava Eclipse and Codenvy IDE Keyboard ShortcutsTKPJava Eclipse and Codenvy IDE Keyboard Shortcuts
TKPJava Eclipse and Codenvy IDE Keyboard ShortcutsLynn Langit
 

Más de Lynn Langit (20)

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
Teaching Kids Programming for Developers
Teaching Kids Programming for DevelopersTeaching Kids Programming for Developers
Teaching Kids Programming for Developers
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and Tableau
 
TKPJava Eclipse and Codenvy IDE Keyboard Shortcuts
TKPJava Eclipse and Codenvy IDE Keyboard ShortcutsTKPJava Eclipse and Codenvy IDE Keyboard Shortcuts
TKPJava Eclipse and Codenvy IDE Keyboard Shortcuts
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

About Streaming Data Solutions for Hadoop

  • 1. Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach Lynn Langit April 2015  
  • 2.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   2 TABLE OF CONTENTS Executive  summary  ...................................................................................................................................................................  3   Introduction  ..................................................................................................................................................................................  4   High-­‐Volume  Real-­‐time  Data  Analytics  ........................................................................................................................  5   Streaming  Analytics  Project  Design  ...................................................................................................................................  7   Architectural  Considerations  ................................................................................................................................................  7   Component  Selection  ................................................................................................................................................................  9   Overall  Architecture  .............................................................................................................................................................  9   Enterprise-­‐grade  Streaming  Engine  ............................................................................................................................  10   Ease  of  Use  and  Development  ........................................................................................................................................  12   Creating  the  Proof  of  Concept  .............................................................................................................................................  13   Management/DevOps  .............................................................................................................................................................  14   Key  Findings  ...............................................................................................................................................................................  15   Summary  Comparisons  ..........................................................................................................................................................  16   About  Lynn  Langit  ....................................................................................................................................................................  18  
  • 3.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   3 Executive summary An increasing number of big data projects include one or more streaming components. The architectures of these streaming solutions are complex, often containing multiple data repositories (data at rest), streaming pipelines (data in motion), and other processing components (such as real- or near-real time analytics). In addition to performing analytics in real time (or near-real-time), streaming platforms can interact with external systems (databases, files, message buses, etc.) for data enrichment or for storing data after processing. Architects must consider many types of big data streaming solutions that can include open source projects as well as commercial products. Now they also have a next-generation reference architecture for big data, also know as fast big data. This report will help IT decision makers at all levels understand the common technologies, implementation patterns, and architectural approaches for creating streaming big data solutions. It will also describe each solution’s tradeoffs and determine which approach best fits specific needs. Key takeaways from this report include: • Use commercial streaming solutions to implement complex Event Stream Processing (ESP) projects. Organizations should match the solution’s complexity to the component’s maturity. • Teams that have already implemented pure open source Hadoop solutions are most capable of adding pure open source streaming solutions. An organization should match its team’s skill level to solution complexity and component maturity. • Organizations should test solutions at production levels of load during the proof-of-concept phase and determine whether they will host the solution on premises, in the cloud, or as a hybrid project. • Organizations should select tools or plan for coding appropriate types of visualization solution.  
  • 4.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   4 Introduction Data continues to grow in variety, volume, and velocity. Enterprises are handling the first two components, with Hadoop emerging as the de facto big data technology of choice. However, they now realize that processing and gaining insights from the increased velocity of data creation yields greater value as it enables better operational efficiency (cost reduction) and personalized customer offerings (revenue growth). Streaming analytic solutions are the technology that supports processing fast big data. Note that the term “fast big data” still includes “big data,” so streaming solutions must also adhere to the tenets of big data: (near) linear scalability, fault tolerance, distributed computing, no single point of failure, security, operability, ease of use etc. These abilities become more critical for fast big data. Fast big data also requires seamless integration with external systems, in-memory processing, advanced partitioning schemes, etc. The fast big data architecture stack must integrate seamlessly in the enterprise’s existing data processing methodology. Contrasting streaming to traditional analytic data architecture determines whether a business can benefit from processing fast big data. • The traditional workflow first entails ingest (usually via batch processing), then store-and-process, and finally query. This workflow provides results in hours or days. • The action-based architecture of streaming, which is based on a pipeline of steps for continuously ingesting, transforming, processing, alerting, and then storing data, provides insights and take actions in seconds, minutes and hours. The following definitions are helpful in understanding solution architectures designed to support one or more streaming data components. • Event-stream processing (ESP). Streaming data is continuously ingested into one or more data systems via a flow or stream. The opposite, non-streaming data is ingested via individual record processing (insert, update or delete) or batched record processing. Considerations for streaming include: o The size and duration of each streaming window or chunk of data. o The stream’s volume and velocity; is it predictable and regular or variable and prone to spikes?
  • 5.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   5 o The design must account for the sources, sizes, and types of data in the streams. o In general, ESP solutions contain many input streams that can include different types and volumes of data. Common types of stream data include sensor data, transactional data, web clickstream, or log data. Internet of Things (IoT) data is increasing the demand for streaming architectures as well. All streaming applications also require access to data at rest for enrichment or to provide context—such as customer data, purchase history, and support history. o Along with multiple data input streams, ESP solutions are often implemented to answer many mission-critical business questions and are placed into the operational data stream requiring fault tolerance. o This type solution can include many data-pipeline-processing steps that vary from simple aggregation to complex machine learning processes. An example is using predictive analysis of live and stored stream data to process all airline engine sensor data for all flights for an airline, with the business goals of improved flight safety and reduced engine maintenance. High-Volume Real-time Data Analytics Since Hadoop is the focal point of big data ecosystem, emerging fast big data platforms must be evaluated based on their interaction with Hadoop. Including Apache Hadoop components in streaming solutions is common so the following definitions of major components are helpful. • Apache Hadoop core. The cores services of Hadoop are the Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator). This separation of HDFS and YARN has enabled the emergence of streaming platforms native to Hadoop. MapReduce (which previously provided cluster management services in place of YARN) is now a user-side library and completely separated from YARN. It is not relevant in streaming platform. • Apache Flume. A distributed service for efficiently collecting, aggregating, and moving streaming event data. • Apache Kafka. A distributed, highly scalable publish/subscribe messaging system. It maintains feeds of messages in topics. Kafka is one of the most popular message buses in the big data ecosystem. Though not part of core Hadoop, it is very widely used by the open source community.
  • 6.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   6 • Apache Zookeeper. This centralized configuration coordinator maintaining configuration information and naming, and providing distributed synchronization and group services is commonly used in a Hadoop ecosystem. • Apache Storm and Spark Streaming. These streaming data processing libraries are defined and differentiated in the body of this report. Storm and Spark-Streaming predate YARN. Storm uses its own scheduler rather than YARN’s, and while Spark-Streaming has YARN integration, it is not designed to work exclusively with HDFS. • Commercial native Hadoop streaming platforms. Some commercial platforms run natively in YARN and leverage all the Hadoop semantics, operability, and other features. The following figure shows a subset Apache Hadoop’s components. Selecting the appropriate Hadoop components for a solution is a key consideration in architecting the streaming solution. Architectural view of typical Hadoop components    
  • 7.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   7 Streaming Analytics Project Design The common phases of streaming data projects are: architecture, component selection, proof of concept, and management/DevOps (or moving the solution to production). Although design phases follow familiar, standard architectural patterns, a closer examination of the tasks performed in each phase is useful in understanding solution design at a deeper level. Because the landscape of streaming data solutions is changing rapidly as more open source libraries and commercial products become available and because many technical teams lack experience creating these types of solutions, following best practices is essential. Architectural Considerations The architecture phase includes sub-phases of design. The figure below shows a simple diagram of the fast big data pipeline. A Fast Big Data Pipeline   Phase 1: Scalable Ingestion. Identify all of the streaming data sources for ingestion. The critical requirements of ingestion are fault tolerance and scalability. These include handling fault tolerance with no-data loss; no loss of application state; and in-order processing, with no manual intervention or dependence on components external to the platform. The platform should have connectors to various sources, as well as the capability to add new future sources. Phase 2: Real-time ETL. “What is the quality of the incoming data?” For example, will possible duplicate records appear in a stream? If so, which will need to be de-duplicated? How will that transformation be accomplished? Will the team write code to de-duplicate the data or a commercial product with data manipulation capabilities do that? Another set of considerations involves compliance. Are timestamps required on data for compliance requirements? Does the streaming platform have connectors to various external sources for data
  • 8.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   8 enrichment? If error checking depends on the order of data, or needs context of data, fault tolerance, and in-order processing with no data loss becomes critical. Phase 3: Real-time Analytics. “What are the most complex analytics that need to be performed and can the business SLA be met to have the job finish in an hour, one minute, or one second, as needed?” Examples of analytics computations involve dimensional cube computations, aggregates, reconciliation, etc. Fast big data analytics are computation-intensive and affect latency. Performance, scalability, and fault- tolerance are of utmost importance in these use cases. Spikes in data have a great impact on analytics, so the scaling automatically (rather than hand-coding such scalability) is important. Analytics require that intermediate results be retained so fault tolerance is critical. A fault tolerant platform must ensure that there is no loss of events, application state, or application data. As with many other considerations around streaming architectures, a choice exists between custom coding fault tolerance on a per-application basis or letting the streaming platform provide that capability. Another consideration is whether to run some or all of the solution on a public cloud where offerings for streaming data and persisting it differ considerably. Commercial solutions can run only on-premises, only on a particular cloud, or both in the cloud and on-premises. Phase 4: Alerts and Actions. Address the business need around notification for when events or activities occur. Alerts can be provided as part of a visual dashboard, via STMP (email), SMS (text), or any other message bus. Taking action is usually an automated business process that occurs without human intervention, based on rules or policy that must be formulated. For example, for “smart building” data from sensors, at what point should local building maintenance team be alerted and for which types of event thresholds. Phase 5: Visualization and Distribution. Design for both storage and integration of processes; result data must be in formats consumable by end-user groups. For example, will manufacturing line metrics be incorporated into a dashboard, a phone application, or a wearable device? Will the developers create these visualizations or will the organization integrate commercial visualization solutions? A streaming platform that has connectors to various external systems and provides ability to integrate with a visualization technology, or provides its own, helps in this phase.
  • 9.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   9 Component Selection The next stage in creating a streaming solution is selecting the particular components for the streaming data architectures. Gigaom suggests that the component selection be evaluated with these considerations: • The overall architecture of the platform, both in terms of simplicity and reliability • The enterprise-grade capabilities of the core streaming engine • Ease of application development and the ability to process data in motion and data at rest • Management and operation of the platform once it is in production Following are key questions to consider that will assist with component selection along with a summary table of the most critical features compared across representative technologies. Overall Architecture • Is production operability natively built into the platform? Streaming analytic solutions run in operational capacity and have high operational requirements, including fault tolerance, scalability, security, native integration with external sources, web services, CLI, visual dashboard, etc. A platform that supports these features natively offers time-to-solution advantages over a platform that does not. • Does the platform depend on external components? The number of external components a streaming platform depends on impacts operability. Each added component introduces another possible point of failure, and may require additional expertise. Stitching together disparate components makes architecture more complex and possibly more brittle. • What is the general attitude toward using open source software? When selecting streaming components, evaluate the level of developer talent that is available. For some teams, particularly those that already have deep expertise in developing and deploying Hadoop-based solution into production, adding streaming functionality via open source libraries may be a good fit. This is because those teams are already familiar with patterns for working with Apache Hadoop libraries. However, for other, more traditional enterprise teams, using pure open source technologies can result in hidden project costs such as resource hours to set up, test, and implement the solution.
  • 10.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   10 In some cases, underestimating the knowledge required to set up, configure, integrate, tune, and test can derail an entire project, as the complexity becomes overwhelming. • What is the true cost of a creating, deploying, and managing a big data streaming solution? Are there hidden costs?. When looking at commercial solutions versus open source, they must compare the licensing costs, support costs, and development resources required for both. Typically, commercial products have an upfront licensing fee that includes support and requires less engineering expertise, while open source has no licensing fees, but requires support fees, services fees, and internal development resources. Organizations running in a commercial cloud must analyze their monthly bills carefully for potential cost savings. Enterprise-grade Streaming Engine The core of any selection process is ensuring that the platform will meet the business needs for scalability, high-availability, performance, and security. The many dimensions to consider vary based on use case. Here are the most common areas of consideration. • What is the forecasted data volume and variety of sources for ingestion? The available solution components vary widely in their ability to ingest at scale from multiple data sources. The ingestion-volume requirements alone could drive a decision to a particular commercial product or set of libraries, or some combination of both, because there are known upper limits to the solutions. The number and scope of data sources requires enterprise-quality adapters to handle the variety of data types. The platform must be based on a very common programming language, such as Java, to enable re-use of current code within the streaming platform. A Java-based commercial product designed and tested for that kind of scale, with fault tolerance and a large number of connectors would be a far better fit. • What are the use case requirements for event processing regarding data processing guarantees and event order? Fast big data is comprised of a series of events that occur over time. Architectural decisions on how a streaming platform processes those events will have a direct impact on performance, latency, and scalability. Many fast big data use cases require that event order be maintained. For example, some predictive analytics use cases must compare event order to determine what might happen next. Streaming platform architectural decisions can impact the ability to guarantee the event order, and whether an event will be processed exactly
  • 11.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   11 one time, at most one time or at least one time. Below, we look at three architectural methods of processing event streams and the implications of each method. 1. Event-at-a-time. Apache Storm uses this method, which processes each event and uses an acknowledgement signal for each. It can impact performance and the cost of hardware. • Source: DataTorrent 2. Micro-batching. Apache Spark Streaming uses this method, which processes tiny groups of events. This method cannot provide a guarantee of in-order processing.   Source: DataTorrent 3. Streaming event window processing. This method processes windows in the stream and differs from micro-batching by relying on a more lightweight process (markers in the stream) rather than batching. This approach is a non-blocking high-performance implementation that can guarantee event order and provide only-once, at-least-once, and at-most-once event processing with no data loss.   Source: DataTorrent What are the fault tolerance requirements of the use case? Fast big data use cases are typically implemented in an operational environment. A streaming application, unlike a batch application, does not have an end. It runs continuously 24/7. An organization must know if the platform it has selected to process incoming data streams meets its requirements for fault tolerance. Does the ESP system manage the application fault tolerance or does the engineer need to hand-code it? In the event of a failure, does the ESP guarantee the events will be processed in the order in which they are generated?
  • 12.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   12 Will a use case and business logic change over time? As an organization learns more about the customer or operational aspects of the streaming application, it will typically want to change or supplement the current business logic of its application. For example, in a financial services fraud detection application, it may want to add another algorithm for detecting fraud, or change an alert process. Since transactions occur continuously, the streaming application needs to handle being updated without impacting the analysis of the data flow. This can be thought of as an extension of fault tolerance. Ease of Use and Development Development process. In addition to the complexity of setting up the development environment, when an organization begins creating a POC, it should also understand the tradeoff involved in the details of solution implementation. These details include items such as selection of programming language. For example, will coding be done in a proprietary vendor-created language or in a general-purpose language, such as Java? What features of the solution (for example event guarantees, parallel partitioning, and fault tolerant capabilities) will be manually coded? Commercial vendors, such as DataTorrent, Informatica, and others, include visual data pipeline creation tools for rapid prototyping. This can be a significant factor in successful pipeline POC creation. For example, in the case of the need to build a time sensitive POC (due to competitive pressure, regulatory changes or some other business concern, the ability to use visual tools to do so can significantly reduce the time to create a prototype. Data visualization. Another decision-point is how output data, insights, and action will be presented to the project stakeholders for validation. • What type of dash-boarding solution will be used? Will alerts be generated to demonstrate the viability of the project? • Will the developer team be expected to create visualization for results or will one or more visualization tools be provided used instead? • If using a commercial visualization solution, such as Tableau, what is the complexity of connecting the output data to it and of creating visualizations that make sense to the subject matter experts? • How is integration with processes, such as indexing or dimensional attribute processing, achieved? This is another factor that separates commercial products from pure open source libraries.
  • 13.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   13 • What steps are involved in optimizing real-time analytics? Can dimensional pre-calculations or some type of indexing be added or adjusted? And, again, is optimization manual or automated via tools? Creating the Proof of Concept The first consideration in the proof of concept phase of a streaming is revalidating the first few business questions the solution should address and what action the organization would like to take as a result of the insights it creates. For example, in an IoT smart-building sensor data-streaming project, typical questions are: Which building sections or rooms seem to be outside of normal for HVAC and is there any related data (changes in weather, number of people in the area, etc.) that normally correlate to such a change? Or does this seem to be an anomaly that requires investigation by a technician? If a technician is required, automatically schedule the appointment. Next, begin the proof of concept (POC) that will be driven by the data sources and methods of making sense of these streams. Developers will be eager to build a POC as this point. Understanding developer knowledge of the systems being proposed is key. Another approach is to ask candidate vendors “What is the developer story when creating applications on different types streaming data systems?”  
  • 14.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   14 Management/DevOps A new set of priorities and decisions arises after the POC has been validated and planning begins for production deployment. Top among these concerns is considering what enterprise-level services are necessary to make the solution operate and meet SLA requirements. These often include processes and tools for monitoring and maintaining security, availability, scalability, and recoverability. Next is deciding whether to build or buy. Which open source projects and which products best meet these enterprise service requirements? Is the best solution an all-in-one or an amalgamation of various open source libraries and vendors tools? How are production solutions monitored, scaled, and managed? Does the streaming platform have rich and deep web services and other integration methods to enable ease of integration into monitoring systems? Commercial solutions include strong support for integration with monitoring systems. Open source projects may not have strong monitoring integration points. Is the set of streams spiky? Is dynamic or even automated scaling needed? Commercial products may include such auto- scaling capabilities, including the ability to add operations into, or remove operations from, a running pipeline. What steps are involved in optimizing the ingest pipeline? Is it a manual process? Does code have to be re-written, tested, and deployed or does the solution include tooling that can automate this?  
  • 15.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   15 Key Findings The addition of streaming platform to a big data stack adds significant complexity to big data solutions. Given this, a solid understanding of technology choices around streaming data solutions is essential for designing and delivering solutions that provide business value to the organization. • Consider use of commercial streaming solutions for complex ESP projects. Match the solution complexity to component maturity. Consider the volume, velocity, and variety of both data and data streams. Consider the true costs of building on top of pure open source Hadoop projects versus buying vendor tools and solutions that include Hadoop plus enterprise services. • Teams that have already implemented pure open source Hadoop solutions are most capable of adding pure open source streaming solutions. Match a team’s skill level to solution complexity and component maturity. Know whether the team has production experience working with Hadoop solutions, and has deployed Hadoop solutions with low time to market. What programming languages do they use, what are their expectation about integrating this new project into their current development environment, and their requirements around monitoring and scaling. Do they expect to work with tools or are they comfortable working with libraries from the command line? • Select tools or plan for coding libraries to perform the types of analytics required. Match the types of analysis expected on the event stream data. Organizations must understand whether they will be performing simple aggregations or anticipate using machine learning algorithms or some other type of predictive component. Know whether a team should write this logic or whether the organization will purchase a solution that integrates or contains these components. Figure out how much of the analytics must be returned in real-time. • Test a solutions at production levels of load during the proof-of-concept phase. Determine whether to host the solution on premises, in the cloud, or as a hybrid project. Examine different vendor’s cloud solutions. Is the solution tested at scale on any/all clouds? Know the potential service costs. • Select tools or plan for coding appropriate types of visualization solution. If the organization plans to build its own visualization solution, the technical team must have the talent to create what is envisioned. If not, what is the cost of hiring, re-training, or getting additional help to implement this part of the solution.
  • 16.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   16 Summary Comparisons Capability Storm Spark Streaming DataTorrent RTS IBM InfoStreams TIBCO StreamBase Core Streaming Engine Capabilities In-memory compute engine Yes Yes Yes Yes Yes Native Hadoop Architecture No Yes Yes No No Sub-second event processing Yes No Yes Yes Yes Extremely Linear Scalability (billion(s) of events /second) No No Yes No No Stream partitioning for parallel processing No No Yes No No Auto-scaling input and event processing No No Yes Yes Yes Event Processing Guarantees Only Once At least once At most once Only Once At least once At most once Only Once At Least Once At Most Once At Most Once Only Once At Most Once Event order Guaranteed No No Yes No No End-to-End Stateful Fault Tolerance No No Yes No No Incremental recovery No No Yes No No Dynamic application updates No No Yes No No Data Loss Potential Yes Yes No Yes Yes Complete separation of business logic, event acknowledgement and fault tolerance No No Yes No No    
  • 17.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   17 Streaming Application Development Tools Native Application Programming Language Java Scala Java API Java Proprietary Streams Query Language Proprietary StreamSQL EventFlow Open Source Pre-built Connectors <10 <10 > 75 No No Graphical application builder No No Yes (beta) Yes Yes Visual Real-time Dashboard No No Yes (beta) No Yes Operations and Management Tools Fully functional GUI- based Management No No Yes Yes Yes Ease of integration with monitoring systems No No Yes Yes Yes Ease of integration with external systems via REST APIs No No Yes Yes Yes Simple install and upgrade No No Yes No No  
  • 18.   Fast Big Data and Streaming Analytics: Discerning Differences and Selecting the Best Approach   18 About Lynn Langit Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more than 15 years. Over the past 4 years, she’s been working as an independent architect using these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB, Neo4j and many other database systems. In addition to building solutions, Lynn also partners with all major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings. She is a Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a Cloudera certified instructor (for MapReduce Programming). Prior to re-entering the consulting world 4 years ago, Lynn’s background is over 10 years as a Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit).