IBM Stream au Hadoop User Group

Big Data
Jerome Chailloux, Big Data Specialist

jerome.chailloux@fr.ibm.com

© 2011 IBM Corporation

Imagine the Possibilities of Analyzing All Available Data
Faster, More Comprehensive, Less Expensive

Real-time Understand and
Traffic Flow Fraud & risk
act on customer
Optimization detection
sentiment

Accurate and timely Predict and act on Low-latency network
threat detection intent to purchase analysis


Where is this data coming from?

Every day, the New York
Stock Exchange captures
1 TB of trade information.
12 TB of tweets being 5 Billion mobile phones in
created each day. use in 2010. Only 12% were
smartphones.

Every second of HD video More than 30M networked
generates > 2,000 times as sensor, growing at a rate
many bytes as required to store >30% per year.
a single page of text.

What is your business doing with it?
3 Source: McKinsey & Company, May 2011

Why is Big Data important ?

Data AVAILABLE to an
organization

Missed ty
ni
opportu

data an organization can
PROCESS

Organizations are able to
Enterprises are “more blind”
process less and less of the
to new opportunities.
available data.

4 © 2011 IBM Corporation
4

What does a Big Data platform do ?
Analyze a Variety of Information
Novel analytics on a broad set of mixed
information that could not be analyzed before

Analyze Information in Motion
Streaming data analysis
Large volume data bursts & ad-hoc analysis

Analyze Extreme Volumes of Information
Cost-efficiently process and analyze petabytes of information
Manage & analyze high volumes of structured, relational data

Discover & Experiment
Ad-hoc analytics, data discovery &
experimentation

Manage & Plan
Enforce data structure, integrity and control to
ensure consistency for repeatable queries
5

Complementary Approaches for Different Use Cases

Traditional Approach New Approach
Structured, analytical, logical Creative, holistic thought, intuition

Data Hadoop
Warehouse Streams
Transaction Data Web Logs

Internal App Data Social Data
Structured
Structured Unstructured
Unstructured
Repeatable Enterprise Exploratory
Repeatable Integration Exploratory Text Data: emails
Mainframe Data Linear Iterative
Linear Iterative sentiment
Monthly sales reports Brand
Profitability analysis Product Sensor data: images
strategy
OLTP System Data
Customer surveys Maximum asset utilization

ERP data Traditional New RFID
Sources Sources


IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drive the Analytic Applications
requirements for a big data platform BI / Exploration / Functional Industry Predictive Content
Reporting Visualization App App BI /
Analytics Analytics
Reporting

• Integrate and manage the full IBM Big Data Platform
variety, velocity and volume of data
Visualization Application Systems
• Apply advanced analytics to & Discovery Development Management
information in its native form
• Visualize all available data for ad- Accelerators
hoc analysis
• Development environment for Hadoop Stream Data
System Computing Warehouse
building new analytic applications
• Workload optimization and
scheduling
• Security and Governance
Information Integration & Governance


Most Client Use Cases Combine Multiple Technologies

Pre-processing
Ingest and analyze unstructured data
types and convert to structured data

Combine structured and unstructured analysis
Augment data warehouse with additional external
sources, such as social media

Combine high velocity and historical analysis
Analyze and react to data in motion; adjust models
with deep historical analysis

Reuse structured data for exploratory analysis
Experimentation and ad-hoc analysis with structured
data


IBM is in a lead position to exploit the Big Data opportunity

February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”

Forrester Wave™: Enterprise Hadoop Solutions, Q1 ’12

IBM Differentiation
Embracing Open Source
Data in Motion (Streams) and Data at
Rest (Hadoop/BigInsights)
Tight integration with other Information
Management products
Bundled, scalable analytics technology
Hardened Apache Hadoop for enterprise
readiness

IBM’s unique strengths in Big Data

Big Data in
 Ingest, analyze and act on massive volumes of streaming data.
Real-Time  Faster AND more cost-effective for specific use cases. (10x volume
of data on the same hardware.)

Fit for
purpose  Analyzes a variety of data types, in their native format – text,
analytics geospatial, time series, video, audio & more.

Enterprise  Open source enhanced for reliability, performance and security.
Class  High performance warehouse software and appliances
 Ease of use with end users, admin and development UIs.

Integration  Integration into your IM architecture.
 Pre-integrated analytic applications.

10

Stream Computing : What is good for ?
Analyze all your data, all the time, just in time
What if you could get IMMEDIATE insight?
Analytic Results What if you could analyze MORE kinds of data?
What if you could do it with exceptional
performance?

Alerts

Threat
Prevention
Systems
More context
Logging

Traditional Data,
Sensor Events, Active
response
Signals
Storage and
Warehousing

11

What is Stream Processing ?

 Relational databases and warehouses find information stored on
disk
 Stream computing analyzes data before you store it

 Databases find the needle in the haystack
 Streams finds the needle as it’s blowing by


Without Streams With Streams
• Intensive scripting Streams provide a Productive and Reusable
• Embedded SQL Development Environment
• File / Storage management by hand
• Record management embedded in application
code
• Data Buffering, Locality
• Security
• Dynamic Application Composition
• High Availability
• Application management (checkpointing, Streams Runtime provides your Application
performance optimization, monitoring, workload Infrastructure
management, error and event handling)
• Application tied to specific Hardware,
Infrastructure
• Multithreading, Multiprocessing
• Debugging
• Migration from development to production
• Integration of best-of-breed commercial tools
• Code reusability
“TerraEchos developers can deliver
• Source / Target interfaces applications 45% faster due to the agility
of Streams Processing Language.“
– Alex Philp, TerraEchos

IBM and Customer Confidential © 2011 IBM Corporation
13

Streams


How Streams Works
 Continuous ingestion Infrastructure provides services for
 Continuous analysis Scheduling analytics across hardware hosts,
Establishing streaming connectivity
Filter / Sample
Transform Annotate

Correlate
Classify

Achieve scale: Where appropriate:
By partitioning applications into software components Elements can be fused together
By distributing across stream-connected hardware hosts for lower communication latency
15

Scalable Stream Processing

 Streams programming model: construct a graph

– Mathematical concept
OP OP OP
• not a line -, bar -, or pie chart! OP OP
OP
• Also called a network stream OP
• Familiar: for example, a tree structure is a graph
– Consisting of operators and the streams that connect them
• The vertices (or nodes) and edges of the mathematical graph
• A directed graph: the edges have a direction (arrows)
 Streams runtime model: distributed processes
– Single or multiple operators form a Processing Element (PE)
– Compiler and runtime services make it easy to deploy PEs
• On one machine
• Across multiple hosts in a cluster when scaled-up processing is required
– All links and data transport are handled by runtime services
• Automatically
• With manual placement directives where required

16

InfoSphere Streams Objects: Runtime View
Instance
 Instance
– Runtime instantiation of InfoSphere Job
Streams executing across one or more Node
hosts PE PE
Stream 1 Stream 2
– Collection of components and services operator

 Processing Element (PE) 1
Stream
– Fundamental execution unit that is run by
the Streams instance PE 3 Stream 4
– Can encapsulate a single operator or Stream
many “fused” operators Stream
3 Stream 5
 Job Node
– A deployed Streams application
executing in an instance
– Consists of one or more PEs

17

InfoSphere Streams Objects: Development View
 Operator
– The fundamental building block of the Streams Streams Application
Processing Language
– Operators process data from streams and may
stream
produce new streams operator
 Stream
– An infinite sequence of structured tuples
– Can be consumed by operators on a tuple-by-
tuple basis or through the definition of a
window height: height: height:
640 1280 640
 Tuple width: width: width:
– A structured list of attributes and their types. 480 1024 480
Each tuple on a stream has the form dictated data: data: data:
by its stream type
 Stream type
– Specification of the name and data type of each
attribute in the tuple directory: directory: directory: directory:
 Window "/img" "/img" "/opt" "/img"
– A finite, sequential group of tuples filename: filename: filename: filename:
– Based on count, time, attribute value, "farm" "bird" "java" "cat"
or punctuation marks
tuple
18

What is Streams Processing Language?

 Designed for stream computing
– Define a streaming-data flow graph
– Rich set of data types to define tuple attributes
 Declarative
– Operator invocations name the input and output streams
– Referring to streams by name is enough to connect the graph
 Procedural support
– Full-featured C++/Java-like language
– Custom logic in operator invocations
– Expressions in attribute assignments and parameter definitions
 Extensible
– User-defined data types
– Custom functions written in SPL or a native language (C++ or Java)
– Custom operator written in SPL
– User-defined operators written in C++ or Java

19

Some SPL Terms port

 An operator represents a class of manipulations Aggregate
– of tuples from one or more input streams
– to produce tuples on one or more output streams
 A stream connects to an operator on a port
– an operator defines input and output ports Employee Salary
Info Statistics
Aggregate
 An operator invocation
– is a specific use of an operator
– with specific assigned input and output streams port
– with locally specified parameters, logic, etc.
TCP
 Many operators have one input port and one output port; Source
others have File
– zero input ports: source adapters, e.g., TCPSource Sink
– zero output ports: sink adapters, e.g., FileSink
– multiple output ports, e.g., Split Split
– multiple input ports, e.g., Join Join
 A composite operator is a collection of operators
– An encapsulation of a subgraph of
• Primitive operators (non-composite) composite
• Composite operators (nested) operator
– Similar to a macro in a procedural language
20

Composite Operators

 Every graph is encoded as a composite
composite Main {
– A composite is a graph of one or more operators graph
– A composite may have input and output ports stream … {
– Source code construct only }
• Nothing to do with operator fusion (PEs) stream … {
}
 Each stream declaration in the composite . . .
– Invokes a primitive operator or }
– another composite operator
Application (logical view)
 An application is a main composite
– No input or output ports Stream 1 Stream 2
– Data flows in and out but not on operator
streams within a graph
1
– Streams may be exported to and Stream
imported from other applications Stream 4
3
running in the same instance Stream
Stream
3 Stream 5

21
21

Anatomy of an Operator Invocation

 Operators share a common structure Syntax:
– <> are sections to fill in stream<stream-type> stream-name
= MyOperator(input-stream; …)
 Reading an operator invocation {
– Declare a stream stream-name logic logic ;
– With attributes from stream-type param parameters ;
– that is produced by MyOperator output output ;
– from the input(s) input-stream window windowspec ;
– MyOperator behavior defined by config configuration ;
}
logic, parameters, windowspec, and
configuration; output attribute
assignments are specified in output
Example:
 For the example: stream<rstring item> Sale
– Declare the stream Sale with the attribute = Join(Bid; Ask)
item, which is a raw string {
– Join Bid and Ask streams with window Bid: sliding, time(30);
– sliding windows of 30 seconds on Bid, Ask: sliding, count(50);
param match: Bid.item == Ask.item
and 50 tuples of Ask && Bid.price >= Ask.price;
– When items are equal, and Bid price is output Sale: item = Bid.item
greater than or equal to Ask price }
– Output the item value on the Sale stream
22
22

Streams V2.0 Data Types
(any)

(primitive) (composite)

boolean enum (numeric) timestamp (string) blob (collection) tuple

(integral) (floatingpoint) (complex) rstring ustring list set map

(signed) (unsigned) (float) (decimal)

int8 uint8 float32 decimal32 complex32
int64 uint64

23

Stream and Tuple Types
 Stream type (often called “schema”)
– Definition of the structure of the data flowing through the stream
 Tuple type definition
– tuple<sequence of attributes> tuple<uint16 id, rstring name>
• Attribute: a type and a name
• Nesting: any attribute may be another tuple type
 Stream type is a tuple type
– stream<sequence of attributes> stream<uint16 id, rstring name>

 Indirect stream type definitions
– Fully defined within the output stream declaration
stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…

– Reference a tuple type
CallInfo = tuple<uint32 callerNum, … rstring endTime, list<uint32> mastIDs>;

stream<CallInfo> InternationalCalls = Op(…) {…}

– Reference another stream
stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…

stream<Calls> RoamingCalls = Op(…) {…}

24

Collection Types

 list: array with bounds-checking [0, 17, age-1, 99]
– Random access: can access any element at any time
Ordered, base-zero indexed: first element is someList[0]
 set: unordered collection {"cats", "yeasts", "plankton"}
– No duplicate element values
 map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1}
– Unordered
 Use type constructors to specify element type
– list<type>, set<type> list<uint16>, set<rstring>
– map<key-type,value-type> map<rstring[3],int8>
 Can be nested to any number of levels
– map<int32, list<tuple<ustring name, int64 value>>>
– {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}
 Bounded collections optimize performance
– list<int32>[5]: at most 5 (32-bit) integer elements
– Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters

25

The Functor Operator
stream<rstring name,
 Transforms input tuples into output
uint32 age,
tuples uint64 salary> Person = Op(…){}
– One input port
– One or more output ports
stream<rstring name,
 May filter tuples uint32 age,
– Parameter filter rstring login,
– A boolean expression tuple<boolean young,
– If true, emit output tuple; boolean rich> info>
if false, do not Adult = Functor(Person) {

 Arbitrary attribute assignments param

– Full-blown expressions filter : age >= 21u;

– Including function calls output Adult :

– Drop, add, transform attributes login = lower(name),

– Omitted attributes auto-assigned info = {young = (age < 30u),

rich = (salary > 100000ul)};
 Custom logic supported }

– logic clause Person Adult
name Functor name
– May include state
age age
– Applies to filter and assignments salary login
info
26

The FileSink Operator

 Writes tuples to a file
 Has a single input port
– No output port: data goes to a file, () as Sink = FileSink(StreamIn) {
not a Streams stream param
 Selected Parameters file : "/tmp/people.dat";
– file format : csv;
• Mandatory
• Base for relative paths is flush : 20u;
data subdirectory }
• Directories must already exist File-
– flush Sink
• Flush the output buffer after a given
number of tuples
– format
• csv: comma-separated values
• txt, line, binary, block

27

Communication Between Streams Applications

 Streams jobs exchange data with the outside world
– Source- and Sink-type operators
– Can also be used between Streams jobs (e.g., TCPSource/Sink)
 Streams jobs can exchange data with each other
– Within one Streams Instance
 Supports Dynamic Application Composition
– By name or based on properties (tags)
– One job exports a stream; another imports it
 Implemented using two new pseudo-operators: Export and Import
Job 1 Stream exported by Job 1
and imported by Job 2
oper-
source sink
ator

Export Import Job 2

oper- oper-
source sink
ator ator

28

Application Design – Dynamic Stream Properties

 API available for toolkit development
 Can add/modify/delete
– Exported stream properties
– Imported stream subscription expression
 Dynamic Job Flow Control Bus Pattern
– Operators within jobs interpret control stream tuples
– Rewire the flow of data from job to job

Flow Control Tuples
Exported [A,B,C]
Control Stream

Job A Job B Job C Job D

Data Stream

29

Application Design – Dynamic Stream Properties

 API available for toolkit development
 Can add/modify/delete
– Exported stream properties
– Imported stream subscription expression
 Dynamic Job Flow Control Bus Pattern
– Operators within jobs interpret control stream tuples
– Rewire the flow of data from job to job

Flow Control Tuples
Exported [A,B,C]
Control Stream
[A,C,D]
Job A Job B Job C Job D

Data Stream

30

Application Design – Multi-job Design
Streams Instance: stream1
Job: imagefeeder Job: imagewriter
Timestamp +
File metadata File metadata Filename
Directory- Image- Image-
Functor FileSink
Scan Source Sink

subscription:
properties: type == "Image" &&
name = "Feed", write == “ok"
type = "Image",
write = “ok"

 Application / Job Decomposition
– Dynamic Job Submission + Stream Import / Export

31

Timestamp +
File metadata Image + File metadata Filename
File metadata
Functor FileSink
Scan Source Sink

subscription:
name = "Feed", write == “ok"
type = "Image",
write = “ok"


32

Timestamp +
File metadata
Functor FileSink
Scan Source Sink
Job:
greyscaler
subscription:
name = "Feed", Greyscale write == “ok"
type = "Image",
write = “ok" properties:
name = “Grey",
subscription: type = "Image",
name == "Feed" write = “ok"


33

Timestamp +
File metadata
Functor FileSink
Scan Source Sink
Job:
greyscaler
subscription:
type = "Image",
Job: resizer name = “Grey",
Job:
facial scan Job: Alerter


34

Timestamp +
Job: imagefeeder
Directory- metadata
File Image- metadata
File
File metadata
Image-
DirReader
Scan
File metadata
Source
WriteImage Functor
Sink
Functor FileSink
Sink
DirReader Job:
greyscaler
subscription:
type = "Image",
Job: resizer name = “Grey",
Job:
facial scan Job: Alerter


35

Two Styles of Export/Import

 Publish and subscribe (Recommended approach):
– The exporting application publishes a stream with certain properties
– The importing stream subscribes to an exported stream with properties
satisfying a specified condition
 Point to point:
– The importing application names a specific stream of a specific exporting
application
 Dynamic publish and subscribe:
– Export properties and Import expressions can be altered during the execution of
a job
– Allows dynamic data flows
– Alter the flow of data based on the data (history, trends, etc.)
() as ImageStream = Export(ImagesIn) { stream<IplImage image, rstring filename,
param properties : { rstring directory> ImagesIn =
streamName = "ImageFeed", Import() {
dataType = "IplImage", param subscription :
writeImage = "true"}; dataType == "IplImage" &&
} writeImage == "true";
}
36

Parallelization Patterns – Introduction

 Problem Statement
– Series of operations to be performed on a piece of data (a tuple)
– How to improve performance of these operations?
 Key Question
– Reduce latency?
• For a single piece of data
– Increase throughput?
• For the entire data flow
 Three possible design patterns
– Serial Path
– Parallel Operators (Task Parallelization)
– Parallel Paths (Data Parallelization)

37

Parallelization Patterns – Pipeline, Task

 Pipeline (serial path)
A B C D

– Base pattern: inherent in graph paradigm
– Results arrive at D in time T(A) + T(B) + T(C)
 Parallel operators (task parallelization)
A

B M D

C

– Process the tuple in operators A, B, and C at the same time
– Requires merger (e.g., Barrier) before operator D
– Results arrive at D in time Max(T(A),T(B),T(C)) + T(M)
– Use when tuple latency requirement < T(A) + T(B) + T(C)
– Complexity of merger depends on behavior of operators A, B, and C

38

Parallelization Patterns – Parallel Pipelines

 Parallel pipelines (data parallelization)
A B C

A B C D

A B C

– Migration step from pipeline patttern
– Can improve throughput
• Especially good for variable-size data / processing time
 Design Decisions
– Are there latency and/or throughput requirements?
– Do the operators perform filtering, feature extraction, transformation?
– Is there an execution order requirement?
– Is there a tuple order requirement?
 Recommend Pipeline  Parallel Pipelines when possible

39

Application Design – Multi-tier Design
Transport Processing / Transport
Ingestion Reduction Transformation
Adaptation Analytics Adaptation

Transport Processing / Transport
Ingestion
Adaptation Analytics Adaptation

Examples
 N-tier design
– Number and purpose of tiers is a result of Application Design
 Create well-defined interfaces between the tiers
 Supports several overarching concepts
– Incremental development / testing
– Application / Job / Operator reuse
– Modular programming practices
 Each tier in these examples may be made up of one or more jobs (programs)

40

Application Design – High Availability

 HA application design pattern
– Source job exports stream, enriched with tuple ID
– Jobs 1 & 2 process in parallel, and export final streams
– Sink job imports streams, discards duplicates, alerts on missing tuples

Job 1
Job 1 Job 1
Job 1 Sink
Sink
Host pool 1 Job 1
Job 1 Job 1
Job 1
Job 2
Job 2
Host pool 2 Source
Source Job 2
Job 2 Job 2
Job 2 Job 2
Job 2

Host pool 3 x86 host x86 host x86 host x86 host x86 host

Host pool 4

41

Application Design – High Availability

 HA application design pattern
– Source job exports stream, enriched with tuple ID
– Jobs 1 & 2 process in parallel, and export final streams
– Sink job imports streams, discards duplicates, alerts on missing tuples

Job 1
Job 1 Job 1
Job 1 Sink
Sink
Host pool 1 Job 1
Job 1 Job 1
Job 1

Source
Source Job 2
Job 2
Host pool 2
Job 2
Job 2 Job 2
Job 2 Job 2
Job 2
x86 host
Host pool 3 x86 host x86 host x86 host x86 host

Host pool 4

42

IBM InfoSphere Streams

Agile Development Distributed Runtime Sophisticated Analytics
Environment Environment with Toolkits & Adapters

Front Office 3.0

 Toolkits
 Database  Advanced Text
 Mining  Geospatial
 Clustered runtime for  Financial  Timeseries
massive scalability  Standard  Messaging
 RHEL v5.x and v6.x,  Internet  ...
Eclipse IDE  BigData  User-defined
CentOS v6.x
Streams Live Graph  x86 & Power multicore
• HDFS
• DataExplorer
Streams Debugger hardware
 Ethernet & InfiniBand
 Over 50 samples

Toolkits and Operators to Speed and Simplify Development
Standard Toolkit Internet Toolkit
Relational Operators InetSource
Filter Sort HTTP FTP HTTPS
Functor Join FTPS RSS file
Punctor Aggregate
Adapter Operators Database Toolkit
FileSource UDPSource ODBCAppendODBCEnrich
FileSink UDPSink ODBCSource SolidDBEnrich
DirectoryScan Export DB2SplitDB DB2PartitionedAppend
TCPSource Import Supports: DB2 LUW, IDS, solidDB,
TCPSink MetricsSink Netezza, Oracle, SQL Server, MySQL
Utility Operators
Custom Split  Financial Toolkit
Beacon DeDuplicate
Throttle Union  Data Mining Toolkit
Delay ThreadedSplit  Big Data toolkit
Barrier DynamicFilter
Pair Gate
 Text Toolkit
JavaOp  …..
Standard toolkit contains the  User-Defined Toolkits
default operators shipped with the  Extend the language by adding
product user-defined operators
and functions
44

User Defined Toolkits

 Streams supports toolkits
– Reusable sets of operators and functions
– What can be included in a toolkit?
• Primitive and composite operators
• Native and SPL functions
• Types
• Tools/documentation/samples/data, etc.
– Versioning is supported
– Define dependencies on other versioned assets (toolkits, Streams)
– Create cross-domain and domain-specific accelerators

45
45

A quick peek inside …
InfoSphere Streams Instance – Single Host

Management Services &
Applications
Streams Web Service (SWS)

Streams Application Manager (SAM)

Streams Resource Manager (SRM)

Authorization and Authentication Service (AAS)

Scheduler Recover DB Name Server

Host Controller Processing Element
Container

File System


InfoSphere Streams Instance – Multi host, Management Services on separate node

Management Services
Streams Web Service (SWS)

Streams Application Manager (SAM)

Streams Resource Manager (SRM)

Authorization and Authentication Service (AAS)

Scheduler Recover DB Name Server

Shared File System

Application Host Application Host Application Host
Host Controller Host Controller Host Controller
Processing Element Processing Element Processing Element
Container Container Container


InfoSphere Streams Instance – Multi host, Management Services on multiple hosts

Management Management Management
Streams Web Service AAS Recovery DB

Management Management
Application Host
Streams App Manager Scheduler Host Controller
Processing Element
Management Management Container

Streams Resource Mgr Name Server

Shared File System

Application Host Application Host Application Host Application Host
Host Controller Host Controller Host Controller Host Controller
Processing Element Processing Element Processing Element Processing Element
Container Container Container Container


IBM Stream au Hadoop User Group

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a IBM Stream au Hadoop User Group

Similar a IBM Stream au Hadoop User Group (20)

Más de Modern Data Stack France

Más de Modern Data Stack France (20)

IBM Stream au Hadoop User Group