HUG Hadoop User Group du 29 Janvier 2015 chez HP.
Slidedeck des 3 talks ci-dessous:
#1: Traitement des données non structurées (Vidéos, images, …) avec Haven pour Hadoop,
#2: Apache Flink: Fast and Reliable Large-scale Data Processing,
#3: Etude de cas, projet Hadoop dans le domaine des RH avec Capgemini.
La vectorisation des documents : rendre comparables des informations non structurées, de nouvelles opportunités pour un acteur de l’emploi
38. What is Flink
Collection programming APIs for batch and real-time
streaming analysis
Backed by a very robust execution backend
• with true streaming capabilities,
• custom memory manager,
• native iteration execution,
• and a cost-based optimizer.
38
39. The case for Flink
Performance and ease of use
• Exploits in-memory and pipelining, language-embedded logical APIs
Unified batch and real streaming
• Batch and Stream APIs on top of streaming engine
A runtime that "just works" without tuning
• C++ style memory management inside the JVM
Predictable and dependable execution
• Bird’s-eye view of what runs and how, and what failed and why
39
40. Example: WordCount
40
case class Word (word: String, frequency: Int)
val env = ExecutionEnvironment.getExecutionEnvironment
env.readTextFile(...)
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.groupBy("word").sum("frequency”).print()
env.execute()
Flink has mirrored Java and Scala APIs that offer the same
functionality, including by-name addressing.
41. Example: Window WordCount
41
case class Word (word: String, frequency: Int)
val env =
StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.fromSocketStream(...)
lines
.flatMap {line => line.split(" ").map(word => Word(word,1))}
.window(Count.of(100)).every(Count.of(10))
.groupBy("word").sum("frequency”).print()
env.execute()
42. Defining windows
Trigger policy
• When to trigger the computation on current window
Eviction policy
• When data points should leave the window
• Defines window width/size
E.g., count-based policy
• evict when #elements > n
• start a new window every n-th element
Built-in: Count, Time, Delta policies
42
43. Flink API in a nutshell
map, flatMap, filter, groupBy,
reduce, reduceGroup,
aggregate, join, coGroup,
cross, project, distinct, union,
iterate, iterateDelta, ...
All Hadoop input formats are
supported
API similar for data sets and
data streams with slightly
different operator semantics
Window functions for data
streams
Counters, accumulators, and
broadcast variables
43
44. Flink stack
44
Flink Optimizer Flink Stream Builder
Common API
Scala API Java API
Python API
(upcoming)
Graph API
(Gelly)
Apache
MRQL
Flink Local RuntimeEmbedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution)
Apache Tez
Data
storage
HDFSFiles S3 JDBC Flume
Rabbit
MQ
KafkaHBase …
Single node execution Standalone or YARN cluster
45. Technology inside Flink
Technology inspired by compilers +
MPP databases + distributed systems
For ease of use, reliable performance,
and scalability
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Cost-based
optimizer
Type extraction
stack
Memory
manager
Out-of-core
algos
real-time
streamingTask
schedulin
g
Recovery
metadata
Data
serialization
stack
Streaming
network
stack
...
Pre-flight
(client) Master
Workers
46. Notable runtime features
1. Pipelined data transfers
2. Management of memory
3. Native iterations
4. Program optimization
46
48. Staged (batch) execution
Romeo,
Romeo,
where art
thou
Romeo?
Loa
d
Log
Searc
h for
str1
Searc
h for
str2
Searc
h for
str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subseqent stages:
Grep log for matches
Caching in-memory
and disk if needed
48
50. Pipelining in Flink
Currently the default mode of operation
• Much better performance in many cases – no
need to materialize large data sets
• Supports both batch and real-time streaming
In the future pluggable
• Batch will use combination of blocking and
pipelining
• Streaming will use pipelining
• Interactive will use blocking
50
52. Memory management in Flink
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
Sorting,
hashing,
caching
Shuffling,
broadcasts
User code
objects
ManagedUnmanaged
52
Flink contains its own memory management stack. Memory is
allocated, de-allocated, and used strictly using an internal buffer pool
implementation. To do that, Flink contains its own type extraction and
serialization components.
53. Configuring Flink
Per job
• Parallelism
System config
• Total JVM heap size (-Xmx)
• % of total JVM size for Flink runtime
• Memory for network buffers (soon not needed)
That's all you need. System will not throw an OOM
exception to you.
53
54. Benefits of managed memory
More reliable and stable performance (less GC
effects, easy to go to disk)
54
58. Iterate natively with deltas
58
partial
solution
delta
setX
other
datasets
Y
initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset
62. Flink roadmap for 2015
Unify batch and streaming
Machine learning library and Mahout
Graph processing library improvements
Interactive programs and Zeppelin
Logical queries and SQL
And many more
62
63. Thank you for your invitation
Check out the project
website:
http://flink.apache.org
news@flink.apache.org
and other mailinglists
Twitter: @ApacheFlink
Feedback & contributions
welcome
63