Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

•

1 recomendación•747 vistas

"Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit In the last 10 years, the IT industry has seen a complete revolution in the perceived value that computing has on businesses and how engineers think about applications: in several application domains, the need for data has outgrown the capacity of commodity hardware and the need for information has outpaced traditional processing technologies and approaches. In this talk we'll introduce Apache Flink, a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It is an open source project that builds on top of proven approaches, as well as innovative algorithms. We will go in-depth on how this tool can be used to implement data-intensive applications, in particular regarding present tools and future perspectives to use machine learning algorithms in a distributed context. Simone Robutti, 27, Machine Learning Engineer at Radicalbit. He achieved a Master’s Degree at Università degli studi di Milano with a thesis on SVM for noisy labeled datasets. From then on his interests shifted towards the engineering side of Machine Learning and Big Data: implementation, deploy, portability and maintainability of ML-intensive systems. Right now his focus in Radicalbit is Flink and its Machine Learning library FlinkML.

Datos y análisis

Milan – July 13 2016
Data Intensive Applications with Apache Flink
Simone Robutti
Machine Learning Engineer at Radicalbit
@SimoneRobutti

Agenda
1. Brief Introduction to Apache Flink
○ Why
○ What
○ How
2. Machine Learning on Flink
○ Present landscape
○ Future of the Ecosystem
3. Closing notes on Radicalbit (shameless plug ahead)

100% Buzzword-free guaranteed
Big Data
Machine
Intelligence
Web-scale
400x
It’s like the
human brain
Exactly-once
Exactly-once

Why Flink (and not Spark/Storm/Samza...)
Because it’s
production-ready
streaming-first
low-latency
fault-tolerant
high-throughput
processing engine

Flink: what is it?
From Flink’s Documentation

Flink’s Runtime
From Flink’s Documentation

Flink’s DataFlow
From Flink’s Documentation
Written by the user through DataSet/DataStream API
Compiled and optimized in the client

Flink’s DataFlow
From Flink’s Documentation
The compiled job is translated to distributed tasks by
the master and executed by workers

Ready and awesome for parallel ML
Work in progress for distributed ML
ML on Flink

Flink for Model Evaluation Pipelines
Source
Data
Preparation
Evaluation Sink
Source
Post
process
-ing
Composable, modular Flink Operator

Evaluation with Flink-JPMML
Source
Operator
Flink -
JPMML
Operator
Sink
Operator
Source
Operator
model.pmml
Small library that implements basic model eval.
Data
Preparation

“I have seen people insisting on using Hadoop for
datasets that could easily fit on a flash drive and could
easily be processed on a laptop.”
- Yann LeCun
-
ML on Flink

FlinkML
What: Out-of-the-box workhorse algorithms (ALS,
SVM, LinReg, LogReg …)
Status: early phase, slow development

FlinkML
Pro: available out of the box, written with Flink API
Cons: reinvents the wheel, only a few algorithms,
no model persistence

Samsara
What: Linear algebra framework
Status: mature

Samsara
Pro: generic algorithms with platform-specific
bindings, skilled community
Cons: covers only a few use cases

SAMOA
What: Online learning algorithm framework (VHT,
AMR, …)
Status: early phase, complicated relationship with
the industry

SAMOA
Pro: many powerful generic online learning
algorithms, backed by academics (MOA, Weka)
Cons: not production ready, academic focus

ML on Flink: the future of the ecosystem

Apache Beam
Programming model for data processing pipelines
● Streaming first, batch as a bounded stream
● Layered API: What, Where, When, How
● Platform agnostic: same program, different
runners

Apache Beam - Runners
● Flink
● Spark (Partial)
● Google Cloud Dataflow
● Plain Java
● Gearpump (WIP)
● Apex (WIP)

FlinkML Roadmap
● More algorithms!
● Evaluation framework
● Persistence/export
● Online Learning Framework

Proteus
Online Learning Platform - based on Flink
Source: Proteus’ website

Contributions
● Cassandra Connector
● Scala API extensions
● FlinkML (Linear Algebra Framework, MinHash)
● Akka Connector

Our vision
Flink can become the ideal choice to build real-time decision-
heavy applications with high data-throughput
To achieve this:
● Ambitious applications (aim for real-time services)
● Reliable distributed online learning (Proteus?)
● A Pipelining Framework (experiment fast, increase testability and
modularity)

THANKS!
Simone Robutti
Mail: simone.robutti@radicalbit.io Medium: @simone.robutti
Twitter: @SimoneRobutti — @weareradicalbit

Más contenido relacionado

La actualidad más candente

Deploying your Predictive Models as a Service via DominoJo-fai Chow

Geo Python16 keynoteRomeo Kienzler

Project "Deep Water"Jo-fai Chow

Some "challenges" on the open-source/open-data frontGreg Landrum

Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGThamme Gowda

Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati

Deep Learning with MXNet - Dmitry LarkoSri Ambati

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati

Airline Reservations and Routing: A Graph Use CaseJason Plurad

Graph Computing with JanusGraphJason Plurad

Stacked Ensembles in H2OSri Ambati

Towards the Cytoscape CyberinfrastructureKeiichiro Ono

cyREST: Cytoscape as a ServiceKeiichiro Ono

Exploring Graph Use Cases with JanusGraphJason Plurad

Distributed deep learningAlireza Shafaei

Cloud Computing - examplesEUBrasilCloudFORUM .

Cytoscape: Now and FutureKeiichiro Ono

Metaflow: The ML Infrastructure at NetflixBill Liu

Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous

ETL & Machine LearningLuthfi Hariz

La actualidad más candente (20)

Deploying your Predictive Models as a Service via Domino

Geo Python16 keynote

Project "Deep Water"

Some "challenges" on the open-source/open-data front

Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG

Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O

Deep Learning with MXNet - Dmitry Larko

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...

Airline Reservations and Routing: A Graph Use Case

Graph Computing with JanusGraph

Stacked Ensembles in H2O

Towards the Cytoscape Cyberinfrastructure

cyREST: Cytoscape as a Service

Exploring Graph Use Cases with JanusGraph

Distributed deep learning

Cloud Computing - examples

Cytoscape: Now and Future

Metaflow: The ML Infrastructure at Netflix

Production-Ready BIG ML Workflows - from zero to hero

ETL & Machine Learning

Destacado

Inaugural talk Data Science Milan - Gianmario SpacagnaData Science Milan

The Barclays Data Science Hackathon: Building Retail Recommender Systems base...Data Science Milan

Risking Everything with Akka Streamsjohofer

Apache Gearpump - Lightweight Real-time Streaming EngineTianlun Zhang

Osobistist kerivnika dnzmtc124

Barrett Wissman – Breaking BarriersBarrett Wissman

Developing JavaScript WidgetsBob German

Expara Business Canvas WorkshopExpara

ICBO 2014, October 8, 2014Warren Kibbe

History of the guitarGraceyL

Czy designerzy powinni uczyć się kodować - Dribbble Warsaw #3Piotr Kmita

HDL-32E High Definition LiDAR™ SensorBrett Johnson

Din itex 10_09_2012Denis Bychkovsky

Historia del museo de telégrafos. victoriacrespog

Preparing for the Zombie Apocalypserandomlogik

Destacado (15)

Inaugural talk Data Science Milan - Gianmario Spacagna

The Barclays Data Science Hackathon: Building Retail Recommender Systems base...

Risking Everything with Akka Streams

Apache Gearpump - Lightweight Real-time Streaming Engine

Osobistist kerivnika dnz

Barrett Wissman – Breaking Barriers

Developing JavaScript Widgets

Expara Business Canvas Workshop

ICBO 2014, October 8, 2014

History of the guitar

Czy designerzy powinni uczyć się kodować - Dribbble Warsaw #3

HDL-32E High Definition LiDAR™ Sensor

Din itex 10_09_2012

Historia del museo de telégrafos.

Preparing for the Zombie Apocalypse

Similar a Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Stephan Ewen

Apache Spark vs Apache FlinkAKASH SIHAG

Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit

Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi

Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi

Portable Streaming Pipelines with Apache Beamconfluent

Present and future of unified, portable, and efficient data processing with A...DataWorks Summit

Realizing the promise of portability with Apache BeamJ On The Beach

Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou

Flink in actionArtem Semenenko

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus

Near real-time anomaly detection at Lyftmarkgrover

Introduction to Apache Flinkdatamantra

Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise

LAMP is so yesterday, MEAN is so tomorrow! :) Sascha Sambale

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Technology Stack DiscussionZaiyang Li

Flink history, roadmap and visionStephan Ewen

Similar a Data intensive applications with Apache Flink - Simone Robutti, Radicalbit (20)

Apache Fink 1.0: A New Era for Real-World Streaming Analytics

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...

Apache Spark vs Apache Flink

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Portable Streaming Pipelines with Apache Beam

Present and future of unified, portable, and efficient data processing with A...

Realizing the promise of portability with Apache Beam

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

Flink in action

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

Near real-time anomaly detection at Lyft

Introduction to Apache Flink

Unified Batch and Real-Time Stream Processing Using Apache Flink

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

LAMP is so yesterday, MEAN is so tomorrow! :)

Apache Arrow at DataEngConf Barcelona 2018

Technology Stack Discussion

Flink history, roadmap and vision

Más de Data Science Milan

ML & Graph algorithms to prevent financial crime in digital paymentsData Science Milan

How to use the Economic Complexity Index to guide innovation plansData Science Milan

Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan

"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan

Question generation using Natural Language Processing by QuestGen.AIData Science Milan

Speed up data preparation for ML pipelines on AWSData Science Milan

Serverless machine learning architectures at HelixaData Science Milan

MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan

Reinforcement Learning Overview | Marco Del PraData Science Milan

Time Series Classification with Deep Learning | Marco Del PraData Science Milan

Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AIData Science Milan

Audience projection of target consumers over multiple domains a ner and baye...Data Science Milan

Weak supervised learning - Kristina KhvatovaData Science Milan

GANs beyond nice pictures: real value of data generation, Alex HoncharData Science Milan

Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoData Science Milan

3D Point Cloud analysis using Deep LearningData Science Milan

Deep time-to-failure: predicting failures, churns and customer lifetime with ...Data Science Milan

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...Data Science Milan

Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyData Science Milan

"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...Data Science Milan

Más de Data Science Milan (20)

ML & Graph algorithms to prevent financial crime in digital payments

How to use the Economic Complexity Index to guide innovation plans

Robustness Metrics for ML Models based on Deep Learning Methods

"You don't need a bigger boat": serverless MLOps for reasonable companies

Question generation using Natural Language Processing by QuestGen.AI

Speed up data preparation for ML pipelines on AWS

Serverless machine learning architectures at Helixa

MLOps with a Feature Store: Filling the Gap in ML Infrastructure

Reinforcement Learning Overview | Marco Del Pra

Time Series Classification with Deep Learning | Marco Del Pra

Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI

Audience projection of target consumers over multiple domains a ner and baye...

Weak supervised learning - Kristina Khvatova

GANs beyond nice pictures: real value of data generation, Alex Honchar

Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco

3D Point Cloud analysis using Deep Learning

Deep time-to-failure: predicting failures, churns and customer lifetime with ...

50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...

Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply

"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...

Último

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Halmar dropshipping via API with DroFxolyaivanovalion

Mature dropshipping via API with DroFx.pptxolyaivanovalion

April 2024 - Crypto Market Report's Analysismanisha194592

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Invezz.com - Grow your wealth with trading signalsInvezz1

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Midocean dropshipping via API with DroFxolyaivanovalion

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

1. Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti

2. Agenda 1. Brief Introduction to Apache Flink ○ Why ○ What ○ How 2. Machine Learning on Flink ○ Present landscape ○ Future of the Ecosystem 3. Closing notes on Radicalbit (shameless plug ahead)

3. 100% Buzzword-free guaranteed Big Data Machine Intelligence Web-scale 400x It’s like the human brain Exactly-once Exactly-once

4. Why Flink (and not Spark/Storm/Samza...) Because it’s production-ready streaming-first low-latency fault-tolerant high-throughput processing engine

5. Flink: what is it? From Flink’s Documentation

6. Connectors and integrations

7. Flink’s Runtime From Flink’s Documentation

8. Flink’s DataFlow From Flink’s Documentation Written by the user through DataSet/DataStream API Compiled and optimized in the client

9. Flink’s DataFlow From Flink’s Documentation The compiled job is translated to distributed tasks by the master and executed by workers

10. Machine Learning on Flink

11. Ready and awesome for parallel ML Work in progress for distributed ML ML on Flink

12. Flink for Model Evaluation Pipelines Source Data Preparation Evaluation Sink Source Post process -ing Composable, modular Flink Operator

13. Evaluation with Flink-JPMML Source Operator Flink - JPMML Operator Sink Operator Source Operator model.pmml Small library that implements basic model eval. Data Preparation

14. “I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.” - Yann LeCun - ML on Flink

15.

16. FlinkML What: Out-of-the-box workhorse algorithms (ALS, SVM, LinReg, LogReg …) Status: early phase, slow development

17. FlinkML Pro: available out of the box, written with Flink API Cons: reinvents the wheel, only a few algorithms, no model persistence

18. Samsara What: Linear algebra framework Status: mature

19. Samsara Pro: generic algorithms with platform-specific bindings, skilled community Cons: covers only a few use cases

20. SAMOA What: Online learning algorithm framework (VHT, AMR, …) Status: early phase, complicated relationship with the industry

21. SAMOA Pro: many powerful generic online learning algorithms, backed by academics (MOA, Weka) Cons: not production ready, academic focus

22. ML on Flink: the future of the ecosystem

23. Apache Beam Programming model for data processing pipelines ● Streaming first, batch as a bounded stream ● Layered API: What, Where, When, How ● Platform agnostic: same program, different runners

24. Apache Beam - Runners ● Flink ● Spark (Partial) ● Google Cloud Dataflow ● Plain Java ● Gearpump (WIP) ● Apex (WIP)

25. BeamML: a runner-agnostic ML library

26. FlinkML Roadmap ● More algorithms! ● Evaluation framework ● Persistence/export ● Online Learning Framework

27. Proteus Online Learning Platform - based on Flink Source: Proteus’ website

28. The role of Radicalbit

29. Contributions ● Cassandra Connector ● Scala API extensions ● FlinkML (Linear Algebra Framework, MinHash) ● Akka Connector

30. Our vision Flink can become the ideal choice to build real-time decision- heavy applications with high data-throughput To achieve this: ● Ambitious applications (aim for real-time services) ● Reliable distributed online learning (Proteus?) ● A Pipelining Framework (experiment fast, increase testability and modularity)

31. Q&A

32. THANKS! Simone Robutti Mail: simone.robutti@radicalbit.io Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit

Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (15)

Similar a Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

Similar a Data intensive applications with Apache Flink - Simone Robutti, Radicalbit (20)

Más de Data Science Milan

Más de Data Science Milan (20)

Último

Último (20)

Data intensive applications with Apache Flink - Simone Robutti, Radicalbit