SlideShare a Scribd company logo
1 of 36
1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect
Flink PMC member
@MartonBalassi | mbalassi@cloudera.com
The Flink - Apache Bigtop integration
2© Cloudera, Inc. All rights reserved.
Outline
• Short introduction to Bigtop
• An even shorter intro to Flink
• From Flink source to linux packages
• Implementing BigPetStore
• From linux packages to Cloudera parcels
• Summary
3© Cloudera, Inc. All rights reserved.
Short introduction to Bigtop
4© Cloudera, Inc. All rights reserved.
What is Bigtop?
Apache project for standardizing testing, packaging and integration of
leading big data components.
5© Cloudera, Inc. All rights reserved.
Components as building blocks
And many more …
6© Cloudera, Inc. All rights reserved.
Dependency hell
---------------------------------------------------------------
----------hdfs
zookeeper
hbase
kafka
spark
.
.
.
mapred
oozie
hive
etc
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
---------------------------------------------
-------------
Build all the
Things!!!
7© Cloudera, Inc. All rights reserved.
Early value added
• Bigtop has been around since the 0.20 days of Hadoop
• Provide a common foundation for proper integration of growing number of
Hadoop family components
• Foundation provides solid base for validating applications running on top of the
stack(s)
• Neutral packaging and deployment/config
8© Cloudera, Inc. All rights reserved.
Early mission accomplished
• Foundation for commercial Hadoop distros/services
• Leveraged by app providers
…
9© Cloudera, Inc. All rights reserved.
Adding more components
…
10© Cloudera, Inc. All rights reserved.
New focus and target groups
• Going way beyond just building debs/rpms
• Data engineers vs distro builders
• Enhance Operations/Deployment
• Reference implementations & tutorials
11© Cloudera, Inc. All rights reserved.
An even shorter intro to Flink
12© Cloudera, Inc. All rights reserved.
The Flink stack
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
13© Cloudera, Inc. All rights reserved.
Flink in the wild
30 billion events daily 2 billion events in
10 1Gb machines
Picked Flink for "Saiki"
data integration &
distribution platform
See talks by at
Runs their fork of Flink on
1000+ nodes
14© Cloudera, Inc. All rights reserved.
From Flink source
to linux packages
15© Cloudera, Inc. All rights reserved.
The Bigtop component build
• Bigtop builds the component (potentially after patching it)
• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)
• Adds users, groups, systemd services for the components
• Sets up the paths and alternatives for convenient access
• Builds the debs/rpm, takes care of the dependencies
http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
16© Cloudera, Inc. All rights reserved.
Implementing BigPetStore
17© Cloudera, Inc. All rights reserved.
BigPetStore Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet and Table APIs
• Matrix factorization with FlinkML
• Recommendation with the DataStream API
18© Cloudera, Inc. All rights reserved.
BigPetStore
• Blueprints for Big Data
applications
• Consists of:
• Data Generators
• Examples using tools in Big Data ecosystem
to process data
• Build system and tests for integrating tools
and multiple JVM languages
• Part of the Bigtop project
19© Cloudera, Inc. All rights reserved.
BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth
International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
20© Cloudera, Inc. All rights reserved.
Data generation
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
21© Cloudera, Inc. All rights reserved.
ETL with the DataSet API
• Read the dirty JSON
• Output (customer, product) pairs for the recommender
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)
22© Cloudera, Inc. All rights reserved.
ETL with Table API
• Read the dirty JSON
• SQL style queries (SQL coming in Flink 1.1)
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]
23© Cloudera, Inc. All rights reserved.
A little recommender theory
Item
factors
User side
information User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)
24© Cloudera, Inc. All rights reserved.
• Read the (customer, product) pairs
• Write P and Q to file
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)
25© Cloudera, Inc. All rights reserved.
Recommendation with the DataStream API
• Give the TopK recommendation for a user
• (Could be optimized)
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();
26© Cloudera, Inc. All rights reserved.
From linux packages
to Cloudera parcels
27© Cloudera, Inc. All rights reserved.
Why parcels?
• We have linux packages, why a new format?
• Cloudera Manager needs to update parcel without root privileges
• A big, single bundle for the whole ecosystem
• Plays well with the CM services and monitoring
• Package signing
https://github.com/cloudera/cm_ext
28© Cloudera, Inc. All rights reserved.
Managing the Flink parcel from CM
29© Cloudera, Inc. All rights reserved.
Next steps – Flink operations
• Flink does not offer a HistoryServer yet
Running on YARN is inconvenient like this
Follow [FLINK-4136] for resulotion
• The stand-alone cluster mode runs multiple jobs in the JVM
In practice users fire up clusters per job
Alibaba has a multitenant fork, aim is to contribute
https://www.youtube.com/watch?v=_Nw8NTdIq9A
30© Cloudera, Inc. All rights reserved.
Next steps – CM services, monitoring
31© Cloudera, Inc. All rights reserved.
Summary
32© Cloudera, Inc. All rights reserved.
Summary
• Flink is a dataflow engine with batch and streaming as first class citizens
• Bigtop offers unified packaging, testing and integration
• BigPetStore gives you a blueprint for a range of apps
• It is straight-forward to CM Parcel based on Bigtop
33© Cloudera, Inc. All rights reserved.
Big thanks to
• Clouderans supporting the project:
Sean Owen
Alexander Bartfeld
Justin Kestelyn
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Bigtop people answering my silly
questions:
Konstantin Boudnik
Roman Shaposhnik
Nate D'Amico
• Squirrels pushing the integration:
Robert Metzger
Fabian Hueske
34© Cloudera, Inc. All rights reserved.
Check out the code
github.com/mbalassi/bigpetstore-flink
github.com/mbalassi/flink-parcel
Feel free to give me feedback.
35© Cloudera, Inc. All rights reserved.
Come to Flink Forward
36© Cloudera, Inc. All rights reserved.
Thank you
@MartonBalassi
mbalassi@cloudera.com

More Related Content

What's hot

Cloud stack networking shapeblue technical deep dive
Cloud stack networking   shapeblue technical deep diveCloud stack networking   shapeblue technical deep dive
Cloud stack networking shapeblue technical deep diveShapeBlue
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceSean Cohen
 
OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateStephen Gordon
 
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...OpenStack Korea Community
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...NETWAYS
 
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...NETWAYS
 
Cloud stack overview
Cloud stack overviewCloud stack overview
Cloud stack overviewhowie YU
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
Contrail Virtual Execution Platform
Contrail Virtual Execution PlatformContrail Virtual Execution Platform
Contrail Virtual Execution PlatformNETWAYS
 
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...CloudOps2005
 
Using next gen storage in Cloudstack
Using next gen storage in CloudstackUsing next gen storage in Cloudstack
Using next gen storage in CloudstackShapeBlue
 
High Availability in OpenStack Cloud
High Availability in OpenStack CloudHigh Availability in OpenStack Cloud
High Availability in OpenStack CloudQiming Teng
 
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld
 
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...Sungjin Kang
 
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...Cloud Native Day Tel Aviv
 
MetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, AntwerpMetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, AntwerpNicolas Trangez
 
Cf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusionCf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusionmcollinsCF
 
Introduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage SubsystemIntroduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage Subsystembuildacloud
 

What's hot (20)

Cloud stack networking shapeblue technical deep dive
Cloud stack networking   shapeblue technical deep diveCloud stack networking   shapeblue technical deep dive
Cloud stack networking shapeblue technical deep dive
 
LinuxTag 2013
LinuxTag 2013LinuxTag 2013
LinuxTag 2013
 
The road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as serviceThe road to enterprise ready open stack storage as service
The road to enterprise ready open stack storage as service
 
OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community Update
 
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
[OpenStack Day in Korea 2015] Track 3-1 - OpenStack Storage Infrastructure & ...
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
 
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
 
Cloud stack overview
Cloud stack overviewCloud stack overview
Cloud stack overview
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
Contrail Virtual Execution Platform
Contrail Virtual Execution PlatformContrail Virtual Execution Platform
Contrail Virtual Execution Platform
 
Geode on Docker
Geode on DockerGeode on Docker
Geode on Docker
 
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...
 
Using next gen storage in Cloudstack
Using next gen storage in CloudstackUsing next gen storage in Cloudstack
Using next gen storage in Cloudstack
 
High Availability in OpenStack Cloud
High Availability in OpenStack CloudHigh Availability in OpenStack Cloud
High Availability in OpenStack Cloud
 
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
VMworld 2013: vSphere Data Protection (VDP) Technical Deep Dive and Troublesh...
 
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
[OpenStack Day in Korea] Keynote#2 - Bringing OpenStack to the Enterprise Dat...
 
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
 
MetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, AntwerpMetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
MetalK8s 2.x 'Moonshot' - LOADays 2019, Antwerp
 
Cf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusionCf Summit East 2018 Scaling ColdFusion
Cf Summit East 2018 Scaling ColdFusion
 
Introduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage SubsystemIntroduction to CloudStack Storage Subsystem
Introduction to CloudStack Storage Subsystem
 

Similar to The Flink - Apache Bigtop integration

Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021InfluxData
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSWeaveworks
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesWeaveworks
 
Warsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime FabricWarsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime FabricPatryk Bandurski
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdWeaveworks
 
Anypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptxAnypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptxAkshata Sawant
 
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptxMuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptxSteve Clarke
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetupragss
 
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...DevOps.com
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherNETWAYS
 
Building managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummitBuilding managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummitmatsunota
 
Java mission control and java flight recorder
Java mission control and java flight recorderJava mission control and java flight recorder
Java mission control and java flight recorderWolfgang Weigend
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps WorkshopWeaveworks
 

Similar to The Flink - Apache Bigtop integration (20)

Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
 
Warsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime FabricWarsaw MuleSoft Meetup - Runtime Fabric
Warsaw MuleSoft Meetup - Runtime Fabric
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Galera Cluster 4 for MySQL 8 Release Webinar slides
Galera Cluster 4 for MySQL 8 Release Webinar slidesGalera Cluster 4 for MySQL 8 Release Webinar slides
Galera Cluster 4 for MySQL 8 Release Webinar slides
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and Linkerd
 
Anypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptxAnypoint Tools and MuleSoft Automation (DRAFT).pptx
Anypoint Tools and MuleSoft Automation (DRAFT).pptx
 
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptxMuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
MuleSoft Meetup #9 - Anypoint Tools and MuleSoft Automation (FINAL).pptx
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
3 Reasons to Select Time Series Platforms for Cloud Native Applications Monit...
 
intro-kafka
intro-kafkaintro-kafka
intro-kafka
 
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius SchumacherOSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
OSDC 2018 | Highly Available Cloud Foundry on Kubernetes by Cornelius Schumacher
 
Building managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummitBuilding managedprivatecloud kvh_vancouversummit
Building managedprivatecloud kvh_vancouversummit
 
Java mission control and java flight recorder
Java mission control and java flight recorderJava mission control and java flight recorder
Java mission control and java flight recorder
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 

Recently uploaded

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMchpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMNanaAgyeman13
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 

Recently uploaded (20)

Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMchpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 

The Flink - Apache Bigtop integration

  • 1. 1© Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect Flink PMC member @MartonBalassi | mbalassi@cloudera.com The Flink - Apache Bigtop integration
  • 2. 2© Cloudera, Inc. All rights reserved. Outline • Short introduction to Bigtop • An even shorter intro to Flink • From Flink source to linux packages • Implementing BigPetStore • From linux packages to Cloudera parcels • Summary
  • 3. 3© Cloudera, Inc. All rights reserved. Short introduction to Bigtop
  • 4. 4© Cloudera, Inc. All rights reserved. What is Bigtop? Apache project for standardizing testing, packaging and integration of leading big data components.
  • 5. 5© Cloudera, Inc. All rights reserved. Components as building blocks And many more …
  • 6. 6© Cloudera, Inc. All rights reserved. Dependency hell --------------------------------------------------------------- ----------hdfs zookeeper hbase kafka spark . . . mapred oozie hive etc --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- --------------------------------------------- ------------- Build all the Things!!!
  • 7. 7© Cloudera, Inc. All rights reserved. Early value added • Bigtop has been around since the 0.20 days of Hadoop • Provide a common foundation for proper integration of growing number of Hadoop family components • Foundation provides solid base for validating applications running on top of the stack(s) • Neutral packaging and deployment/config
  • 8. 8© Cloudera, Inc. All rights reserved. Early mission accomplished • Foundation for commercial Hadoop distros/services • Leveraged by app providers …
  • 9. 9© Cloudera, Inc. All rights reserved. Adding more components …
  • 10. 10© Cloudera, Inc. All rights reserved. New focus and target groups • Going way beyond just building debs/rpms • Data engineers vs distro builders • Enhance Operations/Deployment • Reference implementations & tutorials
  • 11. 11© Cloudera, Inc. All rights reserved. An even shorter intro to Flink
  • 12. 12© Cloudera, Inc. All rights reserved. The Flink stack DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  • 13. 13© Cloudera, Inc. All rights reserved. Flink in the wild 30 billion events daily 2 billion events in 10 1Gb machines Picked Flink for "Saiki" data integration & distribution platform See talks by at Runs their fork of Flink on 1000+ nodes
  • 14. 14© Cloudera, Inc. All rights reserved. From Flink source to linux packages
  • 15. 15© Cloudera, Inc. All rights reserved. The Bigtop component build • Bigtop builds the component (potentially after patching it) • Breaks up the files to linux distro friendly way (/etc/flink/conf, …) • Adds users, groups, systemd services for the components • Sets up the paths and alternatives for convenient access • Builds the debs/rpm, takes care of the dependencies http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
  • 16. 16© Cloudera, Inc. All rights reserved. Implementing BigPetStore
  • 17. 17© Cloudera, Inc. All rights reserved. BigPetStore Outline • BigPetStore model • Data generator with the DataSet API • ETL with the DataSet and Table APIs • Matrix factorization with FlinkML • Recommendation with the DataStream API
  • 18. 18© Cloudera, Inc. All rights reserved. BigPetStore • Blueprints for Big Data applications • Consists of: • Data Generators • Examples using tools in Big Data ecosystem to process data • Build system and tests for integrating tools and multiple JVM languages • Part of the Bigtop project
  • 19. 19© Cloudera, Inc. All rights reserved. BigPetStore model • Customers visiting pet stores generating transactions, location based Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
  • 20. 20© Cloudera, Inc. All rights reserved. Data generation • Use RJ Nowling’s Java generator classes • Write transactions to JSON val env = ExecutionEnvironment.getExecutionEnvironment val (stores, products, customers) = getData() val startTime = getCurrentMillis() val transactions = env.fromCollection(customers) .flatMap(new TransactionGenerator(products)) .withBroadcastSet(stores, ”stores”) .map{t => t.setDateTime(t.getDateTime + startTime); t} transactions.writeAsText(output)
  • 21. 21© Cloudera, Inc. All rights reserved. ETL with the DataSet API • Read the dirty JSON • Output (customer, product) pairs for the recommender val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val productsWithIndex = transactions.flatMap(_.getProducts) .distinct .zipWithUniqueId val customerAndProductPairs = transactions .flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p))) .join(productsWithIndex).where(_._2).equalTo(_._2) .map(pair => (pair._1._1, pair._2._1)) .distinct customerAndProductPairs.writeAsCsv(output)
  • 22. 22© Cloudera, Inc. All rights reserved. ETL with Table API • Read the dirty JSON • SQL style queries (SQL coming in Flink 1.1) val env = ExecutionEnvironment.getExecutionEnvironment val transactions = env.readTextFile(json).map(new FlinkTransaction(_)) val table = transactions.map(toCaseClass(_)).toTable val storeTransactionCount = table.groupBy('storeId) .select('storeId, 'storeName, 'storeId.count as 'count) val bestStores = table.groupBy('storeId) .select('storeId.max as 'max) .join(storeTransactionCount) .where(”count = max”) .select('storeId, 'storeName, 'storeId.count as 'count) .toDataSet[StoreCount]
  • 23. 23© Cloudera, Inc. All rights reserved. A little recommender theory Item factors User side information User-Item matrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  • 24. 24© Cloudera, Inc. All rights reserved. • Read the (customer, product) pairs • Write P and Q to file Matrix factorization with FlinkML val env = ExecutionEnvironment.getExecutionEnvironment val input = env.readCsvFile[(Int,Int)](inputFile) .map(pair => (pair._1, pair._2, 1.0)) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(input) val (p, q) = model.factorsOption.get p.writeAsText(pOut) q.writeAsText(qOut)
  • 25. 25© Cloudera, Inc. All rights reserved. Recommendation with the DataStream API • Give the TopK recommendation for a user • (Could be optimized) StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); env.socketTextStream(”localhost”, 9999) .map(new GetUserVector()) .broadcast() .map(new PartialTopK()) .keyBy(0) .flatMap(new GlobalTopK()) .print();
  • 26. 26© Cloudera, Inc. All rights reserved. From linux packages to Cloudera parcels
  • 27. 27© Cloudera, Inc. All rights reserved. Why parcels? • We have linux packages, why a new format? • Cloudera Manager needs to update parcel without root privileges • A big, single bundle for the whole ecosystem • Plays well with the CM services and monitoring • Package signing https://github.com/cloudera/cm_ext
  • 28. 28© Cloudera, Inc. All rights reserved. Managing the Flink parcel from CM
  • 29. 29© Cloudera, Inc. All rights reserved. Next steps – Flink operations • Flink does not offer a HistoryServer yet Running on YARN is inconvenient like this Follow [FLINK-4136] for resulotion • The stand-alone cluster mode runs multiple jobs in the JVM In practice users fire up clusters per job Alibaba has a multitenant fork, aim is to contribute https://www.youtube.com/watch?v=_Nw8NTdIq9A
  • 30. 30© Cloudera, Inc. All rights reserved. Next steps – CM services, monitoring
  • 31. 31© Cloudera, Inc. All rights reserved. Summary
  • 32. 32© Cloudera, Inc. All rights reserved. Summary • Flink is a dataflow engine with batch and streaming as first class citizens • Bigtop offers unified packaging, testing and integration • BigPetStore gives you a blueprint for a range of apps • It is straight-forward to CM Parcel based on Bigtop
  • 33. 33© Cloudera, Inc. All rights reserved. Big thanks to • Clouderans supporting the project: Sean Owen Alexander Bartfeld Justin Kestelyn • The BigPetStore folks: Suneel Marthi Ronald J. Nowling Jay Vyas • Bigtop people answering my silly questions: Konstantin Boudnik Roman Shaposhnik Nate D'Amico • Squirrels pushing the integration: Robert Metzger Fabian Hueske
  • 34. 34© Cloudera, Inc. All rights reserved. Check out the code github.com/mbalassi/bigpetstore-flink github.com/mbalassi/flink-parcel Feel free to give me feedback.
  • 35. 35© Cloudera, Inc. All rights reserved. Come to Flink Forward
  • 36. 36© Cloudera, Inc. All rights reserved. Thank you @MartonBalassi mbalassi@cloudera.com