SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
distributed stream processing
@humbertostreb
Samza overview
an open-source distributed stream processing created by Linkedin
- sub-second latency
- handle large amount of state
- fault tolerance
- no messages are ever lost
- partitioned and distributed at every level
- processor isolation
- pluggable
Architecture
Streaming: Kafka
Execution: YARN
Processing: Samza
Kafka
- The stream may be sharded into one or more partitions.
- Each partition is independent from the others, and is replicated
across multiple machines.
- Each partition consists of a sequence of messages in a fixed order.
- Each message has an offset, which indicates its position in that
sequence.
- A Samza job can start consuming the sequence of messages from
any starting offset.
YARN
- ResourceManager
- NodeManager
- ApplicationMaster
YARN
Streams
A stream is composed of immutable messages of a similar type or
category
- more than one stream consumed in the same job, are chosen by
RoundRobin by default, but can be overridden
- by configuration streams can be prioritised
Job
Job is code that performs a logical
transformation on a set of input
streams to append output messages to
set of output streams.
Partitions
Each stream is broken into one or
more partitions. Each partition in
the stream is a totally ordered
sequence of messages.
Task
A job is scaled by breaking it into
multiple tasks. The task is the unit of
parallelism of the job, just as the
partition is to the stream. Each task
consumes data from one partition for
each of the job’s input streams.
Containers
Containers are the unit of physical
parallelism, and a container is
essentially a Unix process (or Linux
cgroup). Each container runs one or
more tasks.
SamzaContainer starts up steps
1 - Get last checkpointed offset for each input stream partition
2 - Create a “reader” thread for every input stream partition
3 - Start metrics reporters to report metrics
4 - Start a checkpoint timer to save your task’s input stream offsets
every so often
SamzaContainer starts up steps
5 - Start a window timer to trigger your task’s window method, if it is
defined
6 - Instantiate and initialize your StreamTask once for each input
stream partition
7 - Start an event loop that takes messages from the input stream reader
threads, and gives them to your StreamTasks
8 - Notify lifecycle listeners during each one of these steps
Checkpointing
Samza writes checkpoints to a separate Kafka topic called
__samza_checkpoint_<job-name>_<job-id>
State Management
- fast approach using a local database
- fault tolerance sending a local store’s
writes to a replicated changelog and
checkpointing
- out of the box support RocksDB
(key-value)
Event Loop
- synchronous tasks will run on the single thread by default, but you
can configure
- asynchronous tasks will always be invoked in a single thread, while
callbacks can be triggered from a different thread.
Samza will make sure that checkpointing is automatically performed
only after the async calls have completed.
Metrics
Samza has its own library to expose metrics, with counters, gauges and
timer.
Metrics can be exposed by JMX, Kafka topic and so on
Security
Samza provides no security.
All security is implemented in the stream system, or in the environment
that Samza containers run.
Links
https://www.infoq.com/presentations/samza-linkedin
http://es.slideshare.net/martinkleppmann/samza-at-linkedin-taking-stream-
processing-to-the-next-level
tanks

Más contenido relacionado

La actualidad más candente

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

La actualidad más candente (20)

AWS for IoT
AWS for IoTAWS for IoT
AWS for IoT
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Setting Up a TIG Stack for Your Testing
Setting Up a TIG Stack for Your TestingSetting Up a TIG Stack for Your Testing
Setting Up a TIG Stack for Your Testing
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at Databricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Metamorphic Testing for Web System Security
Metamorphic Testing for Web System SecurityMetamorphic Testing for Web System Security
Metamorphic Testing for Web System Security
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Distributed tracing using open tracing &amp; jaeger 2
Distributed tracing using open tracing &amp; jaeger 2Distributed tracing using open tracing &amp; jaeger 2
Distributed tracing using open tracing &amp; jaeger 2
 
455 internship ppt.pptx
455 internship ppt.pptx455 internship ppt.pptx
455 internship ppt.pptx
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Caching
CachingCaching
Caching
 
Aws schema conversion tool
Aws schema conversion toolAws schema conversion tool
Aws schema conversion tool
 
Keynote: Elastic Observability evolution and vision
Keynote: Elastic Observability evolution and visionKeynote: Elastic Observability evolution and vision
Keynote: Elastic Observability evolution and vision
 
Visual Studio
Visual StudioVisual Studio
Visual Studio
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Database migration
Database migrationDatabase migration
Database migration
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Open Source DataViz with Apache Superset
Open Source DataViz with Apache SupersetOpen Source DataViz with Apache Superset
Open Source DataViz with Apache Superset
 
Distributed Tracing in Practice
Distributed Tracing in PracticeDistributed Tracing in Practice
Distributed Tracing in Practice
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 

Similar a Apache samza

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 

Similar a Apache samza (20)

Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Controlling message flow
Controlling message flowControlling message flow
Controlling message flow
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Comparing processing frameworks v7
Comparing processing frameworks v7Comparing processing frameworks v7
Comparing processing frameworks v7
 
Messaging for Modern Applications
Messaging for Modern ApplicationsMessaging for Modern Applications
Messaging for Modern Applications
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beam
 
Event driven-arch
Event driven-archEvent driven-arch
Event driven-arch
 

Más de Humberto Streb

Más de Humberto Streb (8)

Istio service mesh
Istio service meshIstio service mesh
Istio service mesh
 
Event sourcing e o poder do desacoplamento
Event sourcing e o poder do desacoplamentoEvent sourcing e o poder do desacoplamento
Event sourcing e o poder do desacoplamento
 
Reactive streams, because parallelism matters
Reactive streams, because parallelism mattersReactive streams, because parallelism matters
Reactive streams, because parallelism matters
 
Docker, jenkins e gradle para tomar o controle de sua entrega
Docker, jenkins e gradle para tomar o controle de sua entregaDocker, jenkins e gradle para tomar o controle de sua entrega
Docker, jenkins e gradle para tomar o controle de sua entrega
 
Socket.io
Socket.ioSocket.io
Socket.io
 
Fp without functional language
Fp without functional languageFp without functional language
Fp without functional language
 
Sinatra
SinatraSinatra
Sinatra
 
Descomplicando o controle de versão com git
Descomplicando o controle de versão com gitDescomplicando o controle de versão com git
Descomplicando o controle de versão com git
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Apache samza

  • 2. Samza overview an open-source distributed stream processing created by Linkedin - sub-second latency - handle large amount of state - fault tolerance - no messages are ever lost - partitioned and distributed at every level - processor isolation - pluggable
  • 4. Kafka - The stream may be sharded into one or more partitions. - Each partition is independent from the others, and is replicated across multiple machines. - Each partition consists of a sequence of messages in a fixed order. - Each message has an offset, which indicates its position in that sequence. - A Samza job can start consuming the sequence of messages from any starting offset.
  • 7. Streams A stream is composed of immutable messages of a similar type or category - more than one stream consumed in the same job, are chosen by RoundRobin by default, but can be overridden - by configuration streams can be prioritised
  • 8. Job Job is code that performs a logical transformation on a set of input streams to append output messages to set of output streams.
  • 9. Partitions Each stream is broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages.
  • 10. Task A job is scaled by breaking it into multiple tasks. The task is the unit of parallelism of the job, just as the partition is to the stream. Each task consumes data from one partition for each of the job’s input streams.
  • 11. Containers Containers are the unit of physical parallelism, and a container is essentially a Unix process (or Linux cgroup). Each container runs one or more tasks.
  • 12. SamzaContainer starts up steps 1 - Get last checkpointed offset for each input stream partition 2 - Create a “reader” thread for every input stream partition 3 - Start metrics reporters to report metrics 4 - Start a checkpoint timer to save your task’s input stream offsets every so often
  • 13. SamzaContainer starts up steps 5 - Start a window timer to trigger your task’s window method, if it is defined 6 - Instantiate and initialize your StreamTask once for each input stream partition 7 - Start an event loop that takes messages from the input stream reader threads, and gives them to your StreamTasks 8 - Notify lifecycle listeners during each one of these steps
  • 14. Checkpointing Samza writes checkpoints to a separate Kafka topic called __samza_checkpoint_<job-name>_<job-id>
  • 15. State Management - fast approach using a local database - fault tolerance sending a local store’s writes to a replicated changelog and checkpointing - out of the box support RocksDB (key-value)
  • 16. Event Loop - synchronous tasks will run on the single thread by default, but you can configure - asynchronous tasks will always be invoked in a single thread, while callbacks can be triggered from a different thread. Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
  • 17. Metrics Samza has its own library to expose metrics, with counters, gauges and timer. Metrics can be exposed by JMX, Kafka topic and so on
  • 18. Security Samza provides no security. All security is implemented in the stream system, or in the environment that Samza containers run.
  • 20. tanks