Lambda architecture

•

3 likes•1,580 views

A lithe description of fundamental concepts and about how this new architectural approach work for Big Data problems and even real time systems.

Technology

Lambda Architecture
Una soluzione per i Big Data

Mario A. Santini

A solution born in Twitter

Nathan Marz
Author of Big Data:
http://www.manning.com/marz/

When big is big?
●

OpenStreetMap.org ~1,5 M users, ~2,2 nodes
(http://j.mp/OSM-stats)
http://j.mp/OSM-stats

●

Wikipedia 32 M pages, 20 M users
(http://en.wikipedia.org/wiki/Wikipedia:Statistics)
http://en.wikipedia.org/wiki/Wikipedia:Statistics

●

Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/)
http://www.statisticbrain.com/facebook-statistics/

●

Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/)

●

But also:
–
–

Monitoring systems
Any near real time system

Batch View

All Data

Batch Layer

Batch View

Batch View

Serving Layer

Query

Batch Layer
●

Store an immutable input data set

●

Computing continuosly the batch view

●

Simple & Distributed

Serving Layer
●

Indexing the batch views

●

Access to the batch views

●

Updated by Batch Layer

●

Trivial read only database:
–

Quick

–

Very simple

Batch Layer + Service Layer
●

Robust and fault tollerant

●

Scalable

●

General

●

Extensible

●

Allow ad hoc queries

●

Minimal maintenance

●

Debuggable

What's miss?
While Batch Layer compute the query on the full
data set a pretty big chunk of data just arrived
and be stored.
Should we wait a couple of hours to query this
data?

Speed Layer
Near real time views

New Data

Speed Layer

Near real time views

Near real time views

Query

All together now!
Serving Layer
Batch View
All Data

Batch Layer
Batch View
Query
New Data

Near real time views
Speed Layer

Near real time views

How all this mess should work?
●

●

All new data are sent to both: batch and speed
layer (data are raw and immutalble, append
only)
The batch layer precompute the query
functions continuosly to all the dataset, to
produce the batch views

●

The serving layer indexes the batch views

●

At the end the data are a couple of hours old

How all this mess should work?
●
●

●
●

The speed layer will process only the new data
It use fast read/write database and
incremental processing algorithms
Produce the near real time views
The query will merge real time and batch
views results to resolve the queries

Batch Layer - tools
●

Hadoop
–

YARN: framework to schedule jobs and cluster
management

–

Map / Reduce: a way to parallel processing of
huge amount of data, based on YARN

–

HDFS: distributed file system with an high
throughput access to application data

–

And even more...

Serving Layer – tools
●

ElephantDB
–

●

●

Readonly database, very little, very fast

Here we need anything that has the same
features
Cloudera Impala

Speed Layer - tools
●

Storm project
–

Very fast distributed computed system

●

Apache Hbase

●

MongoDB

What's hot

Spark Streaming and IoT by Mike FreedmanSpark Summit

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain

Lambda Architectures in PracticeC4Media

Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy

Spark Streaming the Industrial IoTJim Haughwout

British Gas Connected Homes: Data EngineeringDataStax Academy

Extracting Insights from Data at TwitterPrasad Wagle

ASPgems - kappa architectureJuantomás García Molina

Cassandra & Spark for IoTMatthias Niehoff

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks

Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens

Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.

Real-time analytics with Druid at AppsflyerMichael Spector

Lambda architecture: from zero to OneSerg Masyutin

Hadoop 2 @Twitter, Elephant Scale. Presented at lohitvijayarenu

DIscover Spark and Spark streamingMaturin BADO

Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland

Introduction to Apache Apex by Thomas WeiseBig Data Spain

What's hot (20)

Spark Streaming and IoT by Mike Freedman

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Lambda Architectures in Practice

Capital One: Using Cassandra In Building A Reporting Platform

Spark Streaming the Industrial IoT

British Gas Connected Homes: Data Engineering

Extracting Insights from Data at Twitter

ASPgems - kappa architecture

Cassandra & Spark for IoT

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Proud to be Polyglot - Riviera Dev 2015

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...

Building a system for machine and event-oriented data with Rocana

Real-time analytics with Druid at Appsflyer

Lambda architecture: from zero to One

Hadoop 2 @Twitter, Elephant Scale. Presented at

DIscover Spark and Spark streaming

Lambda at Weather Scale - Cassandra Summit 2015

Introduction to Apache Apex by Thomas Weise

Similar to Lambda architecture

Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari

Big Stream Processing Systems, Big GraphsPetr Novotný

Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Seattle Apache Flink Meetup

Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...Bowen Li

Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo

Understanding HadoopAhmed Ossama

Architecting Big Data Ingest & ManipulationGeorge Long

Apache Storm ConceptsAndré Dias

Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex

Analyzing Data at Scale with Apache SparkNicola Ferraro

Interactive Data Analysis in Spark Streamingdatamantra

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington

Processing 19 billion messages in real time and NOT dying in the processJampp

E commerce data migration in moving systems across data centres Regunath B

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA

Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki

Similar to Lambda architecture (20)

Monitoring Big Data Systems - "The Simple Way"

Big Stream Processing Systems, Big Graphs

Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...

Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...

Web-scale data processing: practical approaches for low-latency and batch

Understanding Hadoop

Architecting Big Data Ingest & Manipulation

Apache Storm Concepts

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Analyzing Data at Scale with Apache Spark

Interactive Data Analysis in Spark Streaming

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...

Processing 19 billion messages in real time and NOT dying in the process

E commerce data migration in moving systems across data centres

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Architecting Cloud Native ApplicationsWSO2

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

DBX First Quarter 2024 Investor PresentationDropbox

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

ICT role in 21st century education and its challengesrafiqahmad00786416

Why Teams call analytics are critical to your entire businesspanagenda

Platformless Horizons for Digital AdaptabilityWSO2

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Architecting Cloud Native Applications

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

DBX First Quarter 2024 Investor Presentation

CNIC Information System with Pakdata Cf In Pakistan

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays New York 2024 - The value of a flexible API Management solution for O...

ICT role in 21st century education and its challenges

Why Teams call analytics are critical to your entire business

Platformless Horizons for Digital Adaptability

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Vector Search -An Introduction in Oracle Database 23ai.pptx

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

AWS Community Day CPH - Three problems of Terraform

Lambda architecture

1. Lambda Architecture Una soluzione per i Big Data Mario A. Santini

3. A solution born in Twitter Nathan Marz Author of Big Data: http://www.manning.com/marz/

4. When big is big? ● OpenStreetMap.org ~1,5 M users, ~2,2 nodes (http://j.mp/OSM-stats) http://j.mp/OSM-stats ● Wikipedia 32 M pages, 20 M users (http://en.wikipedia.org/wiki/Wikipedia:Statistics) http://en.wikipedia.org/wiki/Wikipedia:Statistics ● Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/) http://www.statisticbrain.com/facebook-statistics/ ● Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/) ● But also: – – Monitoring systems Any near real time system

5. Lambda query = function(allData);

6. Input data Lambda Architecture query

7. Batch View All Data Batch Layer Batch View Batch View Serving Layer Query

8. Batch Layer ● Store an immutable input data set ● Computing continuosly the batch view ● Simple & Distributed

9. Serving Layer ● Indexing the batch views ● Access to the batch views ● Updated by Batch Layer ● Trivial read only database: – Quick – Very simple

10. Batch Layer + Service Layer ● Robust and fault tollerant ● Scalable ● General ● Extensible ● Allow ad hoc queries ● Minimal maintenance ● Debuggable

11. What's miss? While Batch Layer compute the query on the full data set a pretty big chunk of data just arrived and be stored. Should we wait a couple of hours to query this data?

12. Speed Layer Near real time views New Data Speed Layer Near real time views Near real time views Query

13. All together now! Serving Layer Batch View All Data Batch Layer Batch View Query New Data Near real time views Speed Layer Near real time views

14. How all this mess should work? ● ● All new data are sent to both: batch and speed layer (data are raw and immutalble, append only) The batch layer precompute the query functions continuosly to all the dataset, to produce the batch views ● The serving layer indexes the batch views ● At the end the data are a couple of hours old

15. How all this mess should work? ● ● ● ● The speed layer will process only the new data It use fast read/write database and incremental processing algorithms Produce the near real time views The query will merge real time and batch views results to resolve the queries

16. Batch Layer - tools ● Hadoop – YARN: framework to schedule jobs and cluster management – Map / Reduce: a way to parallel processing of huge amount of data, based on YARN – HDFS: distributed file system with an high throughput access to application data – And even more...

17. Serving Layer – tools ● ElephantDB – ● ● Readonly database, very little, very fast Here we need anything that has the same features Cloudera Impala

18. Speed Layer - tools ● Storm project – Very fast distributed computed system ● Apache Hbase ● MongoDB

19. Query - tools ● Cloudera Impala

Lambda architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lambda architecture

Similar to Lambda architecture (20)

More from Mario Alexandro Santini

More from Mario Alexandro Santini (9)

Recently uploaded

Recently uploaded (20)

Lambda architecture