SlideShare una empresa de Scribd logo
1 de 30
Hadoop Ecosystem
Components of a Big Data Architecture
Orchestration
• Most big data solutions consist of repeated data
processing operations, encapsulated in workflows that:
– transform source data,
– move data between multiple sources and sinks,
– load the processed data into an analytical data store,
– or push the results straight to a report or dashboard.
Orchestration
• In the data pipeline example below, the orchestration-based solution has a central
orchestration flow with all of the state transition rules that are centrally managed in
a tool (e.g. Oozie, activity, Azkaban, etc.).
• Each service sends the event/data back to the central brain, which guides the
process to the next step.
Choreography
• Choreography is a set of decoupled microservices that
knows what data to expect and provide without a
central brain or conductor.
λ Lambda architecture
• First proposed by Nathan Marz,
– Addresses this problem by creating two paths for data flow.
• All data coming into the system goes through these two paths:
– A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as
a batch view.
– A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy.
– The batch layer feeds into a serving layer that indexes the batch view for efficient querying.
– The speed layer updates the serving layer with incremental updates based on the most recent data.
λ Lambda architecture
λ Lambda architecture
• A drawback to the lambda architecture
– is its complexity.
• Processing logic appears in two different places
– — the cold and hot paths — using different frameworks.
– leads to duplicate computation logic and the complexity of managing the architecture for both paths.
• Example of Projects implementing Lambda Architecture
– Generic: Twitter Summingbird
• https://github.com/twitter/summingbird
– Dedicated to machine Learning: Cloudera Oryx 2
• http://oryx.io/)
λ Lambda architecture: strengths
• Immutability - retaining master data
– With timestamped events
– Appended versus overwritten events
• Attempt to beat CAP
• Pre-computed views for
– further processing
– faster ad-hoc querying
λ Lambda architecture: weakness
• Two Analytics systems to support
• Operational complexity
• By the time a scheduled job is run 90% of the data is stale
• Many moving parts: KV store, real time platform, batch
technologies
• Running similar code and reconciling queries in dual systems
• Analytics logic changes on dual systems
Kappa Architecture - Where Every Thing Is A Stream
• The kappa architecture was proposed by Jay Kreps as
an alternative to the lambda architecture.
• It has the same basic goals as the lambda architecture,
but with an important distinction:
– All data flows through a single path, using a stream
processing system.
Kappa Architecture
Kappa Architecture:
• strengths
– solution to do everything,
– independent technology,
– simpler than the Lambda architecture.
• weakness
– no separation between needs,
– growing competence.
• Kappa architecture is used by companies like Linkedin.
SMACK architecture
• The SMACK architecture (for Spark Mesos Akka Cassandra Kafka)
– is quite different from the Lambda or Kappa architectures since it
consists of a list of solutions.
– It is therefore necessary to understand the advantages and weaknesses
of the solutions before validating the implementation of a use case.
– Kafka is sometimes replaced by Kinesis on the cloud (Amazon AWS)
• Spark - fast and general engine for distributed, large-scale data processing
• Mesos - cluster resource management system that provides efficient resource isolation and sharing across
distributed applications
• Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven
applications on the JVM
• Cassandra - distributed, highly available database designed to handle large amounts of data across multiple
datacenters
• Kafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real-
time data feeds
SMACK architecture
SMACK architecture
• strengths
– a minimum of solutions capable of handling a very large number of problems,
– mature solutions of Big Data,
– scalability of solutions,
– unique management solution (Mesos),
– compatible batchs, real time, Lambda, ...
• weakness
– integration of new needs and therefore new frameworks,
– complex architecture.
• The SMACK architecture is used by companies like TupleJump or ING.
SMACK architecture
Microservices Architecture
• The microservice architecture is often described
Container-Oriented Architecture.
• This is not a complete architecture and specific to Big
Data.
Microservices Architecture
Microservices Architecture vs SOA
• Microservices are the natural evolution of service oriented architectures (SOA)
• Differences between microservices and SOA
– In a microservices architecture, services
• are small, independent, and loosely coupled.
– Each service is a separate codebase, which can be managed by a small development team.
– Services can be deployed independently.
– Services are responsible for persisting their own data or external state. This differs from the traditional model, where a separate data layer handles data
persistence.
– Services communicate with each other by using well-defined APIs.
– Internal implementation details of each service are hidden from other services.
– Services don't need to share the same technology stack, libraries, or frameworks.
Microservices Architecture
Orchestration
Docker and its ecosystem are
great for managing images, and
running containers in a specific
host.
Kubernates: provides
orchestration, service
discovery, load balancing --
together in one nice package
for you.
Discovery
Load Balancing
Criteria for selecting an architecture
Architecture Main criterion Use case
Hadoop Store data at a low cost Data Lake
lambda Build a complete view of
the data
Chain of treatment /
valuation of the data
Kappa Provide a fresh vision of the
data
Business data for users
SMACK Deal with data at a low cost Data Analysis (Machine
Learning)
Microservices Scalability (elasticity),
decoupling
Smart Cities
Smart Tarffic- IOT Reference Architecture
Data sources
• All big data solutions start with one or more data
sources. Examples include:
– Application data stores, such as relational databases.
– Static files produced by applications, such as web server
log files.
– Real-time data sources, such as IoT devices.
Data storage
• Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in
various formats.
– Data lake (Azure Data Lake Store , S3, HDFS(Cloudera, Hortonworks)
– NoSQL Store (Cassandra, Hbase, Neo4j, mongodb)
– Database as Service : DBaaS
• Oracle Database as a Service ,
• Azure Storage (Microsoft Azure Cloud SQL Database )
Batch processing
• Because the data sets are so large, often a big data solution must process data files
using long-running batch jobs to filter, aggregate, and otherwise prepare the data
for analysis. Usually these jobs involve reading source files, processing them, and
writing the output to new files. Options include running U-SQL jobs in Azure Data
Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop
cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must include a way to capture and store real-
time messages for stream processing. This might be a simple data store, where incoming messages are
dropped into a folder for processing. However, many solutions need a message ingestion store to act as a
buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing
semantics. This portion of a streaming architecture is often referred to as stream buffering. Options
include Azure Event Hubs, Azure IoT Hub, and Kafka.
Stream processing
• After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream
data is then written to an output sink. Azure Stream Analytics provides a managed
stream processing service based on perpetually running SQL queries that operate
on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store
• Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The
analytical data store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence (BI)
solutions. Alternatively, the data could be presented through a low-latency NoSQL technology such as HBase, or an interactive Hive database that
provides a metadata abstraction over data files in the distributed data store. Azure SQL Data Warehouse provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.
Analysis and reporting
• The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data
modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts. For these
scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling these users to leverage their existing skills with Python or R. For large-scale data exploration,
you can use Microsoft R Server, either standalone or with Spark.

Más contenido relacionado

Similar a Big Data_Architecture.pptx

So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxVanshGupta597842
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indixYu Ishikawa
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 

Similar a Big Data_Architecture.pptx (20)

So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptx
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix2014 09-12 lambda-architecture-at-indix
2014 09-12 lambda-architecture-at-indix
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 

Último

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 

Último (20)

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 

Big Data_Architecture.pptx

  • 2. Components of a Big Data Architecture
  • 3. Orchestration • Most big data solutions consist of repeated data processing operations, encapsulated in workflows that: – transform source data, – move data between multiple sources and sinks, – load the processed data into an analytical data store, – or push the results straight to a report or dashboard.
  • 4. Orchestration • In the data pipeline example below, the orchestration-based solution has a central orchestration flow with all of the state transition rules that are centrally managed in a tool (e.g. Oozie, activity, Azkaban, etc.). • Each service sends the event/data back to the central brain, which guides the process to the next step.
  • 5. Choreography • Choreography is a set of decoupled microservices that knows what data to expect and provide without a central brain or conductor.
  • 6. λ Lambda architecture • First proposed by Nathan Marz, – Addresses this problem by creating two paths for data flow. • All data coming into the system goes through these two paths: – A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batch view. – A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy. – The batch layer feeds into a serving layer that indexes the batch view for efficient querying. – The speed layer updates the serving layer with incremental updates based on the most recent data.
  • 8. λ Lambda architecture • A drawback to the lambda architecture – is its complexity. • Processing logic appears in two different places – — the cold and hot paths — using different frameworks. – leads to duplicate computation logic and the complexity of managing the architecture for both paths. • Example of Projects implementing Lambda Architecture – Generic: Twitter Summingbird • https://github.com/twitter/summingbird – Dedicated to machine Learning: Cloudera Oryx 2 • http://oryx.io/)
  • 9. λ Lambda architecture: strengths • Immutability - retaining master data – With timestamped events – Appended versus overwritten events • Attempt to beat CAP • Pre-computed views for – further processing – faster ad-hoc querying
  • 10. λ Lambda architecture: weakness • Two Analytics systems to support • Operational complexity • By the time a scheduled job is run 90% of the data is stale • Many moving parts: KV store, real time platform, batch technologies • Running similar code and reconciling queries in dual systems • Analytics logic changes on dual systems
  • 11. Kappa Architecture - Where Every Thing Is A Stream • The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture. • It has the same basic goals as the lambda architecture, but with an important distinction: – All data flows through a single path, using a stream processing system.
  • 13. Kappa Architecture: • strengths – solution to do everything, – independent technology, – simpler than the Lambda architecture. • weakness – no separation between needs, – growing competence. • Kappa architecture is used by companies like Linkedin.
  • 14. SMACK architecture • The SMACK architecture (for Spark Mesos Akka Cassandra Kafka) – is quite different from the Lambda or Kappa architectures since it consists of a list of solutions. – It is therefore necessary to understand the advantages and weaknesses of the solutions before validating the implementation of a use case. – Kafka is sometimes replaced by Kinesis on the cloud (Amazon AWS)
  • 15. • Spark - fast and general engine for distributed, large-scale data processing • Mesos - cluster resource management system that provides efficient resource isolation and sharing across distributed applications • Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM • Cassandra - distributed, highly available database designed to handle large amounts of data across multiple datacenters • Kafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real- time data feeds SMACK architecture
  • 17. • strengths – a minimum of solutions capable of handling a very large number of problems, – mature solutions of Big Data, – scalability of solutions, – unique management solution (Mesos), – compatible batchs, real time, Lambda, ... • weakness – integration of new needs and therefore new frameworks, – complex architecture. • The SMACK architecture is used by companies like TupleJump or ING. SMACK architecture
  • 18. Microservices Architecture • The microservice architecture is often described Container-Oriented Architecture. • This is not a complete architecture and specific to Big Data.
  • 20. Microservices Architecture vs SOA • Microservices are the natural evolution of service oriented architectures (SOA) • Differences between microservices and SOA – In a microservices architecture, services • are small, independent, and loosely coupled. – Each service is a separate codebase, which can be managed by a small development team. – Services can be deployed independently. – Services are responsible for persisting their own data or external state. This differs from the traditional model, where a separate data layer handles data persistence. – Services communicate with each other by using well-defined APIs. – Internal implementation details of each service are hidden from other services. – Services don't need to share the same technology stack, libraries, or frameworks.
  • 21. Microservices Architecture Orchestration Docker and its ecosystem are great for managing images, and running containers in a specific host. Kubernates: provides orchestration, service discovery, load balancing -- together in one nice package for you. Discovery Load Balancing
  • 22. Criteria for selecting an architecture Architecture Main criterion Use case Hadoop Store data at a low cost Data Lake lambda Build a complete view of the data Chain of treatment / valuation of the data Kappa Provide a fresh vision of the data Business data for users SMACK Deal with data at a low cost Data Analysis (Machine Learning) Microservices Scalability (elasticity), decoupling Smart Cities
  • 23. Smart Tarffic- IOT Reference Architecture
  • 24. Data sources • All big data solutions start with one or more data sources. Examples include: – Application data stores, such as relational databases. – Static files produced by applications, such as web server log files. – Real-time data sources, such as IoT devices.
  • 25. Data storage • Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. – Data lake (Azure Data Lake Store , S3, HDFS(Cloudera, Hortonworks) – NoSQL Store (Cassandra, Hbase, Neo4j, mongodb) – Database as Service : DBaaS • Oracle Database as a Service , • Azure Storage (Microsoft Azure Cloud SQL Database )
  • 26. Batch processing • Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs involve reading source files, processing them, and writing the output to new files. Options include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
  • 27. Real-time message ingestion. • If the solution includes real-time sources, the architecture must include a way to capture and store real- time messages for stream processing. This might be a simple data store, where incoming messages are dropped into a folder for processing. However, many solutions need a message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
  • 28. Stream processing • After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually running SQL queries that operate on unbounded streams. You can also use open source Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.
  • 29. Analytical data store • Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store. Azure SQL Data Warehouse provides a managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.
  • 30. Analysis and reporting • The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use Microsoft R Server, either standalone or with Spark.