SlideShare una empresa de Scribd logo
1 de 26
Selective Data Replication with
Geographically Distributed Hadoop
Brett Rudenstein
April 16, 2015
Brussels, Belgium
Dimensions of Scalability
For distributed storage systems
 Ability to support ever-increasing
requirements for
- Space: more data
- Objects: more files
- Load: more clients
 RAM is limiting HDFS scale
 Other dimensions of scalability
- Geographic Scalability:
scaling across multiple data centers
 Scalability as a universal measure
of a distributed system
2
Geographic Scalability
Scaling file system across multiple data centers
3
 Running Hadoop in multiple Data Centers
- Distributed across the world
- As a single cluster
Main Steps
Four main stages to the goal
 STAGE I: The role of the Coordination Engine
 STAGE II: Replicated Virtual Namespace
- Active-active
 STAGE III: Geographically-distributed Hadoop
- File system running in multiple data centers
- Disaster recovery, load balancing, self-healing, simultaneous data ingest
 STAGE IV: Selective data replication
- Heterogeneous storage Zones
4
The Requirements
Requirements
File System with hardware components distributed over the WAN
 Operated and perceived by users as a single system
- Unified file system view independent of where the data is physically stored
 Strict Consistency
- Everybody sees the same data
- Seamless file level replication
 Continuous Availability
- All components are Active
- Disaster recovery
 Geographic Scalability: Support for multiple Data Centers
6
Architecture Principles
Strict consistency of metadata with fast data ingest
1. Synchronous replication of metadata between data centers
- Using Coordination Engine
- Provides strict consistency of the namespace
2. Asynchronous replication of data over the WAN
- Data replicated in the background
- Allows fast LAN-speed data creation
7
Coordination Engine
For Replicating Consistent State
Coordination Engine
Determines the order of operations in the system
 Coordination Engine ensures the order of events submitted to the engine by
multiple proposers
- Anybody can Propose
- Engine chooses a single Agreement every time and guarantees:
• Learners observe the agreements in the same order they were chosen
• An agreement triggers a corresponding application action
9
Central Coordination
Simple coordination without fault tolerance
 Easy to Coordinate
- Single NameNode as an example of a
Central Coordination Engine (No HA)
- Performance and availability
bottleneck
- Single point of failure
10
Distributed Coordination Engine
Fault-tolerant coordination using multiple acceptors
 Distributed Coordination Engine operates on participating nodes
- Roles: Proposer, Learner, and Acceptor
- Each node can combine multiple roles
 Distributed coordination
- Proposing nodes submit events as
proposals to a quorum of acceptors
- Acceptors agree on the order of each
event in the global sequence of events
- Learners learn agreements in the same
deterministic order
11
Consensus Algorithms
Consensus is the process of agreeing on one result among a group of participants
 Coordination Engine guarantees the same state of the learners at a given GSN
- Each agreement is assigned a unique Global Sequence Number (GSN)
- GSNs form a monotonically increasing number series – the order of agreements
- Learners have the same initial state, apply the same deterministic agreements in the same deterministic order
- GSN represents “logical” time in coordinated systems
 PAXOS is a consensus algorithm
proven to tolerate a variety of failures
- Quorum-based Consensus
- Deterministic State Machine
- Leslie Lamport:
Part-Time Parliament (1990)
12
Coordinated Replication of
HCFS Namespace
Replicated Virtual Namespace
Coordination Engine provides equivalence of multiple namespace replicas
 Coordinated Virtual Namespace controlled by Fusion Node
- Is a client that acts as a proxy to other client interactions
- Reads are not coordinated
- Writes (Open, Close, Append, etc…) are coordinated
 The namespace events are consistent with each other
- Each fusion server maintains a log of changes that would occur in the namespace
- Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes
 Coordination Engine establishes the global order of namespace updates
- Fusion servers ensure deterministic updates in the same deterministic order to
underlying file system
- Systems, which start from the same state and apply the same updates, are equivalent
14
Strict Consistency Model
One-Copy Equivalence as known in replicated databases
 Coordination Engine sequences file open and close proposals into the
global sequence of agreements
- Applied to individual replicated folder namespace in the order of
their Global Sequence Number
 Fusion Replicated Folders have identical states when they reach the
same GSN
 One-copy equivalence
- Folders may have different states at a given moment of “clock” time
as the rate of consuming agreements may vary
- Provides same state in logical time
15
15
Fusion
Geographically Distributed HCFS
Scaling Hadoop Across Data Centers
Continuous Availability and Disaster Recovery over the WAN
 The system should appear, act, and be operated as a single cluster
- Instant and automatic replication of data and metadata
 Parts of the cluster on different data centers should have equal roles
- Data could be ingested or accessed through any of the centers
 Data creation and access should typically be at LAN speed
- Running time of a job executed on one data center as if there are no other centers
 Failure scenarios: the system should provide service and remain consistent
- Any Fusion node can fail and still provide replication
- Fusion nodes can fail simultaneously on two or more data centers and still provide replication
- WAN Partitioning does not cause a data center outage
- RPO is as low as possible due to continuous replication as opposed to periodic
17
Foreign File Replication
File is created on the client’s data center and replicated to the other asynchronously
18
 Fusion workflow
1. Client makes a request to create a file
2. Fusion coordinates File Open to other
clusters involved (membership)
3. File is added to underlying storage
4. IHC server pulls data from cluster and
pushed to remote clusters
5. Fusion coordinates File Close to other
clusters involved (membership)
Inter Hadoop Communication Service
 Uses HCFS API and communicates directly with underlying storage systems
- Isilon
- MAPR
- HDFS
- S3
 NameNode and DataNode operations are unchanged
19
Multi–Data Center Installation
Do I need so many replicas?
20
Features
Active/Active
Selective Data Replication
Selective Data Replication
Three main use cases for restricting data replication
 “Saudi Arabia” case – Data must never leave a specific data center
- This is needed to protect data from being replicated outside of a specific geo-location, a
country, or a facility, e.g., customer data from a branch in Saudi Arabia of a global bank must
never leave the country due to local regulations.
- Virtual namespace: only replicated metadata that has its supporting data replicated
 “/tmp” case – Data created in a directory by a native client should remain native
- Transient data of a job running on a DC does not need to be replicated elsewhere as it is
deleted upon job completion and nobody else needs it.
 “Ingest Only” case – Data directly ingested into cluster at data origin
- Data replicates to all other data centers
- Temporary network partitioned cluster can still ingest data
22
SDR Implementation Example
/
cs-2015-01.log
cs-2015-02.log
shv-2015-03.txtuser/
tmp/
public/
Virtually replicated namespace
Selectively replicated data
cs-2015-01.log dc1 dc1 dc2 dc3
shv-2015-03.txt dc1 dc2 dc2
job-2015-04.xml dc3 dc3 dc3
job-2015-04.xml
dc1
dc2
dc3
Heterogeneous Storage Zones
Virtual Data Centers representing different types of block storage
 Storage Types: Hard Drive, SSD, RAM
 Virtual data center is a zone of similarly configured Data Nodes
 Example:
- Z1 archival zone: DataNodes with dense hard drive storage
- Z2 active data zone: DataNodes with high-performance SSDs
- Z3 real-time access zone: lots of RAM and cores, short-lived hot data
 SDR policy defines three directories:
- /archive – replicated only on Z1
- /active-data – replicated on Z2 and Z1
- /real-time – replicated everywhere
24
Simplified WAN configurations
Reduced operational complexity
 Fast network protocols can keep up
with demanding network replication
 Hadoop clusters do not require
direct communication with each
other.
- No n x m communication among
datanodes across datacenters
- Reduced firewall / socks
complexities
 Reduced Attack Surface
Thank You.
Questions?
Come visit WANdisco at Booth 11
Selective Data Replication with Geographically Distributed Hadoop
Brett Rudenstein

Más contenido relacionado

La actualidad más candente

Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Data Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming SystemsData Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming SystemsDataWorks Summit
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 

La actualidad más candente (20)

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop
Hadoop Hadoop
Hadoop
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Data Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming SystemsData Guarantees and Fault Tolerance in Streaming Systems
Data Guarantees and Fault Tolerance in Streaming Systems
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
HDFS Tiered Storage
HDFS Tiered StorageHDFS Tiered Storage
HDFS Tiered Storage
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Destacado

HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemKonstantin V. Shvachko
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
A secure cloud computing based framework for big data information management ...
A secure cloud computing based framework for big data information management ...A secure cloud computing based framework for big data information management ...
A secure cloud computing based framework for big data information management ...Nexgen Technology
 
A secure cloud computing based framework for big information management syste...
A secure cloud computing based framework for big information management syste...A secure cloud computing based framework for big information management syste...
A secure cloud computing based framework for big information management syste...Pawan Arya
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...
VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...
VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...VMworld
 
Sînică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sînică Alboaie - Programming for cloud computing Flows of asynchronous messagesSînică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sînică Alboaie - Programming for cloud computing Flows of asynchronous messagesCodecamp Romania
 
Programming Languages For The Cloud
Programming Languages For The CloudProgramming Languages For The Cloud
Programming Languages For The CloudTed Leung
 
Mysql data replication
Mysql data replicationMysql data replication
Mysql data replicationTuấn Ngô
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)jemille6
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconDataWorks Summit
 
A framework for secure healthcare systems based on big data analytics in mobi...
A framework for secure healthcare systems based on big data analytics in mobi...A framework for secure healthcare systems based on big data analytics in mobi...
A framework for secure healthcare systems based on big data analytics in mobi...ijasa
 
A Framework for Cloud Computing Adoption in South African Government
A Framework for Cloud Computing Adoption in South African GovernmentA Framework for Cloud Computing Adoption in South African Government
A Framework for Cloud Computing Adoption in South African GovernmentGovCloud Network
 

Destacado (20)

HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
A secure cloud computing based framework for big data information management ...
A secure cloud computing based framework for big data information management ...A secure cloud computing based framework for big data information management ...
A secure cloud computing based framework for big data information management ...
 
A secure cloud computing based framework for big information management syste...
A secure cloud computing based framework for big information management syste...A secure cloud computing based framework for big information management syste...
A secure cloud computing based framework for big information management syste...
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
Data Replication - Synchronization Tool for TCIA
Data Replication - Synchronization Tool for TCIAData Replication - Synchronization Tool for TCIA
Data Replication - Synchronization Tool for TCIA
 
Big Data Applications
Big Data ApplicationsBig Data Applications
Big Data Applications
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...
VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...
VMworld 2013: VMware vSphere Replication: Technical Walk-Through with Enginee...
 
Sînică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sînică Alboaie - Programming for cloud computing Flows of asynchronous messagesSînică Alboaie - Programming for cloud computing Flows of asynchronous messages
Sînică Alboaie - Programming for cloud computing Flows of asynchronous messages
 
Programming Languages For The Cloud
Programming Languages For The CloudProgramming Languages For The Cloud
Programming Languages For The Cloud
 
Mysql data replication
Mysql data replicationMysql data replication
Mysql data replication
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)
 
Hadoop first ETL on Apache Falcon
Hadoop first ETL on Apache FalconHadoop first ETL on Apache Falcon
Hadoop first ETL on Apache Falcon
 
A framework for secure healthcare systems based on big data analytics in mobi...
A framework for secure healthcare systems based on big data analytics in mobi...A framework for secure healthcare systems based on big data analytics in mobi...
A framework for secure healthcare systems based on big data analytics in mobi...
 
A Framework for Cloud Computing Adoption in South African Government
A Framework for Cloud Computing Adoption in South African GovernmentA Framework for Cloud Computing Adoption in South African Government
A Framework for Cloud Computing Adoption in South African Government
 

Similar a Selective Data Replication with Geographically Distributed Hadoop

Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating systemMoeez Ahmad
 
Chapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.pptChapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.pptsirajmohammed35
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfsamaghorab
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsAnkit Gupta
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Alluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio, Inc.
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategySaptarshi Chatterjee
 
Distributed database
Distributed databaseDistributed database
Distributed databasesanjay joshi
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
Distributed database
Distributed databaseDistributed database
Distributed databasesanjay joshi
 
Ch16 OS
Ch16 OSCh16 OS
Ch16 OSC.U
 

Similar a Selective Data Replication with Geographically Distributed Hadoop (20)

Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Document 22.pdf
Document 22.pdfDocument 22.pdf
Document 22.pdf
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating system
 
Distributed D B
Distributed  D BDistributed  D B
Distributed D B
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Chapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.pptChapter 6-Consistency and Replication.ppt
Chapter 6-Consistency and Replication.ppt
 
Lec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdfLec+3-Introduction-to-Distributed-Systems.pdf
Lec+3-Introduction-to-Distributed-Systems.pdf
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Hadoop
HadoopHadoop
Hadoop
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Alluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory Speed
 
Talon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategyTalon systems - Distributed multi master replication strategy
Talon systems - Distributed multi master replication strategy
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Unit 1
Unit 1Unit 1
Unit 1
 
OSCh16
OSCh16OSCh16
OSCh16
 
Ch16 OS
Ch16 OSCh16 OS
Ch16 OS
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Selective Data Replication with Geographically Distributed Hadoop

  • 1. Selective Data Replication with Geographically Distributed Hadoop Brett Rudenstein April 16, 2015 Brussels, Belgium
  • 2. Dimensions of Scalability For distributed storage systems  Ability to support ever-increasing requirements for - Space: more data - Objects: more files - Load: more clients  RAM is limiting HDFS scale  Other dimensions of scalability - Geographic Scalability: scaling across multiple data centers  Scalability as a universal measure of a distributed system 2
  • 3. Geographic Scalability Scaling file system across multiple data centers 3  Running Hadoop in multiple Data Centers - Distributed across the world - As a single cluster
  • 4. Main Steps Four main stages to the goal  STAGE I: The role of the Coordination Engine  STAGE II: Replicated Virtual Namespace - Active-active  STAGE III: Geographically-distributed Hadoop - File system running in multiple data centers - Disaster recovery, load balancing, self-healing, simultaneous data ingest  STAGE IV: Selective data replication - Heterogeneous storage Zones 4
  • 6. Requirements File System with hardware components distributed over the WAN  Operated and perceived by users as a single system - Unified file system view independent of where the data is physically stored  Strict Consistency - Everybody sees the same data - Seamless file level replication  Continuous Availability - All components are Active - Disaster recovery  Geographic Scalability: Support for multiple Data Centers 6
  • 7. Architecture Principles Strict consistency of metadata with fast data ingest 1. Synchronous replication of metadata between data centers - Using Coordination Engine - Provides strict consistency of the namespace 2. Asynchronous replication of data over the WAN - Data replicated in the background - Allows fast LAN-speed data creation 7
  • 9. Coordination Engine Determines the order of operations in the system  Coordination Engine ensures the order of events submitted to the engine by multiple proposers - Anybody can Propose - Engine chooses a single Agreement every time and guarantees: • Learners observe the agreements in the same order they were chosen • An agreement triggers a corresponding application action 9
  • 10. Central Coordination Simple coordination without fault tolerance  Easy to Coordinate - Single NameNode as an example of a Central Coordination Engine (No HA) - Performance and availability bottleneck - Single point of failure 10
  • 11. Distributed Coordination Engine Fault-tolerant coordination using multiple acceptors  Distributed Coordination Engine operates on participating nodes - Roles: Proposer, Learner, and Acceptor - Each node can combine multiple roles  Distributed coordination - Proposing nodes submit events as proposals to a quorum of acceptors - Acceptors agree on the order of each event in the global sequence of events - Learners learn agreements in the same deterministic order 11
  • 12. Consensus Algorithms Consensus is the process of agreeing on one result among a group of participants  Coordination Engine guarantees the same state of the learners at a given GSN - Each agreement is assigned a unique Global Sequence Number (GSN) - GSNs form a monotonically increasing number series – the order of agreements - Learners have the same initial state, apply the same deterministic agreements in the same deterministic order - GSN represents “logical” time in coordinated systems  PAXOS is a consensus algorithm proven to tolerate a variety of failures - Quorum-based Consensus - Deterministic State Machine - Leslie Lamport: Part-Time Parliament (1990) 12
  • 14. Replicated Virtual Namespace Coordination Engine provides equivalence of multiple namespace replicas  Coordinated Virtual Namespace controlled by Fusion Node - Is a client that acts as a proxy to other client interactions - Reads are not coordinated - Writes (Open, Close, Append, etc…) are coordinated  The namespace events are consistent with each other - Each fusion server maintains a log of changes that would occur in the namespace - Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes  Coordination Engine establishes the global order of namespace updates - Fusion servers ensure deterministic updates in the same deterministic order to underlying file system - Systems, which start from the same state and apply the same updates, are equivalent 14
  • 15. Strict Consistency Model One-Copy Equivalence as known in replicated databases  Coordination Engine sequences file open and close proposals into the global sequence of agreements - Applied to individual replicated folder namespace in the order of their Global Sequence Number  Fusion Replicated Folders have identical states when they reach the same GSN  One-copy equivalence - Folders may have different states at a given moment of “clock” time as the rate of consuming agreements may vary - Provides same state in logical time 15 15
  • 17. Scaling Hadoop Across Data Centers Continuous Availability and Disaster Recovery over the WAN  The system should appear, act, and be operated as a single cluster - Instant and automatic replication of data and metadata  Parts of the cluster on different data centers should have equal roles - Data could be ingested or accessed through any of the centers  Data creation and access should typically be at LAN speed - Running time of a job executed on one data center as if there are no other centers  Failure scenarios: the system should provide service and remain consistent - Any Fusion node can fail and still provide replication - Fusion nodes can fail simultaneously on two or more data centers and still provide replication - WAN Partitioning does not cause a data center outage - RPO is as low as possible due to continuous replication as opposed to periodic 17
  • 18. Foreign File Replication File is created on the client’s data center and replicated to the other asynchronously 18  Fusion workflow 1. Client makes a request to create a file 2. Fusion coordinates File Open to other clusters involved (membership) 3. File is added to underlying storage 4. IHC server pulls data from cluster and pushed to remote clusters 5. Fusion coordinates File Close to other clusters involved (membership)
  • 19. Inter Hadoop Communication Service  Uses HCFS API and communicates directly with underlying storage systems - Isilon - MAPR - HDFS - S3  NameNode and DataNode operations are unchanged 19
  • 20. Multi–Data Center Installation Do I need so many replicas? 20
  • 22. Selective Data Replication Three main use cases for restricting data replication  “Saudi Arabia” case – Data must never leave a specific data center - This is needed to protect data from being replicated outside of a specific geo-location, a country, or a facility, e.g., customer data from a branch in Saudi Arabia of a global bank must never leave the country due to local regulations. - Virtual namespace: only replicated metadata that has its supporting data replicated  “/tmp” case – Data created in a directory by a native client should remain native - Transient data of a job running on a DC does not need to be replicated elsewhere as it is deleted upon job completion and nobody else needs it.  “Ingest Only” case – Data directly ingested into cluster at data origin - Data replicates to all other data centers - Temporary network partitioned cluster can still ingest data 22
  • 23. SDR Implementation Example / cs-2015-01.log cs-2015-02.log shv-2015-03.txtuser/ tmp/ public/ Virtually replicated namespace Selectively replicated data cs-2015-01.log dc1 dc1 dc2 dc3 shv-2015-03.txt dc1 dc2 dc2 job-2015-04.xml dc3 dc3 dc3 job-2015-04.xml dc1 dc2 dc3
  • 24. Heterogeneous Storage Zones Virtual Data Centers representing different types of block storage  Storage Types: Hard Drive, SSD, RAM  Virtual data center is a zone of similarly configured Data Nodes  Example: - Z1 archival zone: DataNodes with dense hard drive storage - Z2 active data zone: DataNodes with high-performance SSDs - Z3 real-time access zone: lots of RAM and cores, short-lived hot data  SDR policy defines three directories: - /archive – replicated only on Z1 - /active-data – replicated on Z2 and Z1 - /real-time – replicated everywhere 24
  • 25. Simplified WAN configurations Reduced operational complexity  Fast network protocols can keep up with demanding network replication  Hadoop clusters do not require direct communication with each other. - No n x m communication among datanodes across datacenters - Reduced firewall / socks complexities  Reduced Attack Surface
  • 26. Thank You. Questions? Come visit WANdisco at Booth 11 Selective Data Replication with Geographically Distributed Hadoop Brett Rudenstein

Notas del editor

  1. No secret RAM is the limiting factor for NN scalability and as the result for the entire HDFS
  2. Achieved the goal or making a good progress
  3. The core of a distributed CE are consensus algorithms
  4. Double determinism is important for equivalent evolution of the systems
  5. Unlike multi-cluster architecture, where clusters run independently on each data center mirroring data between them
  6. DC1 DataNodes report replicas to native GeoNodes DC1 GeoNode submits Foreign Replica Report proposal FRR agreement is executed by all foreign GeoNodes: learn about foreign locations DC1 GeoNode schedules replica transfer from native to foreign DC2 DataNode DC2 DataNode reports new replica to DC2 GeoNodes DC2 GeoNode schedules replication of the new replica within DC2