SlideShare una empresa de Scribd logo
1 de 31
Qiyuan Gong
Intel SSG/OTC
2018
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
2
Qiyuan Gong
Software Engineer, Intel OTC (Open Source Technology
Center)
Working on: Deep Learning on Hadoop (Canceled)
SSM (Smart Storage Management for Big Data)
Ph.D on Data anonymization (related to GDPR)
qiyuan.gong@intel.com
https://www.linkedin.com/in/qiyuangong/
Intel® Software contributions to big data and analytics. From
open enterprise-ready software platforms to analytics building
blocks, runtime optimizations, tools, benchmarks and use
cases, Intel Software makes big data and analytics faster,
easier, and more insightful.
https://software.intel.com/en-us/bigdata
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Outlines
• Background & Motivation
• Basic Design & Experience
• Evaluation
• Example & Demo
3
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Background & Motivation
• HDFS (Hadoop Disturbed File System) is widely used in production
4
Source: https://www.edureka.co/blog/hadoop-ecosystem?utm_campaign=social-media-edureka-sep-
ss&utm_medium=crosspost&utm_source=quora
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Background & Motivation
• HDFS is architected to operate efficiently at
scale for normal hardware failures within a
datacenter
• Disk, node or rack fails will not affect HDFS
• But HDFS is not designed to handle
datacenter failures
 If data center is down, we lost everything (meta
and data)
5
Source: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HdfsDesign.html
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Background & Motivation
• Disaster Recovery is not equal to backup
• Essential
• Location Separated – High latency & limited
bandwidth
• No Data loss - Backup & recovery
• High Availability – Switch to standby data center
• Additional
• Performance - Higher is better
• Resource Usage - Don’t use up all resource/
bandwidth
• …
• But let’s focus on backup
6
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Backup HDFS from Primary (local) to standby (remote)
• Entire/Base backup – first backup, focus on performance, no tricks (compression and
encryption)
• Incremental backup – lots of fun & pain
7
Primary
Datacenter
HDFS
Standby
Datacenter
HDFS
Entire
Backup
Incremental Backup
Quit similar to version
control system, such as Git
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Incremental Backup
• Diff detection - What is changed?
• Find out (incremental) difference
• Building diff application tasks/list
• Diff application - How to apply changes to
remote?
• Apply diff content to Standby Cluster
• Key points
• Precise diff (algorithm)
• Performance (concurrency) & Resource usage
(CPU, memory and bandwidth)
• Corner cases – merge failed with N conflicts
8
git fetch (patch) from
remote
git
rebase/merge
Primar
y
Standb
y
Diff Detection
Diff Application
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• DistCp’s Advantages
• Build-in tool/command in Hadoop
• Track changes with Snapshot
• Snapshot report for diff detection (Create,
Delete, Rename, Modify)
• Copy from Snapshot (Snapshot is a
reference which helps keeping deleted
blocks)
• Apply changes with MapReduce
• Copy contents concurrently in block level
• Old version in file level
9
Source: https://community.hortonworks.com/articles/60546/hdfs-snapshots-1-overview.html
Source: http://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• DistCp does great jobs in most cases
10
distcp -update -diff s0 s1 <sourceDir> <targetDir>
Source: http://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-
hadoop/
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Pain points of DistCp:
 Snapshot is not perfect
– Cannot take snapshot on root directory
– No details about Modify (need to compare files)
– Corner cases
 MapReduce is too heavy to incremental backup
– 1 container (2 GB memory & 1 core) for 10KB append
– Long launch time
 Requires administrator (failed tasks, snapshot management etc)
 No resource usage & bandwidth control
11
Source: http://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-
hadoop/
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• We want to address above problems in this way:
 No Snapshot – Track changes with log (Inotify/editlog) post-
processing
– Track changes in details (Create, Delete, Rename, Meta, Close,
Append, Truncate) v.s. Snapshot (Create, Delete, Rename and
Modify)
– Build lineage from Inotify (subset of editlog)
– Translate Events into tasks, e.g.,
– Append means copy content from primary to remote
– Delete means delete file in remote
– …
12
Append 10KB to to A
Append 100KB to
A
Rename from A to B
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
 No MapReduce – Long running services
– Server
– Tracking changes
– Scheduling tasks
– Resource and bandwidth management
– Agents
– Applying changes to remote
13
Namenode
monitoring
Copy Task
Copy Task
Copy Task
Server Copy Task
Copy Task
Copy Task
Agent
Tracking Diff
Scheduling
Tasks
Agent
Copy Task
Copy Task
Copy Task
Agent
Sometimes, they don’t even have
MapReduce
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Advantages of SSM DR (Disaster Recovery)
• Diff Detection: Log post-processing
• Near real-time diff detection
• More details than Snapshot, e.g., Modify
• Diff & Lineage is mergeable
• Limited Impact to Namenode
• Getting Inotify has impact to Namenode
• Getting EditLog has no impact
• Diff Application: Long running Services
• Light-weighted for Cluster/Data center
• Auto scale & bandwidth control
• Can be Containerized
14
Append 10KB
Append
100KB
Append 20KB
Append
130KB
Append
10MB
Append 100MB
Delete
Delete file if exists
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Background & Motivation
• Pain points (painful experience)
• Corner cases
• No more Locality
• How to handle it?
• Corner cases
• Handle general cases, such as rename
• Let tasks fail, then trigger a directly compare and copy
• No Locality
• High speed network
• Place agent near datanodes
15
Append 10KB to to A
Append 100KB to
A
Rename from A to B
Append 110KB to
A
Rename from A to B
–put command
ERROR: Can not read, file A does not exist on
primary
Append 110KB
to B
Rename from A to
B
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Build Disaster Recovery Based on SSM (Smart Storage Management, Open source on
Github)
16
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Reuse SSM modules and metadata
• SSM’s Metadata for diff detection
 SSM’s modules (Agents and actions) for diff application
• Features
• Smart management based on Rules
• Near real time incremental backup
• Less resource requirement
• High Availability & Transparent read/backup (TODO)
17
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Evaluation - Test Environment
HDFS Clusters
 Primary HDFS 4 Nodes
 Secondary HDFS 4 Nodes
SSM
 SSM Server
 3 SSM Agents installed on Primary datanodes
Testing Methodology
 Compare SSM DR with distcp
– 10K small files – 1MB
– 1K large files – 100MB
Hardware configuration
Processor Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
DRAM 256GB DDR4 16x16GB @ 2133MHz
Network 10GbE
Disks 5 x SATA(2TB) ST2000NM0011
Software configuration
OS CentOS 7.3.1611 x86_64
Hadoop 2.7.3
Metastore Mysql 5.7.19
Java JDK1.7.0_141
18
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Workload-2 1MB Files Batch Backup
• Workload:
• 10000 files * (1MB) and 60-concurrency
• Result:
• SSM DR will performer 9.2% better than DistCp
0M 10M 50M 100M
Average Time (s)
Lower is better
SSM DR Distcp
9.2%
19
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Resource Usage
• Workload:
• Batch backup files (10000 *1MB)
• Result:
• SSM DR requires less resource
than DistCp on working nodes
• 46% less CPU
• 41% less memory
20
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Workload-2 100MB Files Batch Backup
• Workload:
• 1000 files * (100MB) and 60 concurrency
• Result:
• SSM DR performs 11.4% better than DistCp
0
100
200
300
400
500
600
700
800
900
1000
0M 10M 50M 100M
Average Time (s)
Lower is better
SSM DR Distcp
11.4%
21
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Resource Usage
• Workload:
• Batch backup files (1000 *100MB)
• Result:
• SSM DR requires less resource than
DistCp on working nodes
• 26% less CPU
• 46% less memory
22
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
SSM Disaster Recovery: Example and Demo
• Rule
 file: every 500ms | path matches "/src1/*" | sync –dest hdfs://namenode:9000/dest1/
 file: every 2s | path matches "/src/*.txt" | sync –dest hdfs://namenode3:9000/dest/
23
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Thanks
• Q&A
• More details
• https://issues.apache.org/jira/browse/HDFS-7343
• https://events.static.linuxfound.org/sites/events/files/slides/ApacheBigDataEurope2016-SSM.pdf
• https://github.com/Intel-bigdata/SSM
24
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Legal Disclaimer & Optimization Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit
www.intel.com/benchmarks.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of
Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are
reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific
instruction sets covered by this notice.
Notice revision #20110804
25
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Namespace Impact on SSM’s Performance
• Workload:
• Batch backup files
• (0M, 10M, 50M, 100M) files in
namespace
• Result:
• Namespace has very limited
impact on SSM Disaster
Recovery’s performance
Run #1 Run #2 Run #3 Run #4 Run #5 Run #6 Run #7 Run #8
Average Running Time (s)
10000 * 1MB
1000 *
100MB
27
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Diff Detection
• We found: Log based works better than we even thought
• Nearly real-time Aync based on incremental diffs
• Limited or no impact to namenode
28
Methods View Advantage Dis-advantage
Directly Scan Result view Easy to develop 1. Long laterncy
2. Network overhead
3. Impact to Namenode
4. Not real-time
HDFS Snapshot Result view Easy to develop 1. A few constrains caused by
snapshot
2. Impact to namenode
3. Not real-time
Log based Continually View 1. Real time async is
possible
2. Less impact to
namenode
Hard to develop and maintain
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Diff Detection
• We found: Log based works better than we even thought
• Nearly real-time Aync based on incremental diffs
• Limited or no impact to namenode
29
Methods View Advantage Dis-advantage
Directly Scan Result view Easy to develop 1. Long laterncy
2. Network overhead
3. Impact to Namenode
4. Not real-time
HDFS Snapshot Result view Easy to develop 1. A few constrains caused by
snapshot
2. Impact to namenode
3. Not real-time
Log based Continually View 1. Real time async is
possible
2. Less impact to
namenode
Hard to develop and maintain
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Diff Application
• We found: Task based is works better
• Light-weighted: multiple tasks in one container
• Hard to develop is not a question any more: lots of distributed frameworks, e.g., Akka
30
Methods Advantage Dis-advantage
Container based (e.g.,
map-reduce)
Easy to develop 1. More resource
2. Less parallelism in the same cluster
Task based 1. Less resource usage
2. More parallelism (multiple
tasks in single container)
1. In some cases, agents needed
2. Hard to develop
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Optimization Notice
Basic Design & Experience
• Our light-weighted DR basic design
• Diff Detection without Snapshot
• Long running service monitoring Editlog
or Inotify
• Directly Scan to handle corner cases and
namespace mis-match
• Diff Application without Map-reduce
• Agents based on Akka (can be replaced
by other frameworks)
• Master dispatch diff application tasks
• Slaves apply diff to remote (multiple tasks
in one container)
31
Light-weighted for Cluster/Data center:
1. Limited Impact to HDFS (HDFS event don’t know DR e
2. Less resource requirement
3. Can be Containerized (in future)

Más contenido relacionado

La actualidad más candente

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesCitiusTech
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera, Inc.
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 

La actualidad más candente (20)

HDFS Overview
HDFS OverviewHDFS Overview
HDFS Overview
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Oracle GoldenGate
Oracle GoldenGate Oracle GoldenGate
Oracle GoldenGate
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 

Similar a Light-weighted HDFS disaster recovery

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Alluxio, Inc.
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and futureboxu42
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Software Brasil
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?Michelle Holley
 
The Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCThe Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCIntel IT Center
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkMichelle Holley
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red_Hat_Storage
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...Intel® Software
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive... Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...Databricks
 
HP flash optimized storage - webcast
HP flash optimized storage - webcastHP flash optimized storage - webcast
HP flash optimized storage - webcastCalvin Zito
 
Optimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDNOptimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDNinside-BigData.com
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesCloudera, Inc.
 
Unity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and ToolsUnity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and ToolsIntel® Software
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13EDB
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...inside-BigData.com
 

Similar a Light-weighted HDFS disaster recovery (20)

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and future
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance Computing
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
The Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPCThe Importance of Fast, Scalable Storage for Today’s HPC
The Importance of Fast, Scalable Storage for Today’s HPC
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function Framework
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
Dynamic Resolution Techniques for Intel® Processor Graphics | SIGGRAPH 2018 T...
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive... Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 
HP flash optimized storage - webcast
HP flash optimized storage - webcastHP flash optimized storage - webcast
HP flash optimized storage - webcast
 
Optimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDNOptimizing Lustre and GPFS with DDN
Optimizing Lustre and GPFS with DDN
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
 
Unity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and ToolsUnity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and Tools
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
Performance Comparison of Intel Enterprise Edition Lustre and HDFS for MapRed...
 
DDN Product Update from SC13
DDN Product Update from SC13DDN Product Update from SC13
DDN Product Update from SC13
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Light-weighted HDFS disaster recovery

  • 2. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 2 Qiyuan Gong Software Engineer, Intel OTC (Open Source Technology Center) Working on: Deep Learning on Hadoop (Canceled) SSM (Smart Storage Management for Big Data) Ph.D on Data anonymization (related to GDPR) qiyuan.gong@intel.com https://www.linkedin.com/in/qiyuangong/ Intel® Software contributions to big data and analytics. From open enterprise-ready software platforms to analytics building blocks, runtime optimizations, tools, benchmarks and use cases, Intel Software makes big data and analytics faster, easier, and more insightful. https://software.intel.com/en-us/bigdata
  • 3. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Outlines • Background & Motivation • Basic Design & Experience • Evaluation • Example & Demo 3
  • 4. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Background & Motivation • HDFS (Hadoop Disturbed File System) is widely used in production 4 Source: https://www.edureka.co/blog/hadoop-ecosystem?utm_campaign=social-media-edureka-sep- ss&utm_medium=crosspost&utm_source=quora
  • 5. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Background & Motivation • HDFS is architected to operate efficiently at scale for normal hardware failures within a datacenter • Disk, node or rack fails will not affect HDFS • But HDFS is not designed to handle datacenter failures  If data center is down, we lost everything (meta and data) 5 Source: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop- hdfs/HdfsDesign.html
  • 6. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Background & Motivation • Disaster Recovery is not equal to backup • Essential • Location Separated – High latency & limited bandwidth • No Data loss - Backup & recovery • High Availability – Switch to standby data center • Additional • Performance - Higher is better • Resource Usage - Don’t use up all resource/ bandwidth • … • But let’s focus on backup 6
  • 7. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Backup HDFS from Primary (local) to standby (remote) • Entire/Base backup – first backup, focus on performance, no tricks (compression and encryption) • Incremental backup – lots of fun & pain 7 Primary Datacenter HDFS Standby Datacenter HDFS Entire Backup Incremental Backup Quit similar to version control system, such as Git
  • 8. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Incremental Backup • Diff detection - What is changed? • Find out (incremental) difference • Building diff application tasks/list • Diff application - How to apply changes to remote? • Apply diff content to Standby Cluster • Key points • Precise diff (algorithm) • Performance (concurrency) & Resource usage (CPU, memory and bandwidth) • Corner cases – merge failed with N conflicts 8 git fetch (patch) from remote git rebase/merge Primar y Standb y Diff Detection Diff Application
  • 9. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • DistCp’s Advantages • Build-in tool/command in Hadoop • Track changes with Snapshot • Snapshot report for diff detection (Create, Delete, Rename, Modify) • Copy from Snapshot (Snapshot is a reference which helps keeping deleted blocks) • Apply changes with MapReduce • Copy contents concurrently in block level • Old version in file level 9 Source: https://community.hortonworks.com/articles/60546/hdfs-snapshots-1-overview.html Source: http://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/
  • 10. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • DistCp does great jobs in most cases 10 distcp -update -diff s0 s1 <sourceDir> <targetDir> Source: http://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache- hadoop/
  • 11. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Pain points of DistCp:  Snapshot is not perfect – Cannot take snapshot on root directory – No details about Modify (need to compare files) – Corner cases  MapReduce is too heavy to incremental backup – 1 container (2 GB memory & 1 core) for 10KB append – Long launch time  Requires administrator (failed tasks, snapshot management etc)  No resource usage & bandwidth control 11 Source: http://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache- hadoop/
  • 12. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • We want to address above problems in this way:  No Snapshot – Track changes with log (Inotify/editlog) post- processing – Track changes in details (Create, Delete, Rename, Meta, Close, Append, Truncate) v.s. Snapshot (Create, Delete, Rename and Modify) – Build lineage from Inotify (subset of editlog) – Translate Events into tasks, e.g., – Append means copy content from primary to remote – Delete means delete file in remote – … 12 Append 10KB to to A Append 100KB to A Rename from A to B
  • 13. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience  No MapReduce – Long running services – Server – Tracking changes – Scheduling tasks – Resource and bandwidth management – Agents – Applying changes to remote 13 Namenode monitoring Copy Task Copy Task Copy Task Server Copy Task Copy Task Copy Task Agent Tracking Diff Scheduling Tasks Agent Copy Task Copy Task Copy Task Agent Sometimes, they don’t even have MapReduce
  • 14. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Advantages of SSM DR (Disaster Recovery) • Diff Detection: Log post-processing • Near real-time diff detection • More details than Snapshot, e.g., Modify • Diff & Lineage is mergeable • Limited Impact to Namenode • Getting Inotify has impact to Namenode • Getting EditLog has no impact • Diff Application: Long running Services • Light-weighted for Cluster/Data center • Auto scale & bandwidth control • Can be Containerized 14 Append 10KB Append 100KB Append 20KB Append 130KB Append 10MB Append 100MB Delete Delete file if exists
  • 15. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Background & Motivation • Pain points (painful experience) • Corner cases • No more Locality • How to handle it? • Corner cases • Handle general cases, such as rename • Let tasks fail, then trigger a directly compare and copy • No Locality • High speed network • Place agent near datanodes 15 Append 10KB to to A Append 100KB to A Rename from A to B Append 110KB to A Rename from A to B –put command ERROR: Can not read, file A does not exist on primary Append 110KB to B Rename from A to B
  • 16. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Build Disaster Recovery Based on SSM (Smart Storage Management, Open source on Github) 16
  • 17. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Reuse SSM modules and metadata • SSM’s Metadata for diff detection  SSM’s modules (Agents and actions) for diff application • Features • Smart management based on Rules • Near real time incremental backup • Less resource requirement • High Availability & Transparent read/backup (TODO) 17
  • 18. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Evaluation - Test Environment HDFS Clusters  Primary HDFS 4 Nodes  Secondary HDFS 4 Nodes SSM  SSM Server  3 SSM Agents installed on Primary datanodes Testing Methodology  Compare SSM DR with distcp – 10K small files – 1MB – 1K large files – 100MB Hardware configuration Processor Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz DRAM 256GB DDR4 16x16GB @ 2133MHz Network 10GbE Disks 5 x SATA(2TB) ST2000NM0011 Software configuration OS CentOS 7.3.1611 x86_64 Hadoop 2.7.3 Metastore Mysql 5.7.19 Java JDK1.7.0_141 18
  • 19. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Workload-2 1MB Files Batch Backup • Workload: • 10000 files * (1MB) and 60-concurrency • Result: • SSM DR will performer 9.2% better than DistCp 0M 10M 50M 100M Average Time (s) Lower is better SSM DR Distcp 9.2% 19
  • 20. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Resource Usage • Workload: • Batch backup files (10000 *1MB) • Result: • SSM DR requires less resource than DistCp on working nodes • 46% less CPU • 41% less memory 20
  • 21. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Workload-2 100MB Files Batch Backup • Workload: • 1000 files * (100MB) and 60 concurrency • Result: • SSM DR performs 11.4% better than DistCp 0 100 200 300 400 500 600 700 800 900 1000 0M 10M 50M 100M Average Time (s) Lower is better SSM DR Distcp 11.4% 21
  • 22. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Resource Usage • Workload: • Batch backup files (1000 *100MB) • Result: • SSM DR requires less resource than DistCp on working nodes • 26% less CPU • 46% less memory 22
  • 23. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice SSM Disaster Recovery: Example and Demo • Rule  file: every 500ms | path matches "/src1/*" | sync –dest hdfs://namenode:9000/dest1/  file: every 2s | path matches "/src/*.txt" | sync –dest hdfs://namenode3:9000/dest/ 23
  • 24. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Thanks • Q&A • More details • https://issues.apache.org/jira/browse/HDFS-7343 • https://events.static.linuxfound.org/sites/events/files/slides/ApacheBigDataEurope2016-SSM.pdf • https://github.com/Intel-bigdata/SSM 24
  • 25. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimer & Optimization Notice Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 25
  • 26.
  • 27. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Namespace Impact on SSM’s Performance • Workload: • Batch backup files • (0M, 10M, 50M, 100M) files in namespace • Result: • Namespace has very limited impact on SSM Disaster Recovery’s performance Run #1 Run #2 Run #3 Run #4 Run #5 Run #6 Run #7 Run #8 Average Running Time (s) 10000 * 1MB 1000 * 100MB 27
  • 28. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Diff Detection • We found: Log based works better than we even thought • Nearly real-time Aync based on incremental diffs • Limited or no impact to namenode 28 Methods View Advantage Dis-advantage Directly Scan Result view Easy to develop 1. Long laterncy 2. Network overhead 3. Impact to Namenode 4. Not real-time HDFS Snapshot Result view Easy to develop 1. A few constrains caused by snapshot 2. Impact to namenode 3. Not real-time Log based Continually View 1. Real time async is possible 2. Less impact to namenode Hard to develop and maintain
  • 29. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Diff Detection • We found: Log based works better than we even thought • Nearly real-time Aync based on incremental diffs • Limited or no impact to namenode 29 Methods View Advantage Dis-advantage Directly Scan Result view Easy to develop 1. Long laterncy 2. Network overhead 3. Impact to Namenode 4. Not real-time HDFS Snapshot Result view Easy to develop 1. A few constrains caused by snapshot 2. Impact to namenode 3. Not real-time Log based Continually View 1. Real time async is possible 2. Less impact to namenode Hard to develop and maintain
  • 30. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Diff Application • We found: Task based is works better • Light-weighted: multiple tasks in one container • Hard to develop is not a question any more: lots of distributed frameworks, e.g., Akka 30 Methods Advantage Dis-advantage Container based (e.g., map-reduce) Easy to develop 1. More resource 2. Less parallelism in the same cluster Task based 1. Less resource usage 2. More parallelism (multiple tasks in single container) 1. In some cases, agents needed 2. Hard to develop
  • 31. Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Basic Design & Experience • Our light-weighted DR basic design • Diff Detection without Snapshot • Long running service monitoring Editlog or Inotify • Directly Scan to handle corner cases and namespace mis-match • Diff Application without Map-reduce • Agents based on Akka (can be replaced by other frameworks) • Master dispatch diff application tasks • Slaves apply diff to remote (multiple tasks in one container) 31 Light-weighted for Cluster/Data center: 1. Limited Impact to HDFS (HDFS event don’t know DR e 2. Less resource requirement 3. Can be Containerized (in future)

Notas del editor

  1. our small file feature will address this in future
  2. our small file feature will address this in future
  3. https://www.edureka.co/blog/hadoop-ecosystem
  4. In most cases, Datanodes are hosted in the same cluster https://issues.apache.org/jira/browse/HDFS-5442
  5. DistCp as an exmaple
  6. DistCp as an exmaple
  7. Copy list
  8. Copy list
  9. Log is always behind it true event
  10. https://events.static.linuxfound.org/sites/events/files/slides/ApacheBigDataEurope2016-SSM.pdf
  11. https://events.static.linuxfound.org/sites/events/files/slides/ApacheBigDataEurope2016-SSM.pdf
  12. Because SSM requires much less resources than DistCp. So, it will
  13. CPU: SSM 6.44493046 DistCp 12.1571644 Memory: SSM 5290 DistCp 9100
  14. CPU: SSM 4.94799801 DistCp 6.74204535 Memory: SSM 5394 DistCp 10044
  15. our small file feature will address this in future
  16. DistCp as an exmaple
  17. DistCp as an exmaple
  18. DistCp as an exmaple
  19. DistCp as an exmaple