Big data processing meets non-volatile memory: opportunities and challenges

Big Data Processing meets Non-volatile Memory:
Opportunities and Challenges
Talk at DataWorks Summit San Jose 2018
by
Shashank Gugnani, Dhabaleswar K. (DK) Panda, Xiaoyi Lu
The Ohio State University
E-mail: gugnani.2@osu.edu, panda@cse.ohio-state.edu, luxi@cse.ohio-state.edu

DataWorks Summit San Jose‘18 2Network Based Computing Laboratory
• Substantial impact on designing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Big Data Management and Processing on Modern Clusters

Big Data Processing with Apache Big Data Analytics Stacks
• Major components included:
– MapReduce (Batch)
– Spark (Iterative and Interactive)
– HBase (Query)
– HDFS (Storage)
– RPC (Inter-process communication)
• Underlying Hadoop Distributed File
System (HDFS) used by MapReduce,
Spark, HBase, and many others
• Model scales but high amount of
communication and I/O can be further
optimized!
HDFS
MapReduce
Apache Big Data Analytics Stacks
User Applications
HBase
Hadoop Common (RPC)
Spark

Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and FPGAs)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM

• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x
– Plugins available for Cloudera (CDH) and Hortonworks (HDP)
distributions
• RDMA for Apache HBase
• RDMA for Memcached
• http://hibd.cse.ohio-state.edu
• Users Base: 285 organizations from 34 countries
• More than 26,650 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand, RoCE,
and, Ethernet
Support for Singularity and Docker
Significant performance
improvement with ‘RDMA+DRAM’
What about RDMA + NVM?

HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
NumberofDownloads
Timeline
RDMA-Hadoop2.x0.9.1
RDMA-Hadoop2.x0.9.6
RDMA-Memcached0.9.4
RDMA-Hadoop2.x0.9.7
RDMA-Spark0.9.1
RDMA-HBase0.9.1
RDMA-Hadoop2.x1.0.0
RDMA-Spark0.9.4
RDMA-Hadoop2.x1.3.0
RDMA-Memcached0.9.6&OHB-0.9.3
RDMA-Spark0.9.5
RDMA-Hadoop1.x0.9.0
RDMA-Hadoop1.x0.9.8
RDMA-Memcached0.9.1&OHB-0.7.1
RDMA-Hadoop1.x0.9.9

Non-Volatile Memory (NVM) and NVMe-SSD
3D XPoint from Intel & Micron Samsung NVMe SSD Performance of PMC Flashtec NVRAM [*]
• Non-Volatile Memory (NVM) provides byte-addressability with persistence
• The huge explosion of data in diverse fields require fast analysis and storage
• NVMs provide the opportunity to build high-throughput storage systems for data-intensive
applications
• Storage technology is moving rapidly towards NVM
[*] http://www.enterprisetech.com/2014/08/06/ flashtec-nvram-15-million-iops-sub-microsecond- latency/

• Popular methods employed by recent works to emulate NVRAM performance
model over DRAM
• Two ways:
– Emulate byte-addressable NVRAM over DRAM
– Emulate block-based NVM device over DRAM
NVRAM Emulation based on DRAM
Application
Virtual File System
Block Device PCMDisk
(RAM-Disk + Delay)
DRAM
mmap/memcpy/msync
Application
Persistent Memory Library
Clflush + Delay
DRAM
pmem_memcpy_persist
(DAX)
Load/store
Load/Store
open/read/write/close

• NRCIO: NVM-aware RDMA-based Communication
and I/O Schemes
• NRCIO for Big Data Analytics
• NVMe-SSD based Big Data Analytics
• Conclusion and Q&A
Presentation Outline

Design Scope (NVRAM)
• Use as block device or memory?
• Software overhead with I/O system calls
• Usage of memory requires application re-design
• Slower than DRAM
• Write performance ~10x worse
• Read performance ~1.5x worse
• Cost
• Cost is more than DRAM
How can you modify existing systems to take advantage of
the non-volatility of NVRAM?

Design Scope (NVM for RDMA)
D-to-N over RDMA N-to-D over RDMA N-to-N over RDMA
D-to-N over RDMA: Communication buffers for client are allocated in DRAM; Server uses NVM
N-to-D over RDMA: Communication buffers for client are allocated in NVM; Server uses DRAM
N-to-N over RDMA: Communication buffers for client and server are allocated in NVM
DRAM NVM
HDFS-RDMA
(RDMADFSClient)
HDFS-RDMA
(RDMADFSServer)
Client
CPU
Server
CPU
PCIe
NIC
PCIe
NIC
Client Server
NVM DRAM
HDFS-RDMA
(RDMADFSClient)
HDFS-RDMA
(RDMADFSServer)
Client
CPU
Server
CPU
PCIePCIe
NIC NIC
Client Server
NVM NVM
HDFS-RDMA
(RDMADFSClient)
HDFS-RDMA
(RDMADFSServer)
Client
CPU
Server
CPU
PCIePCIe
NIC NIC
Client Server
D-to-D over RDMA: Communication buffers for client and server are allocated in DRAM (Common)

NVRAM-aware Communication in NRCIO
NRCIO Send/Recv over NVRAM NRCIO RDMA_Read over NVRAM

NRCIO Send/Recv Emulation over PCM
• Comparison of communication latency using NRCIO send/receive semantics over InfiniBand QDR
network and PCM memory
• High communication latencies due to slower writes to non-volatile persistent memory
• NVRAM-to-Remote-NVRAM (NVRAM-NVRAM) => ~10x overhead vs. DRAM-DRAM
• DRAM-to-Remote-NVRAM (DRAM-NVRAM) => ~8x overhead vs. DRAM-DRAM
• NVRAM-to-Remote-DRAM (NVRAM-DRAM) => ~4x overhead vs. DRAM-DRAM
0
20000
40000
60000
80000
100000
120000
140000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Latency(us)
IB Send/Recv Emulation over PCM
DRAM-DRAM NVRAM-DRAM
DRAM-NVRAM NVRAM-NVRAM
DRAM-SSD-Msync_Per_Line DRAM-SSD-Msync_Per_8KPage
DRAM-SSD-Msync_Per_4KPage

NRCIO RDMA-Read Emulation over PCM
0
20000
40000
60000
80000
100000
120000
140000
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Latency(us)
Message Size (bytes)
IB RDMA_READ Emulation over PCM
DRAM-DRAM NVRAM-DRAM
DRAM-NVRAM NVRAM-NVRAM
DRAM-SSD-Msync_Per_Line DRAM-SSD-Msync_Per_8KPage
DRAM-SSD-Msync_Per_4KPage
• Communication latency with NRCIO RDMA-Read over InfiniBand QDR + PCM memory
• Communication overheads for large messages due to slower writes into NVRAM from
remote memory; similar to Send/Receive
• RDMA-Read outperforms Send/Receive for large messages; as observed for DRAM-DRAM

and I/O Schemes

• Files are divided into fixed sized blocks
– Blocks divided into packets
• NameNode: stores the file system namespace
• DataNode: stores data blocks in local storage
devices
• Uses block replication for fault tolerance
– Replication enhances data-locality and read
throughput
• Communication and I/O intensive
• Java Sockets based communication
• Data needs to be persistent, typically on
SSD/HDD
NameNode
DataNodes
Client
Opportunities of Using NVRAM+RDMA in HDFS

Design Overview of NVM and RDMA-aware HDFS (NVFS)
• Design Features
• RDMA over NVM
• HDFS I/O with NVM
• Block Access
• Memory Access
• Hybrid design
• NVM with SSD as a hybrid
storage for HDFS I/O
• Co-Design with Spark and HBase
• Cost-effectiveness
• Use-case
Applications and Benchmarks
Hadoop MapReduce Spark HBase
Co-Design
(Cost-Effectiveness, Use-case)
RDMA
Receiver
RDMA
Sender
DFSClient
RDMA
Replicator
RDMA
Receiver
NVFS
-BlkIO
Writer/Reader
NVM
NVFS-
MemIO
SSD SSD SSD
NVM and RDMA-aware HDFS (NVFS)
DataNode
N. S. Islam, M. W. Rahman , X. Lu, and D. K.
Panda, High Performance Design for HDFS with
Byte-Addressability of NVM and RDMA, 24th
International Conference on Supercomputing
(ICS), June 2016

• RDMA communication buffers
allocated in NVRAM via D-to-N
approach
• buffers used in a round-robin
manner
• Data received in NVRAM buffers are
encapsulated in java I/O streams
• written to NVM through HDFS
file I/O APIs
• mmap files during HDFS I/O
Proposed Design (NVFS-BlkIO)
NVM
Communication Buffers
RDMADFSServer (DataNode)
Blk1
Blk1RDMADFSClient
Blk2RDMADFSClient
Blk3RDMADFSClient
BlkNRDMADFSClient
Blk2 Blk3
Blk4 … BlkN
write
read
write
read
write
read
write
read
.
.
.
Block Access
Memory Access

• Shared buffers for communication
and storage
• Buffers allocated in NVM; managed
in a linked list
• Data packets added to the tail
(first free buffer)
• Does not maintain HDFS block
structures
• Fast write, suboptimal read for
large number of blocks
• Needs all the buffers to be
registered in the communication
library
Proposed Design (NVFS-MemIO)
NVM
Free Buffers
Blk1RDMADFSClient
Blk2RDMADFSClient
Blk3RDMADFSClient
BlkNRDMADFSClient
write
read
write
read
write
read
write
read
.
.
.
…
Used Buffers
Blk1 Pkt1 Data Nxt
Blk2 Pkt1 Data Nxt
Blk1 Pkt2 Data Nxt
BlkN Pkt1 Data Nxt
Flat Architecture

• Re-usable communication buffers
• Separate storage buffers in NVM
• Maintains HDFS block structures
• Overlapping between
communication and storage
– Efficient write
• Constant time (amortized) retrieval
of blocks
– Faster read over Flat Architecture
• Modularity
Proposed Design (NVFS-MemIO)
NVM
Communication Buffers
Blk1RDMADFSClient
Blk2RDMADFSClient
Blk3RDMADFSClient
BlkNRDMADFSClient
write
read
write
read
write
read
write
read
.
.
.
…
Storage Buffers
Blk1
Pkt1 Data
BlkN
Pkt1 Data
PktM Data PktM Data
Blk1 Blk2 … BlkN
HashTable
LinkedList LinkedList
Block Architecture

Evaluation with Hadoop MapReduce
0
50
100
150
200
250
300
350
Write Read
AverageThroughput(MBps)
HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
• TestDFSIO on SDSC Comet (32 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 1.2x over
HDFS
TestDFSIO
0
200
400
600
800
1000
1200
1400
Write Read
AverageThroughput(MBps)
HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
4x
1.2x
4x
2x
SDSC Comet (32 nodes) OSU Nowlab (4 nodes)
• TestDFSIO on OSU Nowlab (4 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 2x over
HDFS

Evaluation with HBase
0
100
200
300
400
500
600
700
800
8:800K 16:1600K 32:3200K
Throughput(ops/s)
Cluster Size : No. of Records
HDFS (56Gbps) NVFS (56Gbps)
HBase 100% insert
0
200
400
600
800
1000
1200
8:800K 16:1600K 32:3200K
Throughput(ops/s)
Cluster Size : Number of Records
HBase 50% read, 50% update
• YCSB 100% Insert on SDSC Comet (32 nodes)
– NVFS-BlkIO gains by 21% by storing only WALs to NVM
• YCSB 50% Read, 50% Update on SDSC Comet (32 nodes)
– NVFS-BlkIO gains by 20% by storing only WALs to NVM
20%21%

Opportunities to Use NVRAM+RDMA in MapReduce
Disk Operations
• Map and Reduce Tasks carry out the total job execution
– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk (persistent)
– Reduce tasks get these data by shuffle from NodeManagers, operate on it and write to HDFS (persistent)
• Communication and I/O intensive; Shuffle phase uses HTTP over Java Sockets; I/O operations take
place in SSD/HDD typically
Bulk Data Transfer

Opportunities to Use NVRAM in MapReduce-RDMA
DesignInputFiles
OutputFiles
IntermediateData
Map Task
Read Map
Spill
Merge
Map Task
Read Map
Spill
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
RDMA
All Operations are In-
Memory
Opportunities exist to
improve the
performance with
NVRAM

NVRAM-Assisted Map Spilling in MapReduce-RDMA
InputFiles
OutputFiles
IntermediateData
Map Task
Read Map
Spill
Merge
Map Task
Read Map
Spill
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
Reduce Task
Shuffle Reduce
In-
Mem
Merge
RDMA
NVRAM
 Minimizes the disk operations in Spill phase
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? PDSW-DISCS, with SC 2016.
M. W. Rahman, N. S. Islam, X. Lu, and D. K. Panda, NVMD: Non-Volatile Memory Assisted Design for Accelerating MapReduce and DAG Execution Frameworks on
HPC Systems? IEEE BigData 2017.

Comparison with Sort and TeraSort
• RMR-NVM achieves 2.37x benefit for Map
phase compared to RMR and MR-IPoIB;
overall benefit 55% compared to MR-IPoIB,
28% compared to RMR
2.37x
55%
2.48x
51%
• RMR-NVM achieves 2.48x benefit for Map
phase compared to RMR and MR-IPoIB;
overall benefit 51% compared to MR-IPoIB,
31% compared to RMR

Evaluation of Intel HiBench Workloads
• We evaluate different HiBench
workloads with Huge data sets
on 8 nodes
• Performance benefits for
Shuffle-intensive workloads
compared to MR-IPoIB:
– Sort: 42% (25 GB)
– TeraSort: 39% (32 GB)
– PageRank: 21% (5 million pages)
• Other workloads:
– WordCount: 18% (25 GB)
– KMeans: 11% (100 million samples)

Evaluation of PUMA Workloads
• We evaluate different PUMA
workloads on 8 nodes with
30GB data size
• Performance benefits for
Shuffle-intensive workloads
compared to MR-IPoIB :
– AdjList: 39%
– SelfJoin: 58%
– RankedInvIndex: 39%
• Other workloads:
– SeqCount: 32%
– InvIndex: 18%

and I/O Schemes

Overview of NVMe Standard
• NVMe is the standardized interface
for PCIe SSDs
• Built on ‘RDMA’ principles
– Submission and completion I/O
queues
– Similar semantics as RDMA send/recv
queues
– Asynchronous command processing
• Up to 64K I/O queues, with up to 64K
commands per queue
• Efficient small random I/O operation
• MSI/MSI-X and interrupt aggregation
NVMe Command Processing
Source: NVMExpress.org

Overview of NVMe-over-Fabric
• Remote access to flash with NVMe
over the network
• RDMA fabric is of most importance
– Low latency makes remote access
feasible
– 1 to 1 mapping of NVMe I/O queues
to RDMA send/recv queues
NVMf Architecture
I/O
Submission
Queue
I/O
Completion
Queue
RDMA Fabric
SQ RQ
NVMe
Low latency
overhead compared
to local I/O

Overview of SPDK
• High-performance storage library over NVMe
• Significant performance improvement over POSIX NVMe driver

Design Challenges with NVMe-SSD
• QoS
– Hardware-assisted QoS
• Persistence
– Flushing buffered data
• Performance
– Consider flash related design aspects
– Read/Write performance skew
– Garbage collection
• Virtualization
– SR-IOV hardware support
– Namespace isolation
• New software systems
– Disaggregated Storage with NVMf
– Persistent Caches
Co-design

Evaluation with RocksDB
0
5
10
15
Insert Overwrite Random Read
Latency (us)
POSIX SPDK
0
100
200
300
400
500
Write Sync Read Write
Latency (us)
POSIX SPDK
• 20%, 33%, 61% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required

Evaluation with RocksDB
0
5000
10000
15000
20000
Write Sync Read Write
Throughput (ops/sec)
POSIX SPDK
0
100000
200000
300000
400000
500000
600000
Insert Overwrite Random Read
Throughput (ops/sec)
POSIX SPDK
• 25%, 50%, 160% improvement for Insert, Write Sync, and Read Write
• Overwrite: Compaction and flushing in background
– Low potential for improvement
• Read: Performance much worse; Additional tuning/optimization required

QoS-aware SPDK Design
0
50
100
150
1 5 9 13 17 21 25 29 33 37 41 45 49
Bandwidth(MB/s)
Time
Scenario 1
High Priority Job (WRR) Medium Priority Job (WRR)
High Priority Job (OSU-Design) Medium Priority Job (OSU-Design)
0
1
2
3
4
5
2 3 4 5
JobBandwidthRatio
Scenario
Synthetic Application Scenarios
SPDK-WRR OSU-Design Desired
• Synthetic application scenarios with different QoS requirements
– Comparison using SPDK with Weighted Round Robbin NVMe arbitration
• Near desired job bandwidth ratios
• Stable and consistent bandwidth
S. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and
Provisioning QoS for NVMe SSDs, (Under Review)

Conclusion and Future Work
• Exploring NVM-aware RDMA-based Communication and I/O Schemes
for Big Data Analytics
• Proposed a new library, NRCIO (work-in-progress)
• Re-design HDFS storage architecture with NVRAM
• Re-design RDMA-MapReduce with NVRAM
• Design Big Data analytics stacks with NVMe and NVMf protocols
• Results are promising
• Further optimizations in NRCIO
• Co-design with more Big Data analytics frameworks

{panda, luxi}@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

Big data processing meets non-volatile memory: opportunities and challenges

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data processing meets non-volatile memory: opportunities and challenges

Similar to Big data processing meets non-volatile memory: opportunities and challenges (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Big data processing meets non-volatile memory: opportunities and challenges