SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
NvmExpressTM
inLinux*
Thenextstepforstorage
Keith Busch (NVMe architecture and implementation)
Frank Ober (Solution Results on NVMe)
2
1
10
100
1000
10000
100000
1000000
1990 2000 2010 2020
RELATIVEPERFORMANCE
Logscale
CPU vs. Storage Performance Gap
3
Switching to SSDs
SAS + SATA SSDs:
• Drop in replacement to HDDs
• Immediate latency benefit
Legacy software and transport
prevent unlocking the media’s
true potential
4
To Maximize IOPS…
H/W resource intensive: software
and protocol overhead
• 100% CPU utilization from
10 CPUs
• 16 SSDs
5
PCIe* Storage Standardization (since 2009)
TM
6
NVM ExpressTM
and Linux*
Integrated into mainline Linux*
kernel since 3.3 (March 2012)
Backports to previous Linux*
kernels supported by various OS
vendors
TM
7
What difference does a standard make?
AHCI NVMe
Maximum queue depth
1 command queue
32 commands
65536 queues
65536 commands per queue
MMIO
6 reads+writes/non-queued command
9 reads+writes/queued command
2 writes/command
Interrupts and steering Single interrupt
2048 MSI-X interrupts
CPU affinity
Parallelism Single sync lock to issue command Per-CPU lock contention free
Command Transfer Efficiency
Command requires two serialized host
DRAM fetches
One 64B DMA fetch
8
Optimized per-CPU queuing
9
Optimizing for NUMA:
When CPUs exceed h/w queues:
Share with your neighbors
10
The cost of poor NUMA choices
0%
5%
10%
15%
20%
25%
30%
35%
1 Node 2 Node 4 Node 8 Node 32 Node
2x NVMe/node, 4k rand read
No CPU pinning
Observed Performance Loss off h/w spec for
Randomly Scheduled Workloads
Measured by Intel and SGI, on an SGI UV300 computer running a quantity of 32 Intel Xeon E7 v2 Processors with a quantity of 64 Intel SSD Data Center Family
P3700 1.6TB using 100% 4k random reads. SGI public reference: http://blog.sgi.com/reinventing-compute-storage-landscape/
11
Case study: scaling upward with more h/w (SGI*)
NUMA penalty: >30% performance lost
Intel and SGI solutions:
 irqbalance, numactl, libnuma, custom cpu-queue
mapping
 Up to 30 Million IOPS (SC’14) of random read showing
linear performance scaling as h/w is added
12
Storage Stack Comparison
SAS vs. NVMe
Latency and CPU utilization
reduced by 50+%*:
NVMe: 2.8us, 9,100 cycles
SAS: 6.0us, 19,500 cycles
* Measured by Intel on Intel® Core™ i5-2500K 3.3GHz 6MB L3 Cache Quad-Core Desktop Processor using Linux*
13
To Maximize IOPS…
Now with more efficient h/w
utilization vs AHCI:
• 100% utilization from 3.5 CPUs
(previously 10 CPUs)
• 2 SSDs (previously 16 SSDs)
14
The importance of reducing S/W latency
* Measured by Intel on Intel® Core™ i5-2500K 3.3GHz 6MB L3 Cache Quad-Core Desktop Processor using Linux*
15
Looking ahead: removing interrupts
6.21 6.67
4.91 5.01
1.47 1.42
4.57 3.33
4.1
2.63
2.91
1.48
10.78
10
9.01
7.64
4.38
2.9
0
2
4
6
8
10
12
4KB 512B 4KB 512B 4KB 512B
C-State Async Sync
I/OCompletionLatencyinusec
Hardware
Operating
System
NvmexpressTM
inLinux*
ModernNoSQLDatabasesFORSSDandFLASH
Frank Ober
http://communities.intel.com/people/FrankOber
@fxober / #IntelSSD
17
A short taxonomy of NoSQL Databases…
Type Speed Usage Players
Key value databases Fastest Operational Memcache, Redis, Aerospike
Cloud guys use: DynamoDB* (Amazon).
LevelDB (Google), Rocksdb* (Facebook)
Big Table ,
Column-based.
Faster Analytics Big Table*, Cassandra*, Hbase*
(Hadoop)
Document databases Faster Web
documents
(JSON)
MongoDB (WiredTiger* v3.0 is
released)
Couchbase (ForestDB* releases 2015)
Graph databases Fast Social
Graphs
Neo4J…
18
19
Aerospike an In-Memory, Flash Optimized NoSQL database
Environment
3 Clients
You need to spread the load
Here Dell 620 dual sockets are used
DUAL 10Gbit
networks
Dell R730xd Server System
One primary (dual system with replication testing)
Dual CPU socket, rack mountable server system
Dell A03 Board, Product Name: 0599V5
CPU Model used
2 each - Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz max frequency: 4Ghz
18 cores, 36 logical processors per CPU
36 cores, 72 logical processors total
DDR4 DRAM Memory
128GB installed
BIOS Version
Dell* 1.0.4 , 8/28/2014
Network Adapters
Intel® Ethernet Converged 10G X520 – DA2 (dual port PCIe add-in card)
1 – embedded 1G network adapter for management
2 – 10GB port for workload
Storage Adapters
None
Internal Drives and Volumes
/ (root) OS system – Intel SSD for Data Center Family S3500 – 480GB Capacity
/dev/nvme0n1 Intel SSD for Data Center Family P3700 – 1.6TB Capacity, x4 PCIe AIC
/dev/nvme1n1 Intel SSD for Data Center Family P3700 - 1.6TB Capacity, x4 PCIe AIC
/dev/nvme2n1 Intel SSD for Data Center Family P3700 - 1.6TB Capacity, x4 PCIe AIC
/dev/nvme3n1 Intel SSD for Data Center Family P3700 - 1.6TB Capacity, x4 PCIe AIC
6.4TB of raw capacity for Aerospike database namespaces
Aerospike
Community
Version 3.5.8
20
Aerospike results
The reason these tables on NVM are so fast is partially the small block. It
also affects network usage… and costs of clusters so be careful with
replication and object sizes.
Write mixes at 50/50 take the numbers down extensively.
https://communities.intel.com/community/itpeernetwork/blog/2015/02/17/r
eaching-one-million-database-transactions-per-second-aerospike-intel-ssd
Record Size
Aerospike
Number
of clients
threads
Total TPS Percent
below
1ms
(Reads)
Percent
below
1ms
(Writes)
Std Dev
of Read
Latency
(ms)
Std Dev
of Write
Latency
(ms)
Approx.
Database
size
1k 576 1,124,875 97.16 99.9 0.79 0.35 100G
2k 448 875,446 97.33 99.57 0.63 0.18 200G
4k 384 581,272 97.22 99.85 0.63 0.05 400G
1k
(replication)
512 1,003,471 96.11 98.98 0.87 0.30 200G
Record Size
iostat
Read MB/sec Write MB/sec Avg queue
size on SSD
Average
drive latency
CPU Busy %
1k 418 29 31 0.11 93
2k 547 43 27 0.13 81
4k 653 52 20 0.16 52
1k
(replication)
396 51 30 0.13 94
Results measured by Intel and Aerospike. For tests and configurations, see slide 22.
21
TCO Opportunity of In Memory vs. In NVM
Storage Types Cost per GB 1k transaction/socket Memory Capacity
DRAM only $10-15 + (DDR4)
Up to ~1.6 million tps
(1 socket)
192GB – 768 GB
SSD Configuration
$1-3 + (PCIe SSD –
retail channel)
Up to ~600k per node
(1 socket)
4 x 2TB = 8TB
10# SFF NVMe
servers
3x lower transactions per second, yet 5x lower price per GB with NVM.
Capacity is higher, cost is much lower allowing you to do more per unit of rack.
Costs measured by Intel from U.S. based internet retailer.
22
23
Now let’s look at NoSQL – Web Document Store
And Couchbase 4.0…
24
B+ Tree structured database indexing
Not suitable to index variable or fixed-length
long keys
– Significant space overhead as entire
key strings are indexed in non-leaf
nodes
Tree depth grows quickly as more data is
loaded
I/O performance is degraded significantly as
the data size gets bigger
Introducing ForestDB – moving beyond B+ Tree
Ki: ith smallest key in the node
Pi: pointer corresponding to Ki
Vi: value corresponding to Ki
f: fanout degree
K1 P1 … … Kd Pd
K1 V1 K2 V2 … … Kf Vf …
Index (non-leaf) node
Leaf node
… Kj Pj … … Kl Pl
K1 P1 … … Kj Pj
Root node
…
…
K1 P1 … … Kf Pf Kj Pj … … Kn Pn
…
… …
… …
Kj Vj Kk Vk … … Kn Vn
Index (non-leaf) node
Source: ©2015 Couchbase Inc.
25
26
How ForestDB tries to achieve….
Fast, flatter, scalable index structure for variable or fixed-length long keys
Targeting both SSD and HDD
Less storage space overhead
Reduce write amplification
Regardless of the pattern of keys
Efficient to keys both sharing common prefix and not sharing common prefix
Compaction of large index or db files is still slow…
Source: ©2015 Couchbase Inc.
Trie (prefix tree) whose node is B+Tree
– A key is split into the list of fixed-size chunks (sub-string of the key)
HB+Trie (Hierarchical B+Tree based Trie)
Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…
07/26
Lexicographical ordered traversal
Search using Chunk1
Document
B+Tree (Node of HB+Trie)
Node of B+Tree
Chunk1 Chunk2 Chunk3 …
a83j gls8 3jgo …
Search using Chunk2
Search using Chunk3
07/26
Source: ©2015 Couchbase Inc.
27
 Intel® Xeon® processor E5-2697 v3 @ 2.60GHz
 Number of Cores: 28 (56 hw threads)
 RAM: 65G
 Storage:
 SATA SSD: Intel DC S3710 1.2TB (~$1 / GB)
 NVMeTM
SSD: Intel DC P3700 1.6TB (~$2.5/ GB)
 ForestDB: https://github.com/couchbase/forestdb
 ForestDB benchmark:
https://github.com/couchbaselabs/ForestDB-Benchmark
28
Lab Configuration
Source: ©2015 Couchbase Inc.
Testing Scenarios
 Key/Value store (used in the data server layer)
 Index Simulation (first place ForestDB will arrive)
 Throughput Testing (Parallel Benchmark)
Source: ©2015 Couchbase Inc.
29
Summary
K/V Store Indexing
Parallel
Throughput
Benefits
SATA NVMe SATA NVMe SATA NVMe
Read
Throughput
16678 25302 13987 20341 30755 47345 Up to 50%
Write
Throughput
4170 6325 3497 5209 7282 63946 Up 9x
95% Read
Latency
1.745 1.136 1.853 1.254 4.0 7.5 Some work
95% Write
Latency
264 188 276 216 1934 270 Awesome
Results measured by Intel and Couchbase Inc. For tests and configurations, see slide 30.
30
Legal Disclaimer
31
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any
features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or
incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725,
or go to http://www.intel.com/design/literature.htm.
This document may contain information on products in the design phase of development.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual
performance. Consult other sources of information to evaluate performance as you consider your purchase.
For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or
model. Any difference in system hardware or software design or configuration may affect actual performance.
Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its
customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark
data are accurate and reflect performance of systems available for purchase.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2015 Intel Corporation. All rights reserved.
Experience NVM as a complement to DRAM

Más contenido relacionado

La actualidad más candente

SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3
UniFabric
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 
Intel ssd dc data center family for PCIe
Intel ssd dc data center family for PCIeIntel ssd dc data center family for PCIe
Intel ssd dc data center family for PCIe
Low Hong Chuan
 

La actualidad más candente (20)

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3SOUG_SDM_OracleDB_V3
SOUG_SDM_OracleDB_V3
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 
Webinar: NVMe, NVMe over Fabrics and Beyond - Everything You Need to Know
Webinar: NVMe, NVMe over Fabrics and Beyond - Everything You Need to KnowWebinar: NVMe, NVMe over Fabrics and Beyond - Everything You Need to Know
Webinar: NVMe, NVMe over Fabrics and Beyond - Everything You Need to Know
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory EasyIMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
IMCSummit 2016 Keynote - Benzi Galili - More Memory for In-Memory Easy
 
Flash for the Real World – Separate Hype from Reality
Flash for the Real World – Separate Hype from RealityFlash for the Real World – Separate Hype from Reality
Flash for the Real World – Separate Hype from Reality
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
UniPlex vScaleDB pat. pending
UniPlex vScaleDB pat. pendingUniPlex vScaleDB pat. pending
UniPlex vScaleDB pat. pending
 
Intel ssd dc data center family for PCIe
Intel ssd dc data center family for PCIeIntel ssd dc data center family for PCIe
Intel ssd dc data center family for PCIe
 
TDS-16489U - Dual Processor
TDS-16489U - Dual ProcessorTDS-16489U - Dual Processor
TDS-16489U - Dual Processor
 
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
Enterprise Storage NAS - Dual Controller
Enterprise Storage NAS - Dual ControllerEnterprise Storage NAS - Dual Controller
Enterprise Storage NAS - Dual Controller
 
NVMe over Fibre Channel Introduction
NVMe over Fibre Channel IntroductionNVMe over Fibre Channel Introduction
NVMe over Fibre Channel Introduction
 
Ceph Day Seoul - Ceph on Arm Scaleable and Efficient
Ceph Day Seoul - Ceph on Arm Scaleable and Efficient Ceph Day Seoul - Ceph on Arm Scaleable and Efficient
Ceph Day Seoul - Ceph on Arm Scaleable and Efficient
 

Destacado

10 Things you need to know abut designing enterprise SSD
10 Things you need to know abut designing enterprise SSD10 Things you need to know abut designing enterprise SSD
10 Things you need to know abut designing enterprise SSD
Simon Huang
 

Destacado (16)

Redis on NVMe SSD - Zvika Guz, Samsung
 Redis on NVMe SSD - Zvika Guz, Samsung Redis on NVMe SSD - Zvika Guz, Samsung
Redis on NVMe SSD - Zvika Guz, Samsung
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
 
노태상 - 리눅스 커널 개요 및 이슈 아이엠 (2010Y01M30D)
노태상 - 리눅스 커널 개요 및 이슈 아이엠 (2010Y01M30D)노태상 - 리눅스 커널 개요 및 이슈 아이엠 (2010Y01M30D)
노태상 - 리눅스 커널 개요 및 이슈 아이엠 (2010Y01M30D)
 
The Bread of Salt(NVM)SUMMARY
The Bread of Salt(NVM)SUMMARYThe Bread of Salt(NVM)SUMMARY
The Bread of Salt(NVM)SUMMARY
 
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
Analyst Perspective: SSD Caching or SSD Tiering - Which is Better?
 
10 Things you need to know abut designing enterprise SSD
10 Things you need to know abut designing enterprise SSD10 Things you need to know abut designing enterprise SSD
10 Things you need to know abut designing enterprise SSD
 
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 
Server
ServerServer
Server
 
2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report
 
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
[OpenStack Days Korea 2016] Track2 - OpenStack 기반 소프트웨어 정의 스토리지 기술
[OpenStack Days Korea 2016] Track2 - OpenStack 기반 소프트웨어 정의 스토리지 기술[OpenStack Days Korea 2016] Track2 - OpenStack 기반 소프트웨어 정의 스토리지 기술
[OpenStack Days Korea 2016] Track2 - OpenStack 기반 소프트웨어 정의 스토리지 기술
 

Similar a IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory express

Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 

Similar a IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory express (20)

Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads
Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client WorkloadsBenchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads
Benchmarking Performance: Benefits of PCIe NVMe SSDs for Client Workloads
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPC
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
NVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To FallNVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To Fall
 
Ceph
CephCeph
Ceph
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Ceph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in CephCeph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in Ceph
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 

Más de In-Memory Computing Summit

IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
In-Memory Computing Summit
 
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X PlatformIMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
In-Memory Computing Summit
 
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
In-Memory Computing Summit
 

Más de In-Memory Computing Summit (20)

IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
IMC Summit 2016 Breakout - Per Minoborg - Work with Multiple Hot Terabytes in...
 
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
IMC Summit 2016 Breakout - Henning Andersen - Using Lock-free and Wait-free I...
 
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing HubIMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
IMC Summit 2016 Breakout - Roman Shtykh - Apache Ignite as a Data Processing Hub
 
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
IMC Summit 2016 Breakout - Nikita Shamgunov - Propelling IoT Innovation with ...
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
 
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
 
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
IMC Summit 2016 Innovation - Dennis Duckworth - Lambda-B-Gone: The In-memory ...
 
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
IMC Summit 2016 Innovation - Steve Wilkes - Tap Into Your Enterprise – Why Da...
 
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X PlatformIMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
IMC Summit 2016 Innovation - Girish Mutreja - Unveiling the X Platform
 
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
 
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
IMC Summit 2016 Breakout - Brian Bulkowski - NVMe, Storage Class Memory and O...
 
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...
 
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
 
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent MemoryIMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
IMC Summit 2016 Breakout - Gordon Patrick - Developments in Persistent Memory
 
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
 
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise GradeIMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
IMC Summit 2016 Breakout - Steve Wikes - Making IMC Enterprise Grade
 
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distribu...
 
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of StatelessnessIMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
IMC Summit 2016 Breakout - Aleksandar Seovic - The Illusion of Statelessness
 
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
IMC Summit 2016 Breakout - Girish Mutreja - Extreme Transaction Processing in...
 
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
IMC Summit 2016 Breakout - Greg Luck - How to Speed Up Your Application Using...
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory express

  • 1. NvmExpressTM inLinux* Thenextstepforstorage Keith Busch (NVMe architecture and implementation) Frank Ober (Solution Results on NVMe)
  • 2. 2 1 10 100 1000 10000 100000 1000000 1990 2000 2010 2020 RELATIVEPERFORMANCE Logscale CPU vs. Storage Performance Gap
  • 3. 3 Switching to SSDs SAS + SATA SSDs: • Drop in replacement to HDDs • Immediate latency benefit Legacy software and transport prevent unlocking the media’s true potential
  • 4. 4 To Maximize IOPS… H/W resource intensive: software and protocol overhead • 100% CPU utilization from 10 CPUs • 16 SSDs
  • 6. 6 NVM ExpressTM and Linux* Integrated into mainline Linux* kernel since 3.3 (March 2012) Backports to previous Linux* kernels supported by various OS vendors TM
  • 7. 7 What difference does a standard make? AHCI NVMe Maximum queue depth 1 command queue 32 commands 65536 queues 65536 commands per queue MMIO 6 reads+writes/non-queued command 9 reads+writes/queued command 2 writes/command Interrupts and steering Single interrupt 2048 MSI-X interrupts CPU affinity Parallelism Single sync lock to issue command Per-CPU lock contention free Command Transfer Efficiency Command requires two serialized host DRAM fetches One 64B DMA fetch
  • 9. 9 Optimizing for NUMA: When CPUs exceed h/w queues: Share with your neighbors
  • 10. 10 The cost of poor NUMA choices 0% 5% 10% 15% 20% 25% 30% 35% 1 Node 2 Node 4 Node 8 Node 32 Node 2x NVMe/node, 4k rand read No CPU pinning Observed Performance Loss off h/w spec for Randomly Scheduled Workloads Measured by Intel and SGI, on an SGI UV300 computer running a quantity of 32 Intel Xeon E7 v2 Processors with a quantity of 64 Intel SSD Data Center Family P3700 1.6TB using 100% 4k random reads. SGI public reference: http://blog.sgi.com/reinventing-compute-storage-landscape/
  • 11. 11 Case study: scaling upward with more h/w (SGI*) NUMA penalty: >30% performance lost Intel and SGI solutions:  irqbalance, numactl, libnuma, custom cpu-queue mapping  Up to 30 Million IOPS (SC’14) of random read showing linear performance scaling as h/w is added
  • 12. 12 Storage Stack Comparison SAS vs. NVMe Latency and CPU utilization reduced by 50+%*: NVMe: 2.8us, 9,100 cycles SAS: 6.0us, 19,500 cycles * Measured by Intel on Intel® Core™ i5-2500K 3.3GHz 6MB L3 Cache Quad-Core Desktop Processor using Linux*
  • 13. 13 To Maximize IOPS… Now with more efficient h/w utilization vs AHCI: • 100% utilization from 3.5 CPUs (previously 10 CPUs) • 2 SSDs (previously 16 SSDs)
  • 14. 14 The importance of reducing S/W latency * Measured by Intel on Intel® Core™ i5-2500K 3.3GHz 6MB L3 Cache Quad-Core Desktop Processor using Linux*
  • 15. 15 Looking ahead: removing interrupts 6.21 6.67 4.91 5.01 1.47 1.42 4.57 3.33 4.1 2.63 2.91 1.48 10.78 10 9.01 7.64 4.38 2.9 0 2 4 6 8 10 12 4KB 512B 4KB 512B 4KB 512B C-State Async Sync I/OCompletionLatencyinusec Hardware Operating System
  • 17. 17
  • 18. A short taxonomy of NoSQL Databases… Type Speed Usage Players Key value databases Fastest Operational Memcache, Redis, Aerospike Cloud guys use: DynamoDB* (Amazon). LevelDB (Google), Rocksdb* (Facebook) Big Table , Column-based. Faster Analytics Big Table*, Cassandra*, Hbase* (Hadoop) Document databases Faster Web documents (JSON) MongoDB (WiredTiger* v3.0 is released) Couchbase (ForestDB* releases 2015) Graph databases Fast Social Graphs Neo4J… 18
  • 19. 19 Aerospike an In-Memory, Flash Optimized NoSQL database
  • 20. Environment 3 Clients You need to spread the load Here Dell 620 dual sockets are used DUAL 10Gbit networks Dell R730xd Server System One primary (dual system with replication testing) Dual CPU socket, rack mountable server system Dell A03 Board, Product Name: 0599V5 CPU Model used 2 each - Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz max frequency: 4Ghz 18 cores, 36 logical processors per CPU 36 cores, 72 logical processors total DDR4 DRAM Memory 128GB installed BIOS Version Dell* 1.0.4 , 8/28/2014 Network Adapters Intel® Ethernet Converged 10G X520 – DA2 (dual port PCIe add-in card) 1 – embedded 1G network adapter for management 2 – 10GB port for workload Storage Adapters None Internal Drives and Volumes / (root) OS system – Intel SSD for Data Center Family S3500 – 480GB Capacity /dev/nvme0n1 Intel SSD for Data Center Family P3700 – 1.6TB Capacity, x4 PCIe AIC /dev/nvme1n1 Intel SSD for Data Center Family P3700 - 1.6TB Capacity, x4 PCIe AIC /dev/nvme2n1 Intel SSD for Data Center Family P3700 - 1.6TB Capacity, x4 PCIe AIC /dev/nvme3n1 Intel SSD for Data Center Family P3700 - 1.6TB Capacity, x4 PCIe AIC 6.4TB of raw capacity for Aerospike database namespaces Aerospike Community Version 3.5.8 20
  • 21. Aerospike results The reason these tables on NVM are so fast is partially the small block. It also affects network usage… and costs of clusters so be careful with replication and object sizes. Write mixes at 50/50 take the numbers down extensively. https://communities.intel.com/community/itpeernetwork/blog/2015/02/17/r eaching-one-million-database-transactions-per-second-aerospike-intel-ssd Record Size Aerospike Number of clients threads Total TPS Percent below 1ms (Reads) Percent below 1ms (Writes) Std Dev of Read Latency (ms) Std Dev of Write Latency (ms) Approx. Database size 1k 576 1,124,875 97.16 99.9 0.79 0.35 100G 2k 448 875,446 97.33 99.57 0.63 0.18 200G 4k 384 581,272 97.22 99.85 0.63 0.05 400G 1k (replication) 512 1,003,471 96.11 98.98 0.87 0.30 200G Record Size iostat Read MB/sec Write MB/sec Avg queue size on SSD Average drive latency CPU Busy % 1k 418 29 31 0.11 93 2k 547 43 27 0.13 81 4k 653 52 20 0.16 52 1k (replication) 396 51 30 0.13 94 Results measured by Intel and Aerospike. For tests and configurations, see slide 22. 21
  • 22. TCO Opportunity of In Memory vs. In NVM Storage Types Cost per GB 1k transaction/socket Memory Capacity DRAM only $10-15 + (DDR4) Up to ~1.6 million tps (1 socket) 192GB – 768 GB SSD Configuration $1-3 + (PCIe SSD – retail channel) Up to ~600k per node (1 socket) 4 x 2TB = 8TB 10# SFF NVMe servers 3x lower transactions per second, yet 5x lower price per GB with NVM. Capacity is higher, cost is much lower allowing you to do more per unit of rack. Costs measured by Intel from U.S. based internet retailer. 22
  • 23. 23 Now let’s look at NoSQL – Web Document Store And Couchbase 4.0…
  • 24. 24 B+ Tree structured database indexing Not suitable to index variable or fixed-length long keys – Significant space overhead as entire key strings are indexed in non-leaf nodes Tree depth grows quickly as more data is loaded I/O performance is degraded significantly as the data size gets bigger
  • 25. Introducing ForestDB – moving beyond B+ Tree Ki: ith smallest key in the node Pi: pointer corresponding to Ki Vi: value corresponding to Ki f: fanout degree K1 P1 … … Kd Pd K1 V1 K2 V2 … … Kf Vf … Index (non-leaf) node Leaf node … Kj Pj … … Kl Pl K1 P1 … … Kj Pj Root node … … K1 P1 … … Kf Pf Kj Pj … … Kn Pn … … … … … Kj Vj Kk Vk … … Kn Vn Index (non-leaf) node Source: ©2015 Couchbase Inc. 25
  • 26. 26 How ForestDB tries to achieve…. Fast, flatter, scalable index structure for variable or fixed-length long keys Targeting both SSD and HDD Less storage space overhead Reduce write amplification Regardless of the pattern of keys Efficient to keys both sharing common prefix and not sharing common prefix Compaction of large index or db files is still slow… Source: ©2015 Couchbase Inc.
  • 27. Trie (prefix tree) whose node is B+Tree – A key is split into the list of fixed-size chunks (sub-string of the key) HB+Trie (Hierarchical B+Tree based Trie) Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a… 07/26 Lexicographical ordered traversal Search using Chunk1 Document B+Tree (Node of HB+Trie) Node of B+Tree Chunk1 Chunk2 Chunk3 … a83j gls8 3jgo … Search using Chunk2 Search using Chunk3 07/26 Source: ©2015 Couchbase Inc. 27
  • 28.  Intel® Xeon® processor E5-2697 v3 @ 2.60GHz  Number of Cores: 28 (56 hw threads)  RAM: 65G  Storage:  SATA SSD: Intel DC S3710 1.2TB (~$1 / GB)  NVMeTM SSD: Intel DC P3700 1.6TB (~$2.5/ GB)  ForestDB: https://github.com/couchbase/forestdb  ForestDB benchmark: https://github.com/couchbaselabs/ForestDB-Benchmark 28 Lab Configuration Source: ©2015 Couchbase Inc.
  • 29. Testing Scenarios  Key/Value store (used in the data server layer)  Index Simulation (first place ForestDB will arrive)  Throughput Testing (Parallel Benchmark) Source: ©2015 Couchbase Inc. 29
  • 30. Summary K/V Store Indexing Parallel Throughput Benefits SATA NVMe SATA NVMe SATA NVMe Read Throughput 16678 25302 13987 20341 30755 47345 Up to 50% Write Throughput 4170 6325 3497 5209 7282 63946 Up 9x 95% Read Latency 1.745 1.136 1.853 1.254 4.0 7.5 Some work 95% Write Latency 264 188 276 216 1934 270 Awesome Results measured by Intel and Couchbase Inc. For tests and configurations, see slide 30. 30
  • 31. Legal Disclaimer 31 Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to http://www.intel.com/design/literature.htm. This document may contain information on products in the design phase of development. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2015 Intel Corporation. All rights reserved.
  • 32. Experience NVM as a complement to DRAM