SlideShare una empresa de Scribd logo
1 de 63
The Forefront of the Development for NVDIMM on Linux Kernel
2020/12/3
Yasunori Goto (Fujitsu Limited.)
Ruan Shiyang (Nanjing Fujitsu Nanda Software Technology Co., Ltd.)
Copyright 2020 FUJITSULIMITED0
Agenda
 Summary of current status of development for NVDIMM (Yasunori)
 Basis of NVDIMM on Linux
 Issues of Filesystem-DAX (Direct Access mode)
 Deep dive to solve the issues of Filesystem-DAX (Ruan)
 Support reflink & dedupe for fsdax
 Fix NVDIMM-based Reverse mapping
 Conclusion
Copyright 2020 FUJITSULIMITED1
Self introduction
 Yasunori Goto
 I have worked for Linux and related OSS since 2002
• Development for Memory hotplug feature of Linux Kernel
• Technical Support for troubles of Linux Kernel
• etc.
 Currently, leader of Fujitsu Linux Kernel Development team
 In the last few years, I have mainly worked for NVDIMM
• some enhancement for RAS of NVDIMM
• For Fault location feature
• For Fault predictionfeature
“The ideal and reality of NVDIMM RAS”
https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version
Copyright 2020 FUJITSULIMITED2
Basis of NVDIMM on Linux
Copyright 2020 FUJITSULIMITED3
Characteristics of Non-Volatile DIMM (NVDIMM)
 Persistent memory device which can be inserted to DIMM slot like DRAM
 CPU can read/write NVDIMM directly
 It can keep data persistency even if system is powered down or rebooted
 Though its access latency is slower than DRAM, it is faster than NVMe
 Has huger amount than DRAM
 Use case
 In memory Database
 Hierarchical storage, distributed storage
 Etc…
 Famous Product
 Intel Data Center Persistent Memory Module (DCPMM)
Copyright 2020 FUJITSU LIMITED
CPU
cache
DRAM
NVDIMM
NVMe
SSD (SAS/SATA)
HDD
Capacity
Latency cost
4
Impact of NVDIMM
Copyright 2020 FUJITSULIMITED
Page cache
System call
(Switch to kernel mode)
Sync()/fsync()/msync()
It was created for SLOW I/O storage, but NVDIMM is fast
enough without page cache
If user process calls cpu flush instruction, then it is enough to
make persistency
• Application can read/write to NVDIMM directly
• Even system call may be waste of time due to switching between kernel
mode and user mode
Traditional I/O layer becomes redundant for NVDIMM
New interface is expected for NVDIMM!
5
 Many software assumes that memory is VOLATILE yet
NVDIMM is difficult for traditional software
Copyright 2020 FUJITSULIMITED
• CPU cache is still volatile
• If system power down suddenly, then some data may not be
stored
Need to prepare for sudden powerdown
• Should not change data structures in NVDIMM
• If the structures are changed, software update will be cause of
disaster
• If the data is broken, software need to detect it and correct it
• Software need to assign not only free area, but also used area for
reuse its data
• In addition, kernel must assign the area to suitable process with
authority check
Need data structure compatibility on NVDIMM
Need to detect / correct collapsing data Need data area management
What will be necessary for a software to use NVDIMM?
6
Confliction of requirements
Copyright 2020 FUJITSULIMITED
Filesystem is still
is still useful
Filesystem is too
is too slow
• Traditional I/O stack is too heavy
• CPU cache flush is enough to make persistency
• Need new access interface to NVDIMM for
next generation software
• Filesystem gives many solution for the previous
considerations
• It is useful for current software
• Format compatibilityof filesystem
• Data correction
• Journaling, or CoW
• region management
• authority check
• etc...
• call system call
• page cache
• block device layer
• etc…
7
Interfaces of NVDIMM (1/3)
 Because of the previous reasons, Linux provides some interfaces for application
 Storage Access(green)
• Application can access NVDIMM
with traditionalI/O IF like SSD/HDD
• So, application can use this mode
without any modification
 Filesystem DAX(blue)
• Page cache is skipped when you use
read()/write()on Filesystem-DAX
• Application can access NVDIMM area
directly if it calls mmap()for a file
• Need filesystem support
• Xfs, ext4…
• This mode is suitable for modifying
currentapplications for NVDIMM.
Copyright 2020 FUJITSULIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device DAX
BTT driver
Traditional
filesystem
Traditional App New App with Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
8
Interfaces of NVDIMM (2/3)
 Because of the previous reasons, Linux provides some interfaces for application
 Device DAX(red)
• Application can access NVDIMM area
directly if it calls mmap()for /dev/dax
• /dev/dax allows only open(),mmap(),
and close()
• IOW, you can not use read()/write()
nor any other system call
• For innovative new application with
NVDIMM
 PMDK(purple) is provided
• Set of libraries and tools for
filesystem DAX and Device DAX
Copyright 2020 FUJITSULIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device DAX
BTT driver
Traditional
filesystem
Traditional App New App with Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
9
Interfaces of NVDIMM(3/3)
 NVDIMM is shown as device files like storage
 For Storage Access : /dev/pmem##.#s
 Filesystem DAX: /dev/pmem##.#
 Device DAX: /dev/dax##.#
 ndctl command can create these device when it creates namespace
 Example of Filesystem DAX
 You can make filesystem
on /dev/pmem##.#s
or /dev/pmem##.#
 Note: /dev/dax##.# is character device
 Since you cannot use read()/write() for /dev/dax##.#, you cannot use dd command for backup
Copyright 2020 FUJITSULIMITED
$ sudo ndctl create-namespace -e "namespace0.0" -m fsdax -f
{
"dev":"namespace0.0",
"mode":"memory",
"size":"5.90 GiB (6.34 GB)",
"uuid":"dc47d0d7-7d8f-473e-9db4-1c2e473dbc8f",
"blockdev":"pmem0",
"numa_node":0
}
$ sudo ls /dev/pmem*
/dev/pmem0
10
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
PMDK (1/2)
 Set of libraries and tools for Filesystem DAX and Device DAX
Copyright 2020 FUJITSULIMITED
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
11
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• Not only Linux, but also you can
use PMDK on Windows
Note
• Not all components are included in this figure
• To be precise libmemkind is not member of PMDK, but it is recommended library
11
PMDK (2/2)
Typical libraries and a tool
Copyright 2020 FUJITSULIMITED12
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• The low layer library
• Call mmap() to use NVDIMM,call suitable cpu
cache flush instruction, etc…
12
PMDK (2/2)
Typical libraries and a tool
Copyright 2020 FUJITSULIMITED13
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• The high layer library which supports transaction of the objects on DAX
• For general use-case
• This is highly recommended library in PMDK
• Users need to understand how to use its transaction
13
PMDK (2/2)
Typical libraries and a tool
Copyright 2020 FUJITSULIMITED14
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• /dev/dax (device DAX) is character
device. Then you cannot use dd for
backup
• daxio command is provided instead
of it
14
New libraries of PMDK (1/2)
 Libpmem2
 New low-layer library
• Introduce new concept “GRANULARITY”
• PMEM2_GRANULARITY_PAGE : for traditional SSD/HDD
• PMEM2_GRANULARITY_CACHELINE : for persistent memory (the case for process needs flush cache to make persistency)
• PMEM2_GRANULARITY_BYTE : for persistent memory (the case for platform support cpu cache persistency)
• Introduce new functions to get unsafe shutdown status and bad block
• This library uses library of ndctl command internallyto get these information
 Its interface is different from old libpmem
Copyright 2020 FUJITSULIMITED15
New libraries of PMDK (2/2)
 librpma
 The 2nd library for RDMA
• The 1st library is librpmem,
but it is experimental status
due to “no user”
• librpma is created with users
requirement
 Currently, official release
of main library is planned
at Q1/2021
https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf
(*) The above table is quotedfrom this presentation
Copyright 2020 FUJITSULIMITED16
Filesystem-DAX is still experimental status
Copyright 2020 FUJITSULIMITED
• The management way of NVDIMM is almost same with traditional filesystem
• Operator can use traditional command to manage NVDIMM area
• If application call mmap() for “a file” on Filesystem DAX, then it can access to NVDIMM directly
Filesystem-DAX is very expected interface…
• The “experimental” message is shown when the filesystem is mounted with DAX option
• There are some issues in kernel layer
• “Experimental status” has been for a few years….
Unlike Device DAX, Filesystem-DAX is still experimental status
• In contrast, Device DAX requires pool management by tools of PMDK
• Otherwise a software need to posses whole of the namespace (/dev/dax)
Why it is experimental yet?
What is the obstacles?
17
Issues of Filesystem-DAX
 What is solved, and what is current issues
Copyright 2020 FUJITSULIMITED18
Why Filesystem-DAX is experimental?
 In summary, there are 2 big reasons
Copyright 2020 FUJITSULIMITED
Filesystem DAX combines storage and memory characteristics
• This causes corner-case issues of Filesystem-DAX
1.A more additional feature was required,
but it is/was difficult to make it
• Configure DAX on/off for each inode (directory or file)
• Co-existence with CoW filesystem
19
Corner Case Issue (1/3)
Copyright 2020 FUJITSULIMITED
Most important problem is “updating metadata” of the file
metadata
data
In Filesystem-DAX, we expected that application
can make persistency of data with only cpu
cache flush as I said….
However, this also means there is no chance to
update metadata by kernel/filesystem
file
• Update time of the file may be not correct
• If an application use write some data to file on the filesystem DAX, and a user
remove some blocks of the file by truncate(2), kernel cannot negotiate it
• Data of the file may be lost
• If data transferred by DMA/RDMA to the page which is allocated as filesystem
DAX, similar problem may occur
20
Corner Case Issue (2/3)
 Current Status of update metadata problems
Copyright 2020 FUJITSULIMITED
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
21
Corner Case Issue (3/3)
 Current Status of update metadata problems
Copyright 2020 FUJITSULIMITED
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
• In ODP, usually driver/hardware does not map the pages of DMA/RDMA area
for application
• It maps the pages when application accesses them
• Kernel/driver can coordinate metadata at the time
• Mellanox newer card has the feature
Workaround:
“On Demand Paging(ODP)”
22
DAX on/off for each inode (directory or file) (1/4)
 Expected use-cases
Copyright 2020 FUJITSULIMITED
Need more fine-grain settings
Change DAX attribute by application
Performance tuning
Workaround when Filesystem-DAX has a bug
Configuration is always painful for administrator
If application can detect and change it, it will be helpful for them
Users may want to change the DAX mode depending on each file
Since the write latency of NVDIMM is a bit slower than RAM,
User may want to use page cache by DAX off setting
23
DAX on/off for each inode (directory or file) (2/4)
 What is/was the problem
 If filesystem change DAX attribute, filesystem need to change methods of filesystem between DAX and normal file, but they may be
executed yet
 Data of the page cache must be moved silently when the dax attribute becomes off
• These problems were very difficult
Copyright 2020 FUJITSULIMITED
Page caches
On DRAM
xfs_vm_writepages()
xfs_iomap_set_page_dirty()
:
:
Methos for DAX fileMethods for normal file
Need change xfs_dax_writepages()
noop_set_page_dirty()
:
:
struct address
space
inode
struct address
space
inode
No page
cache
Normal file DAX file
24
DAX on/off for each inode (directory or file) (3/4)
 How it was solved?
 The DAX attribute is changed only when its inode cache is NOT loaded on memory
• Filesystem can load suitable methods for each attribute when it reloads inode to memory
• Page caches of the file are also dropped
• Users can use this feature with the new mount option. #mount … -o dax=inode
• The DAX attribute is changed by command.
Copyright 2020 FUJITSULIMITED
Normal file DAX file
drop inode
cache
drop inode
cache
reload
reload
echo 3 > /proc/sys/vm/drop_caches
DAX on : $xfs_io -c 'chattr +x’ <file or directory>
DAX off: $xfs_io -c 'chattr -x’<file or directory>
25
DAX on/off for each inode (directory or file) (4/4)
 Note
 All of applications which use the target file must close it to change the dax attribute
 Filesystem will postpone changing the DAX attribute until dropping inode cache and page cache of the file
• Currently, administrator may need to operate drop_cache to achieve it yet
• Because inode cache may remain due to race condition
Copyright 2020 FUJITSULIMITED
# echo 3 > /proc/sys/vm/drop_caches
This operation affects whole of page caches and slabs on the system!
We made patches to evict inode and dentry cache as soon as possible when the DAX
attribute is changed
• Set I_DONTCACHE and DCACHE_DONTCACHE for it
• Sync inode before evicting inode
We are trying
to solve this
problem
26
Coexsiting with CoW (reflink/dedup) filesystem (1/2)
 The Copy on Write feature of filesystem (xfs:reflink/dedup, btrfs)
 If there is a same data block on different files, filesystem can merge it as same block
 So far, if only filesystem manages such block, it was enough
 In filesystem-dax, it becomes NOT enough
Copyright 2020 FUJITSULIMITED
File A File B
offset 100 offset 200
27
Coexsiting with CoW (reflink/dedup) filesystem (2/2)
 Problems
Copyright 2020 FUJITSULIMITED
Ruan-san will talk how to solve them
• Reverse mapping for plural files are necessary
• A Merged block has only one struct page
(page->index save offset of the file).
• How to save offsets of plural files?
• Iomap is newer interface instead of block IO layer (IOW,
struct buffer_head)
• It can lock and submit I/O for plural pages at once
• Filesystem-DAX depends on iomap feature, but it
does not support CoW
• Though XFS already uses iomap, btrfs also need to
use it
Memory management layer also needs to understand
merged block
Need enhance iomap and Filesystem DAX
for CoW
28
Deep dive to solve the issue of Filesystem-DAX
 Support reflink/dedupe for fsdax*
 Improve NVDIMM-based Reverse mapping
* Filesystem-DAX
Copyright 2020 FUJITSULIMITED29
Self-introduction
 Ruan Shiyang
 A Software Engineer of Fujitsu Nanda
 Experience in Embedded development
 Currently focusing on Linux filesystem and persistent memory
Copyright 2020 FUJITSULIMITED30
Create a reflink featured XFS and mount it with “dax” option
Then error occurs
The dmesg shows
Background
 fsdax is “EXPERIMENTAL”
 reflink and fsdax can not work together
Copyright 2020 FUJITSU LIMITED
$ mkfs.xfs –m reflink=1 [...] && mount –o dax [...]
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0,
missing codepage or helper program, or other error.
XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0): DAX and reflink cannot be used together!
31
reflink
 Files share extents for same data
 Advantages
 Fast copy
 Save storage
 Copy on Write mechanism (CoW)
 Copy the shared extents before writing data
Copyright 2020 FUJITSULIMITED
File A File B
Extent I
File A
Extent I Extent II
copy
File A File B
Extent I
File A
Extent I
File BFile A
Extent I
write
File BFile A
Extent I
File BFile A
Extent I
Copy then Write
File
Disk
File
Disk
File
Disk
32
copy
--reflink=always
fsdax (Filesystem DAX)
 A mode of a NVDIMM namespace
 Removes the page cache from the I/O path
 Allows mmap() to establish direct mappings to persistent memory media
Copyright 2020 FUJITSULIMITED
read()
write()
mmap()
vfs
SSD/HDD
page
cache
NVDIMM
bio
direct access (DAX)
33
Why reflink and fsdax can not work together?
 Issues
1. Need enhance iomap and fsdax for CoW
 The iomap interface extension
 Support CoW in fsdax
 Co-existing support for each filesystem
2. Memory management layer also need to understand merged block
 Improve the current NVDIMM-based Reverse mapping
Copyright 2020 FUJITSULIMITED34
Need enhance iomap and fsdax for CoW
 The iomap interface extension
 Support CoW in fsdax
 Co-existing support for each filesystem
Copyright 2020 FUJITSULIMITED35
write data
CoW not executed
CoW
Need enhance iomap and fsdax for CoW
Copyright 2020 FUJITSU LIMITED
write()
SSD/HDD
page
cache
NVDIMM
direct
access
vfs
bio
start write
if DAX?
N
get distance extent
load data
from disk to page cache
write data
submit bio
flush to disk to disk
Y
36
What must be implemented?
Copyright 2020 FUJITSULIMITED
iomap_begin
• xfs_file_iomap_begin()
• ALLOCATE new extent for CoW
• STORE source extent info
actor
• dax_iomap_actor()
• COPY source data to new allocated extent
• WRITE user data
iomap_end
• xfs_file_iomap_end()
• REMAP the extents of this file
 iomap
 Introduce “srcmap”
 Fill the “srcmap”
 fsdax
 Add CoW for write()
 Add CoW for mmap()
 Add a “dax” dedupe
 XFS
 Remap extents after CoW
37
iomap - Introduce “srcmap” (1/2)
 The source address for CoW
 struct “iomap” tells where to write
 CoW needs to know
• start bno: where to copy from
• length: how much data to copy
Copyright 2020 FUJITSU LIMITED
File A
Extent I
copy• source address
• source length
File B
distance
38
iomap - Introduce “srcmap” (2/2)
 How?
 add new IOMAP type called ‘IOMAP_COW’
• distinguish CoW with normal write
 add another struct iomap called ‘srcmap’
• remember and pass the source extent info
Copyright 2020 FUJITSU LIMITED
+ #define IOMAP_COW 0x06 /* copy data from srcmap before writing */
int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
- unsigned flags, struct iomap *iomap);
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
39
CoW
iomap - Fill “srcmap”
 Fill "srcmap"
 iomap: the distance extent to write at
 srcmap: the source extent to copy from
 filled in ->iomap_begin()
 How?
 handle file with DAX flag and shared extents
Copyright 2020 FUJITSU LIMITED
find extent
if shared?
if hole?
Y
Y
if DAX? Y
N
fill srcmap
fill iomap
set IOMAP_COW
40
CoW
fsdax - Add CoW for write()
 Execute CoW in write() path
 use ->direct_access() to translate address
• offset in block device
• physical memory address in pmem
 How?
 copy source data before copying user data
Copyright 2020 FUJITSU LIMITED
Y
get srcmap
IOMAP_COW?
copy source data
->direct_access()
N
copy user data
from srcmap
from user
41
fsdax - Add CoW for mmap()
 Execute CoW in page fault
 fsdax supports PTE fault and PMD fault
 also uses iomap model
 How?
 copy source data before VMA is associated
Copyright 2020 FUJITSU LIMITED
CoW
Y
get srcmap
IOMAP_COW?
copy source data
->direct_access()
N
from srcmap
associate VMA
42
fsdax - Add a “dax” dedupe (1/2)
 Handle dedupe
 generic dedupe function
• compares data in page caches
 not adapted to fsdax
• no page cache
Copyright 2020 FUJITSU LIMITED
File A
compared same
File B
Extent I Extent II
DAX compare
function
43
fsdax - Add a “dax” dedupe (2/2)
 How?
 introduce a DAX compare function
 only if files have same DAX flag
Copyright 2020 FUJITSU LIMITED
Y
if both DAX?
DAX compare
Y
N
if neither DAX?
Generic compare return error
N
44
XFS - Remap extents after CoW
 Remap extents
 since new extent is allocated, the file need to remap it.
 How?
 execute remap function in ->iomap_end()
 clean up if CoW failed
Copyright 2020 FUJITSU LIMITED
Y
if CoW?
remap extents
N
if io error?Y
clean up CoW extents
45
Memory management layer also need to understand merged block
 Improve the current NVDIMM-based Reverse mapping
Copyright 2020 FUJITSULIMITED46
NVDIMM
When Memory failure occurs
Copyright 2020 FUJITSU LIMITED
file A
process 1
process 2
process 3
process 4
0xA000
0xA001
0xA002
0xA004
0xA005
broken
0xA003
1. track all processes associating with the broken page on NVDIMM
2. send signal to kill those associated processes
47
Need to improve NVDIMM-based Reverse
mapping
 currently only support 1-to-1 mapping
 reflink on fsdax requires 1-to-N mapping
Copyright 2020 FUJITSU LIMITED
NVDIMM
0xA000
0xA001
0xA002
0xA004
0xA005
file A
process 1
process 2
process 3
VMA
VMA
VMA
broken
file B
process 4
process 5
VMA
VMA
->mapping = mappingA
->index = offsetA
file …
…
…
0xA003
NOT SUPPORTED
VMA
VMA
48
Approach to solve this problem
 Idea 1 – per-page tracking
 Introduce dax-ramp rbtree
 Track by iterating the dax-rmap rbtree
 Idea 2 – fs query tracking
 Introduce ->storage_lost() for filesystem
 Collect a list of shared files from ->storage_lost(), and iterate the list
 Idea 3 – dev_pagemap interface
 Call memory-failure handler in ->storage_lost()
 Introduce ->memory_failure(), which is to be implemented by driver
Copyright 2020 FUJITSULIMITED49
Improved
Old implementation
Idea 1 – per-page tracking (1/2)
Copyright 2020 FUJITSULIMITED
 store more file mappings
 introduce a dax-ramp rbtree
 associate
1st
time: associate file mapping and offset to the page
• same as old implementation
2nd
time: create the dax-ramp rbtree (in page->private)
• the page->index counts how many files associated
• insert file mapping and offset as node at the 2nd
time and later
mappingA
offsetA
mappingB
offsetB
mappingC
offsetC
mappingA
indexA
page->mapping
page->index
 1st
time associate  2nd
time associate and later
rbroot
refcount
file A
address_space
0xA003
page
Associate
0xA003
page
file A
address_space
VMA
VMA
VMA
VMA
Track
page->mapping = mappingA
page->index = offsetA
page->private
page->index
50
memory failure
Idea 1 – per-page tracking (2/2)
Copyright 2020 FUJITSU LIMITED
 track
1. page ==> file mappings (new)
• iterate the dax-ramp rbtree
2. file mapping ==> processes
• iterate each file mapping’s i_mmap
to find VMAs of processes
 result
 may cause huge overhead
• each page contains a dax-rmap rbtree
 per-page tracking only works for file
• can not handle metadata
lock dax page
collect processes
unmap pages
kill processes
unlock dax page
+ iterate dax-ramp rbtree
iterate mapping->i_mmap
to_kill
list
51
1
2
Improved
Idea 2 – filesystem query tracking (1/2)
 take advantage of filesystem feature to reduce the huge overhead
 rmapbt of XFS
• record the owners of each data extent.
 introduce a FS interface ->storage_lost() to return owners list
• instead of the dax-rmap rbtree to reduce the huge overhead
• query the rmapbt for owners of a block (broken page on NVDIMM)
• return a list of owners for memory failure
 associate
 store FS super block in page->mapping
• as the specific caller when tracking
 count reference in page->zone_device_data
• not used in memory-failure before
Copyright 2020 FUJITSULIMITED
(XFS) sb
offset
refcount
page->mapping
page->index
page->zone_device_data
52
Idea 2 – filesystem query tracking (2/2)
Copyright 2020 FUJITSULIMITED
 track
1. page ==> file mappings (new)
• iterate the owners list
2. file mapping ==> processes
• iterate each file mapping’s i_mmap
to find VMAs of processes
 result
 reduced the huge overhead
 however, it leaves a little overhead
• need extra memory space for the
owners list
memory failure
lock dax page
collect processes
unmap pages
kill processes
unlock dax page
+ iterate the owners list
iterate mapping->i_mmap
to_kill
list
+ call ->storage_lost()
owners
list
53
1
2
 handle memory-failure during querying to remove the overhead completely
 instead of creating the owners list, handle
every owner during querying
• call ->storage_lost() to query owners
• call memory-failure handler when an owner is found
• iterate each owner’s i_mmap to find
VMAs of processes
 result
 the extra memory space is not necessary
 implemented 1-to-N reverse mapping
Copyright 2020 FUJITSU LIMITED
memory failure
+ call ->storage_lost()
+ FS rmap query ->
collect processes
unmap pages
kill processes
unlock dax page
to_kill
list
lock dax page
Idea 3 – dev_pagemap interface (1/3)
54
owner
+ memory-failure handler
iterate owner mapping->i_mmap
Idea 3 – dev_pagemap interface (2/3)
 support more than fsdax mode
 dive into driver
 introduce a driver interface
->memory_failure()
• for pmem_device
call ->storage_lost()
• for dax_device
Copyright 2020 FUJITSULIMITED
memory failure
+ dev_pagemap ->ops->memory_failure()
+ pmem driver ->storage_lost()
FS rmap query
memory-failure handler
collect processes
unmap pages
kill processes
unlock dax page
iterate mapping->i_mmap
to_kill
list
lock dax page
+ dax driver handler
……
or
55
Idea 3 – dev_pagemap interface (3/3)
 1-to-N reversed mapping has been implemented
 no useless overhead
 compatible for all NVDIMM modes
 compatible for all filesystems
 some remaining difficulties
 can not obtain filesystem from LVM (or other mapped device)
- `get_super()` is for filesystems created directly on the device
- not suitable for mapped device
 require each filesystem has the feature like “rmapbt”
- “realtime device” of XFS does not support “rmapbt”
- other filesystems
Copyright 2020 FUJITSULIMITED56
Conclusion
Copyright 2020 FUJITSULIMITED57
We talked about the followings
 Basis of NVDIMM for Linux
 Issues of Filesystem-DAX (Direct Access mode)
 Deep dive to solve issues of Filesystem-DAX
• Support reflink & dedupe for fsdax
• Fix NVDIMM-based Reverse mapping
Copyright 2020 FUJITSULIMITED
• Community has made many enhancement for NVDIMM on Linux
• We have worked for NVDIMM to remove experimental status of Filesystem DAX
• We expect it will be achieved soon
58
59
Appendix
Copyright 2020 FUJITSULIMITED60
How to use NVDIMM in Linux
 Make region
 You can configure it on BIOS screen (or ipmctl command for Intel DCPMM)
• Region is created by hardware(memory controller)
 You need to reboot to enable the created region
 Make namespace
 Namespace is similar concept against SCSI LUN
 You can configure it by ndctl command
 Format Filesystem (storage or Filesystem DAX)
 If you would like to use Filesystem DAX, you need to select ext4 or xfs which support Filesystem DAX
• Currently, if you select xfs, reflink option must be disabled
 Mount Filesystem (storage or Filesystem DAX) with –o dax option
 Make pool (Filesystem DAX or Device DAX) by the PMDK tool if necessary
Copyright 2020 FUJITSULIMITED61
Users of NVDIMM at kernel layer
 dm-writecache
 Persistent cache for write at device mapper layer
• dm-cache can use SSD as cache, dm-writecache can use NVDIMM
• This presentation is helpful
• https://github.com/ChinaLinuxKernel/CLK2019/blob/master/clk2019-dm-writecache-04.pdf
 Kernel shows Device dax as DRAM
 Though Intel DCPMM has “Memory Mode” which user can use it as huge RAM, it has some pains
• Size of DRAM disappears, because it becomes just cache of NVDIMM
• Software cannot choose area between DRAM or DCPMM
 In this kernel feature, system RAM size becomes sums of DRAM and NVDIMM
• Use App Direct Mode and Device DAX namespace
• Use Memory Hot-add Device DAX area
• Becomes NUMA node
Copyright 2020 FUJITSULIMITED62

Más contenido relacionado

La actualidad más candente

Python パッケージの影響を歴史から理解してみよう!
Python パッケージの影響を歴史から理解してみよう!Python パッケージの影響を歴史から理解してみよう!
Python パッケージの影響を歴史から理解してみよう!Kir Chou
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜Preferred Networks
 
Apache Avro vs Protocol Buffers
Apache Avro vs Protocol BuffersApache Avro vs Protocol Buffers
Apache Avro vs Protocol BuffersSeiya Mizuno
 
次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション
次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション
次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーションNTT Software Innovation Center
 
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...Yasunori Goto
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化Kumazaki Hiroki
 
VirtualBox と Rocky Linux 8 で始める Pacemaker ~ VirtualBox でも STONITH 機能が試せる! Vi...
VirtualBox と Rocky Linux 8 で始める Pacemaker  ~ VirtualBox でも STONITH 機能が試せる! Vi...VirtualBox と Rocky Linux 8 で始める Pacemaker  ~ VirtualBox でも STONITH 機能が試せる! Vi...
VirtualBox と Rocky Linux 8 で始める Pacemaker ~ VirtualBox でも STONITH 機能が試せる! Vi...ksk_ha
 
kintone で AIによる画像解析の活用を試してみた
kintone で AIによる画像解析の活用を試してみたkintone で AIによる画像解析の活用を試してみた
kintone で AIによる画像解析の活用を試してみたCybozucommunity
 
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」PC Cluster Consortium
 
本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話Kumazaki Hiroki
 
Pacemakerを使いこなそう
Pacemakerを使いこなそうPacemakerを使いこなそう
Pacemakerを使いこなそうTakatoshi Matsuo
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれKumazaki Hiroki
 
KubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したいKubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したいYuji Oshima
 
06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ
06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ
06 第5.1節-第5.7節 ROS2に対応したツール/パッケージMori Ken
 
勉強会カンファレンス2012
勉強会カンファレンス2012勉強会カンファレンス2012
勉強会カンファレンス2012Hiro Yoshioka
 
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)NTT DATA Technology & Innovation
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例Fixstars Corporation
 

La actualidad más candente (20)

Python パッケージの影響を歴史から理解してみよう!
Python パッケージの影響を歴史から理解してみよう!Python パッケージの影響を歴史から理解してみよう!
Python パッケージの影響を歴史から理解してみよう!
 
OpenStack Swift紹介
OpenStack Swift紹介OpenStack Swift紹介
OpenStack Swift紹介
 
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
 
Apache Avro vs Protocol Buffers
Apache Avro vs Protocol BuffersApache Avro vs Protocol Buffers
Apache Avro vs Protocol Buffers
 
次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション
次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション
次世代の高速メモリストレージ利用に向けたソフトウェアのモダナイゼーション
 
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
 
トランザクションの設計と進化
トランザクションの設計と進化トランザクションの設計と進化
トランザクションの設計と進化
 
Apache Pulsarの概要と近況
Apache Pulsarの概要と近況Apache Pulsarの概要と近況
Apache Pulsarの概要と近況
 
VirtualBox と Rocky Linux 8 で始める Pacemaker ~ VirtualBox でも STONITH 機能が試せる! Vi...
VirtualBox と Rocky Linux 8 で始める Pacemaker  ~ VirtualBox でも STONITH 機能が試せる! Vi...VirtualBox と Rocky Linux 8 で始める Pacemaker  ~ VirtualBox でも STONITH 機能が試せる! Vi...
VirtualBox と Rocky Linux 8 で始める Pacemaker ~ VirtualBox でも STONITH 機能が試せる! Vi...
 
kintone で AIによる画像解析の活用を試してみた
kintone で AIによる画像解析の活用を試してみたkintone で AIによる画像解析の活用を試してみた
kintone で AIによる画像解析の活用を試してみた
 
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
 
本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話本当は恐ろしい分散システムの話
本当は恐ろしい分散システムの話
 
Pacemakerを使いこなそう
Pacemakerを使いこなそうPacemakerを使いこなそう
Pacemakerを使いこなそう
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれ
 
KubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したいKubernetesでGPUクラスタを管理したい
KubernetesでGPUクラスタを管理したい
 
06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ
06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ
06 第5.1節-第5.7節 ROS2に対応したツール/パッケージ
 
勉強会カンファレンス2012
勉強会カンファレンス2012勉強会カンファレンス2012
勉強会カンファレンス2012
 
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
YugabyteDBを使ってみよう(NewSQL/分散SQLデータベースよろず勉強会 #1 発表資料)
 
いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例いまさら聞けないarmを使ったNEONの基礎と活用事例
いまさら聞けないarmを使ったNEONの基礎と活用事例
 

Similar a The Forefront of the Development for NVDIMM on Linux Kernel

The ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASThe ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASYasunori Goto
 
C++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers KitC++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers KitIntel® Software
 
The Ultimate IBM and Lotus on Linux Workshop for Windows Admins
The Ultimate IBM and Lotus on Linux Workshop for Windows AdminsThe Ultimate IBM and Lotus on Linux Workshop for Windows Admins
The Ultimate IBM and Lotus on Linux Workshop for Windows AdminsBill Malchisky Jr.
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentJignesh Shah
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT Group
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
 
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-ReloadedRNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloadedpanagenda
 
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedRNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedChristoph Adler
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
MongoDB: Advantages of an Open Source NoSQL Database
MongoDB: Advantages of an Open Source NoSQL DatabaseMongoDB: Advantages of an Open Source NoSQL Database
MongoDB: Advantages of an Open Source NoSQL DatabaseFITC
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinciAkash Sahoo
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
MySQL Oslayer performace optimization
MySQL  Oslayer performace optimizationMySQL  Oslayer performace optimization
MySQL Oslayer performace optimizationLouis liu
 
Data Reduction for Gluster with VDO
Data Reduction for Gluster with VDOData Reduction for Gluster with VDO
Data Reduction for Gluster with VDOGluster.org
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016panagenda
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 

Similar a The Forefront of the Development for NVDIMM on Linux Kernel (20)

The ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASThe ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RAS
 
C++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers KitC++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers Kit
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
The Ultimate IBM and Lotus on Linux Workshop for Windows Admins
The Ultimate IBM and Lotus on Linux Workshop for Windows AdminsThe Ultimate IBM and Lotus on Linux Workshop for Windows Admins
The Ultimate IBM and Lotus on Linux Workshop for Windows Admins
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-ReloadedRNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
 
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedRNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
MongoDB: Advantages of an Open Source NoSQL Database
MongoDB: Advantages of an Open Source NoSQL DatabaseMongoDB: Advantages of an Open Source NoSQL Database
MongoDB: Advantages of an Open Source NoSQL Database
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
 
MySQL Oslayer performace optimization
MySQL  Oslayer performace optimizationMySQL  Oslayer performace optimization
MySQL Oslayer performace optimization
 
Data Reduction for Gluster with VDO
Data Reduction for Gluster with VDOData Reduction for Gluster with VDO
Data Reduction for Gluster with VDO
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
FreeBSD Portscamp, Kuala Lumpur 2016
FreeBSD Portscamp, Kuala Lumpur 2016FreeBSD Portscamp, Kuala Lumpur 2016
FreeBSD Portscamp, Kuala Lumpur 2016
 

Último

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

The Forefront of the Development for NVDIMM on Linux Kernel

  • 1. The Forefront of the Development for NVDIMM on Linux Kernel 2020/12/3 Yasunori Goto (Fujitsu Limited.) Ruan Shiyang (Nanjing Fujitsu Nanda Software Technology Co., Ltd.) Copyright 2020 FUJITSULIMITED0
  • 2. Agenda  Summary of current status of development for NVDIMM (Yasunori)  Basis of NVDIMM on Linux  Issues of Filesystem-DAX (Direct Access mode)  Deep dive to solve the issues of Filesystem-DAX (Ruan)  Support reflink & dedupe for fsdax  Fix NVDIMM-based Reverse mapping  Conclusion Copyright 2020 FUJITSULIMITED1
  • 3. Self introduction  Yasunori Goto  I have worked for Linux and related OSS since 2002 • Development for Memory hotplug feature of Linux Kernel • Technical Support for troubles of Linux Kernel • etc.  Currently, leader of Fujitsu Linux Kernel Development team  In the last few years, I have mainly worked for NVDIMM • some enhancement for RAS of NVDIMM • For Fault location feature • For Fault predictionfeature “The ideal and reality of NVDIMM RAS” https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version Copyright 2020 FUJITSULIMITED2
  • 4. Basis of NVDIMM on Linux Copyright 2020 FUJITSULIMITED3
  • 5. Characteristics of Non-Volatile DIMM (NVDIMM)  Persistent memory device which can be inserted to DIMM slot like DRAM  CPU can read/write NVDIMM directly  It can keep data persistency even if system is powered down or rebooted  Though its access latency is slower than DRAM, it is faster than NVMe  Has huger amount than DRAM  Use case  In memory Database  Hierarchical storage, distributed storage  Etc…  Famous Product  Intel Data Center Persistent Memory Module (DCPMM) Copyright 2020 FUJITSU LIMITED CPU cache DRAM NVDIMM NVMe SSD (SAS/SATA) HDD Capacity Latency cost 4
  • 6. Impact of NVDIMM Copyright 2020 FUJITSULIMITED Page cache System call (Switch to kernel mode) Sync()/fsync()/msync() It was created for SLOW I/O storage, but NVDIMM is fast enough without page cache If user process calls cpu flush instruction, then it is enough to make persistency • Application can read/write to NVDIMM directly • Even system call may be waste of time due to switching between kernel mode and user mode Traditional I/O layer becomes redundant for NVDIMM New interface is expected for NVDIMM! 5
  • 7.  Many software assumes that memory is VOLATILE yet NVDIMM is difficult for traditional software Copyright 2020 FUJITSULIMITED • CPU cache is still volatile • If system power down suddenly, then some data may not be stored Need to prepare for sudden powerdown • Should not change data structures in NVDIMM • If the structures are changed, software update will be cause of disaster • If the data is broken, software need to detect it and correct it • Software need to assign not only free area, but also used area for reuse its data • In addition, kernel must assign the area to suitable process with authority check Need data structure compatibility on NVDIMM Need to detect / correct collapsing data Need data area management What will be necessary for a software to use NVDIMM? 6
  • 8. Confliction of requirements Copyright 2020 FUJITSULIMITED Filesystem is still is still useful Filesystem is too is too slow • Traditional I/O stack is too heavy • CPU cache flush is enough to make persistency • Need new access interface to NVDIMM for next generation software • Filesystem gives many solution for the previous considerations • It is useful for current software • Format compatibilityof filesystem • Data correction • Journaling, or CoW • region management • authority check • etc... • call system call • page cache • block device layer • etc… 7
  • 9. Interfaces of NVDIMM (1/3)  Because of the previous reasons, Linux provides some interfaces for application  Storage Access(green) • Application can access NVDIMM with traditionalI/O IF like SSD/HDD • So, application can use this mode without any modification  Filesystem DAX(blue) • Page cache is skipped when you use read()/write()on Filesystem-DAX • Application can access NVDIMM area directly if it calls mmap()for a file • Need filesystem support • Xfs, ext4… • This mode is suitable for modifying currentapplications for NVDIMM. Copyright 2020 FUJITSULIMITED DAX enabled filesystem NVDIMM user layer kernel layer Pmem driver device dax driver /dev/dax New App with Device DAX BTT driver Traditional filesystem Traditional App New App with Filesystem DAX PMDK management command sysfs ACPI driver ACPI 8
  • 10. Interfaces of NVDIMM (2/3)  Because of the previous reasons, Linux provides some interfaces for application  Device DAX(red) • Application can access NVDIMM area directly if it calls mmap()for /dev/dax • /dev/dax allows only open(),mmap(), and close() • IOW, you can not use read()/write() nor any other system call • For innovative new application with NVDIMM  PMDK(purple) is provided • Set of libraries and tools for filesystem DAX and Device DAX Copyright 2020 FUJITSULIMITED DAX enabled filesystem NVDIMM user layer kernel layer Pmem driver device dax driver /dev/dax New App with Device DAX BTT driver Traditional filesystem Traditional App New App with Filesystem DAX PMDK management command sysfs ACPI driver ACPI 9
  • 11. Interfaces of NVDIMM(3/3)  NVDIMM is shown as device files like storage  For Storage Access : /dev/pmem##.#s  Filesystem DAX: /dev/pmem##.#  Device DAX: /dev/dax##.#  ndctl command can create these device when it creates namespace  Example of Filesystem DAX  You can make filesystem on /dev/pmem##.#s or /dev/pmem##.#  Note: /dev/dax##.# is character device  Since you cannot use read()/write() for /dev/dax##.#, you cannot use dd command for backup Copyright 2020 FUJITSULIMITED $ sudo ndctl create-namespace -e "namespace0.0" -m fsdax -f { "dev":"namespace0.0", "mode":"memory", "size":"5.90 GiB (6.34 GB)", "uuid":"dc47d0d7-7d8f-473e-9db4-1c2e473dbc8f", "blockdev":"pmem0", "numa_node":0 } $ sudo ls /dev/pmem* /dev/pmem0 10
  • 12. For log file libpmemlog For generalpurpose pmem allocation with transaction libpmemobj To create same size blocks and atomically update libpmemblk Transaction support PMDK (1/2)  Set of libraries and tools for Filesystem DAX and Device DAX Copyright 2020 FUJITSULIMITED Supportfor volatile memory usage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem librpmem Low-level support C C++ PCJ PythonLLPJ Language bindings In Development: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby 11 Tools back up device dax namespace daxio pool management pmempool • Not only Linux, but also you can use PMDK on Windows Note • Not all components are included in this figure • To be precise libmemkind is not member of PMDK, but it is recommended library 11
  • 13. PMDK (2/2) Typical libraries and a tool Copyright 2020 FUJITSULIMITED12 For log file libpmemlog For generalpurpose pmem allocation with transaction libpmemobj To create same size blocks and atomically update libpmemblk Transaction support Supportfor volatile memory usage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem librpmem Low-level support C C++ PCJ PythonLLPJ Language bindings In Development: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby Tools back up device dax namespace daxio pool management pmempool • The low layer library • Call mmap() to use NVDIMM,call suitable cpu cache flush instruction, etc… 12
  • 14. PMDK (2/2) Typical libraries and a tool Copyright 2020 FUJITSULIMITED13 For log file libpmemlog For generalpurpose pmem allocation with transaction libpmemobj To create same size blocks and atomically update libpmemblk Transaction support Supportfor volatile memory usage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem librpmem Low-level support C C++ PCJ PythonLLPJ Language bindings In Development: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby Tools back up device dax namespace daxio pool management pmempool • The high layer library which supports transaction of the objects on DAX • For general use-case • This is highly recommended library in PMDK • Users need to understand how to use its transaction 13
  • 15. PMDK (2/2) Typical libraries and a tool Copyright 2020 FUJITSULIMITED14 For log file libpmemlog For generalpurpose pmem allocation with transaction libpmemobj To create same size blocks and atomically update libpmemblk Transaction support Supportfor volatile memory usage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem librpmem Low-level support C C++ PCJ PythonLLPJ Language bindings In Development: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby Tools back up device dax namespace daxio pool management pmempool • /dev/dax (device DAX) is character device. Then you cannot use dd for backup • daxio command is provided instead of it 14
  • 16. New libraries of PMDK (1/2)  Libpmem2  New low-layer library • Introduce new concept “GRANULARITY” • PMEM2_GRANULARITY_PAGE : for traditional SSD/HDD • PMEM2_GRANULARITY_CACHELINE : for persistent memory (the case for process needs flush cache to make persistency) • PMEM2_GRANULARITY_BYTE : for persistent memory (the case for platform support cpu cache persistency) • Introduce new functions to get unsafe shutdown status and bad block • This library uses library of ndctl command internallyto get these information  Its interface is different from old libpmem Copyright 2020 FUJITSULIMITED15
  • 17. New libraries of PMDK (2/2)  librpma  The 2nd library for RDMA • The 1st library is librpmem, but it is experimental status due to “no user” • librpma is created with users requirement  Currently, official release of main library is planned at Q1/2021 https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf (*) The above table is quotedfrom this presentation Copyright 2020 FUJITSULIMITED16
  • 18. Filesystem-DAX is still experimental status Copyright 2020 FUJITSULIMITED • The management way of NVDIMM is almost same with traditional filesystem • Operator can use traditional command to manage NVDIMM area • If application call mmap() for “a file” on Filesystem DAX, then it can access to NVDIMM directly Filesystem-DAX is very expected interface… • The “experimental” message is shown when the filesystem is mounted with DAX option • There are some issues in kernel layer • “Experimental status” has been for a few years…. Unlike Device DAX, Filesystem-DAX is still experimental status • In contrast, Device DAX requires pool management by tools of PMDK • Otherwise a software need to posses whole of the namespace (/dev/dax) Why it is experimental yet? What is the obstacles? 17
  • 19. Issues of Filesystem-DAX  What is solved, and what is current issues Copyright 2020 FUJITSULIMITED18
  • 20. Why Filesystem-DAX is experimental?  In summary, there are 2 big reasons Copyright 2020 FUJITSULIMITED Filesystem DAX combines storage and memory characteristics • This causes corner-case issues of Filesystem-DAX 1.A more additional feature was required, but it is/was difficult to make it • Configure DAX on/off for each inode (directory or file) • Co-existence with CoW filesystem 19
  • 21. Corner Case Issue (1/3) Copyright 2020 FUJITSULIMITED Most important problem is “updating metadata” of the file metadata data In Filesystem-DAX, we expected that application can make persistency of data with only cpu cache flush as I said…. However, this also means there is no chance to update metadata by kernel/filesystem file • Update time of the file may be not correct • If an application use write some data to file on the filesystem DAX, and a user remove some blocks of the file by truncate(2), kernel cannot negotiate it • Data of the file may be lost • If data transferred by DMA/RDMA to the page which is allocated as filesystem DAX, similar problem may occur 20
  • 22. Corner Case Issue (2/3)  Current Status of update metadata problems Copyright 2020 FUJITSULIMITED • Sloved by introducing new MAP_SYNC flag of mmap() • Page fault when write access, then kernel update meta data • PMDK specifies this flag • Solved by waiting truncate() until finishing RMDA • Not solved • Truncate() can not wait the completion of transfer, because it may too long time General write access DMA/RDMA data access Kernel/driver layer user process layer (E.g. infiniband, video(v4l2)) 21
  • 23. Corner Case Issue (3/3)  Current Status of update metadata problems Copyright 2020 FUJITSULIMITED • Sloved by introducing new MAP_SYNC flag of mmap() • Page fault when write access, then kernel update meta data • PMDK specifies this flag • Solved by waiting truncate() until finishing RMDA • Not solved • Truncate() can not wait the completion of transfer, because it may too long time General write access DMA/RDMA data access Kernel/driver layer user process layer (E.g. infiniband, video(v4l2)) • In ODP, usually driver/hardware does not map the pages of DMA/RDMA area for application • It maps the pages when application accesses them • Kernel/driver can coordinate metadata at the time • Mellanox newer card has the feature Workaround: “On Demand Paging(ODP)” 22
  • 24. DAX on/off for each inode (directory or file) (1/4)  Expected use-cases Copyright 2020 FUJITSULIMITED Need more fine-grain settings Change DAX attribute by application Performance tuning Workaround when Filesystem-DAX has a bug Configuration is always painful for administrator If application can detect and change it, it will be helpful for them Users may want to change the DAX mode depending on each file Since the write latency of NVDIMM is a bit slower than RAM, User may want to use page cache by DAX off setting 23
  • 25. DAX on/off for each inode (directory or file) (2/4)  What is/was the problem  If filesystem change DAX attribute, filesystem need to change methods of filesystem between DAX and normal file, but they may be executed yet  Data of the page cache must be moved silently when the dax attribute becomes off • These problems were very difficult Copyright 2020 FUJITSULIMITED Page caches On DRAM xfs_vm_writepages() xfs_iomap_set_page_dirty() : : Methos for DAX fileMethods for normal file Need change xfs_dax_writepages() noop_set_page_dirty() : : struct address space inode struct address space inode No page cache Normal file DAX file 24
  • 26. DAX on/off for each inode (directory or file) (3/4)  How it was solved?  The DAX attribute is changed only when its inode cache is NOT loaded on memory • Filesystem can load suitable methods for each attribute when it reloads inode to memory • Page caches of the file are also dropped • Users can use this feature with the new mount option. #mount … -o dax=inode • The DAX attribute is changed by command. Copyright 2020 FUJITSULIMITED Normal file DAX file drop inode cache drop inode cache reload reload echo 3 > /proc/sys/vm/drop_caches DAX on : $xfs_io -c 'chattr +x’ <file or directory> DAX off: $xfs_io -c 'chattr -x’<file or directory> 25
  • 27. DAX on/off for each inode (directory or file) (4/4)  Note  All of applications which use the target file must close it to change the dax attribute  Filesystem will postpone changing the DAX attribute until dropping inode cache and page cache of the file • Currently, administrator may need to operate drop_cache to achieve it yet • Because inode cache may remain due to race condition Copyright 2020 FUJITSULIMITED # echo 3 > /proc/sys/vm/drop_caches This operation affects whole of page caches and slabs on the system! We made patches to evict inode and dentry cache as soon as possible when the DAX attribute is changed • Set I_DONTCACHE and DCACHE_DONTCACHE for it • Sync inode before evicting inode We are trying to solve this problem 26
  • 28. Coexsiting with CoW (reflink/dedup) filesystem (1/2)  The Copy on Write feature of filesystem (xfs:reflink/dedup, btrfs)  If there is a same data block on different files, filesystem can merge it as same block  So far, if only filesystem manages such block, it was enough  In filesystem-dax, it becomes NOT enough Copyright 2020 FUJITSULIMITED File A File B offset 100 offset 200 27
  • 29. Coexsiting with CoW (reflink/dedup) filesystem (2/2)  Problems Copyright 2020 FUJITSULIMITED Ruan-san will talk how to solve them • Reverse mapping for plural files are necessary • A Merged block has only one struct page (page->index save offset of the file). • How to save offsets of plural files? • Iomap is newer interface instead of block IO layer (IOW, struct buffer_head) • It can lock and submit I/O for plural pages at once • Filesystem-DAX depends on iomap feature, but it does not support CoW • Though XFS already uses iomap, btrfs also need to use it Memory management layer also needs to understand merged block Need enhance iomap and Filesystem DAX for CoW 28
  • 30. Deep dive to solve the issue of Filesystem-DAX  Support reflink/dedupe for fsdax*  Improve NVDIMM-based Reverse mapping * Filesystem-DAX Copyright 2020 FUJITSULIMITED29
  • 31. Self-introduction  Ruan Shiyang  A Software Engineer of Fujitsu Nanda  Experience in Embedded development  Currently focusing on Linux filesystem and persistent memory Copyright 2020 FUJITSULIMITED30
  • 32. Create a reflink featured XFS and mount it with “dax” option Then error occurs The dmesg shows Background  fsdax is “EXPERIMENTAL”  reflink and fsdax can not work together Copyright 2020 FUJITSU LIMITED $ mkfs.xfs –m reflink=1 [...] && mount –o dax [...] mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0, missing codepage or helper program, or other error. XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk XFS (pmem0): DAX and reflink cannot be used together! 31
  • 33. reflink  Files share extents for same data  Advantages  Fast copy  Save storage  Copy on Write mechanism (CoW)  Copy the shared extents before writing data Copyright 2020 FUJITSULIMITED File A File B Extent I File A Extent I Extent II copy File A File B Extent I File A Extent I File BFile A Extent I write File BFile A Extent I File BFile A Extent I Copy then Write File Disk File Disk File Disk 32 copy --reflink=always
  • 34. fsdax (Filesystem DAX)  A mode of a NVDIMM namespace  Removes the page cache from the I/O path  Allows mmap() to establish direct mappings to persistent memory media Copyright 2020 FUJITSULIMITED read() write() mmap() vfs SSD/HDD page cache NVDIMM bio direct access (DAX) 33
  • 35. Why reflink and fsdax can not work together?  Issues 1. Need enhance iomap and fsdax for CoW  The iomap interface extension  Support CoW in fsdax  Co-existing support for each filesystem 2. Memory management layer also need to understand merged block  Improve the current NVDIMM-based Reverse mapping Copyright 2020 FUJITSULIMITED34
  • 36. Need enhance iomap and fsdax for CoW  The iomap interface extension  Support CoW in fsdax  Co-existing support for each filesystem Copyright 2020 FUJITSULIMITED35
  • 37. write data CoW not executed CoW Need enhance iomap and fsdax for CoW Copyright 2020 FUJITSU LIMITED write() SSD/HDD page cache NVDIMM direct access vfs bio start write if DAX? N get distance extent load data from disk to page cache write data submit bio flush to disk to disk Y 36
  • 38. What must be implemented? Copyright 2020 FUJITSULIMITED iomap_begin • xfs_file_iomap_begin() • ALLOCATE new extent for CoW • STORE source extent info actor • dax_iomap_actor() • COPY source data to new allocated extent • WRITE user data iomap_end • xfs_file_iomap_end() • REMAP the extents of this file  iomap  Introduce “srcmap”  Fill the “srcmap”  fsdax  Add CoW for write()  Add CoW for mmap()  Add a “dax” dedupe  XFS  Remap extents after CoW 37
  • 39. iomap - Introduce “srcmap” (1/2)  The source address for CoW  struct “iomap” tells where to write  CoW needs to know • start bno: where to copy from • length: how much data to copy Copyright 2020 FUJITSU LIMITED File A Extent I copy• source address • source length File B distance 38
  • 40. iomap - Introduce “srcmap” (2/2)  How?  add new IOMAP type called ‘IOMAP_COW’ • distinguish CoW with normal write  add another struct iomap called ‘srcmap’ • remember and pass the source extent info Copyright 2020 FUJITSU LIMITED + #define IOMAP_COW 0x06 /* copy data from srcmap before writing */ int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, - unsigned flags, struct iomap *iomap); + unsigned flags, struct iomap *iomap, + struct iomap *srcmap); 39
  • 41. CoW iomap - Fill “srcmap”  Fill "srcmap"  iomap: the distance extent to write at  srcmap: the source extent to copy from  filled in ->iomap_begin()  How?  handle file with DAX flag and shared extents Copyright 2020 FUJITSU LIMITED find extent if shared? if hole? Y Y if DAX? Y N fill srcmap fill iomap set IOMAP_COW 40
  • 42. CoW fsdax - Add CoW for write()  Execute CoW in write() path  use ->direct_access() to translate address • offset in block device • physical memory address in pmem  How?  copy source data before copying user data Copyright 2020 FUJITSU LIMITED Y get srcmap IOMAP_COW? copy source data ->direct_access() N copy user data from srcmap from user 41
  • 43. fsdax - Add CoW for mmap()  Execute CoW in page fault  fsdax supports PTE fault and PMD fault  also uses iomap model  How?  copy source data before VMA is associated Copyright 2020 FUJITSU LIMITED CoW Y get srcmap IOMAP_COW? copy source data ->direct_access() N from srcmap associate VMA 42
  • 44. fsdax - Add a “dax” dedupe (1/2)  Handle dedupe  generic dedupe function • compares data in page caches  not adapted to fsdax • no page cache Copyright 2020 FUJITSU LIMITED File A compared same File B Extent I Extent II DAX compare function 43
  • 45. fsdax - Add a “dax” dedupe (2/2)  How?  introduce a DAX compare function  only if files have same DAX flag Copyright 2020 FUJITSU LIMITED Y if both DAX? DAX compare Y N if neither DAX? Generic compare return error N 44
  • 46. XFS - Remap extents after CoW  Remap extents  since new extent is allocated, the file need to remap it.  How?  execute remap function in ->iomap_end()  clean up if CoW failed Copyright 2020 FUJITSU LIMITED Y if CoW? remap extents N if io error?Y clean up CoW extents 45
  • 47. Memory management layer also need to understand merged block  Improve the current NVDIMM-based Reverse mapping Copyright 2020 FUJITSULIMITED46
  • 48. NVDIMM When Memory failure occurs Copyright 2020 FUJITSU LIMITED file A process 1 process 2 process 3 process 4 0xA000 0xA001 0xA002 0xA004 0xA005 broken 0xA003 1. track all processes associating with the broken page on NVDIMM 2. send signal to kill those associated processes 47
  • 49. Need to improve NVDIMM-based Reverse mapping  currently only support 1-to-1 mapping  reflink on fsdax requires 1-to-N mapping Copyright 2020 FUJITSU LIMITED NVDIMM 0xA000 0xA001 0xA002 0xA004 0xA005 file A process 1 process 2 process 3 VMA VMA VMA broken file B process 4 process 5 VMA VMA ->mapping = mappingA ->index = offsetA file … … … 0xA003 NOT SUPPORTED VMA VMA 48
  • 50. Approach to solve this problem  Idea 1 – per-page tracking  Introduce dax-ramp rbtree  Track by iterating the dax-rmap rbtree  Idea 2 – fs query tracking  Introduce ->storage_lost() for filesystem  Collect a list of shared files from ->storage_lost(), and iterate the list  Idea 3 – dev_pagemap interface  Call memory-failure handler in ->storage_lost()  Introduce ->memory_failure(), which is to be implemented by driver Copyright 2020 FUJITSULIMITED49
  • 51. Improved Old implementation Idea 1 – per-page tracking (1/2) Copyright 2020 FUJITSULIMITED  store more file mappings  introduce a dax-ramp rbtree  associate 1st time: associate file mapping and offset to the page • same as old implementation 2nd time: create the dax-ramp rbtree (in page->private) • the page->index counts how many files associated • insert file mapping and offset as node at the 2nd time and later mappingA offsetA mappingB offsetB mappingC offsetC mappingA indexA page->mapping page->index  1st time associate  2nd time associate and later rbroot refcount file A address_space 0xA003 page Associate 0xA003 page file A address_space VMA VMA VMA VMA Track page->mapping = mappingA page->index = offsetA page->private page->index 50
  • 52. memory failure Idea 1 – per-page tracking (2/2) Copyright 2020 FUJITSU LIMITED  track 1. page ==> file mappings (new) • iterate the dax-ramp rbtree 2. file mapping ==> processes • iterate each file mapping’s i_mmap to find VMAs of processes  result  may cause huge overhead • each page contains a dax-rmap rbtree  per-page tracking only works for file • can not handle metadata lock dax page collect processes unmap pages kill processes unlock dax page + iterate dax-ramp rbtree iterate mapping->i_mmap to_kill list 51 1 2
  • 53. Improved Idea 2 – filesystem query tracking (1/2)  take advantage of filesystem feature to reduce the huge overhead  rmapbt of XFS • record the owners of each data extent.  introduce a FS interface ->storage_lost() to return owners list • instead of the dax-rmap rbtree to reduce the huge overhead • query the rmapbt for owners of a block (broken page on NVDIMM) • return a list of owners for memory failure  associate  store FS super block in page->mapping • as the specific caller when tracking  count reference in page->zone_device_data • not used in memory-failure before Copyright 2020 FUJITSULIMITED (XFS) sb offset refcount page->mapping page->index page->zone_device_data 52
  • 54. Idea 2 – filesystem query tracking (2/2) Copyright 2020 FUJITSULIMITED  track 1. page ==> file mappings (new) • iterate the owners list 2. file mapping ==> processes • iterate each file mapping’s i_mmap to find VMAs of processes  result  reduced the huge overhead  however, it leaves a little overhead • need extra memory space for the owners list memory failure lock dax page collect processes unmap pages kill processes unlock dax page + iterate the owners list iterate mapping->i_mmap to_kill list + call ->storage_lost() owners list 53 1 2
  • 55.  handle memory-failure during querying to remove the overhead completely  instead of creating the owners list, handle every owner during querying • call ->storage_lost() to query owners • call memory-failure handler when an owner is found • iterate each owner’s i_mmap to find VMAs of processes  result  the extra memory space is not necessary  implemented 1-to-N reverse mapping Copyright 2020 FUJITSU LIMITED memory failure + call ->storage_lost() + FS rmap query -> collect processes unmap pages kill processes unlock dax page to_kill list lock dax page Idea 3 – dev_pagemap interface (1/3) 54 owner + memory-failure handler iterate owner mapping->i_mmap
  • 56. Idea 3 – dev_pagemap interface (2/3)  support more than fsdax mode  dive into driver  introduce a driver interface ->memory_failure() • for pmem_device call ->storage_lost() • for dax_device Copyright 2020 FUJITSULIMITED memory failure + dev_pagemap ->ops->memory_failure() + pmem driver ->storage_lost() FS rmap query memory-failure handler collect processes unmap pages kill processes unlock dax page iterate mapping->i_mmap to_kill list lock dax page + dax driver handler …… or 55
  • 57. Idea 3 – dev_pagemap interface (3/3)  1-to-N reversed mapping has been implemented  no useless overhead  compatible for all NVDIMM modes  compatible for all filesystems  some remaining difficulties  can not obtain filesystem from LVM (or other mapped device) - `get_super()` is for filesystems created directly on the device - not suitable for mapped device  require each filesystem has the feature like “rmapbt” - “realtime device” of XFS does not support “rmapbt” - other filesystems Copyright 2020 FUJITSULIMITED56
  • 59. We talked about the followings  Basis of NVDIMM for Linux  Issues of Filesystem-DAX (Direct Access mode)  Deep dive to solve issues of Filesystem-DAX • Support reflink & dedupe for fsdax • Fix NVDIMM-based Reverse mapping Copyright 2020 FUJITSULIMITED • Community has made many enhancement for NVDIMM on Linux • We have worked for NVDIMM to remove experimental status of Filesystem DAX • We expect it will be achieved soon 58
  • 60. 59
  • 62. How to use NVDIMM in Linux  Make region  You can configure it on BIOS screen (or ipmctl command for Intel DCPMM) • Region is created by hardware(memory controller)  You need to reboot to enable the created region  Make namespace  Namespace is similar concept against SCSI LUN  You can configure it by ndctl command  Format Filesystem (storage or Filesystem DAX)  If you would like to use Filesystem DAX, you need to select ext4 or xfs which support Filesystem DAX • Currently, if you select xfs, reflink option must be disabled  Mount Filesystem (storage or Filesystem DAX) with –o dax option  Make pool (Filesystem DAX or Device DAX) by the PMDK tool if necessary Copyright 2020 FUJITSULIMITED61
  • 63. Users of NVDIMM at kernel layer  dm-writecache  Persistent cache for write at device mapper layer • dm-cache can use SSD as cache, dm-writecache can use NVDIMM • This presentation is helpful • https://github.com/ChinaLinuxKernel/CLK2019/blob/master/clk2019-dm-writecache-04.pdf  Kernel shows Device dax as DRAM  Though Intel DCPMM has “Memory Mode” which user can use it as huge RAM, it has some pains • Size of DRAM disappears, because it becomes just cache of NVDIMM • Software cannot choose area between DRAM or DCPMM  In this kernel feature, system RAM size becomes sums of DRAM and NVDIMM • Use App Direct Mode and Device DAX namespace • Use Memory Hot-add Device DAX area • Becomes NUMA node Copyright 2020 FUJITSULIMITED62

Notas del editor

  1. 0
  2. 61