The Forefront of the Development for NVDIMM on Linux Kernel

The Forefront of the Development for NVDIMM on Linux Kernel
2020/12/3
Yasunori Goto (Fujitsu Limited.)
Ruan Shiyang (Nanjing Fujitsu Nanda Software Technology Co., Ltd.)
Copyright 2020 FUJITSULIMITED0

Agenda
 Summary of current status of development for NVDIMM (Yasunori)
 Basis of NVDIMM on Linux
 Issues of Filesystem-DAX (Direct Access mode)
 Deep dive to solve the issues of Filesystem-DAX (Ruan)
 Support reflink & dedupe for fsdax
 Fix NVDIMM-based Reverse mapping
 Conclusion

Self introduction
 Yasunori Goto
 I have worked for Linux and related OSS since 2002
• Development for Memory hotplug feature of Linux Kernel
• Technical Support for troubles of Linux Kernel
• etc.
 Currently, leader of Fujitsu Linux Kernel Development team
 In the last few years, I have mainly worked for NVDIMM
• some enhancement for RAS of NVDIMM
• For Fault location feature
• For Fault predictionfeature
“The ideal and reality of NVDIMM RAS”
https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version

Basis of NVDIMM on Linux

Characteristics of Non-Volatile DIMM (NVDIMM)
 Persistent memory device which can be inserted to DIMM slot like DRAM
 CPU can read/write NVDIMM directly
 It can keep data persistency even if system is powered down or rebooted
 Though its access latency is slower than DRAM, it is faster than NVMe
 Has huger amount than DRAM
 Use case
 In memory Database
 Hierarchical storage, distributed storage
 Etc…
 Famous Product
 Intel Data Center Persistent Memory Module (DCPMM)
Copyright 2020 FUJITSU LIMITED
CPU
cache
DRAM
NVDIMM
NVMe
SSD (SAS/SATA)
HDD
Capacity
Latency cost
4

Impact of NVDIMM
Copyright 2020 FUJITSULIMITED
Page cache
System call
(Switch to kernel mode)
Sync()/fsync()/msync()
It was created for SLOW I/O storage, but NVDIMM is fast
enough without page cache
If user process calls cpu flush instruction, then it is enough to
make persistency
• Application can read/write to NVDIMM directly
• Even system call may be waste of time due to switching between kernel
mode and user mode
Traditional I/O layer becomes redundant for NVDIMM
New interface is expected for NVDIMM!
5

 Many software assumes that memory is VOLATILE yet
NVDIMM is difficult for traditional software
• CPU cache is still volatile
• If system power down suddenly, then some data may not be
stored
Need to prepare for sudden powerdown
• Should not change data structures in NVDIMM
• If the structures are changed, software update will be cause of
disaster
• If the data is broken, software need to detect it and correct it
• Software need to assign not only free area, but also used area for
reuse its data
• In addition, kernel must assign the area to suitable process with
authority check
Need data structure compatibility on NVDIMM
Need to detect / correct collapsing data Need data area management
What will be necessary for a software to use NVDIMM?
6

Confliction of requirements
Filesystem is still
is still useful
Filesystem is too
is too slow
• Traditional I/O stack is too heavy
• CPU cache flush is enough to make persistency
• Need new access interface to NVDIMM for
next generation software
• Filesystem gives many solution for the previous
considerations
• It is useful for current software
• Format compatibilityof filesystem
• Data correction
• Journaling, or CoW
• region management
• authority check
• etc...
• call system call
• page cache
• block device layer
• etc…
7

Interfaces of NVDIMM (1/3)
 Because of the previous reasons, Linux provides some interfaces for application
 Storage Access(green)
• Application can access NVDIMM
with traditionalI/O IF like SSD/HDD
• So, application can use this mode
without any modification
 Filesystem DAX(blue)
• Page cache is skipped when you use
read()/write()on Filesystem-DAX
• Application can access NVDIMM area
directly if it calls mmap()for a file
• Need filesystem support
• Xfs, ext4…
• This mode is suitable for modifying
currentapplications for NVDIMM.
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device DAX
BTT driver
Traditional
filesystem
Traditional App New App with Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
8

Interfaces of NVDIMM (2/3)
 Because of the previous reasons, Linux provides some interfaces for application
 Device DAX(red)
• Application can access NVDIMM area
directly if it calls mmap()for /dev/dax
• /dev/dax allows only open(),mmap(),
and close()
• IOW, you can not use read()/write()
nor any other system call
• For innovative new application with
NVDIMM
 PMDK(purple) is provided
• Set of libraries and tools for
filesystem DAX and Device DAX
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device DAX
BTT driver
Traditional
filesystem
Traditional App New App with Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
9

Interfaces of NVDIMM(3/3)
 NVDIMM is shown as device files like storage
 For Storage Access : /dev/pmem##.#s
 Filesystem DAX: /dev/pmem##.#
 Device DAX: /dev/dax##.#
 ndctl command can create these device when it creates namespace
 Example of Filesystem DAX
 You can make filesystem
on /dev/pmem##.#s
or /dev/pmem##.#
 Note: /dev/dax##.# is character device
 Since you cannot use read()/write() for /dev/dax##.#, you cannot use dd command for backup
$ sudo ndctl create-namespace -e "namespace0.0" -m fsdax -f
{
"dev":"namespace0.0",
"mode":"memory",
"size":"5.90 GiB (6.34 GB)",
"uuid":"dc47d0d7-7d8f-473e-9db4-1c2e473dbc8f",
"blockdev":"pmem0",
"numa_node":0
}
$ sudo ls /dev/pmem*
/dev/pmem0
10

For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
PMDK (1/2)
 Set of libraries and tools for Filesystem DAX and Device DAX
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
11
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• Not only Linux, but also you can
use PMDK on Windows
Note
• Not all components are included in this figure
• To be precise libmemkind is not member of PMDK, but it is recommended library
11

PMDK (2/2)
Typical libraries and a tool
For log file
libpmemlog
For generalpurpose
transaction
libpmemobj
To create
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
persistent memory
libpmem librpmem
Low-level support
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• The low layer library
• Call mmap() to use NVDIMM,call suitable cpu
cache flush instruction, etc…
12

PMDK (2/2)
For log file
libpmemlog
For generalpurpose
transaction
libpmemobj
To create
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
persistent memory
libpmem librpmem
Low-level support
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• The high layer library which supports transaction of the objects on DAX
• For general use-case
• This is highly recommended library in PMDK
• Users need to understand how to use its transaction
13

PMDK (2/2)
For log file
libpmemlog
For generalpurpose
transaction
libpmemobj
To create
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
persistent memory
libpmem librpmem
Low-level support
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• /dev/dax (device DAX) is character
device. Then you cannot use dd for
backup
• daxio command is provided instead
of it
14

New libraries of PMDK (1/2)
 Libpmem2
 New low-layer library
• Introduce new concept “GRANULARITY”
• PMEM2_GRANULARITY_PAGE : for traditional SSD/HDD
• PMEM2_GRANULARITY_CACHELINE : for persistent memory (the case for process needs flush cache to make persistency)
• PMEM2_GRANULARITY_BYTE : for persistent memory (the case for platform support cpu cache persistency)
• Introduce new functions to get unsafe shutdown status and bad block
• This library uses library of ndctl command internallyto get these information
 Its interface is different from old libpmem

New libraries of PMDK (2/2)
 librpma
 The 2nd library for RDMA
• The 1st library is librpmem,
but it is experimental status
due to “no user”
• librpma is created with users
requirement
 Currently, official release
of main library is planned
at Q1/2021
https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf
(*) The above table is quotedfrom this presentation

Filesystem-DAX is still experimental status
• The management way of NVDIMM is almost same with traditional filesystem
• Operator can use traditional command to manage NVDIMM area
• If application call mmap() for “a file” on Filesystem DAX, then it can access to NVDIMM directly
Filesystem-DAX is very expected interface…
• The “experimental” message is shown when the filesystem is mounted with DAX option
• There are some issues in kernel layer
• “Experimental status” has been for a few years….
Unlike Device DAX, Filesystem-DAX is still experimental status
• In contrast, Device DAX requires pool management by tools of PMDK
• Otherwise a software need to posses whole of the namespace (/dev/dax)
Why it is experimental yet?
What is the obstacles?
17

Issues of Filesystem-DAX
 What is solved, and what is current issues

Why Filesystem-DAX is experimental?
 In summary, there are 2 big reasons
Filesystem DAX combines storage and memory characteristics
• This causes corner-case issues of Filesystem-DAX
1.A more additional feature was required,
but it is/was difficult to make it
• Configure DAX on/off for each inode (directory or file)
• Co-existence with CoW filesystem
19

Corner Case Issue (1/3)
Most important problem is “updating metadata” of the file
metadata
data
In Filesystem-DAX, we expected that application
can make persistency of data with only cpu
cache flush as I said….
However, this also means there is no chance to
update metadata by kernel/filesystem
file
• Update time of the file may be not correct
• If an application use write some data to file on the filesystem DAX, and a user
remove some blocks of the file by truncate(2), kernel cannot negotiate it
• Data of the file may be lost
• If data transferred by DMA/RDMA to the page which is allocated as filesystem
DAX, similar problem may occur
20

 Current Status of update metadata problems
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
21

 Current Status of update metadata problems
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
• In ODP, usually driver/hardware does not map the pages of DMA/RDMA area
for application
• It maps the pages when application accesses them
• Kernel/driver can coordinate metadata at the time
• Mellanox newer card has the feature
Workaround:
“On Demand Paging(ODP)”
22

DAX on/off for each inode (directory or file) (1/4)
 Expected use-cases
Need more fine-grain settings
Change DAX attribute by application
Performance tuning
Workaround when Filesystem-DAX has a bug
Configuration is always painful for administrator
If application can detect and change it, it will be helpful for them
Users may want to change the DAX mode depending on each file
Since the write latency of NVDIMM is a bit slower than RAM,
User may want to use page cache by DAX off setting
23

 What is/was the problem
 If filesystem change DAX attribute, filesystem need to change methods of filesystem between DAX and normal file, but they may be
executed yet
 Data of the page cache must be moved silently when the dax attribute becomes off
• These problems were very difficult
Page caches
On DRAM
xfs_vm_writepages()
xfs_iomap_set_page_dirty()
:
:
Methos for DAX fileMethods for normal file
Need change xfs_dax_writepages()
noop_set_page_dirty()
:
:
struct address
space
inode
struct address
space
inode
No page
cache
Normal file DAX file
24

 How it was solved?
 The DAX attribute is changed only when its inode cache is NOT loaded on memory
• Filesystem can load suitable methods for each attribute when it reloads inode to memory
• Page caches of the file are also dropped
• Users can use this feature with the new mount option. #mount … -o dax=inode
• The DAX attribute is changed by command.
Normal file DAX file
drop inode
cache
drop inode
cache
reload
reload
echo 3 > /proc/sys/vm/drop_caches
DAX on : $xfs_io -c 'chattr +x’ <file or directory>
DAX off: $xfs_io -c 'chattr -x’<file or directory>
25

 Note
 All of applications which use the target file must close it to change the dax attribute
 Filesystem will postpone changing the DAX attribute until dropping inode cache and page cache of the file
• Currently, administrator may need to operate drop_cache to achieve it yet
• Because inode cache may remain due to race condition
# echo 3 > /proc/sys/vm/drop_caches
This operation affects whole of page caches and slabs on the system!
We made patches to evict inode and dentry cache as soon as possible when the DAX
attribute is changed
• Set I_DONTCACHE and DCACHE_DONTCACHE for it
• Sync inode before evicting inode
We are trying
to solve this
problem
26

Coexsiting with CoW (reflink/dedup) filesystem (1/2)
 The Copy on Write feature of filesystem (xfs:reflink/dedup, btrfs)
 If there is a same data block on different files, filesystem can merge it as same block
 So far, if only filesystem manages such block, it was enough
 In filesystem-dax, it becomes NOT enough
File A File B
offset 100 offset 200
27

Coexsiting with CoW (reflink/dedup) filesystem (2/2)
 Problems
Ruan-san will talk how to solve them
• Reverse mapping for plural files are necessary
• A Merged block has only one struct page
(page->index save offset of the file).
• How to save offsets of plural files?
• Iomap is newer interface instead of block IO layer (IOW,
struct buffer_head)
• It can lock and submit I/O for plural pages at once
• Filesystem-DAX depends on iomap feature, but it
does not support CoW
• Though XFS already uses iomap, btrfs also need to
use it
Memory management layer also needs to understand
merged block
Need enhance iomap and Filesystem DAX
for CoW
28

Deep dive to solve the issue of Filesystem-DAX
 Support reflink/dedupe for fsdax*
 Improve NVDIMM-based Reverse mapping
* Filesystem-DAX

Self-introduction
 Ruan Shiyang
 A Software Engineer of Fujitsu Nanda
 Experience in Embedded development
 Currently focusing on Linux filesystem and persistent memory

Create a reflink featured XFS and mount it with “dax” option
Then error occurs
The dmesg shows
Background
 fsdax is “EXPERIMENTAL”
 reflink and fsdax can not work together
$ mkfs.xfs –m reflink=1 [...] && mount –o dax [...]
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0,
missing codepage or helper program, or other error.
XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0): DAX and reflink cannot be used together!
31

reflink
 Files share extents for same data
 Advantages
 Fast copy
 Save storage
 Copy on Write mechanism (CoW)
 Copy the shared extents before writing data
File A File B
Extent I
File A
Extent I Extent II
copy
File A File B
Extent I
File A
Extent I
File BFile A
Extent I
write
File BFile A
Extent I
File BFile A
Extent I
Copy then Write
File
Disk
File
Disk
File
Disk
32
copy
--reflink=always

fsdax (Filesystem DAX)
 A mode of a NVDIMM namespace
 Removes the page cache from the I/O path
 Allows mmap() to establish direct mappings to persistent memory media
read()
write()
mmap()
vfs
SSD/HDD
page
cache
NVDIMM
bio
direct access (DAX)
33

Why reflink and fsdax can not work together?
 Issues
1. Need enhance iomap and fsdax for CoW
 The iomap interface extension
 Support CoW in fsdax
 Co-existing support for each filesystem
2. Memory management layer also need to understand merged block
 Improve the current NVDIMM-based Reverse mapping

Need enhance iomap and fsdax for CoW
 The iomap interface extension
 Support CoW in fsdax
 Co-existing support for each filesystem

write data
CoW not executed
CoW
Need enhance iomap and fsdax for CoW
write()
SSD/HDD
page
cache
NVDIMM
direct
access
vfs
bio
start write
if DAX?
N
get distance extent
load data
from disk to page cache
write data
submit bio
flush to disk to disk
Y
36

What must be implemented?
iomap_begin
• xfs_file_iomap_begin()
• ALLOCATE new extent for CoW
• STORE source extent info
actor
• dax_iomap_actor()
• COPY source data to new allocated extent
• WRITE user data
iomap_end
• xfs_file_iomap_end()
• REMAP the extents of this file
 iomap
 Introduce “srcmap”
 Fill the “srcmap”
 fsdax
 Add CoW for write()
 Add CoW for mmap()
 Add a “dax” dedupe
 XFS
 Remap extents after CoW
37

iomap - Introduce “srcmap” (1/2)
 The source address for CoW
 struct “iomap” tells where to write
 CoW needs to know
• start bno: where to copy from
• length: how much data to copy
File A
Extent I
copy• source address
• source length
File B
distance
38

iomap - Introduce “srcmap” (2/2)
 How?
 add new IOMAP type called ‘IOMAP_COW’
• distinguish CoW with normal write
 add another struct iomap called ‘srcmap’
• remember and pass the source extent info
+ #define IOMAP_COW 0x06 /* copy data from srcmap before writing */
int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
- unsigned flags, struct iomap *iomap);
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
39

CoW
iomap - Fill “srcmap”
 Fill "srcmap"
 iomap: the distance extent to write at
 srcmap: the source extent to copy from
 filled in ->iomap_begin()
 How?
 handle file with DAX flag and shared extents
find extent
if shared?
if hole?
Y
Y
if DAX? Y
N
fill srcmap
fill iomap
set IOMAP_COW
40

CoW
fsdax - Add CoW for write()
 Execute CoW in write() path
 use ->direct_access() to translate address
• offset in block device
• physical memory address in pmem
 How?
 copy source data before copying user data
Y
get srcmap
IOMAP_COW?
copy source data
->direct_access()
N
copy user data
from srcmap
from user
41

fsdax - Add CoW for mmap()
 Execute CoW in page fault
 fsdax supports PTE fault and PMD fault
 also uses iomap model
 How?
 copy source data before VMA is associated
CoW
Y
get srcmap
IOMAP_COW?
copy source data
->direct_access()
N
from srcmap
associate VMA
42

fsdax - Add a “dax” dedupe (1/2)
 Handle dedupe
 generic dedupe function
• compares data in page caches
 not adapted to fsdax
• no page cache
File A
compared same
File B
Extent I Extent II
DAX compare
function
43

fsdax - Add a “dax” dedupe (2/2)
 How?
 introduce a DAX compare function
 only if files have same DAX flag
Y
if both DAX?
DAX compare
Y
N
if neither DAX?
Generic compare return error
N
44

XFS - Remap extents after CoW
 Remap extents
 since new extent is allocated, the file need to remap it.
 How?
 execute remap function in ->iomap_end()
 clean up if CoW failed
Y
if CoW?
remap extents
N
if io error?Y
clean up CoW extents
45

Memory management layer also need to understand merged block
 Improve the current NVDIMM-based Reverse mapping

NVDIMM
When Memory failure occurs
file A
process 1
process 2
process 3
process 4
0xA000
0xA001
0xA002
0xA004
0xA005
broken
0xA003
1. track all processes associating with the broken page on NVDIMM
2. send signal to kill those associated processes
47

Need to improve NVDIMM-based Reverse
mapping
 currently only support 1-to-1 mapping
 reflink on fsdax requires 1-to-N mapping
NVDIMM
0xA000
0xA001
0xA002
0xA004
0xA005
file A
process 1
process 2
process 3
VMA
VMA
VMA
broken
file B
process 4
process 5
VMA
VMA
->mapping = mappingA
->index = offsetA
file …
…
…
0xA003
NOT SUPPORTED
VMA
VMA
48

Approach to solve this problem
 Idea 1 – per-page tracking
 Introduce dax-ramp rbtree
 Track by iterating the dax-rmap rbtree
 Idea 2 – fs query tracking
 Introduce ->storage_lost() for filesystem
 Collect a list of shared files from ->storage_lost(), and iterate the list
 Idea 3 – dev_pagemap interface
 Call memory-failure handler in ->storage_lost()
 Introduce ->memory_failure(), which is to be implemented by driver

Improved
Old implementation
Idea 1 – per-page tracking （1/2）
 store more file mappings
 introduce a dax-ramp rbtree
 associate
1st
time： associate file mapping and offset to the page
• same as old implementation
2nd
time： create the dax-ramp rbtree (in page->private)
• the page->index counts how many files associated
• insert file mapping and offset as node at the 2nd
time and later
mappingA
offsetA
mappingB
offsetB
mappingC
offsetC
mappingA
indexA
page->mapping
page->index
 1st
time associate  2nd
time associate and later
rbroot
refcount
file A
address_space
0xA003
page
Associate
0xA003
page
file A
address_space
VMA
VMA
VMA
VMA
Track
page->mapping = mappingA
page->index = offsetA
page->private
page->index
50

memory failure
Idea 1 – per-page tracking （2/2）
 track
1. page ==> file mappings （new）
• iterate the dax-ramp rbtree
2. file mapping ==> processes
• iterate each file mapping’s i_mmap
to find VMAs of processes
 result
 may cause huge overhead
• each page contains a dax-rmap rbtree
 per-page tracking only works for file
• can not handle metadata
lock dax page
collect processes
unmap pages
kill processes
unlock dax page
+ iterate dax-ramp rbtree
iterate mapping->i_mmap
to_kill
list
51
1
2

Improved
Idea 2 – filesystem query tracking （1/2）
 take advantage of filesystem feature to reduce the huge overhead
 rmapbt of XFS
• record the owners of each data extent.
 introduce a FS interface ->storage_lost() to return owners list
• instead of the dax-rmap rbtree to reduce the huge overhead
• query the rmapbt for owners of a block (broken page on NVDIMM)
• return a list of owners for memory failure
 associate
 store FS super block in page->mapping
• as the specific caller when tracking
 count reference in page->zone_device_data
• not used in memory-failure before
(XFS) sb
offset
refcount
page->mapping
page->index
page->zone_device_data
52

Idea 2 – filesystem query tracking （2/2）
 track
1. page ==> file mappings （new）
• iterate the owners list
2. file mapping ==> processes
• iterate each file mapping’s i_mmap
to find VMAs of processes
 result
 reduced the huge overhead
 however, it leaves a little overhead
• need extra memory space for the
owners list
memory failure
lock dax page
collect processes
unmap pages
kill processes
unlock dax page
+ iterate the owners list
to_kill
list
+ call ->storage_lost()
owners
list
53
1
2

 handle memory-failure during querying to remove the overhead completely
 instead of creating the owners list, handle
every owner during querying
• call ->storage_lost() to query owners
• call memory-failure handler when an owner is found
• iterate each owner’s i_mmap to find
VMAs of processes
 result
 the extra memory space is not necessary
 implemented 1-to-N reverse mapping
memory failure
+ call ->storage_lost()
+ FS rmap query ->
collect processes
unmap pages
kill processes
unlock dax page
to_kill
list
lock dax page
Idea 3 – dev_pagemap interface （1/3）
54
owner
+ memory-failure handler
iterate owner mapping->i_mmap

 support more than fsdax mode
 dive into driver
 introduce a driver interface
->memory_failure()
• for pmem_device
call ->storage_lost()
• for dax_device
memory failure
+ dev_pagemap ->ops->memory_failure()
+ pmem driver ->storage_lost()
FS rmap query
memory-failure handler
collect processes
unmap pages
kill processes
unlock dax page
to_kill
list
lock dax page
+ dax driver handler
……
or
55

 1-to-N reversed mapping has been implemented
 no useless overhead
 compatible for all NVDIMM modes
 compatible for all filesystems
 some remaining difficulties
 can not obtain filesystem from LVM (or other mapped device)
- `get_super()` is for filesystems created directly on the device
- not suitable for mapped device
 require each filesystem has the feature like “rmapbt”
- “realtime device” of XFS does not support “rmapbt”
- other filesystems

Conclusion

We talked about the followings
 Basis of NVDIMM for Linux
 Issues of Filesystem-DAX (Direct Access mode)
 Deep dive to solve issues of Filesystem-DAX
• Support reflink & dedupe for fsdax
• Fix NVDIMM-based Reverse mapping
• Community has made many enhancement for NVDIMM on Linux
• We have worked for NVDIMM to remove experimental status of Filesystem DAX
• We expect it will be achieved soon
58

Appendix

How to use NVDIMM in Linux
 Make region
 You can configure it on BIOS screen (or ipmctl command for Intel DCPMM)
• Region is created by hardware(memory controller)
 You need to reboot to enable the created region
 Make namespace
 Namespace is similar concept against SCSI LUN
 You can configure it by ndctl command
 Format Filesystem (storage or Filesystem DAX)
 If you would like to use Filesystem DAX, you need to select ext4 or xfs which support Filesystem DAX
• Currently, if you select xfs, reflink option must be disabled
 Mount Filesystem (storage or Filesystem DAX) with –o dax option
 Make pool (Filesystem DAX or Device DAX) by the PMDK tool if necessary

Users of NVDIMM at kernel layer
 dm-writecache
 Persistent cache for write at device mapper layer
• dm-cache can use SSD as cache, dm-writecache can use NVDIMM
• This presentation is helpful
• https://github.com/ChinaLinuxKernel/CLK2019/blob/master/clk2019-dm-writecache-04.pdf
 Kernel shows Device dax as DRAM
 Though Intel DCPMM has “Memory Mode” which user can use it as huge RAM, it has some pains
• Size of DRAM disappears, because it becomes just cache of NVDIMM
• Software cannot choose area between DRAM or DCPMM
 In this kernel feature, system RAM size becomes sums of DRAM and NVDIMM
• Use App Direct Mode and Device DAX namespace
• Use Memory Hot-add Device DAX area
• Becomes NUMA node

The Forefront of the Development for NVDIMM on Linux Kernel

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a The Forefront of the Development for NVDIMM on Linux Kernel

Similar a The Forefront of the Development for NVDIMM on Linux Kernel (20)

Último

Último (20)

The Forefront of the Development for NVDIMM on Linux Kernel

Notas del editor