This is talk for Open Source Summit Japan 2020
--------------------------
NVDIMM (Non Volatile DIMM) is the most interesting device, because it has not only characteristic of memory but also storage. To support NVDIMM, Linux kernel provides three access methods for users. - Storage (Sector) mode - Filesystem DAX(=Direct Access) mode - Device DAX mode. In the above three methods, Filesystem DAX is the most expected access method, because applications can write data to the NVDIMM area directory, and it is easier to use than Device DAX mode. So, some software already uses it with official support. However, Filesystem-DAX is still "experimental status" in the upstream community due to some difficult issues . In this session, Yasunori Goto will talk to the forefront of the development of NVDIMM, and Ruan Shiyang will talk about his challenge with the latest status from CLK2019.
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
The Forefront of the Development for NVDIMM on Linux Kernel
1. The Forefront of the Development for NVDIMM on Linux Kernel
2020/12/3
Yasunori Goto (Fujitsu Limited.)
Ruan Shiyang (Nanjing Fujitsu Nanda Software Technology Co., Ltd.)
Copyright 2020 FUJITSULIMITED0
2. Agenda
Summary of current status of development for NVDIMM (Yasunori)
Basis of NVDIMM on Linux
Issues of Filesystem-DAX (Direct Access mode)
Deep dive to solve the issues of Filesystem-DAX (Ruan)
Support reflink & dedupe for fsdax
Fix NVDIMM-based Reverse mapping
Conclusion
Copyright 2020 FUJITSULIMITED1
3. Self introduction
Yasunori Goto
I have worked for Linux and related OSS since 2002
• Development for Memory hotplug feature of Linux Kernel
• Technical Support for troubles of Linux Kernel
• etc.
Currently, leader of Fujitsu Linux Kernel Development team
In the last few years, I have mainly worked for NVDIMM
• some enhancement for RAS of NVDIMM
• For Fault location feature
• For Fault predictionfeature
“The ideal and reality of NVDIMM RAS”
https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version
Copyright 2020 FUJITSULIMITED2
5. Characteristics of Non-Volatile DIMM (NVDIMM)
Persistent memory device which can be inserted to DIMM slot like DRAM
CPU can read/write NVDIMM directly
It can keep data persistency even if system is powered down or rebooted
Though its access latency is slower than DRAM, it is faster than NVMe
Has huger amount than DRAM
Use case
In memory Database
Hierarchical storage, distributed storage
Etc…
Famous Product
Intel Data Center Persistent Memory Module (DCPMM)
Copyright 2020 FUJITSU LIMITED
CPU
cache
DRAM
NVDIMM
NVMe
SSD (SAS/SATA)
HDD
Capacity
Latency cost
4
6. Impact of NVDIMM
Copyright 2020 FUJITSULIMITED
Page cache
System call
(Switch to kernel mode)
Sync()/fsync()/msync()
It was created for SLOW I/O storage, but NVDIMM is fast
enough without page cache
If user process calls cpu flush instruction, then it is enough to
make persistency
• Application can read/write to NVDIMM directly
• Even system call may be waste of time due to switching between kernel
mode and user mode
Traditional I/O layer becomes redundant for NVDIMM
New interface is expected for NVDIMM!
5
7. Many software assumes that memory is VOLATILE yet
NVDIMM is difficult for traditional software
Copyright 2020 FUJITSULIMITED
• CPU cache is still volatile
• If system power down suddenly, then some data may not be
stored
Need to prepare for sudden powerdown
• Should not change data structures in NVDIMM
• If the structures are changed, software update will be cause of
disaster
• If the data is broken, software need to detect it and correct it
• Software need to assign not only free area, but also used area for
reuse its data
• In addition, kernel must assign the area to suitable process with
authority check
Need data structure compatibility on NVDIMM
Need to detect / correct collapsing data Need data area management
What will be necessary for a software to use NVDIMM?
6
8. Confliction of requirements
Copyright 2020 FUJITSULIMITED
Filesystem is still
is still useful
Filesystem is too
is too slow
• Traditional I/O stack is too heavy
• CPU cache flush is enough to make persistency
• Need new access interface to NVDIMM for
next generation software
• Filesystem gives many solution for the previous
considerations
• It is useful for current software
• Format compatibilityof filesystem
• Data correction
• Journaling, or CoW
• region management
• authority check
• etc...
• call system call
• page cache
• block device layer
• etc…
7
9. Interfaces of NVDIMM (1/3)
Because of the previous reasons, Linux provides some interfaces for application
Storage Access(green)
• Application can access NVDIMM
with traditionalI/O IF like SSD/HDD
• So, application can use this mode
without any modification
Filesystem DAX(blue)
• Page cache is skipped when you use
read()/write()on Filesystem-DAX
• Application can access NVDIMM area
directly if it calls mmap()for a file
• Need filesystem support
• Xfs, ext4…
• This mode is suitable for modifying
currentapplications for NVDIMM.
Copyright 2020 FUJITSULIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device DAX
BTT driver
Traditional
filesystem
Traditional App New App with Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
8
10. Interfaces of NVDIMM (2/3)
Because of the previous reasons, Linux provides some interfaces for application
Device DAX(red)
• Application can access NVDIMM area
directly if it calls mmap()for /dev/dax
• /dev/dax allows only open(),mmap(),
and close()
• IOW, you can not use read()/write()
nor any other system call
• For innovative new application with
NVDIMM
PMDK(purple) is provided
• Set of libraries and tools for
filesystem DAX and Device DAX
Copyright 2020 FUJITSULIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device DAX
BTT driver
Traditional
filesystem
Traditional App New App with Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
9
11. Interfaces of NVDIMM(3/3)
NVDIMM is shown as device files like storage
For Storage Access : /dev/pmem##.#s
Filesystem DAX: /dev/pmem##.#
Device DAX: /dev/dax##.#
ndctl command can create these device when it creates namespace
Example of Filesystem DAX
You can make filesystem
on /dev/pmem##.#s
or /dev/pmem##.#
Note: /dev/dax##.# is character device
Since you cannot use read()/write() for /dev/dax##.#, you cannot use dd command for backup
Copyright 2020 FUJITSULIMITED
$ sudo ndctl create-namespace -e "namespace0.0" -m fsdax -f
{
"dev":"namespace0.0",
"mode":"memory",
"size":"5.90 GiB (6.34 GB)",
"uuid":"dc47d0d7-7d8f-473e-9db4-1c2e473dbc8f",
"blockdev":"pmem0",
"numa_node":0
}
$ sudo ls /dev/pmem*
/dev/pmem0
10
12. For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
PMDK (1/2)
Set of libraries and tools for Filesystem DAX and Device DAX
Copyright 2020 FUJITSULIMITED
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
11
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• Not only Linux, but also you can
use PMDK on Windows
Note
• Not all components are included in this figure
• To be precise libmemkind is not member of PMDK, but it is recommended library
11
13. PMDK (2/2)
Typical libraries and a tool
Copyright 2020 FUJITSULIMITED12
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• The low layer library
• Call mmap() to use NVDIMM,call suitable cpu
cache flush instruction, etc…
12
14. PMDK (2/2)
Typical libraries and a tool
Copyright 2020 FUJITSULIMITED13
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• The high layer library which supports transaction of the objects on DAX
• For general use-case
• This is highly recommended library in PMDK
• Users need to understand how to use its transaction
13
15. PMDK (2/2)
Typical libraries and a tool
Copyright 2020 FUJITSULIMITED14
For log file
libpmemlog
For generalpurpose
pmem allocation with
transaction
libpmemobj
To create
same size blocks and
atomically update
libpmemblk
Transaction
support
Supportfor volatile
memory usage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem librpmem
Low-level support
C C++ PCJ PythonLLPJ
Language bindings In Development:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• /dev/dax (device DAX) is character
device. Then you cannot use dd for
backup
• daxio command is provided instead
of it
14
16. New libraries of PMDK (1/2)
Libpmem2
New low-layer library
• Introduce new concept “GRANULARITY”
• PMEM2_GRANULARITY_PAGE : for traditional SSD/HDD
• PMEM2_GRANULARITY_CACHELINE : for persistent memory (the case for process needs flush cache to make persistency)
• PMEM2_GRANULARITY_BYTE : for persistent memory (the case for platform support cpu cache persistency)
• Introduce new functions to get unsafe shutdown status and bad block
• This library uses library of ndctl command internallyto get these information
Its interface is different from old libpmem
Copyright 2020 FUJITSULIMITED15
17. New libraries of PMDK (2/2)
librpma
The 2nd library for RDMA
• The 1st library is librpmem,
but it is experimental status
due to “no user”
• librpma is created with users
requirement
Currently, official release
of main library is planned
at Q1/2021
https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf
(*) The above table is quotedfrom this presentation
Copyright 2020 FUJITSULIMITED16
18. Filesystem-DAX is still experimental status
Copyright 2020 FUJITSULIMITED
• The management way of NVDIMM is almost same with traditional filesystem
• Operator can use traditional command to manage NVDIMM area
• If application call mmap() for “a file” on Filesystem DAX, then it can access to NVDIMM directly
Filesystem-DAX is very expected interface…
• The “experimental” message is shown when the filesystem is mounted with DAX option
• There are some issues in kernel layer
• “Experimental status” has been for a few years….
Unlike Device DAX, Filesystem-DAX is still experimental status
• In contrast, Device DAX requires pool management by tools of PMDK
• Otherwise a software need to posses whole of the namespace (/dev/dax)
Why it is experimental yet?
What is the obstacles?
17
20. Why Filesystem-DAX is experimental?
In summary, there are 2 big reasons
Copyright 2020 FUJITSULIMITED
Filesystem DAX combines storage and memory characteristics
• This causes corner-case issues of Filesystem-DAX
1.A more additional feature was required,
but it is/was difficult to make it
• Configure DAX on/off for each inode (directory or file)
• Co-existence with CoW filesystem
19
21. Corner Case Issue (1/3)
Copyright 2020 FUJITSULIMITED
Most important problem is “updating metadata” of the file
metadata
data
In Filesystem-DAX, we expected that application
can make persistency of data with only cpu
cache flush as I said….
However, this also means there is no chance to
update metadata by kernel/filesystem
file
• Update time of the file may be not correct
• If an application use write some data to file on the filesystem DAX, and a user
remove some blocks of the file by truncate(2), kernel cannot negotiate it
• Data of the file may be lost
• If data transferred by DMA/RDMA to the page which is allocated as filesystem
DAX, similar problem may occur
20
22. Corner Case Issue (2/3)
Current Status of update metadata problems
Copyright 2020 FUJITSULIMITED
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
21
23. Corner Case Issue (3/3)
Current Status of update metadata problems
Copyright 2020 FUJITSULIMITED
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
• In ODP, usually driver/hardware does not map the pages of DMA/RDMA area
for application
• It maps the pages when application accesses them
• Kernel/driver can coordinate metadata at the time
• Mellanox newer card has the feature
Workaround:
“On Demand Paging(ODP)”
22
24. DAX on/off for each inode (directory or file) (1/4)
Expected use-cases
Copyright 2020 FUJITSULIMITED
Need more fine-grain settings
Change DAX attribute by application
Performance tuning
Workaround when Filesystem-DAX has a bug
Configuration is always painful for administrator
If application can detect and change it, it will be helpful for them
Users may want to change the DAX mode depending on each file
Since the write latency of NVDIMM is a bit slower than RAM,
User may want to use page cache by DAX off setting
23
25. DAX on/off for each inode (directory or file) (2/4)
What is/was the problem
If filesystem change DAX attribute, filesystem need to change methods of filesystem between DAX and normal file, but they may be
executed yet
Data of the page cache must be moved silently when the dax attribute becomes off
• These problems were very difficult
Copyright 2020 FUJITSULIMITED
Page caches
On DRAM
xfs_vm_writepages()
xfs_iomap_set_page_dirty()
:
:
Methos for DAX fileMethods for normal file
Need change xfs_dax_writepages()
noop_set_page_dirty()
:
:
struct address
space
inode
struct address
space
inode
No page
cache
Normal file DAX file
24
26. DAX on/off for each inode (directory or file) (3/4)
How it was solved?
The DAX attribute is changed only when its inode cache is NOT loaded on memory
• Filesystem can load suitable methods for each attribute when it reloads inode to memory
• Page caches of the file are also dropped
• Users can use this feature with the new mount option. #mount … -o dax=inode
• The DAX attribute is changed by command.
Copyright 2020 FUJITSULIMITED
Normal file DAX file
drop inode
cache
drop inode
cache
reload
reload
echo 3 > /proc/sys/vm/drop_caches
DAX on : $xfs_io -c 'chattr +x’ <file or directory>
DAX off: $xfs_io -c 'chattr -x’<file or directory>
25
27. DAX on/off for each inode (directory or file) (4/4)
Note
All of applications which use the target file must close it to change the dax attribute
Filesystem will postpone changing the DAX attribute until dropping inode cache and page cache of the file
• Currently, administrator may need to operate drop_cache to achieve it yet
• Because inode cache may remain due to race condition
Copyright 2020 FUJITSULIMITED
# echo 3 > /proc/sys/vm/drop_caches
This operation affects whole of page caches and slabs on the system!
We made patches to evict inode and dentry cache as soon as possible when the DAX
attribute is changed
• Set I_DONTCACHE and DCACHE_DONTCACHE for it
• Sync inode before evicting inode
We are trying
to solve this
problem
26
28. Coexsiting with CoW (reflink/dedup) filesystem (1/2)
The Copy on Write feature of filesystem (xfs:reflink/dedup, btrfs)
If there is a same data block on different files, filesystem can merge it as same block
So far, if only filesystem manages such block, it was enough
In filesystem-dax, it becomes NOT enough
Copyright 2020 FUJITSULIMITED
File A File B
offset 100 offset 200
27
29. Coexsiting with CoW (reflink/dedup) filesystem (2/2)
Problems
Copyright 2020 FUJITSULIMITED
Ruan-san will talk how to solve them
• Reverse mapping for plural files are necessary
• A Merged block has only one struct page
(page->index save offset of the file).
• How to save offsets of plural files?
• Iomap is newer interface instead of block IO layer (IOW,
struct buffer_head)
• It can lock and submit I/O for plural pages at once
• Filesystem-DAX depends on iomap feature, but it
does not support CoW
• Though XFS already uses iomap, btrfs also need to
use it
Memory management layer also needs to understand
merged block
Need enhance iomap and Filesystem DAX
for CoW
28
30. Deep dive to solve the issue of Filesystem-DAX
Support reflink/dedupe for fsdax*
Improve NVDIMM-based Reverse mapping
* Filesystem-DAX
Copyright 2020 FUJITSULIMITED29
31. Self-introduction
Ruan Shiyang
A Software Engineer of Fujitsu Nanda
Experience in Embedded development
Currently focusing on Linux filesystem and persistent memory
Copyright 2020 FUJITSULIMITED30
32. Create a reflink featured XFS and mount it with “dax” option
Then error occurs
The dmesg shows
Background
fsdax is “EXPERIMENTAL”
reflink and fsdax can not work together
Copyright 2020 FUJITSU LIMITED
$ mkfs.xfs –m reflink=1 [...] && mount –o dax [...]
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0,
missing codepage or helper program, or other error.
XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0): DAX and reflink cannot be used together!
31
33. reflink
Files share extents for same data
Advantages
Fast copy
Save storage
Copy on Write mechanism (CoW)
Copy the shared extents before writing data
Copyright 2020 FUJITSULIMITED
File A File B
Extent I
File A
Extent I Extent II
copy
File A File B
Extent I
File A
Extent I
File BFile A
Extent I
write
File BFile A
Extent I
File BFile A
Extent I
Copy then Write
File
Disk
File
Disk
File
Disk
32
copy
--reflink=always
34. fsdax (Filesystem DAX)
A mode of a NVDIMM namespace
Removes the page cache from the I/O path
Allows mmap() to establish direct mappings to persistent memory media
Copyright 2020 FUJITSULIMITED
read()
write()
mmap()
vfs
SSD/HDD
page
cache
NVDIMM
bio
direct access (DAX)
33
35. Why reflink and fsdax can not work together?
Issues
1. Need enhance iomap and fsdax for CoW
The iomap interface extension
Support CoW in fsdax
Co-existing support for each filesystem
2. Memory management layer also need to understand merged block
Improve the current NVDIMM-based Reverse mapping
Copyright 2020 FUJITSULIMITED34
36. Need enhance iomap and fsdax for CoW
The iomap interface extension
Support CoW in fsdax
Co-existing support for each filesystem
Copyright 2020 FUJITSULIMITED35
37. write data
CoW not executed
CoW
Need enhance iomap and fsdax for CoW
Copyright 2020 FUJITSU LIMITED
write()
SSD/HDD
page
cache
NVDIMM
direct
access
vfs
bio
start write
if DAX?
N
get distance extent
load data
from disk to page cache
write data
submit bio
flush to disk to disk
Y
36
38. What must be implemented?
Copyright 2020 FUJITSULIMITED
iomap_begin
• xfs_file_iomap_begin()
• ALLOCATE new extent for CoW
• STORE source extent info
actor
• dax_iomap_actor()
• COPY source data to new allocated extent
• WRITE user data
iomap_end
• xfs_file_iomap_end()
• REMAP the extents of this file
iomap
Introduce “srcmap”
Fill the “srcmap”
fsdax
Add CoW for write()
Add CoW for mmap()
Add a “dax” dedupe
XFS
Remap extents after CoW
37
39. iomap - Introduce “srcmap” (1/2)
The source address for CoW
struct “iomap” tells where to write
CoW needs to know
• start bno: where to copy from
• length: how much data to copy
Copyright 2020 FUJITSU LIMITED
File A
Extent I
copy• source address
• source length
File B
distance
38
40. iomap - Introduce “srcmap” (2/2)
How?
add new IOMAP type called ‘IOMAP_COW’
• distinguish CoW with normal write
add another struct iomap called ‘srcmap’
• remember and pass the source extent info
Copyright 2020 FUJITSU LIMITED
+ #define IOMAP_COW 0x06 /* copy data from srcmap before writing */
int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
- unsigned flags, struct iomap *iomap);
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
39
41. CoW
iomap - Fill “srcmap”
Fill "srcmap"
iomap: the distance extent to write at
srcmap: the source extent to copy from
filled in ->iomap_begin()
How?
handle file with DAX flag and shared extents
Copyright 2020 FUJITSU LIMITED
find extent
if shared?
if hole?
Y
Y
if DAX? Y
N
fill srcmap
fill iomap
set IOMAP_COW
40
42. CoW
fsdax - Add CoW for write()
Execute CoW in write() path
use ->direct_access() to translate address
• offset in block device
• physical memory address in pmem
How?
copy source data before copying user data
Copyright 2020 FUJITSU LIMITED
Y
get srcmap
IOMAP_COW?
copy source data
->direct_access()
N
copy user data
from srcmap
from user
41
43. fsdax - Add CoW for mmap()
Execute CoW in page fault
fsdax supports PTE fault and PMD fault
also uses iomap model
How?
copy source data before VMA is associated
Copyright 2020 FUJITSU LIMITED
CoW
Y
get srcmap
IOMAP_COW?
copy source data
->direct_access()
N
from srcmap
associate VMA
42
44. fsdax - Add a “dax” dedupe (1/2)
Handle dedupe
generic dedupe function
• compares data in page caches
not adapted to fsdax
• no page cache
Copyright 2020 FUJITSU LIMITED
File A
compared same
File B
Extent I Extent II
DAX compare
function
43
45. fsdax - Add a “dax” dedupe (2/2)
How?
introduce a DAX compare function
only if files have same DAX flag
Copyright 2020 FUJITSU LIMITED
Y
if both DAX?
DAX compare
Y
N
if neither DAX?
Generic compare return error
N
44
46. XFS - Remap extents after CoW
Remap extents
since new extent is allocated, the file need to remap it.
How?
execute remap function in ->iomap_end()
clean up if CoW failed
Copyright 2020 FUJITSU LIMITED
Y
if CoW?
remap extents
N
if io error?Y
clean up CoW extents
45
47. Memory management layer also need to understand merged block
Improve the current NVDIMM-based Reverse mapping
Copyright 2020 FUJITSULIMITED46
48. NVDIMM
When Memory failure occurs
Copyright 2020 FUJITSU LIMITED
file A
process 1
process 2
process 3
process 4
0xA000
0xA001
0xA002
0xA004
0xA005
broken
0xA003
1. track all processes associating with the broken page on NVDIMM
2. send signal to kill those associated processes
47
49. Need to improve NVDIMM-based Reverse
mapping
currently only support 1-to-1 mapping
reflink on fsdax requires 1-to-N mapping
Copyright 2020 FUJITSU LIMITED
NVDIMM
0xA000
0xA001
0xA002
0xA004
0xA005
file A
process 1
process 2
process 3
VMA
VMA
VMA
broken
file B
process 4
process 5
VMA
VMA
->mapping = mappingA
->index = offsetA
file …
…
…
0xA003
NOT SUPPORTED
VMA
VMA
48
50. Approach to solve this problem
Idea 1 – per-page tracking
Introduce dax-ramp rbtree
Track by iterating the dax-rmap rbtree
Idea 2 – fs query tracking
Introduce ->storage_lost() for filesystem
Collect a list of shared files from ->storage_lost(), and iterate the list
Idea 3 – dev_pagemap interface
Call memory-failure handler in ->storage_lost()
Introduce ->memory_failure(), which is to be implemented by driver
Copyright 2020 FUJITSULIMITED49
51. Improved
Old implementation
Idea 1 – per-page tracking (1/2)
Copyright 2020 FUJITSULIMITED
store more file mappings
introduce a dax-ramp rbtree
associate
1st
time: associate file mapping and offset to the page
• same as old implementation
2nd
time: create the dax-ramp rbtree (in page->private)
• the page->index counts how many files associated
• insert file mapping and offset as node at the 2nd
time and later
mappingA
offsetA
mappingB
offsetB
mappingC
offsetC
mappingA
indexA
page->mapping
page->index
1st
time associate 2nd
time associate and later
rbroot
refcount
file A
address_space
0xA003
page
Associate
0xA003
page
file A
address_space
VMA
VMA
VMA
VMA
Track
page->mapping = mappingA
page->index = offsetA
page->private
page->index
50
52. memory failure
Idea 1 – per-page tracking (2/2)
Copyright 2020 FUJITSU LIMITED
track
1. page ==> file mappings (new)
• iterate the dax-ramp rbtree
2. file mapping ==> processes
• iterate each file mapping’s i_mmap
to find VMAs of processes
result
may cause huge overhead
• each page contains a dax-rmap rbtree
per-page tracking only works for file
• can not handle metadata
lock dax page
collect processes
unmap pages
kill processes
unlock dax page
+ iterate dax-ramp rbtree
iterate mapping->i_mmap
to_kill
list
51
1
2
53. Improved
Idea 2 – filesystem query tracking (1/2)
take advantage of filesystem feature to reduce the huge overhead
rmapbt of XFS
• record the owners of each data extent.
introduce a FS interface ->storage_lost() to return owners list
• instead of the dax-rmap rbtree to reduce the huge overhead
• query the rmapbt for owners of a block (broken page on NVDIMM)
• return a list of owners for memory failure
associate
store FS super block in page->mapping
• as the specific caller when tracking
count reference in page->zone_device_data
• not used in memory-failure before
Copyright 2020 FUJITSULIMITED
(XFS) sb
offset
refcount
page->mapping
page->index
page->zone_device_data
52
54. Idea 2 – filesystem query tracking (2/2)
Copyright 2020 FUJITSULIMITED
track
1. page ==> file mappings (new)
• iterate the owners list
2. file mapping ==> processes
• iterate each file mapping’s i_mmap
to find VMAs of processes
result
reduced the huge overhead
however, it leaves a little overhead
• need extra memory space for the
owners list
memory failure
lock dax page
collect processes
unmap pages
kill processes
unlock dax page
+ iterate the owners list
iterate mapping->i_mmap
to_kill
list
+ call ->storage_lost()
owners
list
53
1
2
55. handle memory-failure during querying to remove the overhead completely
instead of creating the owners list, handle
every owner during querying
• call ->storage_lost() to query owners
• call memory-failure handler when an owner is found
• iterate each owner’s i_mmap to find
VMAs of processes
result
the extra memory space is not necessary
implemented 1-to-N reverse mapping
Copyright 2020 FUJITSU LIMITED
memory failure
+ call ->storage_lost()
+ FS rmap query ->
collect processes
unmap pages
kill processes
unlock dax page
to_kill
list
lock dax page
Idea 3 – dev_pagemap interface (1/3)
54
owner
+ memory-failure handler
iterate owner mapping->i_mmap
56. Idea 3 – dev_pagemap interface (2/3)
support more than fsdax mode
dive into driver
introduce a driver interface
->memory_failure()
• for pmem_device
call ->storage_lost()
• for dax_device
Copyright 2020 FUJITSULIMITED
memory failure
+ dev_pagemap ->ops->memory_failure()
+ pmem driver ->storage_lost()
FS rmap query
memory-failure handler
collect processes
unmap pages
kill processes
unlock dax page
iterate mapping->i_mmap
to_kill
list
lock dax page
+ dax driver handler
……
or
55
57. Idea 3 – dev_pagemap interface (3/3)
1-to-N reversed mapping has been implemented
no useless overhead
compatible for all NVDIMM modes
compatible for all filesystems
some remaining difficulties
can not obtain filesystem from LVM (or other mapped device)
- `get_super()` is for filesystems created directly on the device
- not suitable for mapped device
require each filesystem has the feature like “rmapbt”
- “realtime device” of XFS does not support “rmapbt”
- other filesystems
Copyright 2020 FUJITSULIMITED56
59. We talked about the followings
Basis of NVDIMM for Linux
Issues of Filesystem-DAX (Direct Access mode)
Deep dive to solve issues of Filesystem-DAX
• Support reflink & dedupe for fsdax
• Fix NVDIMM-based Reverse mapping
Copyright 2020 FUJITSULIMITED
• Community has made many enhancement for NVDIMM on Linux
• We have worked for NVDIMM to remove experimental status of Filesystem DAX
• We expect it will be achieved soon
58
62. How to use NVDIMM in Linux
Make region
You can configure it on BIOS screen (or ipmctl command for Intel DCPMM)
• Region is created by hardware(memory controller)
You need to reboot to enable the created region
Make namespace
Namespace is similar concept against SCSI LUN
You can configure it by ndctl command
Format Filesystem (storage or Filesystem DAX)
If you would like to use Filesystem DAX, you need to select ext4 or xfs which support Filesystem DAX
• Currently, if you select xfs, reflink option must be disabled
Mount Filesystem (storage or Filesystem DAX) with –o dax option
Make pool (Filesystem DAX or Device DAX) by the PMDK tool if necessary
Copyright 2020 FUJITSULIMITED61
63. Users of NVDIMM at kernel layer
dm-writecache
Persistent cache for write at device mapper layer
• dm-cache can use SSD as cache, dm-writecache can use NVDIMM
• This presentation is helpful
• https://github.com/ChinaLinuxKernel/CLK2019/blob/master/clk2019-dm-writecache-04.pdf
Kernel shows Device dax as DRAM
Though Intel DCPMM has “Memory Mode” which user can use it as huge RAM, it has some pains
• Size of DRAM disappears, because it becomes just cache of NVDIMM
• Software cannot choose area between DRAM or DCPMM
In this kernel feature, system RAM size becomes sums of DRAM and NVDIMM
• Use App Direct Mode and Device DAX namespace
• Use Memory Hot-add Device DAX area
• Becomes NUMA node
Copyright 2020 FUJITSULIMITED62