Scaling API-first – The story of a global engineering organization
Function Level Analysis of Linux NVMe Driver
1. Function Level Analysis of
Linux NVMe Driver
Comparative analysis on legacy and MQ-based NVMe driver
Ingu Kang
Embedded Systems Lab.
Kookmin University
2016-05-09
3. Linux NVMe
Function Call Flow
● The NVMe driver of Linux ~3.18
bypasses the block layer
routines, and directly takes bio
structure instances.
● On Linux 3.19+, bios go through
blk-mq and are converted into
“request” structure.
● blk-mq queues requests into the
NVMe driver by calling
nvme_queue_rq().
● On Linux 4.4+, some refactoring
has been done, including
optimization and bug-fixes.
3
4. 1. nvme_make_request() takes a
bio from the upper layer,
converting it into iod (I/O
descriptor). iod is allocated
independently for each bio.
2. nvme_submit_iod() takes an
iod and gets a struct
nvme_cmd_info as well as its
index on the space pre-allocated
in current CPU's nvmeq(struct)
for it.
3. nvme_submit_iod() builds up
a command in a submission
queue (SQ) based on the iod. It
also sets up the nvme_cmd_info
(pre-allocated) that saves the
reference to an iod and a
callback. Those will be used in
completion queue entry
processing. Finally, it rings the
doorbell of SQ.
Note: we save pointers to iod and
callback into cmd_info, as
SQ/CQ entries can contain
command_id but not iod and
Linux ~3.18
4
5. 1. blk_mq_make_request() takes a
bio and builds up a request
instance from it. The instance is
picked from the instance pool of a
hctx, which was pre-allocated on
device initialization. request->tag
keeps the index of request instance
itself in pool array. 'tag' will be used
as command_id in
nvme_submit_iod() later on. The
request is then added to
plug->mq_list.
2. blk_mq_flush_plug_list() and
blk_mq_insert_requests() flush
requests from the plug->mq_list to
the ctx->rq_list (note: ctx->rq_list
is a per-CPU SW queue).
3. flush_busy_ctxs() and
__blk_mq_insert_request() flush
requests from the ctx->rq_list to
the locally defined rq_list that acts
as 1-to-1 HW dispatch queue, and
process them with
nvme_queue_rq().
Linux 3.19+ (1)
5
6. 4. nvme_queue_rq() allocates an
iod and converts a request into the
allocated iod.
5. nvme_submit_iod() converts iod
into command. request->tag is
reused as command_id. Callback
and iod information is set into
nvme_cmd_info for later use in the
completion routine. It then rings
the SQ doorbell.
Note 1: The address of
pre-allocated nvme_cmd_info is
obtained by calling
blk_mq_rq_to_pdu(), which
calculates address to PDU data
area by adding sizeof(struct
request) to address of request
instance.
Note 2: nvme_cmd_info and
nvme_iod are merged in the most
recent version of NVMe driver. All
the iods are pre-allocated on device
initialization, but the DMA memory
segments information array
iod->sg is NOT.
Linux 3.19+ (2)
6