1. FSCK SX
An approach to FSCK Performance Enhancement
Gaurav Naigaonkar, Sanjyot, Tipnis, Ajay Mandvekar, Moksh Matta
{gnaigaonkar, sanj312} @gmail.com
{ajay30_sam, mokshmatta1004} @hotmail.com
Pune Institute of Computer Technology, Pune
Abstract
1. Introduction
File System Check Program (fsck)
is an interactive file system check and File system repair is usually an
repair program. Fsck uses the redundant afterthought for file system designers. One
structural information in the UNIX file reason is that repair is difficult and
system to perform several consistency annoying to reason about. It‘s neither
checks. Unfortunately, disk capacity is possible nor worthwhile to fix all error
growing faster than disk bandwidth, seek modes so we must focus our efforts on the
times are hardly budging, and the overall ones that commonly occur, yet we do not
chance of an I/O error occurring know what they are until we encounter
somewhere on the disk is increasing. The them in the wild. In practice, most file
result: the traditional file system check and system repair code is written in response to
repair cycle will be not only longer, but an observed corruption mode. File system
also more frequent, with disastrous repair is annoying because, by definition,
consequences for data availability. Data something went wrong and we must think
reliability will also decline with frequency outside the state space of our beautifully
of corruption. designed system. In the end, designing a
file system is more fun than designing a
We have implemented techniques for file system checker.
reducing the average fsck time on ext3 file
systems. First, we improve the For many years, we could brush off
performance by parallelizing the two the importance of making file system
major operations of fsck – fetching repair fast and reliable with the following
metadata and checking it. This chain of reasoning: File system corruption
multithreaded operation, along with an is a rare event, and when it does occur,
intelligent issuing of IO requests, helps to repairing it takes only a few minutes or
greatly reduce the overall seek time. We maybe a few hours of downtime, and if
have also implemented ‗Metaclustering‘ repair is too difficult or time–consuming,
wherein we store the indirect blocks in ―That‘s what backups are for.‖
clusters on a per group basis instead of Unfortunately, if this reasoning was ever
spreading them out along with the data valid, it is being eroded by some
blocks. This makes fsck even faster inconvenient truths about disk hardware
since it can now read and verify all trends.
indirect blocks without much seek. 2006 2009 2013 Change
Capacity (GB) 500 2000 8000 16x
Keywords: Fsck, Parallelism, Metadata Bandwidth(Mb/s) 1000 2000 5000 5x
clustering, Readahead Seek Time(ms) 8 7.2 6.5 1.2x
Table 1: Projected disk hardware trends
2. As Table 1 shows, Seagate projects checks to see if the ext3 file system was
that during the same time that disk cleanly un-mounted by reading the state
capacity increases by 16 times, disk field in the file system superblock. If the
bandwidth will increase by only 5 times, state is set as VALID, the file system is
and seek times will remain nearly already consistent and does not need
unchanged. This is good news for many recovery; fsck exits without further ado. If
common workloads—we can store more the state is INVALID, fsck does a full
data and read and write more of it at once. check of the file system integrity, repairing
But it is terrible news for any workload any inconsistencies it finds. In order to
that is (a) proportional to the size of the check the correctness of allocation
disk, (b) throughput–intensive, and (c) bitmaps, file nlinks, directory entries, etc.,
seek–intensive. fsck reads every inode in the system, every
indirect block referenced by an inode, and
One workload that fits this profile every directory entry. Using this
is file system check and repair. It has been information, it builds up a new set of inode
calculated that file system check and repair and block allocation bitmaps, calculates
time will increase by approximately a the correct number of links of every inode,
factor of 10 between 2006 and 2013 with and removes directory entries to
today‘s file system formats. unreferenced inodes. It does many other
things as well, such as sanity check inode
At the same time that capacity is fields, but these three activities
increasing, the per–bit error rate is fundamentally require reading every inode
improving. However, for an overall in the file system. Otherwise, there is no
improvement in the error rate for way to find out whether, for example, a
operations that read data proportional to particular block is referenced by a file but
the file system size (such as fsck), the per– is marked as unallocated on the block
bit error rate must improve as fast as allocation bitmap. In summary, there are
capacity grows, which seems unlikely. We no back pointers from a data block to the
conclude that the frequency of file system indirect block that points to it or from a
corruption and necessary check and repair file to the directories that point to it, so the
or restore is more likely to increase than only way to reconstruct reference counts is
decrease. This combination of increasing to start at the top level and build a
fsck time and increasing fsck frequency is complete picture of the file system
what we call the fsck time crunch. metadata.
2. The fsck program Unsurprisingly, it takes fsck quite
some time to rebuild the entirety of the file
Cutting down crash recovery time system metadata, approximately O(total
for an ext3 file system depends on file system size + data stored). The
understanding how the file system checker average laptop takes several minutes to
program, fsck works. After Linux has fsck an ext2 file system; large file servers
finished booting the kernel, the root file can sometimes take hours or, on occasion,
system is mounted read-only and the days!
kernel executes the init program. As part
of normal system initialization, fsck is run Straightforward tactical
on the root file system before it is performance optimizations such as
remounted read-write and on other file requesting reads of needed blocks in
systems before they are mounted. Repair sequential order and readahead requests
of the file system is necessary before it can can only improve the situation so much,
be safely written. When fsck runs, it given that the whole operation will still
3. take time proportional to the entire file read off disk. This takes a relatively small
system. What we want is file system amount of time compared to the time spent
recovery time that is O(writes in progress), doing what are effectively random 4 KB or
as is the case for journal replay in similar–sized reads, although more
journaling file systems. complex file systems may burn more CPU
time in computing checksums or similar
3. Motivation tasks. In summary, the ways to reduce fsck
time, in rough order of effectiveness, are to
reduce seeks, reduce dependent reads,
The fundamental limiting factors in
reduce the amount of metadata that needs
the performance of fsck are amount of data
to be read (either by reducing the overall
read, number of separate I/Os, how
quantity or the amount that needs to be
scattered the data is on disk, number of
read), and to reduce the complexity of the
dependent reads, and CPU time required to
consistency checks themselves.
check and repair the data read. The amount
of memory available is a factor as well,
though most fsck programs operate on an 4. Our Approach
all–or–nothing basis: Either there is
enough memory to fit all the needed In order to discover and correct
metadata for a particular checking pass or filesystem errors, fsck must read all the
the checker simply aborts. metadata in the entire file system. Hence,
the basic idea of our project is to
The time required to read the file introduce parallelism in the operation of
system metadata is partially constrained by Fsck by pre fetching these metadata
the bandwidth of the disk. Depending on blocks(which includes inodes, bitmaps,
the file system, some kinds of file system directory entries, indirect blocks, block
data, such as blocks of statically allocated group summaries, etc) and simultaneously
inodes or block group summaries, are performing consistency checks on this pre
located in contiguous chunks at known fetched data.
locations. Reading this data off disk is
relatively swift. Originally, Fsck consists of a
single thread of operation. To enhance
Other kinds of file system data are Fsck performance, we have added an extra
dynamically allocated, such as directory thread to read ahead the indirect blocks.
entries, indirect blocks, and extents, and Thus, the project involves two threads
hence are scattered all over the disk. Many namely - ‗main thread‘ and a ‗prefetch
modern file systems allocate nearly all thread‘. The main thread is responsible for
their metadata dynamically. The location the actual data checking while the prefetch
of much of this kind of metadata is not thread fetches the metadata (indirect
known until the block of metadata pointing blocks) for the main thread. This
to it is read off the disk, introducing many modification would ensure that when the
levels of dependent reads. This portion of main thread begins its checking operation,
the file system check is usually the most the metadata that it would require is
time consuming, as we must issue a set of already brought into the system cache by
small scattered reads, wait for them to the prefetch thread. Hence, there would be
complete, read the address of the next a reduction in the overall time taken by
block, then issue another set of reads. Fsck to complete its operations.
Finally, we need CPU time and While actual fetching the data from
sufficient memory to actually compute the the disk, the prefetch thread has to go to
consistency checks on the data we have the disk many times and each time the data
4. brought into the cache will be minimal. As As mentioned earlier, when the
a solution to this, we have designed a main fsck thread begins its operation, it
strategy to reduce the number of disk seeks needs to fetch the metadata from the disk.
for indirect blocks and also to increase the This data is then checked for consistency.
amount of data brought in each time we go In other words, while the metadata is being
to the disk. This strategy involves queuing fetched no other checking is performed
the block numbers to be brought in until and CPU remains idle. This is a major
we reach the end of a block group and bottleneck which leads to fsck taking
once the end is reached, issue these queued enormous time to check and repair the file
IO‘s to fetch the blocks from disk. Also, system. As a solution to this we have
by merging the IO requests, we ensure that implemented a multi-threaded model in
during each disk seek maximum data can which we create a new thread to perform
be pre fetched into the cache instead of the fetching of metadata. This thread what
issuing single IO‘s. we call as ‗Pre-fetch‘ thread performs the
task of pre-fetching metadata from disk
Another aspect of the project is and making it available to the main fsck
metadata clustering. Metaclustering refers thread for performing usual consistency
to storing indirect blocks in clusters on a checks. In this way, we can ensure
per-group basis instead of spreading them maximum CPU utilization by co-
out along with the data blocks. This makes ordinating the operation of both the main
fsck faster since it can now read and verify and also the pre-fetch thread. Since the
all indirect blocks without much seeks. prefetch thread reads in all the metadata
that the main thread requires, the main
Fsck involves five passes. Pass 1 is
thread is absolved of the fetching work. As
responsible for checking inodes, blocks
a result, the checking and fetching of data
and sizes and pass 2 for checking directory
can take place simultaneously thus
structure. Implementing the above
ensuring performance benefits with
mentioned features helps us achieve
regards to time taken for the overall
reduction in the times for these two passes
working of fsck.
which take up maximum time as compared
to the other passes. Hence, by our
modifications and additions to the original 5.2 Working of Pre-fetch thread
Fsck utility we can ensure improvement in
the overall performance of Fsck. As the name suggests, the prefetch
thread has been introduced to pre-fetch or
read-ahead the metadata for the main
5. Implementation thread. We have added two new queues
namely a ‗Select Queue‘ and an ‗IO
Queue‘ which forms an integral part of the
5.1 Parallelization Operation
prefetch procedure.
The working of prefetch thread and
the queues can be better understood from
fig. and can be summarised as follows:
1. Initially the inode table location on
disk i.e. the block number holding the
inode table is read into the IO queue.
Figure 1: Multithreaded Fsck
2. This inode table block is then actually
fetched from disk into the buffer cache.
5. Figure 2: Pre-fetch thread working
3. From this table, individual inodes are The above implementation provides the
picked up to perform various following benefits:
consistency checks.
1. The main fsck thread performs only
4. For each inode in the table, the checking of metadata and does not
prefetch thread fetches the indirect have to go to the disk to fetch any
block numbers associated with it into blocks as all the blocks required by the
the select queue. Thus, we see that, main thread have been pre-fetched into
select queue holds the indirect block the system cache by the pre-fetch
numbers of every inode in the current thread.
inode table.
2. The select queue that holds the indirect
5. Once the end of block group is block numbers is sorted. This helps us
reached, the select queue is sorted as achieve a nearly sequential sweep of
per block numbers and merging is read-write head over the disk.
performed to club together contiguous
block numbers. Then all those block 3. By merging the requests in the select
numbers that lie within the current queue, we reduce the number of times
block group are transferred into the IO the prefetch thread needs to go the disk
queue. Thus, the IO queue holds all to fetch blocks. Thus we ensure
those block numbers that are to be minimal in-ordered seeks and also
currently fetched from disk. minimal overall fetches from disk.
6. Finally, the indirect blocks indicated 4. Only those block numbers that lie
by the blocks numbers present in the within the current block group are held
IO queue are fetched from disk into the by the IO queue. These blocks are then
buffer cache. Thus, the required fetched from disk. Thus, we limit our
metadata blocks become available to fetching to the current block group
main thread. while delaying fetching those indirect
blocks that lie in other block groups.
This further helps in reducing random
6. movement of read-write head over fetch these blocks which otherwise are
disk. spread out across the filesystem.
5.3 Metadata Clustering
6. Performance Evaluation
Every block group has at its end
a semi-reserved region called the 6.1 Test Environment
‗Metacluster‘. This region is mostly used
Processor: Intel Core 2 Duo
for allocating indirect blocks. Under
(2.6 GHz)
normal circumstances, the metacluster is
used only for allocating indirect blocks Memory: 512 MB
which are allocated in decreasing order of Operating System: Kubuntu
block numbers. The non-Metacluster Gutsy Gibbon 7.10
region is used for data block allocations Kernel: 2.6.23
which are allocated in increasing order of Base File System Checker
block numbers. However, when the MC Code: e2fsprogs-1.40.4
runs out of space, indirect blocks can be
allocated in the non-MC region along with Number of disks used 2+1
the data blocks in the forward direction. Type of disks IDE
Similarly, when non-MC runs out of Size of each disk 40 GB
space, new data blocks are allocated in MC Partition size 80 GB (RAID 0)
but in the forward direction.
Avg. Seek Time 9 ms
Avg. Rotational Latency 6 ms
Table 2: Experiment Disk Characteristics
(1 GB = 10^9 bytes)
Figure 3: Metaclustering 6.2 Time vs Percentage of file system
under use
The steps involved in metadata clustering
are: We measured the time taken to run fsck on
our test machine with a gradual increase in
1. Read in the inode table into the buffer the percentage of file system used. We
cache. observed that by using the readahead
2. Scan through each inode present in the concept to fetch the data, we get about 30 -
table. 35% decrease in the time taken to
3. Find the number of indirect blocks complete an fsck run.
indicated by the inodes.
4. Find the amount of contiguous free % filled Original Modified
space (metacluster region) required to
cluster or group together these indirect
15 99.62 66.26
blocks.
25 198.52 132.69
5. Transfer the indirect blocks into the
35 303.1 215.65
metacluster region.
6. Perform required updation of metadata 45 331.88 231.46
to reflect above changes to the file system. 55 412.32 290.87
65 475.72 365.14
Such clustering helps to club together the 75 526.39 397.67
indirect blocks and hence reduces seeks to Table 3: Time taken to run fsck
7. affect fsck time in 2013, we ran fsck on the
/dev/md0 partition (ext3 formatted) of a
desktop machine (see Table 1 for details
on the disk used).
We measured the elapsed time and CPU
time using the time command, and the
number of individual I/O operations and
the total data read using iostat. Using this
data and projected changes in disk
hardware, we made a rough estimate of the
time needed to complete a file system
check on a moderately sized desktop
(RAID 0) file system in 2013.
First, we estimated how the using the
CPU, reading data off the disk, and
head travel between blocks (seek time
plus rotational latency).
Graph 1: Time vs. FS usage (%) To check an 80 GB file system with 60
GB of data:
6.3 Finding sequential order break with The total elapsed time of original
respect to the file system usage fsck is 527 seconds, 21 seconds of
which are spent in CPU time. That
We also measured the number of times leaves 506 seconds in I/O.
fsck needs to fetch blocks which are The total elapsed time of modified
inordered i.e. which break the sequential fsck is 398 seconds, 16 seconds of
movement of the read-write head and which are spent in CPU time. That
found that in the original fsck, the number leaves 382 seconds in I/O.
of inordered reads is quite high. This count
of the number of inordered reads reduces We measured 1.5GB of data read. We
considerably by using the modified fsck. estimated the amount of time to read
The actual count can be seen as follows: 1.5 GB of data off the disk by using dd
to transfer that amount of data from the
File Original Modified partition into /dev/null, which took 37
System seconds at optimal read size.
used (in The remaining time, 490 seconds (for
GB) original fsck) and 345seconds (for
10 29 5 modified fsck), we assume is spent
20 47 26 seeking between tracks and waiting for
30 89 40 the data to rotate under the head.
40 107 49 We measured 233,440 separate I/O
50 136 61 requests.
The average seek time for this disk is
Table 4: Number of in-ordered reads 9 ms, and the average rotational
latency is 6 ms. We estimate that
6.4 Finding seek per number of IOs original fsck required about 32000
seeks (about one seek per every 35-38
To get a rough estimate of how 16x I/Os) while the modified fsck required
capacity increase, 5x bandwidth increase, about 22000 seeks (about one seek per
and almost no change in seek time will every 58-60 I/Os).
8. Steps to find IOs/seek: [1] Valerie Henson, Open Source
Technology Centre, Intel Corporation.
1. Elapsed time = CPU time + time to Repair – Driven File System Design
read data of the disk + head travel
between the blocks(seek time + [2] Val Henson, Zach Brown,
rotational latency) Theodore Ts‘o, Arjan Van De Ven.
2. Calculate total elapsed time for e2fsck Reducing Fsck time for Ext2 File
3. Subtract CPU time from it to get the Sytems. In Ottawa linux Symposiu
input/output time 2006, 2006.
Note: [CPU time = user time + system
time] (time command used) [3] Val Henson, Arjan van de Ven,
4. Calculate time required to read certain Amit, Gud, and Zach Brown. Chunkfs:
amount of metadata (approximately = Using divide-and-conquer to improve
current metadata for test) file system reliability and repair. In
Note: use dd operation to calculate Hot Topics in System Dependability,
5. Subtract the above time from I/O time 2006.
to get time spent in head travel
6. Divide this time by the disk access [4] Design and Implementation of the
time second extended file system.
Note: Access time = seek time + http://e2fsprogs.sourceforge.net/ext2int
rotational latency ro.html
7. Calculate number of I/O requests for
the test [5] T.J.Kowalski and Marshall K.
8. Divide this by the time calculated in McKusick. Fsck – the UNIX file
step 6 to get the number of I/O‘s system check program. Technical
required for one seek report, Bell Laboratories, 1978.
7. Conclusion
This paper enunciates the design and
implementation for a multithreaded
filesystem checker (fsck), an improvement
over the current single threaded version. It
also describes the extensions implemented
in the current fsck, to enable clustering of
metadata to further improve performance.
The sample tests performed using FSCK –
SX have shown its capability to achieve
nearly a 30% enhancement over current
performance. FSCK – SX thus enables a
framework to achieve an optimized
filesystem checker on ext3 filesystem, the
concept of which can be extended to other
file systems.
8. References