Note: There’s a limited number of pointers (e.g., 15
above)
What to do when a file is bigger than 10 blocks?
→ Indirection - Multi-level index
Multi-level indexing
One of the data pointers points to a data-block which contains
another pointer table (an indirect block)
So if we have 15 pointers in an inode, and each pointer is 4 bytes,
we get 14 + (4K / 4) = 14 + 1024 = 1038 pointers, which
means our files can have 1038 * 4K = 4152K (roughly
4M)
What if we need bigger files?
→ Double indirection
Add a pointer to a block that contains pointers to blocks with
pointers
we get 1038 (single indirection) and an additional
1024 * 1024 pointers
More needed?
→ Triple indirection
This approach is used by ext2, ext3, the original UNIX fs, etc.
Note that we get a very imbalanced tree
But this is okay - most files are small!
Extents
If a file is stored as a set of contiguous blocks, it seems wasteful
to point to each separately
Some FSs (e.g., ext4) use “extents” - a pointer together with a size
in blocks
Directories
A directory also has an inode (file type = directory)
Data blocks contain file entries: an i-number + the file name +
length of the name
The root directory (/) has a set i-number
Example:
inum
reclen
strlen
name
5
12
2
.
2
12
3
..
12
12
4
foo
13
12
4
bar
24
36
28
foobar_is_a_pretty_longname
Performance
With our FS, what is the number of I/O when accessing a file?
It depends on the length of the path (at least two reads per
directory)
For write/create operations, bitmaps and inodes need also
bemodified
→ Caching
Most file systems use main memory as a cache to store frequently
accessed blocks
Cache for reads: can prevent most I/Os
Cache for writes:
Impair reliability
Most FS cache writes between 5 and 30 seconds
Better I/O scheduling
Merge writes (e.g., for the bitmaps)
Consistency
What happens if a computer/disk loses power while writing
data?
Let’s say that we are trying to update a file (name is not
important) with the following metadata:
The update involves adding another 20 bytes of data
This means we have to add another block
What needs to be updated?
a block needs to be allocated (block bitmap)
the inode has to be updated (size + block2)
the data block needs to be written
The power may be cut during any of these tasks
This gives us a few possible crash scenarios depending on whether
just one task was completed or two
One task completed:
A block was allocated
No inode tracks it -> space leak
Data lost, inconsistency
The inode was updated
Inode points to the disk address where data was about to be
written
But there’s just garbage, or contents of a previous file
Even worse: Block could be allocated twice!!
Data lost, inconsistency
The data block was written
No inode tracks it, it’s not marked as allocated
Data lost (as if it was never written), but no inconsistency
Two tasks completed:
Inode and bitmap: yes / Bata: no
Data lost, no inconsistency
Inode points to garbage
Inode and data block: yes / Bitmap: no
Inode points to correct data
But: data block may be allocated and overwritten via two different
files
Inconsistency, high probability of data lost
Bitmap and data block: yes / Inode: no
No inode tacks data block -> space leak
Inconsistency,
There are generally two solutions people have come up with:
Let bad things happen and check afterwards (fsck)
Write-ahead Logging (journalling)
File System Checking (fsck)
Idea: Bad things happen, let’s see if we can fix them by checking
the disk post-hoc
Usually, this means that a utility (fsck) is run
regularly
Scans the whole disk, looking for inconsistencies
Bitmaps: trust inodes and mark blocks used by inodes as not free
Also, if an inode looks used, mark it as used
Inodes
Check for corruption - do they look okay?
E.g., is size consistent with the no of blocks?
If the inode cannot be fixed, clear it and deallocate
Inode ref count
Check that there is a corresponding number of directory entries
If not, update the ref count
If ref count == 0, deallocate
Duplicates: No two inodes should point to the same block
Might delete a bad inode
Or could create a copy of the block so each inode has its own
Bad blocks - pointer outside of the disk
Dir checks
Problem takes a long time for a big disk - everything needs to be
scanned
Might be too late to fix a problem
Journalling
Journal - an area of disk where the file system logs what it’s abut
to do
Transactions
Journal write: Write the contents of the transaction (containing TxB
and the contents of the update) to the log; wait for these writes to
complete.
Journal commit: Write the transaction commit block (containing TxE)
to the log; wait for the write to complete; the transaction is now
committed.
Checkpoint: Write the contents of the update to their final
locations within the file system.
Free: Some time later, mark the transaction free in the journal by
updating the journal superblock
Metadata Journalling
Data write: Write data to final location; wait for completion (the
wait is optional; see below for details).
Journal metadata write: Write the begin block and metadata to the
log; wait for writes to complete.
Journal commit: Write the transaction commit block (containing TxE)
to the log; wait for the write to complete; the transaction (in- cluding
data) is now committed.
Checkpoint metadata: Write the contents of the metadata update to
their final locations within the file system.
Free: Later, mark the transaction free in journal superbloc