Intro
Persistent storage devices allow storing data (non-volatile
memory)
Data stored as a sequence of bits using some physical
representation, e.g.,
magnetic (HDDs, floppy disks, tape drives)
optic (CDs)
electronic (SSDs)
From a (low-level) software perspective: data stored as a sequence
of bytes, usually organized in blocks
Long-term storage in a linear array would be a pain:
What if files grow?
Where do we add new files?
How do we organize data?
→ File systems
Note, the very beginning of a disk has some standardized contents -
we talked about the MBR
Modern file systems have two abstractions:
File - think of them simply as a collection of
bytes that can be read and written and can be addressed using a
name
Directory - a collection of files and other
directories
In its simplest form, this gives rise to a tree-shaped organisation
(there are some complications that we will mention later):
foo/
|
+-----+---+---+--------+
| | | |
bar/ baz/ info.txt hello.c
| |
img.png |
|
+--------+-------+-------+-------+
| | | | |
meme.gif run.sh items.csv main.s letter.docx
File System API
The OS provides several system calls to interact with a file
system
We’ve worked with some file I/O operations:
open / close - open (or create) / close a
file
read / write - read / write bytes from/to
file
There’s more
Files
lseek
fsync
flush a buffer to a file - force data to be written
rename
stat
get information (metadata) about a file
link
associate another name with file data
unlink
remove (a reference to) a file / delete a file
Directories
mkdir
rmdir
opendir / readdir /
closedir
FS Implementation
File systems are typically implemented as “drivers” - but, really,
they do not abstract the hardware directly
Block device drivers that fill the layer below file systems
They provide a software abstraction over the lower-level storage
APIs
All they really do:
Translation: Translate a name + offset to the appropriate
disk position
Free space management: Keep track of where to store
data
To design a FS, we really need to understand and decide on two
things:
Data structures
Access methods
Data structs - how is data and metadata organized - arrays, linked
lists, trees, …
Access methods - how does the file API map to usage of these data
structures
We will look at a simple file system - very similar to the first
Unix FS by Ken Thompson
Useful Observations
Most files are small
2K is the most common size
The average file size is growing
Most bytes are in the large files
A few big file use up most of the space
File systems contain lots of files
File systems are roughly half full
Even as disks grow the number of files is ~50%
Directories are small
Many have few entries, most have 20 or less
Organization
Blocks
A disk is divided into blocks
Fixed size - 4KB is typical
Blocks are numbered sequentially: 0 to N - 1 (for a partition of
size N * 4K)
Let’s say we have 32 blocks…
+---------------------------------------------------------------+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
+---------------------------------------------------------------+
0 31
Regions
What is stored in blocks?
Data
Majority of the blocks are data blocks - stored in the data
region
+---------------------------------------------------------------+
| | | | | | | | |D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0 31
File Metadata -> Inodes
Blocks belonging to a file
Size
Owner
Access
Inode = Index node
Stored as an inode table
Multiple inodes / block
+---------------------------------------------------------------+
| | | |I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0 31
Free space info
Some blocks at the beginning track which blocks are free
In our example, represented as a bitmap
If a block is in use, its bit is set to 1 otherwise its 0
Need to keep track of both free data blocks and free inodes
Other representations are possible, e.g., free lists
+---------------------------------------------------------------+
| |i|d|I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0 31
File system metadata - Superblock
Information about the file system itself
Type (ext2, ext3, ext4, btrfs, FAT, etc.) - “magic number”
How many inodes in the file system
How many data blocks
Beginning of the inode table
Important when mounting a file system
+---------------------------------------------------------------+
|S|i|d|I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0 31
Adressing contents of files
Each inode has an i-number (a low-level identifier) -
its index in the inode table
Given an index, it’s straightforward to calculate the location of
an inode on a disk (its sector):
blk = (inumber * sizeof(inode_t)) / blockSize
sector = ((blk * blockSize) + inodeStartAddr) / sectorSize
Contains an array of block pointers (addresses) for the file:
foo.txt: +-+
+------>| |
+-+ | |-|
| |------+ | |
|-| |-|
| |-----+ +---->| |
|-| | | |-|
| |-----|--|---->| |
+-+ | | |-|
+--|---->| |
bar.txt: | |-|
+-+ | | |
| |----+ | |-|
|-| | | | |
| |----|---+ |-|
+-+ +-------->| |
+-+
Example inode contents (ext2):
2
mode
can this file be read/written/executed?
2
uid
who owns this file?
4
size
how many bytes are in this file?
4
time
what time was this file last accessed?
4
ctime
what time was this file created?
4
mtime
what time was this file last modified?
4
dtime
what time was this inode deleted?
2
gid
which group does this file belong to?
2
links_count
how many hard links are there to this file?
4
blocks
how many blocks have been allocated to this file?
4
flags
how should ext2 use this inode?
4
osd1
an OS-dependent field
60
block
a set of disk pointers (15 total for 32-bit)
4
generation
file version (used by NFS)
4
file_acl
a new permissions model beyond mode bits
4
dir_acl
called access control lists
Note: There’s a limited number of pointers (e.g., 15
above)
What to do when a file is bigger than 10 blocks?
→ Indirection - Multi-level index
Multi-level indexing
One of the data pointers points to a data-block which contains
another pointer table (an indirect block)
So if we have 15 pointers in an inode, and each pointer is 4 bytes,
we get 14 + (4K / 4) = 14 + 1024 = 1038 pointers, which
means our files can have 1038 * 4K = 4152K (roughly
4M)
What if we need bigger files?
→ Double indirection
Add a pointer to a block that contains pointers to blocks with
pointers
we get 1038 (single indirection) and an additional
1024 * 1024 pointers
More needed?
→ Triple indirection
This approach is used by ext2, ext3, the original UNIX fs, etc.
Note that we get a very imbalanced tree
But this is okay - most files are small!
Extents
If a file is stored as a set of contiguous blocks, it seems wasteful
to point to each separately
Some FSs (e.g., ext4) use “extents” - a pointer together with a size
in blocks
Directories
A directory also has an inode (file type = directory)
Data blocks contain file entries: an i-number + the file name +
length of the name
The root directory (/) has a set i-number
Example:
5
12
2
.
2
12
3
..
12
12
4
foo
13
12
4
bar
24
36
28
foobar_is_a_pretty_longname
With our FS, what is the number of I/O when accessing a file?
It depends on the length of the path (at least two reads per
directory)
For write/create operations, bitmaps and inodes need also
bemodified
→ Caching
Most file systems use main memory as a cache to store frequently
accessed blocks
Cache for reads: can prevent most I/Os
Cache for writes:
Impair reliability
Most FS cache writes between 5 and 30 seconds
Better I/O scheduling
Merge writes (e.g., for the bitmaps)