Lecture 12: File Systems

Intro

Persistent storage devices allow storing data (non-volatile memory)
Data stored as a sequence of bits using some physical representation, e.g.,
- magnetic (HDDs, floppy disks, tape drives)
- optic (CDs)
- electronic (SSDs)
From a (low-level) software perspective: data stored as a sequence of bytes, usually organized in blocks
Long-term storage in a linear array would be a pain:
- What if files grow?
- Where do we add new files?
How do we organize data?
→ File systems
Note, the very beginning of a disk has some standardized contents - we talked about the MBR
Modern file systems have two abstractions:
1. File - think of them simply as a collection of bytes that can be read and written and can be addressed using a name
2. Directory - a collection of files and other directories
In its simplest form, this gives rise to a tree-shaped organisation (there are some complications that we will mention later):

                   foo/
                    |
          +-----+---+---+--------+
          |     |       |        |
         bar/  baz/  info.txt hello.c        
          |     |
      img.png   |
                |
       +--------+-------+-------+-------+
       |        |       |       |       |
    meme.gif run.sh items.csv main.s letter.docx

File System API

The OS provides several system calls to interact with a file system
We’ve worked with some file I/O operations:
- open / close - open (or create) / close a file
- read / write - read / write bytes from/to file
There’s more

Files

lseek
- move withing a file
fsync
- flush a buffer to a file - force data to be written
rename
- rename a file
stat
- get information (metadata) about a file
link
- associate another name with file data
unlink
- remove (a reference to) a file / delete a file

Directories

mkdir
rmdir
opendir / readdir / closedir

FS Implementation

File systems are typically implemented as “drivers” - but, really, they do not abstract the hardware directly
- Block device drivers that fill the layer below file systems
They provide a software abstraction over the lower-level storage APIs
All they really do:
1. Translation: Translate a name + offset to the appropriate disk position
2. Free space management: Keep track of where to store data
To design a FS, we really need to understand and decide on two things:
1. Data structures
2. Access methods
Data structs - how is data and metadata organized - arrays, linked lists, trees, …
Access methods - how does the file API map to usage of these data structures
We will look at a simple file system - very similar to the first Unix FS by Ken Thompson

Useful Observations

Most files are small
- 2K is the most common size
The average file size is growing
- ~200K
Most bytes are in the large files
- A few big file use up most of the space
File systems contain lots of files
- ~100K on average
File systems are roughly half full
- Even as disks grow the number of files is ~50%
Directories are small
- Many have few entries, most have 20 or less

Organization

Blocks

A disk is divided into blocks
Fixed size - 4KB is typical
Blocks are numbered sequentially: 0 to N - 1 (for a partition of size N * 4K)
Let’s say we have 32 blocks…

+---------------------------------------------------------------+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
+---------------------------------------------------------------+
0                                                             31

Regions

What is stored in blocks?

Data
- Majority of the blocks are data blocks - stored in the data region

+---------------------------------------------------------------+
| | | | | | | | |D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

File Metadata -> Inodes
- Blocks belonging to a file
- Size
- Owner
- Access
- Inode = Index node
- Stored as an inode table
- Multiple inodes / block

+---------------------------------------------------------------+
| | | |I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

Free space info
- Some blocks at the beginning track which blocks are free
- In our example, represented as a bitmap
- If a block is in use, its bit is set to 1 otherwise its 0
- Need to keep track of both free data blocks and free inodes
- Other representations are possible, e.g., free lists

+---------------------------------------------------------------+
| |i|d|I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

File system metadata - Superblock
- Information about the file system itself
- Type (ext2, ext3, ext4, btrfs, FAT, etc.) - “magic number”
- How many inodes in the file system
- How many data blocks
- Beginning of the inode table
- Important when mounting a file system

+---------------------------------------------------------------+
|S|i|d|I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

Adressing contents of files

Each inode has an i-number (a low-level identifier) - its index in the inode table

Given an index, it’s straightforward to calculate the location of an inode on a disk (its sector):

blk = (inumber * sizeof(inode_t)) / blockSize
sector = ((blk * blockSize) + inodeStartAddr) / sectorSize

Contains an array of block pointers (addresses) for the file:

foo.txt:          +-+
          +------>| |
 +-+      |       |-|
 | |------+       | |
 |-|              |-|
 | |-----+  +---->| |
 |-|     |  |     |-|
 | |-----|--|---->| |
 +-+     |  |     |-|
         +--|---->| |
bar.txt:    |     |-|
 +-+        |     | |
 | |----+   |     |-|
 |-|    |   |     | |
 | |----|---+     |-|
 +-+    +-------->| |
                  +-+

Example inode contents (ext2):

Size	Name	Purpose
2	mode	can this file be read/written/executed?
2	uid	who owns this file?
4	size	how many bytes are in this file?
4	time	what time was this file last accessed?
4	ctime	what time was this file created?
4	mtime	what time was this file last modified?
4	dtime	what time was this inode deleted?
2	gid	which group does this file belong to?
2	links_count	how many hard links are there to this file?
4	blocks	how many blocks have been allocated to this file?
4	flags	how should ext2 use this inode?
4	osd1	an OS-dependent field
60	block	a set of disk pointers (15 total for 32-bit)
4	generation	file version (used by NFS)
4	file_acl	a new permissions model beyond mode bits
4	dir_acl	called access control lists

Note: There’s a limited number of pointers (e.g., 15 above)
What to do when a file is bigger than 10 blocks?
→ Indirection - Multi-level index

Multi-level indexing

One of the data pointers points to a data-block which contains another pointer table (an indirect block)
So if we have 15 pointers in an inode, and each pointer is 4 bytes, we get 14 + (4K / 4) = 14 + 1024 = 1038 pointers, which means our files can have 1038 * 4K = 4152K (roughly 4M)
What if we need bigger files?
→ Double indirection
Add a pointer to a block that contains pointers to blocks with pointers
we get 1038 (single indirection) and an additional 1024 * 1024 pointers
More needed?
→ Triple indirection
This approach is used by ext2, ext3, the original UNIX fs, etc.
Note that we get a very imbalanced tree
But this is okay - most files are small!

Extents

If a file is stored as a set of contiguous blocks, it seems wasteful to point to each separately
Some FSs (e.g., ext4) use “extents” - a pointer together with a size in blocks

Directories

A directory also has an inode (file type = directory)
Data blocks contain file entries: an i-number + the file name + length of the name
The root directory (/) has a set i-number
Example:

`inum`	`reclen`	`strlen`	`name`
5	12	2	`.`
2	12	3	`..`
12	12	4	`foo`
13	12	4	`bar`
24	36	28	`foobar_is_a_pretty_longname`

Performance

With our FS, what is the number of I/O when accessing a file?

It depends on the length of the path (at least two reads per directory)
For write/create operations, bitmaps and inodes need also bemodified
→ Caching
Most file systems use main memory as a cache to store frequently accessed blocks
Cache for reads: can prevent most I/Os
Cache for writes:
- Impair reliability
- Most FS cache writes between 5 and 30 seconds
- Better I/O scheduling
- Merge writes (e.g., for the bitmaps)