Squashfs is a compressed read-only filesystem for Linux.
It uses zlib, lz4, lzo, or xz compression to compress files, inodes anddirectories. Inodes in the system are very small and all blocks are packed tominimise data overhead. Block sizes greater than 4K are supported up to amaximum of 1Mbytes (default block size 128K).
Squashfs is intended for general read-only filesystem use, for archivaluse (i.e. in cases where a .tar.gz file may be used), and in constrainedblock device/memory systems (e.g. embedded systems) where low overhead isneeded.
Mailing list: squashfs-devel@lists.sourceforge.netWeb site: www.squashfs.org
1. Filesystem Features¶
Squashfs filesystem features versus Cramfs:
Max filesystem size | 2^64 | 256 MiB |
Max file size | ~ 2 TiB | 16 MiB |
Max files | unlimited | unlimited |
Max directories | unlimited | unlimited |
Max entries per directory | unlimited | unlimited |
Max block size | 1 MiB | 4 KiB |
Metadata compression | yes | no |
Directory indexes | yes | no |
Sparse file support | yes | no |
Tail-end packing (fragments) | yes | no |
Exportable (NFS etc.) | yes | no |
Hard link support | yes | no |
“.” and “..” in readdir | yes | no |
Real inode numbers | yes | no |
32-bit uids/gids | yes | no |
File creation time | yes | no |
Xattr support | yes | no |
ACL support | no | no |
Squashfs compresses data, inodes and directories. In addition, inode anddirectory data are highly compacted, and packed on byte boundaries. Eachcompressed inode is on average 8 bytes in length (the exact length varies onfile type, i.e. regular file, directory, symbolic link, and block/char deviceinodes have different sizes).
2. Using Squashfs¶
As squashfs is a read-only filesystem, the mksquashfs program must be used tocreate populated squashfs filesystems. This and other squashfs utilitiescan be obtained from http://www.squashfs.org. Usage instructions can beobtained from this site also.
- The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
2.1 Mount options¶
errors=%s | Specify whether squashfs errors trigger a kernel panicor not
| ||||||||||
threads=%s | Select the decompression mode or the number of threads If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set:
If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is not set andSQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS areboth set:
|
3. Squashfs Filesystem Design¶
A squashfs filesystem consists of a maximum of nine parts, packed together on abyte alignment:
---------------| superblock ||---------------|| compression || options ||---------------|| datablocks || & fragments ||---------------|| inode table ||---------------|| directory || table ||---------------|| fragment || table ||---------------|| export || table ||---------------|| uid/gid || lookup table ||---------------|| xattr || table | ---------------
Compressed data blocks are written to the filesystem as files are read fromthe source directory, and checked for duplicates. Once all file data has beenwritten the completed inode, directory, fragment, export, uid/gid lookup andxattr tables are written.
3.1 Compression options¶
Compressors can optionally support compression specific options (e.g.dictionary size). If non-default compression options have been used, thenthese are stored here.
3.2 Inodes¶
Metadata (inodes and directories) are compressed in 8Kbyte blocks. Eachcompressed block is prefixed by a two byte length, the top bit is set if theblock is uncompressed. A block will be uncompressed if the -noI option is set,or if the compressed block was larger than the uncompressed block.
Inodes are packed into the metadata blocks, and are not aligned to blockboundaries, therefore inodes overlap compressed blocks. Inodes are identifiedby a 48-bit number which encodes the location of the compressed metadata blockcontaining the inode, and the byte offset into that block where the inode isplaced (<block, offset>).
To maximise compression there are different inodes for each file type(regular file, directory, device, etc.), the inode contents and lengthvarying with the type.
To further maximise compression, two types of regular file inode anddirectory inode are defined: inodes optimised for frequently occurringregular files and directories, and extended types where extrainformation has to be stored.
3.3 Directories¶
Like inodes, directories are packed into compressed metadata blocks, storedin a directory table. Directories are accessed using the start address ofthe metablock containing the directory and the offset into thedecompressed block (<block, offset>).
Directories are organised in a slightly complex way, and are not simplya list of file names. The organisation takes advantage of thefact that (in most cases) the inodes of the files will be in the samecompressed metadata block, and therefore, can share the start block.Directories are therefore organised in a two level list, a directoryheader containing the shared start block value, and a sequence of directoryentries, each of which share the shared start block. A new directory headeris written once/if the inode start block changes. The directoryheader/directory entry list is repeated as many times as necessary.
Directories are sorted, and can contain a directory index to speed upfile lookup. Directory indexes store one entry per metablock, each entrystoring the index/filename mapping to the first directory headerin each metadata block. Directories are sorted in alphabetical order,and at lookup the index is scanned linearly looking for the first filenamealphabetically larger than the filename being looked up. At this point thelocation of the metadata block the filename is in has been found.The general idea of the index is to ensure only one metadata block needs to bedecompressed to do a lookup irrespective of the length of the directory.This scheme has the advantage that it doesn’t require extra memory overheadand doesn’t require much extra storage on disk.
3.4 File data¶
Regular files consist of a sequence of contiguous compressed blocks, and/or acompressed fragment block (tail-end packed block). The compressed sizeof each datablock is stored in a block list contained within thefile inode.
To speed up access to datablocks when reading ‘large’ files (256 Mbytes orlarger), the code implements an index cache that caches the mapping fromblock index to datablock location on disk.
The index cache allows Squashfs to handle large files (up to 1.75 TiB) whileretaining a simple and space-efficient block list on disk. The cacheis split into slots, caching up to eight 224 GiB files (128 KiB blocks).Larger files use multiple slots, with 1.75 TiB files using all 8 slots.The index cache is designed to be memory efficient, and by default uses16 KiB.
3.5 Fragment lookup table¶
Regular files can contain a fragment index which is mapped to a fragmentlocation on disk and compressed size using a fragment lookup table. Thisfragment lookup table is itself stored compressed into metadata blocks.A second index table is used to locate these. This second index table forspeed of access (and because it is small) is read at mount time and cachedin memory.
3.6 Uid/gid lookup table¶
For space efficiency regular files store uid and gid indexes, which areconverted to 32-bit uids/gids using an id look up table. This table isstored compressed into metadata blocks. A second index table is used tolocate these. This second index table for speed of access (and because itis small) is read at mount time and cached in memory.
3.7 Export table¶
To enable Squashfs filesystems to be exportable (via NFS etc.) filesystemscan optionally (disabled with the -no-exports Mksquashfs option) containan inode number to inode disk location lookup table. This is required toenable Squashfs to map inode numbers passed in filehandles to the inodelocation on disk, which is necessary when the export code reinstantiatesexpired/flushed inodes.
This table is stored compressed into metadata blocks. A second index table isused to locate these. This second index table for speed of access (and becauseit is small) is read at mount time and cached in memory.
3.8 Xattr table¶
The xattr table contains extended attributes for each inode. The xattrsfor each inode are stored in a list, each list entry containing a type,name and value field. The type field encodes the xattr prefix(“user.”, “trusted.” etc) and it also encodes how the name/value fieldsshould be interpreted. Currently the type indicates whether the valueis stored inline (in which case the value field contains the xattr value),or if it is stored out of line (in which case the value field stores areference to where the actual value is stored). This allows large valuesto be stored out of line improving scanning and lookup performance and italso allows values to be de-duplicated, the value being stored once, andall other occurrences holding an out of line reference to that value.
The xattr lists are packed into compressed 8K metadata blocks.To reduce overhead in inodes, rather than storing the on-disklocation of the xattr list inside each inode, a 32-bit xattr idis stored. This xattr id is mapped into the location of the xattrlist using a second xattr id lookup table.
4. TODOs and Outstanding Issues¶
4.1 TODO list¶
Implement ACL support.
4.2 Squashfs Internal Cache¶
Blocks in Squashfs are compressed. To avoid repeatedly decompressingrecently accessed data Squashfs uses two small metadata and fragment caches.
The cache is not used for file datablocks, these are decompressed and cached inthe page-cache in the normal way. The cache is used to temporarily cachefragment and metadata blocks which have been read as a result of a metadata(i.e. inode or directory) or fragment access. Because metadata and fragmentsare packed together into blocks (to gain greater compression) the read of aparticular piece of metadata or fragment will retrieve other metadata/fragmentswhich have been packed with it, these because of locality-of-reference may beread in the near future. Temporarily caching them ensures they are availablefor near future access without requiring an additional read and decompress.
In the future this internal cache may be replaced with an implementation whichuses the kernel page cache. Because the page cache operates on page sizedunits this may introduce additional complexity in terms of locking andassociated race conditions.