In January of 2001, another customer required compression support in JFFS to be provided as part of a contract undertaken. After a period of discussion on the mailing list, it was concluded that the most appropriate course of action would be a complete reimplementation, allowing all of the above-mentioned deficiencies in the original implementation to be addressed.
The JFFS2 code was intended to be portable, in particular to eCos, Red Hat's embedded operating system targetted at consumer devices[eCos]. For this reason, JFFS2 is released under a dual licence -- both under GPL and the MPL-style ``Red Hat eCos Public License'', to be compatible with the licence of the remainder of the eCos source.
Although portability was intended, no ports have yet been completed, and the current code is only usable with the 2.4 series of Linux kernels.
While the original JFFS had only one type of node on the medium, JFFS2 is more flexible, allowing new types of node to be defined while retaining backward compatibility through use of a scheme inspired by the compatibility bitmasks of the ext2 file system.
Every type of node starts with a common header containing the full node length, node type and a cyclic redundancy checksum (CRC). The common node structure is shown in Figure 1.
In addition to a numeric value uniquely identifying the node structure and meaning, the node type field also contains a bitmask in the most significant two bits which indicates the behaviour required by a kernel which does not support the node type used:
It is an unfortunate matter of record that this compatibility bitmask was in fact the reason why it was necessary to break compatibility with older JFFS2 file systems. Originally, the CRC was omitted from the common node header, and it was discovered that because the INCOMPAT feature mask has more bits set than the other bitmasks, it is relatively easy, by interrupting erases, to accidentally generate a structure on the medium which looks like an unknown node with the INCOMPAT feature bit set. For this reason, a CRC on the node header was added, in addition to the existing CRCs on the node contents and on the data or name field if present.
Aside from the differences in the individual nodes, the high-level layout of JFFS2 also changed from a single circular log format, because of the problem caused by strictly garbage collecting in order. In JFFS2, each erase block is treated individually, and nodes may not overlap erase block boundaries as they did in the original JFFS.
This means that the garbage collection code can work with increased efficiency by collecting from one block at a time and making intelligent decisions about which block to garbage collect from next.
Each erase block may be in one of many states, depending primarily on its contents. The JFFS2 code keeps a number of linked lists of structures representing individual erase blocks. During the normal operation of a JFFS2 file system, the majority of erase blocks will be on the clean_list or the dirty_list, which represent blocks full of valid nodes and blocks which contain at least one obsoleted node, respectively. In a new filesystem, many erase blocks may be on the free_list, and will contain only one valid node -- a marker which is present to show that the block was properly and completely erased.
As mentioned previously, the garbage collection code uses the lists to choose a sector for garbage collection. A very simple probabilistic method is used to determine which block should be chosen -- based on the jiffies counter. If jiffies % 100 is non-zero, a block is taken from the dirty_list. Otherwise, on the one-in-one-hundred occasions that the formula is zero, a block is taken from the clean_list. In this way, we optimise the garbage collection to re-use blocks which are already partially obsoleted, but over time, we still move data around on the medium sufficiently well to ensure that no one erase block will be worn out before the others.
The third major change in JFFS2 is the separation between directory entries and inodes, which allows JFFS2 to support hard links and also removes the problem of repeating name information which was referred to in the footnote on page .
At the time of writing there are three types of nodes defined and implemented by JFFS2. These are as follows:
Data attached to these nodes may be compressed using one of many compression algorithms which can be plugged into the JFFS2 code. The simplest types are ``none'' and ``zero'', which mean that the data are uncompressed, or that the data are all zero, respectively. Two compression algorithms were developed specifically for use in JFFS2, and also the JFFS2 code can contain yet another copy of the zlib compression library which is already present in at least three other places in the Linux kernel source.4
In order to facilitate rapid decompression of data upon readpage() requests, nodes contain no more than a single page of data, according to the hardware page size on the target platform. This means that in some cases JFFS2 filesystem images are not portable between hosts, but this is not a serious problem because the nature of the flash storage medium makes transportation between devices unlikely. JFFS2 is also entirely host-endian in its storage of numbers larger than a single byte.
POSIX requires that upon renaming, for example ``passwd.new'' to ``passwd'', the replacement of the passwd link should be atomic -- there should not be any time at which a lookup of that name shall fail to return either the old or the new target. JFFS2 meets that requirement, although as with many other file systems, the entire rename operation is not atomic.
Renaming is performed in two stages. First a new dirent node is written, with the new name and the inode number of the inode being renamed. This atomically replaces the link to the original inode with the new one, and is identical to the way in which a hard link is created. Next, the original name is unlinked, by writing a dirent node with the original name and target inode number zero.
This two-stage process means that at some point during the rename operation, the inode being renamed into place is accessible through both the old and the new names. This behaviour is permitted by POSIX -- the atomicity guarantee required is for the behaviour of the target link only.
The original JFFS simply assumed that any block which appeared at first scan to contain 0xFF in every byte was free, and would select the longest run of apparently free space at mount time to be the space between the head and tail of the log. Unfortunately, extensive power fail testing on JFFS proved this to be unwise. For many types of flash chips, if power is lost during an erase operation, some bits may be left in an unstable state, while most are reset to a logical one. If the initial scan happens to read all ones and treat a block containing such unstable bits as usable, then data may be lost -- and such data loss may not even be avoidable by the naïve method of verification by reading back data immediately by writing, because the bit may just happen to return the correct value when read back for verification.
Empirical results showed that even rereading the entire contents of the block multiple times in an attempt to detect unstable bits was not sufficiently reliable to avoid data loss, so an alternative approach was required. The accepted solution was to write the marker node to the flash block immediately after successful completion of an erase operation. Upon encountering flash blocks which do not appear to contain any valid nodes, JFFS2 will trigger an erase operation and subsequently write the appropriate marker node to the erased block.
This node type was introduced after JFFS2 had started to be used in real applications, and uses the RWCOMPAT_DELETE feature bitmask to signify that an older JFFS2 implementation may safely ignore the node.
The operation of JFFS2 is at a fundamental level very similar to that of the original JFFS -- nodes, albeit now of various types, are written out sequentially until a block is filled, at which point a new block is taken from the free_list and writing continues from the beginning of the new block.
When the size of the free_list reaches a heuristic threshold, garbage collection starts, moving nodes from an older block into the new block until space can be reclaimed by erasing the older one.
However, JFFS2 does not keep all inode information in core memory at all times. During mount, the full map is built as before -- but the structures kept in memory are strictly limited to the information which cannot be recreated quickly on-demand. For each inode on the medium, there is a struct jffs2_inode_cache which stores its inode number, the number of current links to the inode, and a pointer to the start of a linked list of the physical nodes which belong to that inode. These structures are stored in a hash table, with each hash bucket containing a linked list. The hash function is a very primitive one - merely the inode number modulo the size of the hash table. The distribution of inode numbers means this should be well-distributed.5
Each physical node on the medium is represented by a smaller struct jffs2_raw_node_ref, also shown in Figure 2, which contains two pointers to other raw node references -- the next one in the physical erase block and the next one in the per-inode list -- and also the physical offset and total length of the node. Because of the number of such structures and the limited amount of RAM available on many embedded systems, this structure is extremely limited in size.
Because all nodes on the medium are aligned to a granularity of four bytes, the least significant two bits of the flash_offset field are redundant. They are therefore available for use as extra flags. The least significant bit is set to indicate that the node represented is an obsolete node, and the other is not yet used.
For garbage collection, it is necessary to find, given a raw node reference, the inode to which it belongs. It is preferable not to add four bytes containing this information to every such structure, so instead we play even more evil games with the pointers. Rather than having a NULL-terminated linked list for the next_in_ino list, the last raw node reference actually contains a pointer to the struct jffs2_inode_cache for the relevant inode. Because that structure has a NULL at the offset where the struct jffs2_raw_node_ref would the pointer to the next node in the inode, the code traversing the list knows it has reached the end, at which point the pointer can be cast to the appropriate type and the inode number and other information can be read from the structure.
The NULL field shown in the inode cache structure is used only during the initial scan of the filesystem for temporary storage, and hence can be guaranteed to be NULL during normal operation.
During normal operation, the file system's read_inode() method is passed an inode number and is expected to populate a struct inode with appropriate information. JFFS2 uses the inode number to look up the appropriate struct jffs2_inode_cache in a hash table, then uses the list of nodes to directly read each node which belongs to the required inode, thereby building up a complete map of the physical locations of each range of the inode's data, similar to the information which JFFS would have kept in memory even while it was unused.
Once the full inode structure has been populated in this manner, it remains in memory until the kernel later tries to prune its inode cache under memory pressure, at which point the extra information is freed, leaving only the raw node references and the minimal JFFS2 inode cache structure which were originally present.
Mounting a JFFS2 file system involves a four-stage operation. First, the physical medium is scanned, the CRCs on all the nodes are checked for validity, and the raw node references are allocated. During this stage, the inode cache structures are also allocated and inserted into the hash table for each inode for which valid nodes are found.
Extra information from the nodes on the flash is cached, such as the version and the range of data covered by the node, to prevent the subsequent stages of the mount process from having to read anything again from the physical medium.
After the physical scan is complete, a first pass through all the physical nodes is made, building a full map of the map of data for each inode so that obsoleted nodes can be detected as such, and increasing the nlink field in the inode cache of the linked inode for each valid directory entry node.
A second pass is then made to find inodes which have no remaining links on the file system and delete them. Each time a directory inode is deleted, the pass is restarted, as other inodes may have been orphaned. In future, this behaviour may be modified to store orphaned inodes in a lost+found directory instead of just removing them.
Finally, a third pass is made to free the temporary information which was cached for each inode; leaving only the information which is normally kept in the struct jffs2_inode_cache during operation. In doing so, the field in the inode cache which corresponds to the next_in_ino field of the raw node reference is set to NULL, thereby enabling the slightly evil hack referred to earlier, for detecting that the end of the next_in_ino list has been reached.
In JFFS2, garbage collection moves data nodes by determining the inode to which the node to be garbage collected belongs, and calling the Linux kernel's iget() function for the inode in question. Often, the inode will be in the kernel's inode cache -- but sometimes, this will cause a call to the JFFS2 read_inode() function as described above.
Once the full inode structure is obtained, a replacement node can be written to obsolete the original node. If it was a data node, the garbage collection routine calls the standard readpage() function for the page for which the node contains data -- again using the existing file system caching mechanism because the required page may already be in the page cache. Then, as much of the page as possible is recompressed and written out in a new node. A partial page may be written if the node to be garbage collected is small and there is not sufficient slack space to allow a full page to be written, or if the page being garbage collected is at the end of the inode.
One of the features which is strongly desired for JFFS2 is a formal proof of correctness of the garbage collection algorithm. The current empirical method is not sufficient. The compression, however, gives rise to a serious potential problem with this proof. If a full page is written which compresses extremely well, and later a single byte is written in the middle of the page which reduces the compressibility of the page, then when garbage collecting the original page we may find that the new node written out is larger than the original. Thus, there would be no way to place an upper bound on the amount of space required to garbage collect an erase block full of data.
The proposed solution to this is to allow the total ordering of the version field to be relaxed to a partial ordering. We allow two nodes to have the same version field as long as they have identical data. Thus, when garbage collection finds a node which would expand, yet insufficient slack space to allow it to do so, it may copy the original node intact, preserving the original version so that the nodes which overlay the data contained therein will still continue to do so.
A problem which arose during the design stage for JFFS2 which had not already been addressed for the original version involved truncation of files. The sequence of events which could be problematic was a truncation followed by a write to an offset larger than the truncation point -- leaving a ``hole'' in the file which should return all zeroes upon being read.
On truncation, the original JFFS merely wrote out a new node giving the new length, and marked (in memory) the older nodes containing data beyond the truncation point as obsolete. Later writes would occur as normal.
During a scan of the file system on remounting, the sequential nature of the garbage collection ensured that all the old nodes containing actual data for the ranges which should be ``holes'' were garbage collected before the truncation node. As nodes were interpreted in version order after the physical scan, correct behaviour could be guaranteed, because the evidence of the truncation was still present at all times until the old data were erased.
For JFFS2, where blocks can be garbage collected out of order, it was necessary to ensure that old data could never ``show through'' the holes caused by truncation and subsequent extension of a file.
For this reason, it was decided that there should be no holes in the proper sense -- a complete absence of information for the range of bytes in question. Instead, upon receiving a request to write to an offset greater than the current size of a file, or a request to truncate to a larger size, JFFS2 inserts a data node with the previously-mentioned compression type JFFS2_COMPR_ZERO, meaning that no actual data are contained with the node, and the entire range represented by the node should be set to zero upon being read.
In the case where a file contains a very large hole, it is preferable to represent that hole by only a single physical node on the medium, rather than a ``hole'' node for each page in the range affected. Therefore, such hole nodes are a special case of data node; the only type of data node which may cover a range of more than one page.
This special case is itself the reason for further complication, because of concerns about expansion during garbage collection. If a single byte is written to a page which was previously part of a hole, it is necessary to ensure that garbage collection of either the original hole node or the node containing the new byte of data should not require more space than is taken by the original.
The solution to this problem is the same as for compressed nodes which may expand when merged -- if garbage collection would cause an expansion, and there is insufficient slack space to accommodate such growth, then the original node is copied exactly, retaining the original version number.