diff --git a/repos/gems/src/lib/tresor/README b/repos/gems/src/lib/tresor/README new file mode 100644 index 0000000000..0cf2a9b614 --- /dev/null +++ b/repos/gems/src/lib/tresor/README @@ -0,0 +1,1031 @@ + The Tresor library + + Martin Stein + +The Tresor library provides tools for creating and using data-storage +containers that are managed and encrypted on the block level according to the +Tresor scheme. The features of these containers include: + +* Protection against unauthorized data access, +* detection of any intentional or unintentional data modifications, +* recovery from system crashes to a consistent state, +* trust-anchor-based authorization, +* online replacement of encryption keys, +* online extending, +* management of incremental, read-only snapshots, +* and a container capacity of up to 4 terabyte. + + +Basic terminology +================= + +* Back-end storage: The medium that the container is stored on. It is assumed + to act like a Block device. + +* VBD: Virtual block device. The medium that the container provides to the + user. The data it stores is, in the background, managed and encrypted via + the Tresor scheme. + +* On disc: At the back-end storage. +* Physical block: A block at the backing storage. +* Virtual block: A block at the VBD. +* PBA: Physical block address +* VBA: Virtual block address + +* Block encryption key: A secret random number that is used to encrypt the + user data stored in the VBD. On disc, the key is stored encrypted as part of + the container. The block encryption key of a container can be transparently + replaced by the user through rekeying. + +* Master key: A secret random number that authenticates the container user, + is assumed to be known only to the trust anchor, and is used to encrypt + the block encryption keys of a container. + +* TA: Trust anchor. A software or hardware component that stores the + master key and the master hash. It is assumed that the trust anchor is the + only component in the system, that knows these two values or can make correct + assumptions about them. The trust anchor provides an interface for performing + certain operations with these values without compromising the before + mentioned condition. + +* EVD block: Encrypted VBD data block. A block of user data that is stored in + the VBD and that was encrypted with the block encryption key. + +* COW: Copy-On-Write. The blocks that make up a container are never directly + overridden. Instead, the new state is written into a free physical block + and references to both blocks are kept at first. + +* SB: Superblock. + +* Snapshot: One of the types of hash trees held in the container. A snapshot + represents one state of the VBD (incrementally). One container can hold + multiple snapshots. The most recent snapshot represents the VBD state that is + observed by the user under normal access. + +* FT: Free tree. One of the hash trees held in the container. Enables efficient + allocation of PBAs for COW at the VBD meta data and payload data. + +* MT: Meta tree. One of the hash trees held in the container. Enables efficient + allocation of PBAs for COW at the FT and MT meta data. + +* To secure the SB: The act of flushing the block caches, writing out + the superblock to the back-end storage, and eventually storing the + superblock hash at the TA. This transitions the containers on-disc structures + from an older to a new consistent state and results in a momentary + synchronization of user state and on-disc state. + +* Generation: An integer value that identifies a VBD state. Each snapshot + corresponds to a unique generation value. The VBD starts at generation 0 and + then increments the generation value for each new snapshot. + +* Root node: A reference to the heighest (meta-data- resp. inner-) block in a tree. +* Root block: The block referenced by the root node of a tree + + +On-disc structure of a container +================================ + +Both the physical block size and the virtual block size are fixed to 4096 +bytes per block. + +On-disc block layout of a container: + +! Physical +! block address +! 0 +------------------------------------+ +! | Superblock #1 | +! 1 +------------------------------------+ +! . | . | +! . . +! . | . | +! 7 +------------------------------------+ +! | Superblock #8 | +! 8 +------------------------------------+ +! | Other block type | +! 9 +------------------------------------+ +! . | . | +! . . +! . | . | +! [Physical Blocks]-1 +------------------------------------+ +! | Other block type | +! [Physical Blocks] +------------------------------------+ + +The PBA range of the container is always contiguous and starts always at 0. +Extending the container always uses the PBAs that come right after the current +end of the PBA range in order to not violate this condition. + +The superblocks are always located in the first 8 physical blocks. The one +superblock to use for the container can be found by iterating over all +superblocks and find the one whose hash matches the superblock hash stored in +the TA. If this fails, then the container is rendered unusable most probably +because it has been altered unauthorized since the last time of securing the +SB. + +Although there are 8 superblocks, in reality, at most 2 of them at a time +reference an intact container and most of the time only one does. A freshly +initialized container has its current superblock at PBA 0. Whenever a new +superblock state is written out to the back-end storage, the next higher PBA +module 8 is overwritten. So, the PBAs 0 to 7 are used in a round-robin fashion. +During the timespan between writing out a new superblock to the back-end +storage and updating the superblock hash at the TA both the most recent and the +previous SB PBA contain a valid superblock. This allows for a roll-back to the +previous superblock state when storing the superblock hash at the TA failed. +However, when the new superblock hash was stored successfully at the TA, +software will also select the new superblock for accessing the container and +soon render the previous one unusable by re-assigning PBAs that still form +part of the older superblock's hash trees. + +There are 3 other types of blocks that live in the area that comes after the +superblocks: data blocks, type 1 blocks, and type 2 blocks. + +On-disc layout of a superblock: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | State | 1 +! 1 +------------------------------------+ +! | Rekeying VBA | 8 +| 9 +------------------------------------+ +! | Resizing: Number of PBAs | 8 +! 17 +------------------------------------+ +! | Resizing: Number of leaves | 8 +! 25 +------------------------------------+ +! | Previous key | 36 +! 61 +------------------------------------+ +! | Current key | 36 +! 97 +------------------------------------+ +! | Snapshot root node #1 | 72 +! 169 +------------------------------------+ +! . | . | . +! . . . +! . | . | . +! 3481 +------------------------------------+ +! | Snapshot root node #48 | 72 +! 3553 +------------------------------------+ +! | Last secured generation | 8 +! 3561 +------------------------------------+ +! | Current snapshot index | 4 +! 3565 +------------------------------------+ +! | Snapshot degree | 4 +! 3569 +------------------------------------+ +! | First PBA | 8 +! 3577 +------------------------------------+ +! | Number of PBAs | 8 +! 3585 +------------------------------------+ +! | Free tree root node | 64 +! 3649 +------------------------------------+ +! | Meta tree root node | 64 +! 3713 +------------------------------------+ + +On-disc superblock state values: + +! 0 = Invalid superblock +! 1 = Valid superblock; No special operations in progress +! 2 = Valid superblock; Rekeying in progress +! 3 = Valid superblock; Extending virtual block device in progress +! 4 = Valid superblock; Extending free tree in progress + +On-disc layout of a key: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | Value | 32 +! 32 +------------------------------------+ +! | ID | 4 +! 36 +------------------------------------+ + +On-disc layout of a snapshot root node: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | Hash | 32 +! 32 +------------------------------------+ +! | PBA | 8 +! 40 +------------------------------------+ +! | Generation | 8 +! 48 +------------------------------------+ +! | Number of tree leaves | 8 +! 56 +------------------------------------+ +! | Maximum tree level index | 4 +! 60 +------------------------------------+ +! | Valid (boolean) | 1 +! 61 +------------------------------------+ +! | Snapshot ID | 4 +! 65 +------------------------------------+ +! | Keep snapshot (boolean) | 1 +! 66 +------------------------------------+ +! | 0* | 6 +! 72 +------------------------------------+ + +On-disc boolean values: + +! 0 = False +! 1 = True + +On-disc layout of an FT or MT root node: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | Generation | 8 +! 8 +------------------------------------+ +! | PBA | 8 +! 16 +------------------------------------+ +! | Hash | 32 +! 48 +------------------------------------+ +! | Maximum tree level index | 4 +! 52 +------------------------------------+ +! | Tree degree | 4 +! 56 +------------------------------------+ +! | Number of tree leaves | 8 +! 64 +------------------------------------+ + +On-disc layout of a type 1 block: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | Type 1 node #1 | 64 +! 64 +------------------------------------+ +! . | . | . +! . . . +! . | . | . +! X*64-1 +------------------------------------+ +! | Type 2 node #X | 64 +! X*64 +------------------------------------+ +! | 0* | 4096-X*64 +! 4096 +------------------------------------+ +! +! With X <= Degree + +On-disc layout of a type 1 node: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | PBA | 8 +! 8 +------------------------------------+ +! | Generation | 8 +! 16 +------------------------------------+ +! | Hash | 32 +! 48 +------------------------------------+ +! | 0* | 16 +! 64 +------------------------------------+ + +On-disc layout of a type 2 block: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | Type 2 node #1 | 64 +! 64 +------------------------------------+ +! . | . | . +! . . . +! . | . | . +! X*64-1 +------------------------------------+ +! | Type 2 node #X | 64 +! X*64 +------------------------------------+ +! | 0* | 4096-X*64 +! 4096 +------------------------------------+ +! +! With X <= Degree + +On-disc layout of a type 2 node: + +! Byte offset Size in bytes +! 0 +------------------------------------+ +! | PBA | 8 +! 8 +------------------------------------+ +! | Last VBA | 8 +! 16 +------------------------------------+ +! | Alloc generation | 8 +! 24 +------------------------------------+ +! | Free generation | 8 +! 32 +------------------------------------+ +! | Last key ID | 4 +! 36 +------------------------------------+ +! | Reserved (boolean) | 1 +! 37 +------------------------------------+ +! | 0* | 27 +! 64 +------------------------------------+ + +Layout of a snapshot hash tree: + +! | (Root node) +! Level | | +! index | | +! | +---------------+ +! Max=3 | | Type 1 block | +! | +---------------+ +! | | ... | +! | | `--------. +! | | | +! | +---------------+ +---------------+ +! 2 | | Type 1 block | .. | Type 1 block | +! | +---------------+ +---------------+ +! | | ... | .. +! | | `--------. +! | | | +! | +---------------+ +---------------+ +! 1 | | Type 2 block | .. | Type 2 block | ... +! | +---------------+ +---------------+ +! | | ... | .. +! | | `--------. +! | | | +! | +---------------+ +---------------+ +---------------+ +! 0 | | EVD block | .. | EVD block | ... | EVD block | +! | +---------------+ +---------------+ +---------------+ +! +! ----------------------------------------------------------------------------- +! Virtual +! block 0 1 ... Leaves-1 +! address + +On-disc layout of an FT or MT hash tree: + +! | (Root node) +! Level | | +! index | | +! | +---------------+ +! Max=3 | | Type 1 block | +! | +---------------+ +! | | ... | +! | | `--------. +! | | | +! | +---------------+ +---------------+ +! 2 | | Type 1 block | .. | Type 1 block | +! | +---------------+ +---------------+ +! | | ... | .. +! | | `--------. +! | | | +! | +---------------+ +---------------+ +! 1 | | Type 2 block | .. | Type 2 block | ... +! | +---------------+ +---------------+ +! | | ... | .. +! | | `--------. +! | | | +! | +---------------+ +---------------+ +---------------+ +! 0 | | Managed block | .. | Managed block | ... | Managed block | +! | +---------------+ +---------------+ +---------------+ + +The dimension parameters of all trees are restricted as follows: + + * Degree >= 2 + * Degree <= 64 + * Degree is a power of 2 + * Maximum level index >= 5 + * Maximum level index <= 5 + * Number of leaves >= 1 + * Number of leaves <= (Degree^[Maximum level index]) - 1 + +The VBD consists of up to 48 incremental snapshots that are referenced by the +superblock. There are two types of valid snapshots: Those with the Keep flag +unset and those with the Keep flag set. Of the former, there should never be +more than 2 present. They are the most recent and the second-most recent +snapshot and software manages them automatically in order to keep track of the +most recent VBD state over the "secure SB" cycle. Of the latter, the Keep +snapshots, there can be up to 46 present. They are explicitly marked with the +Keep flag by the user and can be removed only explicitly by the user. This +means that they are the only VBD states beside the most recent one that the +user can access. However, in contrast to the most recent VBD state, these older +VBD states are read-only and must be addressed explicitly. + +By convention, this manual depicts trees always with node indices increasing +from the left to the right. This means that node #1 is always the left-most in +a block and VBA 0 is always the left-most at the trees leaf-level. Each node +in a snapshot hash-tree references its child block by its PBA. Furthermore it +holds the hash of the data in that block. Whenever the child block is read by +software, it must be checked against this hash before being used. This ensures +data integrity on every level of operation in the container. Finally, each +node also contains a generation value indicating which is the VBD state +(snapshot) with which the referenced child block was added to the VBD. Note +that a block in a tree might always be part of multiple snapshots as snapshots +work incrementally, meaning each snapshot only adds what has changed since the +last snapshot: + +! Snapshot Snapshot Snapshot +! Generation 5 Generation 7 Generation 11 +! | | | +! ___O___ ___O___ ___O___ +! / \ _____________ /_______\___________/_______\ +! / \ / / / +! O O O O +! / \ _____ /_\ __________/_\ / \ +! / \ / / \ / / \ +! O O O O O O O +! / \ / \ / \ / / \ _______________/_\ / \ +! / \ / \ / \ / / \/ / / \ +! O O O O O O O O O O O O +! ---------------------------------------------------------------------------- +! VBA 0 1 2 3 4 5 6 0 1 0 2 3 + +In the example, generation 7 alters only the block data at VBAs 0 and 1 whereas +in generation 11 only the block data at VBAs 0, 2, and 3 was modified. Note, +that snapshots are not necessarily of the same dimensions or virtual storage +capacity. A snapshot tree is also not necessarily using all of the topology +possible with its dimensional parameters. A snapshot tree only spans as far as +it is needed to provide the storage capacity configured by the user (although +there might be one additional unfinished branch created by an extension +operation). The rest of the topology is spared out as a contiguous area +reaching out from the bottom-right corner of the tree: + +! | +! ____O____ +! / \ +! / \ +! __O O +! / \ / . +! / \ / . . +! O O O . . . <------ Unused area +! / \ / \ / \ . . . +! O O O O O O <----------- Unfinished branch due to +! / \ / \ / \ / \ / \ . . . . . extension operation +! O O O O O O O O O O . . . . . . +! +! VBA 0 1 2 3 4 5 6 7 8 9 <-------------- Highest VBA +! ((storage capacity / block size) - 1) + +The nodes that would reach into the spared-out area are marked invalid, i.e., +set to all zeroes. The VBA range of a snapshot always starts at VBA 0 and than +continues contiguously without skipping any index at any tree level. + +The set of blocks at level 0 of the free tree is always equal to set of blocks +that form the snapshot trees minus all blocks that form the tree of the most +recent snapshot. When a snapshot gets invalidated, all blocks in its tree that +are not part of any other valid snapshot are said to become "free". That means +they become available for re-assignment and, at the same time, unknown to the +VBD meta data. That said, the core functions of the free tree are to keep a +reference to these blocks and enable detecting whether one of these blocks is +free already or still in use. + + +Description of container operations +=================================== + +VBA access +~~~~~~~~~~ + +Reading a VBA +------------- + +For reading a virtual block, software has to walk down the one branch of the +most recent snapshot of the VBD that contains the corresponding EVD block. It +starts by reading the PBA that is noted in the snapshot root node. The read +block is a type 1 block, but before doing anything with it, the block data must +be checked against the hash noted in the root node. If the hashes match, +software can select the type 1 node that leads towards the desired EVD block +from the read block. It does so by using the VBA, the tree level index and the +tree dimensions from the root node to determine the correct node index. This +process is repeated until reaching tree level 0. + +Once the hash for the level 0 block (the EVD block) was checked, it can be +decrypted using the key from the superblock and the resulting plain VBD data +can be delivered to the user. If the lowest type 1 node in the branch (tree +level 1) indicates that the EVD block is of generation 0 (the initial +generation), then the EVD block is not initialized yet and must be considered +to contain random data. In that case, software should refrain from reading and +hash-checking the EVD block and instead return one block of zeroes to the user. + +Writing a VBA +------------- + +The procedure of writing a VBA starts the same as reading a VBA. However, once +the lowest type 1 node of the branch (tree level 1) was determined, instead of +reading the EVD block, software determines how many free PBAs are required to +update the branch according to the write operation. This number differs because +software might have updated parts of that branch already during previous write +operations since the last synchronization with the back-end storage (secure SB +operation). + +This part is called volatile and can safely be modified again without doing +COW. Note that, if there is a volatile part, it always starts at the highest +tree level and ends at a lower level. This includes level 0 in case that the +exact same VBA was already written since the last synchronization. For each +level that requires COW, however, the free tree is consulted in order to +allocate a free PBA for the new block data. The allocation algorithm is +described in detail in the paragraph "PBA allocation". + +If the allocations succeeded, software encrypts the new virtual block data. +Then the algorithm walks up the tree branch again starting with level 0. At +each level, it first writes it out the new block to the new or old but volatile +PBA and then updates the corresponding type 1 node at the above level. The +hash is always updated while the PBA and generation need an update only if in +the part of the branch that was not yet volatile. When the algorithm has +updated the root node of the snapshot in the superblock, the write operation +is complete. Note that it is not necessary to directly secure the updated +superblock. It can keep accumulating further write opertions in the increasinly +volatile snapshot (that is yet not known to the back-end storage) until another +operation requires a synchronization or the user explicitely requests one. + +PBA allocation +-------------- + +A VBA write operation creates not only a new version of the EVD block of the +VBA but also of each type 1 block in the snapshot branch that leads to this EVD +block. For each of these new block versions, a free PBA is required where the +data can be written to without overwriting data that is still in use. This is +how each PBA is allocated at the free tree: The PBA of the original block (the +one that shall be "replaced") is given to the free tree. When the free tree has +found a new physical block to hand out, it replaces the type 2 node of the new +block with a type 2 node for the old block. The new type 2 node indicates that +the old block remains reserved. This means although the old block isn't part of +the current VBD state anymore, it is potentially still used by older snapshots. + +Therefore the type 2 node of the old block carries the names of the generations +during which the block was allocated respectively freed. As long as there is +still a snapshot with a generation number greater or equal the allocation +generation and less the free generation of the block, the block stays reserved +in the free tree. Note, however, that this is only checked on-demand, when +trying to allocate a PBA. + +Rekeying +~~~~~~~~ + +The main goal of a rekeying operation is to dissolve the currently used block +encryption key from the container and replace it with a new key. This means +essentially decrypting all data that was encrypted with the current key and +re-encrypt it with the new one. This is assumed to be computation- and +time-intensive but not time-critical. Furthermore, rekeying is supposed to be +done online, i.e., as a background task while the user can keep accessing the +VBD. As a third requirement, it should be possible to keep the VBD performance +at a sensible level during rekeying. + +In order to meet the above stated requirements, the Tresor scheme provides +that rekeying is split up into smaller atomic operations, called steps, that +can be interleaved with, e.g., VBD access operations. After each rekeying step, +the container remains in a consistent state from which any other operation, +except resizing operations and rekeying, can be started. The original rekeying +can be continued at any time the container is idle again. + +There are two types of rekeying steps: The initialization step and VBA rekeying +steps. The initialization step transitions the container from the Normal to the +Rekeying state and initializes rekeying parameters. A VBA rekeying step adapts +all EVD blocks of a single VBA. Note that the VBD may contain multiple EVD +blocks per VBA, which each refer to a different state of the corresponding +virtual block over time: + +! Time ---T1--------T2-------------T3--------T4=Now------> +! +! Snapshot Snapshot Snapshot Snapshot +! / \ / \ / \ / \ +! / \/ \ / \ / \ +! / /\ \ / \/ \ +! / / \ \ / /\ \ +! /____/____\____\ /______/__\______\ +! | | | +! +-----------+ +-----------+ +-----------+ +! | EVD block | | EVD block | | EVD block | +! | VBA 5 @T1 | | VBA 5 @T3 | | VBA 5 @T4 | +! +-----------+ +-----------+ +-----------+ + +The VBA rekeying steps start with VBA 0 and increment the VBA after each step. +Therefore, after each VBA rekeying step, the VBD can be divided into two +sections regarding the used block encryption key: + + VBA | Annotation | EVD encrypted with + --------------------------------------------------------------------------- + --------------------------------------------------------------------------- + 0 | Was the first to be rekeyed | new block + ------------------------------------------------------- encryption key + 1 | | + ------------------------------------------------------- + ... | | + ------------------------------------------------------- + X | Was rekeyed just now | + | (Superblock: Rekeying VBA = X) | + --------------------------------------------------------------------------- + X+1 | Will be rekeyed next | old block + ------------------------------------------------------- encryption key + ... | | + ------------------------------------------------------- + [Max number of | Will be the last to be rekeyed | + leaves of any | (every VBA of any version of the | + snapshot] - 1 | VBD must be processed) | + --------------------------------------------------------------------------- + +This is important as it allows for efficiently selecting the correct key for +VBD access and performing COW allocations during a rekeying operation. This +will be explained in detail in a moment. + +This illustrates the process of rekeying one VBA over 4 snapshots of a VBD +(type 1 blocks are divided into nodes with the PBA shown for each node): + +! Start +! | +! | Generation 15 Generation 14 +! | +--->| +---> ... +! |Read blocks | |Read blocks | +! Tree |to branch buffer | |to branch buffer | +! Level | | | | +! ----- | +----+ | | +----+ | +! Root | | 77 | | | | 12 | | +! | +-|--+ | | +-|--+ | +! | _| | | _| | +! | | | | | | +! | V | | V | +! | +-------------------+ | | +-------------------+ |Update and +! 3 | | 18 | 20 | 93 | 75 | | | | 18 | 36 | 29 | 90 | |COW-write +! | +-|-----------------+ | | +-|-----------------+ |Type 1 blocks +! | _| | | | | +! | | | | V | +! | V | | Already rekeyed | +! | +-------------------+ | | with generation 15 | +! 2 | | 13 | 44 | 69 | 41 | | V | +! | +-----------|-------+ | --------------------------+ +! | ___________| | Allocate PBAs +! | | | at free tree +! | V | +! | +-------------------+ | +! 1 | | 34 | 10 | 81 | 72 | | +! | +------|------------+ | +! | ______| |Update and +! | | |COW-write +! | V |Type 1 blocks +! | +-------------------+ | +! 0 | | EVD block | | +! | +-------------------+ | +! V | +! --------------------------+ +! Allocate PBAs Re-encrypt +! at free tree EVD block + +Continuation: + +! Generation 8 Generation 6 +! ... --->>| +--->| +---> End +! | | | | +! Tree | | | | +! Level | | | | +! ----- |R +----+ | |R +----+ | +! Root |e | 84 | Generation 15 | |e | 12 | Generation 14 | +! |a +-|--+ | |a +-|--+ | +! |d _| | |d _| | +! | | | | | | +! | V | | V | +! | +-------------------+ | | +-------------------+ | +! 3 | | 40 | 51 | 56 | 78 | | | | 60 | 49 | 42 | 22 | | +! | +-|-----------------+ | | +-|-----------------+ | +! | _| | | _| | +! | | | | | | +! | V | | V | +! | +-------------------+ |U | +-------------------+ | +! 2 | | 82 | 88 | 69 | 70 | |p | | 17 | 11 | 31 | 30 | | +! | +-----------|-------+ |d | +-----------|-------+ | +! | | |a | ___________| | +! | V |t | | | +! | Already rekeyed |e | V | +! | with generation 15 | | +-------------------+ | +! 1 V | | | 66 | 89 | 19 | 28 | | +! ---------------------------+ | +------|------------+ | +! Allocate | ______| |U +! | | |p +! | V |d +! | +-------------------+ |a +! 0 | | EVD block | |t +! | +-------------------+ |e +! V | +! ---------------------------+ +! Allocate Re-encrypt + +PBA allocation for rekeying +--------------------------- + +The allocation of physical blocks for rekeying is different from the allocation +of physical blocks for VBA access. When rekeying does COW, it doesn't do it to +preserve the old device state for later user access. It doesn't create new +snapshots, it merely re-writes the existing ones in place. Rekeying does COW +because, in the process of rekeying a VBA for one snapshot, it has to be +considered that the other, yet to be rekeyed snapshots still reference the +updated blocks with their original hashes. And if rekeying would update the +blocks without COW, it would break the remaining snapshots and run into a hash +mismatch before the end of the current VBA rekeying step. + +This makes clear why the common allocation strategy wouldn't work for rekeying. +The criterion when to revert the reservation of the old blocks in the free tree +is not the vanishing of certain snapshots but whether rekeying has reached a +certain point. In order to know this point for a specific block, two situations +must be distinguished. Either, the block forms part of the current device state +(I'll call this an effective block) or it doesn't and is relevant for older +snapshots only (I'll call this a superseded block). + +Let's look at effective blocks first. When the rekeying replaces them, they can +be re-allocated as soon as rekeying is done with the current VBA, because by +then the rekeying has replaced them in all snapshots. That said, a +COW-allocation adds the old block directly as "non-reserved" to the free tree. +This causes the block to become re-allocatable as soon as the current +generation is secured, which is done at the end of the VBA-rekeying step. + +With superseded blocks things are more complicated: When being replaced by +rekeying, they could become re-allocatable in the next generation as well. +However, in contrast to effective blocks, for a superseded block there is +already an entry in the free tree indicating that the block is reserved. Even +more, because of the way the free tree is designed, there is no efficient way +to find this entry. But we have to do something about this entry. Otherwise it +would keep the block reserved until all generations that used to reference it +disappeared (despite the fact that they are not referencing it anymore). + +Let's call these pseudo-reserved blocks and see how we can deal with them: +Luckily, we can make use of the ascending order in which VBAs are rekeyed. +Because pseudo-reserved blocks always belong to a VBA less than the current +rekeying VBA. So, in each free tree entry, the Tresor additionally stores the +VBA for which the block was used last for. For leaf nodes of the VBD, the last +VBA is obvious. For a type 1 node, the last VBA is the lowest VBA of the +sub-tree under that node. Furthermore, the ID of the last block encryption key +of the block is remembered. With these two additional values, pseudo-reserved +entries become detectable. If, during an allocation, the superblock is in the +"Rekeying" state, the free tree checks for reserved entries whether they have +the old key ID and a VBA less than the current rekeying VBA. If so, a +pseudo-reserved block was found that can be treated like a non-reserved block. +As a result, such blocks become re-allocatable as soon as the rekeying of their +VBA is finished. When the superblock returns to the "Normal" state, i.e., the +entire rekeying is complete, the remaining pseudo-reserved blocks stay +re-allocatable because rekeying raised the generation value of each snapshot. +So, the allocation algorithm now correctly concludes that these blocks are not +part of any snapshot anymore. + +Unfortunately, there is more to it. The above mentioned scheme elegantly solves +things for the old PBAs of the superseded blocks (that were allocated during +the era of the old block encryption key, i.e., before the rekeying started). +But we haven't spoken yet about the new PBAs that rekeying allocates to replace +them. Usually, PBA allocations exchange the old type 2 node of the allocated +PBA with the new type 2 node of the now reserved PBA. However, in case of +rekeying a superseded block, the allocation result will not form part of the +most recent VBD state and becomes therefore reserved as well. Furthermore, +we already discussed that the PBA that is about to be replaced already has a +type 2 node somewhere in the free tree. So, we just stay with the existing +type 2 nodes as they are? That's . + +Let's illustrate this with an example: Rekeying has just rekeyed VBA 0 and +thereby replaced a superseded inner node of the VBD, physical block 10, with +physical block 20. Block 10 is still in the free tree but will be recognized as +pseudo-reserved because it is marked with the old key ID. Assume that we were +to mark the free tree entry of block 20 with the new key ID. Later, during the +rekeying of VBA 1, block 20 must be replaced again. This time, the +pseudo-reserved free-tree entry of block 20 will remain undetected because of +the new key ID. Alright. So, let's go back to the rekeying of VBA 0 and use the +old key ID for the free-tree entry instead. This won't work neither because +now, the entry for block 20 is freed too soon, when the rekeying of VBA 0 is +done. + +We have to find another criterion for freeing such superseded blocks that were +allocated by rekeying itself. Luckily, this is possible because we know that +such a block is always replaced in the next VBA-rekeying step, given that the +current rekeying VBA is not the last one covered by the corresponding node in +the virtual block device. So, we can mark the free tree entry with the old key +ID and the next VBA that is to be rekeyed. In the above example, this would +cause block 20 to be freed again as soon as the rekeying of VBA 1 is complete, +which is exactly what we want. + +The only thing left is what happens when the current rekeying VBA is the last +one covered by the VBD node. In this case, the block that is allocated for COW +will not become pseudo-reserved because it will contain the last version of the +VBD node that is created by the running rekeying process. Its free-tree entry +can therefore be a "commonly" reserved one with the new key ID. + +Resizing +~~~~~~~~ + +The Tresor is currently resizable in two ways. The virtual block device can be +extended and the free tree can be extended. As both operations have a lot in +common, I'll describe the basic idea first and will go into the pecularities +later. + +Directly at the Tresor interface, an extension operation is communicated like +any other operation by submitting a request. The request has either the +operation type "Extend Virtual Block Device" or "Extend Free Tree". The request +carries one parameter that is the number of physical blocks that shall be added +to the Tresor. Note that the set of physical blocks that the Tresor uses is +always given as contiguous range of block addresses that starts with block +address 0. The Tresor furthermore remembers this range in the superblock. That +said, when telling the Tresor to extend itself using N additional physical +blocks and A is the highest physical block address currently used, the Tresor +will incorporate the physical block addresses A + 1 to A + N. + +As an extension operation might require the Tresor to update and write many +branches of different trees, the operation can be time intensive depending on +the number of added physical blocks. Extension operations are therefore +implemented as a sequence of many small extension steps. After step, the Tresor +container returns to a consitsent state where "Read", "Write", "Sync", and +"Discard Snapshot" requests can be mixed in. Note that "Create Snapshot", +"Extend Free Tree", "Extend Virtual Block Device", and "Rekeying" requests cannot +be executed in parallel to an extension operation. + +Breaking up extensions into small steps also has the benefit that there are many +container states during an extension that can be secured to the physical block +device and the trust anchor. Should the system be turned off during an +extension, the progress isn't lost (except the last unfinished step of course) +and the extension operation can be continued on next startup. Better said, it +has to be continued on next startup, because the virtual block device would +otherwise remain in a state that limits the functionality of the Tresor. + +That said, the Tresor has to remember inside the superblock that an extension +operation is pending and in which state it is. And it will automatically +continue a pending extension operation on startup. + +There are two types of extension steps: The initialization of the extension +process and the extension of the targeted tree by a number of leaf blocks that +is, at max, the tree's degree. After the initialization step, the Tresor keeps +doing extension steps on the targeted tree until the contingent of new physical +blocks is depleted. At the end of each extension step, the Tresor updates the +superblock and secures the device state. + +Resizing steps +-------------- + +In order to initiate the extension process, the Tresor first sets the superblock +to state "Extending Virtual Block Device" or "Extending Free Tree" depending on +the targeted tree. Furthermore, it remembers in the superblock the contingent of +new physical superblocks that is left for the extension operation. Initially, +this is the number of physical blocks given in the extension request of the user +but throughout the extension process it will be decreased more and more. + +When doing an extension step at the targeted tree, the Tresor first determines +the identifier of the right-most complete branch in the tree. For the virtual +block device, this is the highest virtual block address covered. Fot the free +tree, it is technically the same: the combination of node indices along the way +of the branch. But as the branches in the free tree are not related to block +addresses, we call it branch identifier instead. The left-most branch always has +the identifier 0, the second left-most the identifier 1, and so on. + +So, the identifier of the right-most branch in the tree is known. The Tresor now +wants to add a new branch to the right of the right-most branch. Consequently, +this new branch would have the identifier X + 1, where X is the identifier of +the right-most branch. With this identifier known, two situations must be +distinguished. For this distinction we need to know about the trees current +geometry - the number of tree levels and the number of nodes per inner block. +This geometry defines a maximum for the number of branches that the tree can +contain. + +If the identifier of the new branch is greater or equal to this maximum, the +current tree geometry doesn't suffice for adding another branch. In this case, +a new block is inserted between the current root block and the root node of the +tree before the new branch can be added (i.e., a new level is added to the +tree). The first node of this new root block references the previous root +block while the rest of its nodes are now available for extending the tree. The +physical block address for the new root is taken from the contingent of the +extension operation. + +If the identifier of the new branch, however, is less than the maximum number of +branches that the tree can contain, the current tree geometry is sufficient and +can therefore remain unmodified. Note that in this case, the blocks for the new +branch already exist down to a certain tree level. We don't know down to which +level but we know that at least the leaf block does not exist so far. + +Now that we have the tree geometry right, adding the new branch is performed by +doing a tree walk for the identifier of the branch. Whenever we find a yet +unset node during this tree walk, the lowest physical block address is taken +from the resizing contingent and the node initialized to reference the +corresponding block. This also applies for the missing leaf block at the end +of the tree walk. In the virtual block device, the new leaf block is marked +with generation 0 to indicate that its data is yet uninitialized. In the free +tree, the new leaf block is marked as not reserved with the current generation +as free generation. I.e., the new leaf block can be allocated as soon as the +next superblock securing is through. + +But wait! Once we are down here, we can utilize the situation better: If the +lowest inner block of the tree walk has multiple unset nodes, they can be used +to add further branches with almost no effort as each of them merely misses the +leaf. So, to say it more generally, at the end of the tree walk, we will +simply fill up all unused nodes of the lowest inner block with new leaf blocks +from the extension contingent. + +At this point, all inner blocks of the tree walk are in memory. Their hashes need +to be updated and then they can be written back to the physical block device. +Just as after a normal write request. Of course, for those blocks of the tree +walk that already existed and that are not yet volatile (not of the current +generation) a copy-on-write must be done in order to update them. The blocks that +were just added, however, need no copy-on-write. If we are in an "Extend Virtual +Block Device" request, the COW blocks are allocated at the free tree. If we are +in an "Extend Free Tree" request the COW blocks come from the meta tree instead. +After having allocated the COW blocks, the Tresor walks up again through the +loaded blocks, updates the hashes in the updated nodes, and does the +write-back. + +If the targeted tree is the virtual block device and the most recent device +state (the one which we did the tree walk on) was not yet volatile (no +unsynchronized writes so far), a new, volatile device state must be created +in order to reference the resized tree in the superblock. + +Finally, the number of remaining physical blocks for the extension operation is +updated in the superblock. If the number reaches 0, the Tresor returns to the +state "Normal" and the extension request finished successfully. On the other +hand, if there are still physical blocks left in the extension contingennt the +superblock remains in the "Extend Virtual Block Device" respectively "Extend +Free Tree" state. To complete the extension step, the updated superblock is +secured. + +The outcome of each step of an "Extend VBD" operation visualized (blocks that +were just added are marked with A, blocks that were updated with U): + +! Init ----------> Extend #1 --------------> Extend #2 --------> ... +! +! Superblock Superblock Superblock +! State: Extend State: Extend State: Extend +! Blocks: 10 Blocks: 8 Blocks: 1 +! | | | +! __o___ __U___ _____A______ +! / | | \ / | | \ / \ +! o o o o o o o __U__ __o___ A +! ... / \ ... / | | \ / | | \ \ +! o o o o A A o o o __o__ __A__ +! ... / | | \ / | | \ +! o o o o A A A A +! ------------------------------------------------------------------------ +! VBA ... 12 13 ... 12 13 14 15 ... 12 13 14 15 16 17 18 19 +! +! +! +! +! ... -------> Extend #3 +! +! Superblock +! State: Normal +! | +! _____U___________ +! / \ +! __o___ _U_ +! / | | \ / \ +! o o o __o__ __o__ A +! ... / | | \ / | | \ +! o o o o o o o o +! ------------------------------------------------------------------------ +! VBA ... 12 13 14 15 16 17 18 19 + +Meta-tree extension +------------------- + +When extending the free tree, there is one thing that is missing in the above +description of the algorithm. The meta tree, that manages the sparse blocks for +the COW in the free tree, must always be dimensioned according to the size of +the free tree. It is assumed that an allocation at the meta tree for COW in the +free tree never fails. Adding the fact that there are never more than two +versions of the free-tree meta-data, this means that the meta tree must have at +least as many leaves as there are inner blocks in the free tree. The same as for +the free tree meta data also applies for the meta data of the meta tree itself. +So, to sum it up, the number of leaves in the meta tree must be at least the +number of inner blocks in the free tree plus the number of inner blocks in the +meta tree. + +That said, whenever the Tresor is at the point of adding the first new inner +block during a free-tree extension-step, the meta tree must be extended as a +prerequisite. This is not implemented so far but as is technically necessary +and, therefore, the envisioned algorithm for this aspect is described here: + +First, we have to know how many leaves the meta tree must have so that we can +continue the extension step. For this, we have to know the total number N1 of +inner blocks that the free tree will have with the new branch. This number can +be calculated because, at this point, we already know how many leaves the free +tree will have after the extension step. Then, we can determine the number N2 +of inner blocks that the meta tree would have with N1 leaves. After that, we +calculate the number N3 of inner blocks that the meta tree would have with N1 + +N2 leaves. And so on and so on, until we reach the point where the assumed +number of leaves in the meta tree doesn't change anymore. This final number of +leaves is then set as goal for the meta-tree extension. + +Now, we check whether the meta tree already fulfills this goal or not. If not, +we issue one meta tree extension step, and afterwards check again. If the goal +is still not reached, we continue issuing meta tree extension steps until there +are enough leaves. Note that all this is done as part of the atomic free tree +extension step and no other request can be scheduled in between. After that, +the Tresor can continue with adding the new inner blocks to the free tree. + +An extension step at the meta tree is done using the same algorithm as for the +free tree and the virtual block device. The only difference is that, for doing +the CoW, we allocate blocks directly from the lowest inner block of the new +meta-tree branch. This is always possible because the degree of the meta tree is +always greater than its number of levels. Either the lowest inner block is a new +block, then we can add new leaves as required. Or, the lowest inner block already +existed, which leaves us with two further situations. If one of the leaves of +the lowest inner block is already allocated, the branch needs no CoW anymore. +Otherwise we have enough leaves to do the CoW. + +Contingent depletion +-------------------- + +A remaining topic is how the depletion of the contingent of new physical blocks +is handled. The extension algorithm assumes that the contingent of new physical +blocks can be of any size and that it will always be incorporated completely by +the Tresor. So, the possibility of having no blocks left must be considered at +any point in the algorithm were a block shall be taken from the contingent. + +Obviously, the easiest situation is that the continguent is consumed exactly +when having filled up all nodes of the lowest inner block of the extension tree +walk in the targeted tree. Then, the extension step can be finished as +described. The same goes for the case that we filled up some but not all of the +nodes of the lowest inner block. A future extension request will deal just fine +with the remaining unset nodes. + +More interesting is the situation that the contingent becomes empty when we want +to add an inner block to the targeted tree. But fortunately our algorithm is well +prepared for that. If the parent block of the missing inner block is not a new +block, this means that the extension step has done nothing to the targeted tree +so far. We can simply jump to securing the superblock without updating the +targeted tree (the meta tree or free tree however, might have changed nonetheless). +If the parent block of the missing inner block is a new block, we have to stop and walk +up again updating the hashes and doing the write back. The unfinished new branch +remains in the targeted tree with its lowest inner block having all nodes unset +(if it is not a new root block). The Tresor has no problem with this as it only +translates VBAs that are in its range. It will simply never do a tree walk that +leads into these new inner blocks. A future extension request on this tree, +however, expects finding an unset node during its tree walk for the lowest +invalid VBA. The position of this node is not relevant. + +If the contingent becomes depleted during the extension of the meta tree, all +this applies as well. The corresponding extension step at the free tree has done +nothing to the free tree so far.