- # Implementing a Fast, Durable, Flash Friendly, Queue
- **Goals**
- Minimize write amplification, particularly on consumer flash media such as MicroSD and USB drives
- Overflow many small records from memory to disk as necessary, but **only** as necessary
- Work on constrained systems
- *Maybe* `no_std`
- "Best effort" at-least-once delivery
- Be as configurable as reasonable
- Opinions are the mark of someone who knows what they're doing... usually
- **Non-Goals**
- 100% durability guarantees
- Power loss is not a case we will *necessarily* gracefully handle, though handling partial or torn writes is a goal
- Even during a graceful shutdown, we prefer to drop buffered records that have not been persisted to disk yet, as the cost on wear and complexity of supporting partial pages is too great. This may change in the future
- Best performance
- Duplicate record handling
- Maximize storage utilization
- ie. expect some storage overhead
- High concurrency
- **Ethos**
- Every write is *costly*
- Deletes are 4x as expensive
- **Features**
- Reduced write amplification via novel sequencing algorithm
- Coalescing writes that align to a configurable page size (defaults to 4Kb)
- Use the minimum amount of information necessary to accomplish at least once delivery (up to the buffers capped size)
- ## Design
- ### On-Disk
- The primary unit of serialization/storage is the `page`
- `page`s are aligned to 4Kb by default, but may be more or less. A multiple of 4Kb is recommended
- The size of a `page` must fit *at least* one user record type, otherwise a fatal error will be returned on init
- The total size of a page includes metadata described below in the Page section
- The first 4Kb of the buffer file are reserved for configuration and metadata. The fields are listed below.
- *This could also be a separate file, to reduce like 3.8Kb of unused space*, however that removes the ability to use a raw block-device, which is not desirable. Perhaps we should have 2 modes of operation?
- `page_count`: the number of pages in this buffer. This can be used to (re)calculate the total size of the buffer
- `page_size`: the size in bytes of each page
- `num_pages`: (3-bits), 0-3
- This is only set the first time the number of records exceeds the previous value. That is, on init this value is 0. After the first page is serialized, this is set to 1. When the second page is written, this gets set to 2. Finally, when the 3rd page is written, it gets set to 3. It should never be decreased once it has been incremented unless the whole buffer gets reinitialized.
- `looped`: a boolean flag to indicate that the buffer has looped back around at least once
- This signals to the reader that records greater than `head` should be treated as part of the tail of the queue
- #### Pages
- Pages should be large and typically aligned to the *block* size of the storage medium
- "block size" being the unit of deletion
- To minimize write-amplification and partially unused blocks
- Each page contains the following info
- Arbitrary bytes representing the user record. Must contain 1 or more user records
- CRC32 of the userdata and sequence number
- `u16` sequence number
- `flags: u16`
- `0:1`: Whether the
- #### Sequence Numbers
- On buffer init, the sequence number begins at 1 and increments by 1 for each page added until the end of the buffer is reached
- Once the buffer loops around, the sequence number starts at `page_count` and decrements by 1 for every write until it once again reaches the end of the buffer and loops around, at which point it starts from 1 and begins incrementing again, and so on
- The importance of these sequence numbers is primary to algorithmically determine where the tail of the buffer is on a reload of the circular buffer. The algorithm is as follows
- If `num_pages` is greater than 1
- Look at the first 2 pages in the buffer
- If page 1's sequence number is greater than page 0's, the `head` record can be obtained by scanning until the largest sequence number is found. This can short circuit as any tail records will always be less than the head's sequence number
- Similarly, if page 1's sequence number is *less* than page 0's sequence number, the `head` record can be obtained by scanning the buffer until the lowest sequence number is found. That record is the `head`.
- If `num_pages` is 1
- Page 0 is the `head`
- If `num_pages` is 0
- There are no records and `head` is 0
- The beauty of the sequence numbers is we never need to do an extra write to a metadata record for each page write in order to be able to identify the head of the buffer: it's baked right into each page's own metadata and we only page a small 2-byte overhead cost per page
![[Ring Page Sequencing.excalidraw]]
- #### Reloading State
- As described about, `head` can be determined using the seq nums. However, tail cannot be determined reliably in this way. We explicitly want to avoid having to write any additional data as part of a *read* or drain of the buffer, so we only guarantee *at least* once delivery (obviously bounded up to the max of the buffer size) redelivery of every record
- To reduce the potential for re-sended records, an optional tail index can be stored at the beginning of the buffer in it's metadata
- This only *really* needs to be done during a safe shutdown of the process. If an unsafe (ie. `kill -9`, `SIGKILL`) occurs, there's no guarantees about the structure of the data and we cannot "trust" that the tail index is accurate
- The rest of the buffer state can be loaded at any time (ie. the `next` buffer)
- **Persistence**
- `page`s are only written when they are full
- Yes, this means power loss is a critical failure that will lose data. Actually, even a graceful shutdown will lose the tail of the buffer
- **Initialization**
- If using a file, only the metadata must be written upon init
- If using a raw block device, the first 2 pages *must* be zeroed out upon initialization