Virtual Non-Volatile Heap for ResPECT
- Overview
- Behaves like a normal memory allocator, but all data stored is virtually persistent.
- Project provides a part of the ResPECT project (https://sys.cs.fau.de/research/respect), which has the goal of makeing the whole system state (operating system, and applications) virtually persistent. A large volatile heap is a problem in this context, as flushing it to disk would be impractical when a power failure is imminent.
- We are collaborting with the Telecommunications Lab of Uni Saarland to make the networking stack transactional, which will prevent long-running operations (e.g., set up a bluetooth connection) to be attempted when a power failure is imminent.
- Assume that the kernel state, device state, userspace stack and registers are persisted using another mechanism. There are plenty of papers that already did that (e.g., TreeSLS).
- Conceptually similar to “NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories” (https://dl.acm.org/doi/pdf/10.1145/1961295.1950380)
- There’s a lot of information here, you do not have to understand all of it now. Many references are optional and you can focuse on the things that are interesting to you.
- Target platform:
- You should be able to test and work on the project on a desktop computer, but you can also test in on the target hardware which will be handed to you.
- ESP32-C3 (https://www.espressif.com/en/products/socs/esp32-c3, see Image above) mit Adafruit SPI FRAM (https://learn.adafruit.com/adafruit-spi-fram-breakout)
- Zephyr RTOS
- Design:
- Resources managed by the library:
- Persistent file (on NVRAM / disk)
- Memory
- If possible, all algortihms should be as simple as possible and have a bounded upper runtime (and energy consumption). Follow-up projects might calculate their Worst-Case Energy-Consumption/Execution-Time (WCEC/WCET) statically.
- Interfaces:
init(max_dirty_bytes, file, memory)
- Initialize the heap.
sync()
- Must have an upper bound on energy use.
- The trigger for this data cache backup is a Power-Failure Interrupt (PFI), which is handled on a case-by-case basis either on userland or in the operating-system kernel.
- For details on the PFI concept see “Neverlast: Towards the Design and Implementation of the NVM-based Everlasting Operating System” (https://www4.cs.fau.de/Publications/2021/eichler_21_hicss.pdf)
- Number of dirty pages must be limited
- Eviction algortihm suggestion: “FIFO queues are all you need for cache eviction” (https://dl.acm.org/doi/10.1145/3600006.3613147)?
- Triggered by a signal to the process (
SIGUSR1 ~= SIG_ENERGY_OUT
/ PFI)- Must be atomic with regards to
allocate
- Must be atomic with regards to
- The data cache in RAM, whose “dirty content” is to be efficiently combined with the already persistent NVRAM content.
- Optional: In the case of large or even huge objects, consideration is given to not transfer entire object states, but rather to combine the relevant (changed) RAM “object snippets” with the corresponding NVRAM counterparts by means of a delayed update [28] or an update mask [14].
- The trigger for this data cache backup is a Power-Failure Interrupt (PFI), which is handled on a case-by-case basis either on userland or in the operating-system kernel.
allocate(size) -> ?
- Interface to allocate virtually non-volatile memory.
- Returns Rust type from which the app can temporarily retrieve a read/writeable reference (reference counted or owned)
- Reference expires within N cycles/joules as determined by resourcegauge.rs (https://gitlab.com/netzdoktor/resourcegauge-rs)
- Implementation in Rust would be feasible because this allows the use of resourcegauge.rs (https://gitlab.com/netzdoktor/resourcegauge-rs), however C is also possible for smaller projects if you are not interested in learning Rust.
- When the memory is not used, we can move it to disk and bring it back when a read/writeable reference is requested by the program.
- This might also be possible in C++.
- Reference expires within N cycles/joules as determined by resourcegauge.rs (https://gitlab.com/netzdoktor/resourcegauge-rs)
- For context, if the hardware would support it the solution would be different: Transparently move pages to file when unused.
- I.e., leverage the page cache and selective flushing
- This is not possible on our taget platform (ESP32-C3)
- Returns Rust type from which the app can temporarily retrieve a read/writeable reference (reference counted or owned)
- Resources managed by the library:
- Challenges
- ResPECT creates a transactional networking stack. Can we create a transactional memory allocator?
- This would allow us to not start operations that will not be able to complete given the current power budget.
- By requesting the guard type, the app would not only request the memory but also the power budget needed to compute on that memory. If this is not available, the app is blocked. When the budget is ready, the app is resumed.
- This would be a form of energy-driven scheduling.
- Can we achieve external synchrony?
- “To support external synchrony, an SLS should make sure that the state changes caused by a request are persisted before sending responses to external systems. With high-frequency checkpointing, TreeSLS archives this by delaying external visible operations (e.g., sending network packets) until a checkpoint is taken. This can be implemented transparently to applications by allowing user-space services (e.g., network drivers) to register a checkpoint callback, which will be invoked at the end of each checkpointing, and a restore callback, which is invoked at the end of recovery. TreeSLS also provides an eternal PMO, which is a special kind of PMO that will not be rolled back during recovery.” (https://dl.acm.org/doi/pdf/10.1145/3600006.3613160)
- Write amplification
- e.g., changing 1 byte might require the whole block to be rewritten to disk
- Can we solve this using alignment to separate cold from hot data?
- Can we pack stuff that is frequently changed into one block?
- How to get this data? Can we leverage Rust guard types to count accesses?
- When and how to flush data do storage?
- Synchronously in the main thread using stop-the-world approach?
- Asynchronously in a background thread?
- When done correctly, this could be faster. But it would make the impl. more complex.
- Maybe a semaphore can be used to count the number of dirty blocks. Then we can slow down the worker thread when the background thread can not flush data quickly enough.
- Related work
- TreeTLS (https://dl.acm.org/doi/pdf/10.1145/3600006.3613160): “[…] small-sized and frequently updated objects (e.g., Thread) are directly copied to the new checkpoint during checkpointing since the copying is quick. Second, large-sized and slowly changing objects (i.e., memory pages) are asynchronously copied during runtime. Third, objects that can be rebuilt (e.g., page tables) are not included in the checkpoint, which trades restore time for faster checkpointing.”
- “TreeSLS checkpoints these hot pages on DRAM with stop-and-copy and the rest pages on NVM with copy-on-write. Note that hybrid copy echos the high-level idea of speculative copy-on-write: predicting pages that are likely to be modified and copying them before the actual copy-on-write.”
- “TreeSLS introduces a dual-function active page list to track hot pages and implement the migration. When a page fault is triggered, we increase the page’s hotness value, and append the page to the list when its hotness exceeds the threshold.”
- For our vNV-Heap we could use the MPU or Rust reference types to track hotness.
- TreeTLS (https://dl.acm.org/doi/pdf/10.1145/3600006.3613160): “[…] small-sized and frequently updated objects (e.g., Thread) are directly copied to the new checkpoint during checkpointing since the copying is quick. Second, large-sized and slowly changing objects (i.e., memory pages) are asynchronously copied during runtime. Third, objects that can be rebuilt (e.g., page tables) are not included in the checkpoint, which trades restore time for faster checkpointing.”
- What allocation strategy to use?
- Related work
- TreeTLS (https://dl.acm.org/doi/pdf/10.1145/3600006.3613160): “As an NVM allocator, the checkpoint manager uses a buddy system to manage all NVM resources in TreeSLS. Both the runtime data and checkpoints are stored in the space allocated by the checkpoint manager. Slab systems are also used to facilitate the allocation of small fixed-sized objects. […] TreeSLS leverages redo/undo journaling to maintain the crash consistency of the checkpoint manager.”
- Linux also uses SLAB (but for memory, we have a disk)
- What does glibc / musl use today?
- Related work
- ResPECT creates a transactional networking stack. Can we create a transactional memory allocator?
- Implementation:
- Tools and links that might be helpful:
- Zephyr RTOS Supported Boards » RISC-V Boards » ESP32-C3: https://docs.zephyrproject.org/latest/boards/riscv/esp32c3_devkitm/doc/index.html
- Storage Driver
- Hardware documentation: https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/api-reference/peripherals/spi_master.html#id7
- Existing bare-metal driver: https://gitlab.cs.fau.de/watwa/esp32c3/playground/-/blob/master/main/fram.cpp
- Zephyr SPI driver: https://docs.zephyrproject.org/latest/hardware/peripherals/spi.html
- Might be useable for our board with the rigth device tree (https://en.wikipedia.org/wiki/Devicetree)
- API bindings, libstd, and Cargo integration for running Rust applications on a Zephyr kernel: https://github.com/tylerwhall/zephyr-rust
- Maybe impl. as a standardized Rust allocator to easily plug it into applications?
- Tools and links that might be helpful: