The NEON project proposes the concept of an “NVM-only” operating system, more precisely: an operating system that resides and executes together with all machine programs in byte-addressable non-volatile memory (NVRAM). Such a system not only manages NVRAM for other programs, but also uses NVRAM for its own purposes (here) in an effort to achieve a computing-system operation maximized on energy efficiency and computational performance and minimized on latency.
Assuming a mechanism for defining a checkpoint for the operating-system state stored in the device and processor registers and the cache at the moment of an impending system crash (REDOS), it is hypothesized that such an “NVM-only” operating system can dispense with many, if not all, of the persistence measures that would otherwise have to be implemented and thereby reduce its level of background noise. Specifically, this can decrease power consumption, increase computational power, and reduce latency, both for operations performed in the operating system itself and for actions occurring at the machine program level. Furthermore, by eliminating or simplifying these persistence measures, an “NVM-only” operating system will be leaner than its functionally identical twin of conventional (i.e. DRAM-based) design. On the one hand, this contributes to better analyzability of non-functional properties of the operating system – the hardware will provide upper timing bounds for NVRAM write accesses, which was previously not possible or only possible to a limited extent for persistence realized purely in software – and, on the other hand, results in a smaller attack surface or trusted computing base.
Typical examples of such persistence measures, especially in general-purpose operating systems such as UNIX and Linux, or machine programs running on them, are the superblock of a file system and the index node (inode) of a file, each of which stores important metadata persistently but is permanently in main memory (in-core) for performance reasons. Likewise, the caching of written data during block-oriented output (write caching) for the purpose of latency hiding (delayed write, lazy write), both within the operating-system kernel (buffer cache: write-back) and at machine program level (fwrite(3)). In UNIX systems, a program called update runs in the background and performs a sync(2) every 30 seconds, ensuring that relevant data residing in volatile foreground memory is transferred to nonvolatile background memory. Also in Linux background programs run for this purpose (bdflush), starting from version 2.6 the pdflush thread within the operating system kernel), which follow for example straight for the consistency assurance of the superblock also quite complex transaction protocols. All these measures can be simplified or are even superfluous if the data in question is basically persistent in the NVRAM.
Another example which, in view of the assumed checkpoint mechanism of an “NVM-only” approach, allows special recovery measures to be dispensed with is recoverable mutual exclusion. If a system crash hits a process in a critical section in the NVM, this very process is implicitly continued after restart with its processor state valid at the moment of the system crash. The lock variable also located in the NVM will continue to be associated with this process, just as all data accessible by this process within the critical section is implicitly persistent and corresponds to the system state of this process. The extra effort (in space, time, and energy) required of a process to ensure recovery of the entry log after restart is eliminated. This assumes, however, that the process itself has not contributed to the failure, for example, it proceeds within the critical section in a type-safe manner – which may be implicitly assumed (by resorting to a type-safe programming language) or explicitly assumed (by strictly passive programming) for a critical section lying within the operating system.
Besides the regular shutdown of the computing system, power failure and other serious error causes (panic) in the operating system are assumed as occasions to create a checkpoint for device/processor register contents. Power failure is reported to the operating system as an exception (trap, interrupt), and then the processor state of the interrupted process is saved and persisted in NVRAM as part of exception handling. The emergency action (in case of panic) of the operating system runs through the same procedure, only this is then usually triggered synchronously with the current process. It is essential to be able to save the processor status completely within the remaining time until the processor stops functioning due to a lack of sufficient power supply.
Even in operation under normal conditions, “NVM-only” operating systems are assumed to improve the non-functional properties of the computing system. In terms of energy consumption, this applies to both static sleep states and dynamic runtime decisions. When entering a deep sleep state, the memory pages in the “NVM-only” approach are already persistent and do not have to be backed up from main memory to a persistent secondary memory as in DRAM-based computing systems. Complex copying operations (in terms of time requirements and power consumption) are therefore not necessary during the transition to this state. Similarly, the effort required to exit the deep sleep state is also reduced (omission of readback operations). Furthermore, it is assumed that the fixed-point safety mechanisms provided by NEON form the basis for new sleep states in the millisecond range. High-frequency entry and exit of these sleep states (latency in the millisecond range) is expected to establish the previously missing bridge between hardware-controlled sleep states of the CPU (C-states, microsecond range) and operating system-managed sleep states (suspend-to-DRAM, second range as well as coarser temporal resolution). These operating system managed and controlled sleep states have the characteristic of frequent, short-term power failures.
The approach followed with NEON as outlined here uses Linux as the base operating system. The adjacent figure roughly outlines the NEON function blocks. On the one hand, the existing system software is restructured so that it can be run directly in the NVRAM. In the first step, however, only a subset of operating-system functions is considered and subjected to a tool-supported adaptation to NVRAM specifics. On the other hand, software and hardware components are precisely aligned and configured to reduce the energy requirements of the computing system. The focus is on energetic operating-system methods that help to compensate for the higher energy requirements of NVRAM in certain operations (especially when writing).
The system software is designed so that its volatile state resides solely in registers or caches of the underlying CPU, it does not assume program text and data reside in DRAM-based main memory. A computing system is assumed here that does not have to be equipped with conventional volatile main memory for operation. The volatile contents of the CPU registers and hardware caches form the system state that is backed up to the NVRAM in the event of a power failure. It is to be guaranteed that the time required to back up this state must not exceed the time for which the residual energy window of the power supply unit will keep the computing system alive. For this purpose, both the energy costs of the backup procedure and the NVRAM write bandwidth are determined and the electrical characteristics of the computer’s power supply unit is measured.
Not least, the Linux-based system software is subjected to a NVRAM-related artefact elimination. The software that explicitly provides persistence is removed because it can now be implicitly based on a non-volatile system state by running it directly in NVRAM, which has consequently become redundant and otherwise only generates overhead. These examinations rely on own tools for compiler-based operating-system tailoring as well as static program analysis to predict time and energy costs.
Finally, and by no means insignificantly, the minimal subset of persistence features suitable for use in operating-system kernels supports Linux’s NEON modules to run directly in NVRAM. The focus is on abstractions that help critical / sensitive sections run non-blocking or even wait-free.
Project results and findings
The first comprehensive NEON measure was the “NVRAM-ification“ of Linux, that is, the provision of a Linux that, including the machine programs run by it, operates exclusively from NVRAM. The concept for this was presented at Dagstuhl Seminar 22341 prior to the work, implementation in practice as well as an evaluation of “NVM-only” Linux are documented in a Technical Report and a contribution to ARCS 2023. However, the Linux variant described in these papers is not yet REDOS-ed.
For the purification of Linux investigations ran to the dynamic updating and specialization of programs in general and operating systems in the special one. A basic consideration for this was the question of whether the removal of superfluous functions to ensure the persistence of system data structures may also require dynamic adaptations to the respective call environment, since no prior knowledge is available at Linux build time. This would then primarily affect persistence functions that are justified by certain application scenarios and do not have to serve the Linux operating principle alone. Work on such update and specialization techniques could be published at USENIX ATC 2023 and PLOS 2023.
As practical as the dynamic adaptation of Linux depending on environmental conditions – “from above” on the part of the applications and “from below” with respect to certain hardware properties (here: NVRAM) – may be, it has been shown that they are not absolutely necessary for the already demanding work on NEON. Rather, this “nice to have” work for the purpose of NEON is now being pursued in the context of DOSS. So the NVRAM-motivated “decluttering” of Linux will happen statically (at build time), primarily in line with the operating principle and secondarily depending on the respective application profile.
The sister project PAVE considers a virtual NVM-only approach in which the availability of volatile main memory (e.g. DRAM) is assumed, but only enough to enable efficient computing operation given the higher access latencies to the NVRAM. Furthermore, methods for capacity scaling using NVRAM are being investigated. This work uses FreeBSD as the base operating system.
The ResPECT project, which focuses on embedded communication systems, is developing a holistic operating system and communication protocol concept, which assumes that the transfer of information (receiving control data for actuators or sending sensor data) is the core task of almost all networked nodes.
What all three projects have in common is the approach to a residual energy dependent NVRAM-based operation shutdown (REDOS).