Demand-adaptive Memory Compression in Hardware

THE CHALLENGE

The density scaling of computer memory has lagged behind central processing units (CPUs) and storage creating a cost bottleneck in data centers and cloud computing. To help logically scale up memory density, operating systems (OSes) today compress and pack more values into the same physical amount of memory. To minimize the impact on application performance, the OS compresses memory in a demand-adaptive manner; it adaptively compresses more and more memory pages as needed when memory demand increases selectively compressing the coldest, least accessed, pages. As a result, memory accesses to compressed pages requires waking up the OS to decompress them via slow page faults; as a result, systems today can only afford to compress a small fraction (5 – 20%) of pages. Compressing more values (up to 80 – 90%) in memory while maintaining good performance requires speeding up accesses to the compressed pages.

 

OUR SOLUTION

Xun (Steve) Jian and his lab have designed a novel demand-adaptive hardware memory compression module to compress memory values in hardware (Figure 1). The module resides in the CPU’s memory controller and compresses and packs cold pages as well as decompressing and unpacking compressed pages when they are accessed. This is done by adding a new level of dynamic address translation between the OS-managed physical address and the actual memory (e.g., dynamic random-access memory, or DRAM) address to adaptively compress only the necessary number of pages according to the current demand. The system schedules more compression only when the hardware-visible, free (i.e., unused) memory falls below a specified threshold.

Compared to OS memory compression, each memory access to a compressed page can be twenty times faster; the hardware module can transparently locate and decompress hardware-compressed pages without needing any OS support during the access. However, this introduces two new sources of overheads as compared to OS memory compression: the added level of address translation can slow down accesses to uncompressed pages, and transparently and dynamically changing the actual size of each page in hardware interferes with OS ability to precisely allocate memory to different workloads and, thus, increases performance variability for multi-tenant execution. To address the first issue, this novel module dynamically applies faster address translation to hot, uncompressed pages (Figure 2). To address the latter issue, the module provides architectural support for the OS to precisely allocate actual memory (e.g., DRAM) to each workload (Figure 3).

 

Figure 1: The novel demand-adaptive memory compression module and its high-level operations, which are highlighted in green.

 

Figure 2: The most-frequently accessed uncompressed pages are dynamically selected to use short compressed-memory translation entries (CTEs). By being small (e.g., 1 or 2 bits), short CTEs fit well in the translation cache (CTE$) and enjoy a high hit rate.

 

Figure 3: Overview of the hardware support to allow OS to precisely allocate actual memory to individual workloads while hardware transparently compresses memory.

Patent Information: