High-Speed Computational Architecture For Reduced Latency In Big Data Processing

Novel computational architecture designs to reduce the latency time to process large volumes of data utilizing the reconfiguration of memory and storage; streamlining read/write functions to include computational logic within the register file; and programmable schedule and memory utilization within a configurable load/store unit.
Problem:
The execution of big data applications used by search engines, natural language processing or classification algorithms is energy intensive and requires high throughput sorting and processing of data. Part of the inefficiency arises from:

fixed data access processing between the local storage and memory and cloud storage.
a conventional register file retrieves (“reads”) data for a processing operation by another component, then stores (“writes”) the processed data back into the register file.
current load/store units allow no control over or optimized organization of scheduling or which memory registers are utilized or not utilized. Memory organization can become a limitation when processing large volumes of data.

Solution:

Reconfigurable architecture of memory and storage in a computer to offload functions away from the central and intelligence processing units. This allows direct access to both local and remote memory, with high bandwidth and low latency.
A computational register file that reduces the latency time by combining the “write” and operation steps described above. Therefore, the computational register file can operate on the content during read/write, instead of performing them separately.
A configurable load-store unit to reduce latency of data access by applying different scheduling and utilization policies for different load/store instructions. The programmer has direct control over the memory scheduling and utilization.

Technology:

To reconfigure memory and storage, logically disaggregated interconnections (Peripheral Component Interconnect Express (PCIe) switches) allow the field-programmable gate array to access both local and remote memory. The memory network serves local and remote requests using a point-to-point connection instead of the switch-based network to reduce latency.
To perform operations within a computational register file, the operand name is expanded from RX to RX.Y, where X is the name of the register and Y is the port ID of the register. The name of register X selects which register in the register file to be used and port ID Y selects which computational logic to be used. Using this architecture, an operation only requires one load instruction to be performed.
Each new load/store unit has its ID and respective attributes stored in a configuration table. The programmer uses the table to improve the scheduling and utilization to define the size of a single memory request, the coalescing threshold, the coalescing window, and its scheduling priority.

Advantages:

Reconfiguration of Memory and Storage:

Fifteen microsecond latency for data access, one hundred times lower than CPU-based systems
160 GB/s bandwidth, eight times higher than conventional CPU servers
Enables adaptation to accommodate different index algorithm schemes and different index sizes

Computational Register File:

Given a photo document dataset, processes indexes with four times fewer nodes at sixty-eight times lower latency compared to CPU only; and
Handles multiple types of data, including text, images, and videos.

Reconfigurable and Programmable Load/Store Unit:

Improved latency of big data applications by four times
Better tradeoff between memory access latency and bandwidth by allowing control of memory scheduling

Stage of Development:

Proof of Concept
Bench Prototype

Fig. 1: The system architecture connects the Intelligence Processing Unit (IPU) to a storage and memory pool through a set of interconnections. Fig. 2: Comparison of the function of a conventional register file (RF) write process and the computational register file (CRF) in situ logic operation. In this example, a new data element arrives (“3”), is written into the computational register file, is compared to each other element, and then written to the vector. Fig. 3: The load/store unit includes an operand indicating the ID of attributes for the load/store request. The programmer sends a request to the load/store unit along with this attribute ID. The unit looks up the corresponding attribute stores in a configuration table according to the ID.
Intellectual Property: