The Open Speech Signal Processing Platform Web Portal

(Future Site)

 

The FPGA Deterministic Latency Advantage
The advantage that SoC FPGA devices bring to the open speech platform is the integration of ARM CPUs with the computational fabric of FPGAs, which now can contain floating-point multipliers and adders. This allows existing code libraries to be easily ported to the ARM CPUs, providing a conventional computer platform for signal processing that everyone is familiar with. The true advantage of using these devices for real-time speech and audio signal processing is that deterministic low-latency signal processing is possible in the same device by placing algorithms in the FPGA fabric.

Deterministic low-latency signal processing is impossible with standard computer
architectures, which is why FPGAs are used almost exclusively in high-performance real-time digital signal processing (DSP) applications. These two different processing paradigms (ARM CPUs vs FPGA fabric) are both supported in the same SoC FPGA device.

We now contrast these two computational paradigms where the differences can be seen in figures 1 and 2. We can either use conventional CPUs or we can create a custom architecture in the FPGA fabric. We first describe conventional CPU processing using the ARM CPUs, which is outlined in the steps below. At each step we note where latency indeterminacy gets injected into the conventional CPU processing paradigm, which is true for any CPU+cache architecture.

 

Latency Variability in Conventional CPU Architectures
The following steps outline how the ARM CPUs will process data as they encounter it.
The CPU is first notified that data is available when an interrupt line connected to the CPU is asserted by the sensor containing a data source.

The CPU must stop what it is currently doing and switch to an interrupt service routine (ISR) to get the data. This involves saving the current state of the CPU where CPU registers and status registers are pushed onto the stack. There is slight indeterminacy in the time it takes to get the data due to when the instruction underway was interrupted, which must run to completion first.

The interrupt service routine must then place data into main memory via load/store operations. Here is the first opportunity for a significant amount of latency indeterminacy to be injected into the process. This latency indeterminacy occurs at two levels.

Any time the CPU reads or writes to main memory, there is an automatic cache operation. Ideally these memory operations result in a cache hit, making for fast memory access. A cache hit typically takes 2-10 clock cycles on the ARM CPU. However, since the cache is much smaller than main memory, there are cache misses.

On a cache miss (first level of indeterminacy), the entire cache line is updated involving the movement of 64 bytes of data, even if the request was only for two bytes. Updating an entire cache line from main memory takes ~250 of clock cycles (see figure 1).

Furthermore, when accessing DRAM main memory, the memory controller could be just starting a DRAM refresh operation that stalls memory access (second level of indeterminacy). DRAM refresh operations will temporarily block memory access and can take up 30% of a memory’s rated bandwidth (Micron’s rule of thumb for DRAM is that you can get only ~70% of rated memory bandwidth).

Thus any time the ARM CPU accesses main memory, the access times could be any one of the following:

1. 10 clock cycles (cache hit).
2. 250 clock cycles (cache miss).
3. 325 clock cycles (cache miss plus refresh wait).

There is even more latency variability given memory access patterns (random vs bursting) and using multiple CPU cores, but this is beyond the scope of our description here. The point is that there are significant latency variations, which are unavoidable, even by the best of programmers who know the underlying architecture and are taking effort to program for performance.

Once the data is placed in main memory, it is then processed by the CPU to implement the DSP algorithm being implemented. Again, many processing steps will require memory accesses and each memory access can again result in significant latency indeterminacy (10 vs 250 vs 325 clock cycles).

It should now be apparent that there are many CPU data processing steps that are latency indeterminate due to hardware effects and beyond the control of software programmers. Also, getting maximum performance out of a CPU+cache system involves significant programming effort because cache effects must be taken into account and cache misses minimized. If a programmer ignores cache effects, the program run time can be orders of magnitude slower if main memory (DRAM) is accessed frequently and the working data set is not in the cache. It should be noted that direct memory access (DMA) controllers can reduce this latency indeterminacy, but they will not eliminate it.

Deterministic Low-Latency Signal-Processing in FPGAs
The advantage that FPGAs bring to signal processing is that designs with very low and deterministic latency can be crafted for the algorithms being placed in the FPGA fabric. The FPGA fabric is comprised of interconnected blocks of logic elements, multipliers, adders, and memory blocks (see figure 2 for a simple FIR filter illustration). The interconnections between all these blocks are programmable, which is why these devices get the name field programmable gate arrays (FPGAs). Loading a new configuration changes the connections between these blocks in the FPGA fabric. A new configuration creates a completely new architecture that implements a completely different algorithm.

In contrast to the latency indeterminacy that caches bring to CPUs, the processing in the FPGA fabric occurs in lockstep to a system clock* as data moves through different processing stages. This allows the data processing to be deterministic as one knows exactly how many clock cycles the processing takes and this never changes. Processing can also be done with very low latency because the data can move through the computational fabric with many fewer clock cycles due to the massive parallelism of the processing elements in the FPGA fabric. This is because there can be thousands of multipliers and thousands of adders running in parallel.

*The system clock speed depends on the FPGA family. This ranges from ~100-200 MHz for the low cost Cyclone V family, ~300-500 MHz for the midrange Arria 10 Family. The recently announced high end Stratix 10 FPGAs will have a ~1 GHz fabric clock speed.