Skip to main content

One post tagged with "QuestaSim"

View All Tags

· 6 min read
Abid Arcot

In the realm of digital circuits, implementing mathematical functions efficiently is paramount for various applications, from signal processing to machine learning. In this technical exploration, we delve into the optimization of hardware implementations for one such crucial function: the exponential function (ex)(e^x).

Introduction

The task at hand involves implementing the exponential function for 16-bit fixed-point input values. While numerous methods exist for hardware realization, we opted for a Taylor expansion-based approach due to its simplicity and effectiveness. The 5th Taylor polynomial approximation serves as our foundation, providing a balance between accuracy and computational complexity.

Taylor Expansion of the Exponential Function:

The Taylor expansion of the exponential function exe^x is given by:

ex=n=0xnn!=1+x+x22!+x33!+x44!+x55!+ e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \frac{x^5}{5!} + \ldots

Baseline Implementation

Our journey begins with the baseline implementation, a fundamental circuit computed through Verilog. This circuit, rooted in the 5th Taylor polynomial, forms the cornerstone for subsequent optimizations. With a clear understanding of its functionality and performance metrics, we embark on the quest for enhancement.

Circuit Diagram:

Baseline Circuit Diagram

Figure 1: The baseline circuit for computing the 5th5^{th} Taylor polynomial approximating exe^x. Where y=((((a5x+a4)x+a3)x+a2)x+a1)x+a0,a0=1,a1=1,a2=12,a3=16,a4=124,a5=1120 y = ((((a5x + a4)x + a3)x + a2)x + a1)x + a0, a_0 = 1, a_1 = 1, a_2 = \frac{1}{2}, a_3 = \frac{1}{6}, a_4 = \frac{1}{24}, a_5 = \frac{1}{120}

The Verilog code for the baseline implementation computes the 5th Taylor polynomial approximation of exe^x. It takes 16-bit fixed-point input values in Q2.14 format and produces 32-bit fixed-point outputs in Q7.25 format. The circuit consists of multipliers and adders arranged in a sequential manner to compute the polynomial terms and accumulate the result. Source code can be found here.

Pipeline Optimization

The first stop on our optimization journey leads us to the realm of pipelining. By introducing pipelines for multipliers and adders, we aim to enhance throughput and mitigate critical path delays. Leveraging registers strategically, we achieve a pipeline latency of 10 cycles, ensuring a steady stream of outputs with minimal overhead. Source code can be found here.

Pipeline Block Diagram:

Pipeline Block Diagram

Figure 2: The pipeline circuit, where purple blocks are registers.

Thorough functional verification validates the efficacy of our pipelined design. Through waveform analysis and meticulous testing, we confirm the expected pipeline latency and ascertain the accuracy of our outputs.

Shared Hardware Design

In the shared hardware design, through the harmonious coordination of multipliers and adders orchestrated by a finite state machine (FSM), we aim to achieve optimal efficiency while maintaining high performance standards. Every input initiates a meticulously synchronized series of actions, resulting in outputs produced at regular intervals of 5 cycles. Source code can be found here.

Shared Hardware and FSM Diagram:

Shared Hardware Design

Figure 3: The shared circuit, where purple blocks are registers.

FSM

Figure 4: FSM for shared circuit.

Comparative Analysis

With our optimized designs in hand, we conduct a comprehensive comparative analysis to gauge their efficacy. From resource utilization to operating frequency and throughput, each design undergoes meticulous scrutiny. The pipeline emerges as the champion in throughput, leveraging its enhanced frequency of operation to deliver unparalleled performance.

Exact vs. Approximated Exponential Function

As we are reimplementing same circuit for baseline, pipeline, and shared, we achieved same accuracy

Graph

Figure 5: Exact vs. Approximated Exponential Function.

Efficiency Metrics:

We calculate efficiency metrics for each design based on resource utilization, operating frequency, throughput, and dynamic power consumption. It's important to note that these metrics are calculated assuming the whole device is filled with maximum replications for a given circuit design, where the maximum DSP slices available are 1518.

MetricBaseline CircuitPipelined CircuitShared HW Circuit
Resources for one circuit36 ALMs + 6 DSPs92 ALMs + 4 DSPs34 ALMs + 2 DSPs
Operating frequency47.7 MHz213.35 MHz146.13 MHz
Critical pathDSP mult-add + 2x DSP mult + LE-based adder + 6x DSP mult + LE-based adderMult1 (2 ALMs) + Addr1 (1 ALM) + Mult2 & y4 (1 DSP)State (1 FF) + multiplier’s mux (1 ALM) + Mult (2 DSPs) + Addr (1 ALM) + y_Q (1 FF)
Cycles per valid output1pipeline latency = 10, 1 cycle for every output after first output.5
Max. # of copies/device253 (i.e. 1518 DSPs/6)379 (i.e. 1518 DSPs/4)759 (i.e. 1518 DSPs/2)
Max. Throughput for a full device (computations/s)12,068.1 Million computations/s (i.e. 253 * 47.7 MHz)80,859.65 Million computations/s (i.e. 379 * 213.35 MHz)22,182.534 Million computations/s (i.e. 759 * 146.13 MHz / 5)
Dynamic power of one circuit21.41 mW18.28 mW7.76 mW
Max. throughput/Watt for a full device2.23 billion computations/Watt (i.e. Max. Throughput/(21.41mW * 253 ))11.67 billion computations/Watt (i.e. Max. Throughput/(18.28 mW * 379 ))3.76 billion computations/Watt (i.e. Max. Throughput/(7.76 mW * 759))

Toggle Rate Analysis:

We also analyze the toggle rates for each design, which indicates how often signals change in the circuit.

Circuit DesignToggle Rate (Million Transitions/s)
Baseline134.086
Pipelined129.790
Shared HW140.558

Conclusion

In conclusion, our exploration into optimizing hardware implementations of the exponential function has yielded significant insights and effective strategies. We've observed that implementing a few terms of Taylor’s expansion leads to some level of inaccuracy, which can be mitigated by incorporating more terms for enhanced precision. Furthermore, while clipping the most significant and least significant bits during multiplication to fit the Q7.25 format introduces some loss of accuracy, widening inputs and outputs to achieve the desired precision before formatting to Q7.25 in the final stage can counteract this effect.

In terms of circuit efficiency, our analysis of toggle rates reveals that the pipelined design boasts the lowest toggle rate. Despite having larger circuitry than the other two designs, its lower toggle rate underscores its greater efficiency.

The pipelined architecture stands out for its ability to achieve the highest throughput, owing to the benefits of pipelining that allow for a higher frequency of operation and output generation. Compared to the baseline and shared designs, the pipelined approach excels in both frequency of operation and frequency of output generation.