Blog | Abid

Optimizing Hardware Implementation of the Exponential Function: A Deep Dive

February 10, 2024 · 6 min read

CS grad @ ASU

In the realm of digital circuits, implementing mathematical functions efficiently is paramount for various applications, from signal processing to machine learning. In this technical exploration, we delve into the optimization of hardware implementations for one such crucial function: the exponential function $(e^x)$ .

Introduction

The task at hand involves implementing the exponential function for 16-bit fixed-point input values. While numerous methods exist for hardware realization, we opted for a Taylor expansion-based approach due to its simplicity and effectiveness. The 5th Taylor polynomial approximation serves as our foundation, providing a balance between accuracy and computational complexity.

Taylor Expansion of the Exponential Function:

The Taylor expansion of the exponential function $e^x$ is given by:

$e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \frac{x^5}{5!} + \ldots$

Baseline Implementation

Our journey begins with the baseline implementation, a fundamental circuit computed through Verilog. This circuit, rooted in the 5th Taylor polynomial, forms the cornerstone for subsequent optimizations. With a clear understanding of its functionality and performance metrics, we embark on the quest for enhancement.

Circuit Diagram:

The Verilog code for the baseline implementation computes the 5th Taylor polynomial approximation of $e^x$ . It takes 16-bit fixed-point input values in Q2.14 format and produces 32-bit fixed-point outputs in Q7.25 format. The circuit consists of multipliers and adders arranged in a sequential manner to compute the polynomial terms and accumulate the result. Source code can be found here.

Pipeline Optimization

The first stop on our optimization journey leads us to the realm of pipelining. By introducing pipelines for multipliers and adders, we aim to enhance throughput and mitigate critical path delays. Leveraging registers strategically, we achieve a pipeline latency of 10 cycles, ensuring a steady stream of outputs with minimal overhead. Source code can be found here.

Pipeline Block Diagram:

Thorough functional verification validates the efficacy of our pipelined design. Through waveform analysis and meticulous testing, we confirm the expected pipeline latency and ascertain the accuracy of our outputs.

Shared Hardware Design

In the shared hardware design, through the harmonious coordination of multipliers and adders orchestrated by a finite state machine (FSM), we aim to achieve optimal efficiency while maintaining high performance standards. Every input initiates a meticulously synchronized series of actions, resulting in outputs produced at regular intervals of 5 cycles. Source code can be found here.

Shared Hardware and FSM Diagram:

Comparative Analysis

With our optimized designs in hand, we conduct a comprehensive comparative analysis to gauge their efficacy. From resource utilization to operating frequency and throughput, each design undergoes meticulous scrutiny. The pipeline emerges as the champion in throughput, leveraging its enhanced frequency of operation to deliver unparalleled performance.

Exact vs. Approximated Exponential Function

As we are reimplementing same circuit for baseline, pipeline, and shared, we achieved same accuracy

Graph — Figure 5: Exact vs. Approximated Exponential Function.

Efficiency Metrics:

We calculate efficiency metrics for each design based on resource utilization, operating frequency, throughput, and dynamic power consumption. It's important to note that these metrics are calculated assuming the whole device is filled with maximum replications for a given circuit design, where the maximum DSP slices available are 1518.

Metric	Baseline Circuit	Pipelined Circuit	Shared HW Circuit
Resources for one circuit	36 ALMs + 6 DSPs	92 ALMs + 4 DSPs	34 ALMs + 2 DSPs
Operating frequency	47.7 MHz	213.35 MHz	146.13 MHz
Critical path	DSP mult-add + 2x DSP mult + LE-based adder + 6x DSP mult + LE-based adder	Mult1 (2 ALMs) + Addr1 (1 ALM) + Mult2 & y4 (1 DSP)	State (1 FF) + multiplier’s mux (1 ALM) + Mult (2 DSPs) + Addr (1 ALM) + y_Q (1 FF)
Cycles per valid output	1	pipeline latency = 10, 1 cycle for every output after first output.	5
Max. # of copies/device	253 (i.e. 1518 DSPs/6)	379 (i.e. 1518 DSPs/4)	759 (i.e. 1518 DSPs/2)
Max. Throughput for a full device (computations/s)	12,068.1 Million computations/s (i.e. 253 * 47.7 MHz)	80,859.65 Million computations/s (i.e. 379 * 213.35 MHz)	22,182.534 Million computations/s (i.e. 759 * 146.13 MHz / 5)
Dynamic power of one circuit	21.41 mW	18.28 mW	7.76 mW
Max. throughput/Watt for a full device	2.23 billion computations/Watt (i.e. Max. Throughput/(21.41mW * 253 ))	11.67 billion computations/Watt (i.e. Max. Throughput/(18.28 mW * 379 ))	3.76 billion computations/Watt (i.e. Max. Throughput/(7.76 mW * 759))

Toggle Rate Analysis:

We also analyze the toggle rates for each design, which indicates how often signals change in the circuit.

Circuit Design	Toggle Rate (Million Transitions/s)
Baseline	134.086
Pipelined	129.790
Shared HW	140.558

Conclusion

In conclusion, our exploration into optimizing hardware implementations of the exponential function has yielded significant insights and effective strategies. We've observed that implementing a few terms of Taylor’s expansion leads to some level of inaccuracy, which can be mitigated by incorporating more terms for enhanced precision. Furthermore, while clipping the most significant and least significant bits during multiplication to fit the Q7.25 format introduces some loss of accuracy, widening inputs and outputs to achieve the desired precision before formatting to Q7.25 in the final stage can counteract this effect.

In terms of circuit efficiency, our analysis of toggle rates reveals that the pipelined design boasts the lowest toggle rate. Despite having larger circuitry than the other two designs, its lower toggle rate underscores its greater efficiency.

The pipelined architecture stands out for its ability to achieve the highest throughput, owing to the benefits of pipelining that allow for a higher frequency of operation and output generation. Compared to the baseline and shared designs, the pipelined approach excels in both frequency of operation and frequency of output generation.

Mastering Hill Climbing with Pololu Robot: Step-by-Step Guide

February 1, 2024 · 7 min read

Abid Arcot

CS grad @ ASU

This blog aims to guide you through the process of programming a Pololu robot to execute a specific set of actions: ascending a ramp, recognizing the summit, changing direction, and safely descending the ramp without the risk of falling off. We will delve into best practices for designing an embedded system, breaking down the process into modeling, designing, and analyzing.

Throughout this journey, we'll navigate the intricacies of implementing line sensing, feedback control, and accelerometer integration to ensure our robot successfully ascends a ramp, executes a turn at the top, and descends without veering off the edge.

Background

For this project, we will utilize Lingua Franca, the first reactor-oriented coordination language. It allows you to specify reactive components and compose them. Lingua Franca's semantics eliminate race conditions by construction and provide a sophisticated model of time, including a clear and precise notion of simultaneity. The blog will explain the concepts sufficiently for you to comprehend the program, even if you're not familiar with the language.

To kickstart our project, we'll fork lf-3pi-template, which comes with pico-sdk, facilitating a quick start.

Understanding the Task at Hand

Our primary goal is to program the Pololu robot to perform a specific sequence: climb a ramp, detect the summit, turn around, and descend without falling off the edge. Key elements for this challenge include a ramp with a 15-degree slope, a flat turning surface at the top, a light-colored surface, and dark-colored edges for the robot to detect and avoid.

Ramp

Modeling

Let's initiate our project by modeling it with a finite-state machine and outlining the sensors we'll use.

Sensors Utilized

Line Sensors

For detecting edges and preventing the robot from falling off the wedge. Additionally, the accelerometer will provide roll, yaw, and pitch values, while the gyroscope calculates the angular velocity for precise turns.

Encoders

Encoders play a crucial role in sustaining a consistent speed during both ascents and descents on the hill, effectively powering the wheels in accordance with the terrain.

Accelerometer

Implement feedback control to adjust individual wheel speed periodically, maintaining a near-zero roll measurement. We will leverage the Motors With Feedback reactor to compensate for roll-induced variations in motor power.

Gyroscope

Utilize the gyroscope to calculate angular displacement for accurate turning of the bot.

(Note: For more details about the implementation of the program for sensors, refer to the src directory)

Finite State Machine

Our journey begins in a calibrating state, where we set the line sensor threshold. After a 5-second calibration, we transition to the driving state. Subsequently, the robot defaults to moving forward.

While moving forward, the robot may encounter the dark line or reach the end of the wedge. Upon detection, the robot saves the data, tracks back a few centimeters, and enters the LineDetect state. Decisions in this state are based on the saved data, leading to a 20-degree turn in the opposite direction of the detected line or a 180-degree turn if the wedge's end is detected.

Preparing for the Climb

Before delving into the implementation, it's crucial to outline the key requirements for our robot program:

Ascend and descend a ramp with a 15-degree slope.
Detect dark-colored edges of the ramp using line sensors.
Implement feedback motor control using accelerometer and encoders to adjust wheel speeds based on roll and encoder readings.
Perfom precise turns using gyroscope measurements.

Designing/Implementation

my-3pi/src/HillClimbSolution.lf
target C {
    platform: {
      name: "rp2040",
      board: "pololu_3pi_2040_robot"
    },
    threading: false,
  }
  
  import Display from "lib/Display.lf"
  import GyroAngle from "lib/IMU.lf"
  import LineSensor from "HillLineDetectSolution.lf"
  import Tilt from "lib/Tilt.lf"
  import Encoders from "lib/Encoders.lf"
  import MotorsWithFeedback from "lib/MotorsWithFeedback.lf";


  reactor Robot {
    output notify:string   // Notify of mode change.
    state turn_angle:int
    state turn_direction:int //+1 for left direciton, -1 for right direction
    state lineOnLeft:bool
    state lineOnCenter:bool
    state lineOnRight:bool
    state takenUTurn:bool=false


    lineSensor = new LineSensor()
    motors = new MotorsWithFeedback();
    encoder = new Encoders();

    encoder.right -> motors.right;
    encoder.left -> motors.left;

    tilt = new Tilt()
    timer t(0, 10 ms)
    timer t2(0, 50 ms)

    reaction(startup) -> notify {=
        lf_set(notify, "INIT:CALIBRATING");
      =}

    reaction(t) -> encoder.trigger {=
      lf_set(encoder.trigger, true);
    =}

    initial mode CALIBRATING 
    {
        reaction(lineSensor.left, lineSensor.center, lineSensor.right) -> reset(DRIVING), notify {=
            lf_set_mode(DRIVING);
            lf_set(notify, "DRIVING");
        =}
    }

  mode DRIVING { 

    reaction(reset) ->  motors.left_speed, motors.right_speed {=
      lf_set(motors.left_speed, 0.2); //0.2 m/s
      lf_set(motors.right_speed, 0.2);
    =}

    reaction(t2) -> tilt.trigger{=
        lf_set(tilt.trigger, true);
    =}

    reaction(tilt.x, tilt.y) ->  motors.left_speed, motors.right_speed, reset(TURN), notify {=

      if(tilt.x->value <= 0 && !self->takenUTurn)//considering  we have reached the top, then perfom U-turn
      {
        self->turn_angle = 180;
        self->turn_direction = 1;//direction doesn't matter
        self->takenUTurn = true;
        lf_set_mode(TURN);
        lf_set(notify, "TURN");
      }

      lf_set(motors.left_speed, 0.2 * (1.0 - 0.05 * tilt.y->value * ((tilt.x->value >= 0) ? 1.0 : -1.0))); 
      lf_set(motors.right_speed, 0.2 * (1.0 + 0.05 * tilt.y->value * ((tilt.x->value >= 0) ? 1.0 : -1.0)));

    =}

    reaction(lineSensor.left, lineSensor.center, lineSensor.right) -> reset(LineDetect), notify {=

      self->lineOnLeft = lineSensor.left->value;
      self->lineOnCenter = lineSensor.center->value;
      self->lineOnRight = lineSensor.right->value;
        
      if(self->lineOnLeft|| self->lineOnCenter || self->lineOnCenter)
      {
        lf_set_mode(LineDetect);
        lf_set(notify, "LineDetect");
      }
    =}
    }
  
  mode LineDetect {
    logical action turn    

    reaction(reset) -> motors.left_speed, motors.right_speed, turn, reset(DRIVING), notify {=

        lf_set(motors.left_speed, -0.2);
        lf_set(motors.right_speed, -0.2);

      if(self->lineOnCenter)
      { 
        self->turn_angle = 180;
        self->turn_direction = 1;//direction doesn't matter
        
      }
      else if(self->lineOnLeft)
      { 
        self->turn_angle = 20;
        self->turn_direction = -1; //right
      }
      else if (self->lineOnRight)
      {
        self->turn_angle = 20;
        self->turn_direction = 1; //right
      }
      else
      {
        //no line detected
        lf_set_mode(DRIVING);
        lf_set(notify, "DRIVING");
      }

      lf_schedule(turn, SEC(0.4));
    =}
  
    reaction(turn) -> reset(TURN), notify{=
      lf_set_mode(TURN);
      lf_set(notify, "TURN");
    =}

  }

  mode TURN {
    gyro = new GyroAngle()

    reaction(reset) ->  motors.left_speed, motors.right_speed {=
      lf_set(motors.left_speed, self->turn_direction * -0.2);
      lf_set(motors.right_speed,self->turn_direction * 0.2);
    =}

    reaction(t) -> gyro.trigger {=
      lf_set(gyro.trigger, true);
    =}

    reaction(gyro.z) -> reset(DRIVING), notify{=
    if(abs(gyro.z->value) > self->turn_angle)
    {   lf_set_mode(DRIVING);
        lf_set(notify, "DRIVING");
    }
    =}
  } 


  
  }
  
  main reactor {
    robot = new Robot()
    display = new Display()
    robot.notify -> display.line0;
  }
  

The primary component in the above Lingua Franca program is a reactor, functioning like an object to some extent. The reactor can have multiple states, and it provides mechanisms to transition between states based on time or physical/logical actions. It also has interfaces to input and output the data.

Here, in the program, we have two reactors:

Robot
Display

A Robot reactor has a total of 4 states:

CALIBRATING
DRIVING
LINEDETECT
TURN

The Robot reactor operates exclusively in one of its states at any given time. According to the design, the robot defaults to the CALIBRATING state and then shifts to DRIVING after completing the calibration process. The robot remains in the DRIVING state until it encounters obstacles. Upon detecting obstacles, it retraces a few centimeters, transitioning to the LINEDETECT state. In the LINEDETECT state, a decision is made to navigate around the obstacle. After deciding, it proceeds to the TURN state, determining the number of degrees to turn (let's say x degrees) and the direction (clockwise/anti-clockwise) based on information from LINEDETECT during exit.

Once the TURN state executes actions in line with the decision from LINEDETECT, control shifts back to the DRIVING state, with its default behavior of moving forward. Following the initial CALIBRATING state, the robot continually cycles through the DRIVING, LINEDETECT, TURN states to fulfill its designated task.

Finally, the Display reactor receives input from the Robot reactor to showcase obstacle detection, indicating positions such as left, right, and middle.

Analysis

The robot effectively accomplishes its assigned mission, skillfully steering clear of the dark-colored edges. Moreover, the feedback control adeptly mitigates variations in roll, ensuring a consistent speed both uphill and downhill. Once aligned with the ramp, the robot seamlessly follows a straight path, courtesy of the precise adjustments made by the feedback control system. In instances where the robot starts its task atop the dark-colored edges, it promptly detects the situation and makes informed decisions to evade potential issues.

Conclusion

This project seamlessly combines sensor integration, feedback control, and real-world navigation, offering valuable insights into the intricate challenges of robotics and a practical way to apply the concepts learned from embedded systems courses, including Designing, Modeling, and Analysis. Source code can be found at Github. Stay tuned for more exciting technical explorations in our upcoming blogs!

References

Edward A. Lee and Sanjit A. Seshia, Introduction to Embedded Systems, A Cyber-Physical Systems Approach Second Edition, MIT Press, ISBN 978-0-262-53381-2, 2017.
Lingua Franca

Exploring Fully Homomorphic Encryption: A Comparative Analysis of SEAL and OpenFHE Libraries

January 29, 2024 · 9 min read

Abid Arcot

CS grad @ ASU

Background: Fully Homomorphic Encryption Unveiled

Before we embark on our comparative analysis of the SEAL and OpenFHE libraries, let's unveil the revolutionary concept of Fully Homomorphic Encryption (FHE). In the realm of cryptography, FHE stands as a beacon of innovation, allowing computations on encrypted data without the need for decryption.

Traditional encryption methods ensure the confidentiality of sensitive information during transmission or storage. However, they often hinder the ability to perform computations on this encrypted data without decrypting it first. FHE, on the other hand, transcends these limitations by enabling operations directly on encrypted data, preserving its confidentiality throughout the computational process.

Introduction

In this technical blog, we delve into the world of FHE and conduct a comparative analysis of two leading open-source libraries: Microsoft SEAL and OpenFHE. Our exploration focuses on the implementation of two widely-used FHE schemes, Brakerski-Gentry-Vaikuntanathan (BGV) and Brakerski/Fan-Vercauteren.

Understanding OpenFHE and SEAL

OpenFHE: A Glimpse into Versatility

OpenFHE, powered by Duality Technologies, emerges as a versatile open-source homomorphic encryption library. Its popularity stems from its support for a diverse range of FHE schemes, including BFV, BGV, CKKS, FHEW, and TFHE. Additionally, OpenFHE incorporates multiparty extensions for threshold FHE and proxy re-encryption, providing a comprehensive toolkit for various cryptographic applications.

Explore OpenFHE: OpenFHE GitHub Repository

To download OpenFHE binaries, visit the OpenFHE GitHub Releases section.

SEAL: Harnessing Microsoft's Expertise

SEAL, an open-source library developed by Microsoft, is designed to implement the most popular FHE schemes in use today: BGV, BFV, and CKKS. Leveraging Microsoft's expertise in cryptography, SEAL offers a robust platform for secure computations over encrypted data.

Explore SEAL: SEAL GitHub Repository

To download SEAL binaries, visit the SEAL GitHub Releases section.

Homomorphic Operations Data and Analysis

For our comparative analysis, we leverage the resources available in the GitHub repository. The repository contains C programs and binaries, facilitating a more detailed and resourceful exploration.

The provided binaries include:

SEAL BGV
SEAL BFV
OpenFHE BGV
OpenFHE BFV

Each binary executes homomorphic addition and multiplication operations on encrypted data, producing runtime and result outputs. These binaries serve as the foundation for our exploration into the efficiency and performance of FHE schemes.

Data Generation and Analysis

To initiate our analysis, we generate at least ten unique combinations of inputs for both multiplication and addition operations, taken average over ten iterations. The resulting tables showcase the runtimes of these operations for each scheme within the SEAL and OpenFHE libraries.

Pre-compiled binaries for the above libraries are provided at GitHub repository. To build you own binaries refer appendix. One can runs the binaries as follows.

./openFHE_BFV 1 1 2

example-img — Sample output when running the provided binary

Multiplication:

Input1	Input2	Input3	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	52	26	118	43
4	5	6	54	26	115	47
7	8	9	55	28	118	44
10	11	12	56	29	121	46
13	14	15	56	26	124	44
16	17	18	58	28	120	50
19	20	21	57	28	114	48
22	23	24	52	28	120	46
25	26	27	55	28	121	45
28	29	30	56	28	120	45

Visualization for Enhanced Comprehension

mult graph — X-axis : Inputx, Y-axis : overhead in milliseconds

Addition:

Input1	Input2	Input3	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	0	0	1
4	5	6	0	0	1
7	8	9	0	0	1
10	11	12	0	0	1
13	14	15	0	0	1
16	17	18	1	0	1
19	20	21	0	1	1
22	23	24	0	0	1
25	26	27	0	0	1
28	29	30	0	0	1

Visualization for Enhanced Comprehension

addition graph — X-axis : Inputx, Y-axis : overhead in milliseconds

Comparing the overheads when FHE is not used

The GitHub repository also includes C programs that perform identical multiplication and addition operations on 64-bit integer values. The execution times are compared with FHE operations, revealing the overheads associated with FHE execution.

binaries/c_program.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <inttypes.h> 

int main(int argc, char *argv[]) {
int64_t num1, num2, num3;


if (argc >= 4) {
    num1 = strtoll(argv[1], NULL, 10);
    num2 = strtoll(argv[2], NULL, 10);
    num3 = strtoll(argv[3], NULL, 10);
} else {
    printf("usage: ./c_program <num1> <num2> <num3> \n");
    return 1;
}
   

int64_t add, mul;
clock_t start, end;
double add_time, mul_time;


start = clock();
add = num1 + num2 + num3;
end = clock();

add_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; //time in milliseconds.


// Perform multiplication
start = clock();
mul = num1 * num2 * num3;
end = clock();
mul_time = ((double) (end - start)) / CLOCKS_PER_SEC * 1000; //time in milliseconds


// Output results
printf("mul_time: %lf   add_time%lf\n", mul_time, add_time);

return 0;
}

Multiplication:

Input1	Input2	Input3	non_FHE	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	0.0009	52	26	118	43
4	5	6	0.001	54	26	115	47
7	8	9	0.001	55	28	118	44
10	11	12	0.0009	56	29	121	46
13	14	15	0.001	56	26	124	44
16	17	18	0.0017	58	28	120	50
19	20	21	0.0005	57	28	114	48
22	23	24	0.0003	52	28	120	46
25	26	27	0.0006	55	28	121	45
28	29	30	0.0011	56	28	120	45

Average Overheads:

Overheads (ms)	non-FHE	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
Multiplication	0.0009	55.0991	27.4991	119.0991	45.7991

Addition:

Input1	Input2	Input3	non_FHE	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	0.0014	0	0	1
4	5	6	0.0017	0	0	1
7	8	9	0.001	0	0	1
10	11	12	0.0012	0	0	1
13	14	15	0.001	0	0	1
16	17	18	0.009	1	0	1
19	20	21	0.0009	0	1	1
22	23	24	0.0008	0	0	1
25	26	27	0.0008	0	0	1
28	29	30	0.0014	0	0	1

Average Overheads:

Overheads (ms)	non-FHE	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
Addition	0.00192	0.00192	0	0.1	0.1

Conclusion

Our analysis brings forth a compelling narrative. OpenFHE demonstrates superior implementation of FHE schemes, with OpenFHE_BGV emerging as the top performer, closely followed by OpenFHE_BFV. SEAL, while robust, exhibits higher runtimes compared to OpenFHE, suggesting that OpenFHE's implementation is more efficient.

Explore Further

To delve deeper into the world of homomorphic encryption and contribute to its advancements, consider exploring the GitHub repositories for OpenFHE and SEAL:

Explore OpenFHE: OpenFHE GitHub Repository
Explore SEAL: SEAL GitHub Repository

Download the respective binaries from the library repositories and engage in the evolving landscape of homomorphic encryption.

This exploration not only enhances our understanding of FHE but also contributes valuable insights to the ongoing development of secure and privacy-preserving computational technologies.

Appendix

This section provides details on the build process for two libraries. The instructions are presented to facilitate independent experimentation with Fully Homomorphic Encryption (FHE) implementations.

Prerequisites

To execute this project, ensure that the following dependencies are installed:

CMake
G++
GMP
Clang

OpenFHE

Navigate to the openfhe directory and follow these steps to set up the project:

mkdir build
cd build
cmake ..
make
make install
make testall

After modifying the code, compile test programs using the following commands:

make added_bgv
make added_bfv

To execute the examples, run the binaries as follows:

bin/examples/pke/added_bgv
bin/examples/pke/added_bfv

Additionally, new files created will be automatically linked during the make process.

SEAL

Set up this library with the following instructions:

cmake -S . -B build -DSEAL_BUILD_EXAMPLES=ON
cmake --build build
sudo cmake --install build

To run the compiled work, locate the binary at build/bin/sealexamples.

Nurturing a Root of Trust: A Step-by-Step Guide to Safeguarding Device's Unique Identity with Arbiter PUF and Ring Oscillator PUF

January 26, 2024 · 5 min read

Abid Arcot

CS grad @ ASU

Background: Understanding Physical Unclonable Functions (PUFs)

What are PUFs?

Physical Unclonable Functions (PUFs) are hardware-based security primitives that leverage the inherent manufacturing variations present in semiconductor devices. These variations, often microscopic and uncontrollable during fabrication, result in unique physical fingerprints for each device. PUFs exploit these variations to generate responses that are practically impossible to replicate, providing a reliable means of device authentication.

Importance in Hardware Security

PUFs play a crucial role in enhancing hardware security. Unlike traditional cryptographic methods that rely on stored secret keys, PUFs derive their security from the physical characteristics of the device. This makes PUFs resistant to various attacks, including invasive attacks attempting to clone or tamper with the device.

Applications of PUFs

Device Authentication: PUFs can be used to authenticate electronic devices by generating unique identifiers based on their physical characteristics.
Secure Key Generation: PUFs serve as a source of entropy for generating cryptographic keys, ensuring a high level of randomness and security.
Anti-Counterfeiting: PUFs help combat counterfeiting by providing a reliable way to verify the authenticity of electronic components.

In the subsequent sections, we will delve into the implementation details of Arbiter PUF and Ring Oscillator (RO) PUF, highlighting their design principles, testing methodologies, and achieved results.

Introduction

In the realm of hardware security and trust, the implementation of Physical Unclonable Functions (PUFs) stands as a formidable solution. This blog explores the design and implementation of two distinct PUF variants: Arbiter PUF and Ring Oscillator (RO) PUF. The source code and project details can be found on GitHub.

Arbiter PUF: Harnessing Delay Differences

Design Overview

The Arbiter PUF relies on delay differences in multiplexers to generate unique responses based on challenge bits. Leveraging Verilog and ModelSim, we crafted a robust PUF architecture capable of handling a 4-bit challenge with 32-bit responses.

Implementation Details

To simulate the Arbiter PUF, we utilized the provided mux module with varying propagation delays. The 4-bit delay values for each multiplexer were parameterized using the DELAY parameter. The resulting PUF demonstrated an impressive single-chip Hamming distance of 49.10%, affirming its high-quality output.

Sub-Module Hierarchy and Design Decisions

Our design incorporated additional modules for a single response bit and a single round of the Arbiter PUF. The sub-module hierarchy was carefully crafted for modularity and scalability. The report details the design decisions made during the implementation.

Verification and Testing

Ensuring the reliability of our PUF, we developed a comprehensive test bench for every module. The ModelSim GUI proved instrumental in observing signal waveforms and validating the functionality of the Arbiter PUF. The following table represents challenge-response pairs generated from random dealy parameter bits.

Challenge	Response
0000	11010110000010111010110011001011
0001	11011110001010010111110111011011
0010	01111111101000000101101101010010
0011	01110111010001011110101001010010
0100	00110101001001000101001101010000
0101	10110101000101001101001101010100
0110	11110100000001111011010101010001
0111	11110000000001000011110101010001
1000	00110001100101000101001100110100
1001	10100101000101111101001110110100
1010	10000000000111111011010110110101
1011	11011000000101110011010110110101
1100	11001100010010111010110010101011
1101	11011000100010110111110110101011
1110	00011111111110010101101110110010
1111	00000111010110011110101010100110

The resulting PUF demonstrated an impressive single-chip Hamming distance of 49.10%, affirming its high-quality output.

Ring Oscillator PUF: Unleashing FPGA Potential

RO PUF Architecture

Shifting focus to the RO PUF, our implementation targeted a Cyclone V GX FPGA. This variant employed multiple Ring Oscillators, each with unique frequencies due to fabrication variations. Challenge-Response Pairs (CRPs) were generated by racing counters using different RO outputs as clocks. We implemented RO PUF capable of producing 8-bit Challenge - 4-bit Response pairs. For a given 8-bit challenge - 1-bit Response PUF as shown below where N=16, 4 MSB from 8-bit challenge is used to select the ring oscillator for first MUX and the remaining 4 LSB for second MUX.

Synthesis and FPGA Integration

We utilized Quartus Prime to synthesize the Verilog design into an FPGA bit-stream. The Chip Planner tool facilitated the careful placement of RO modules within a single LAB Cell, minimizing propagation delays and ensuring optimal performance.

Results and Quality Assessment

The RO PUF exhibited a single-chip Hamming distance of 41.30% for 8-bit Challenge - 4-bit Response pairs, showcasing its robustness, while the utilizing 192 ALMS. The following table represents challenge-response pairs generated from FPGA.

Challenge	FPGA Response
00000000	1111
10001001	0101
01110011	1010
00001101	0011
11100101	0111
00110001	0101
01100110	1111
11000110	0111
11001000	1110
10000111	1010
00000011	1010
11110111	0101
10000001	0101
10111011	1111
00000001	0010
10100110	0001
11101011	1111
00010110	0011
10011111	1000
01000000	0101
00111100	0101
00101111	1010
01111100	1010
01010101	1111

Conclusion

This technical blog provides a comprehensive insight into the implementation of Arbiter PUF and RO PUF. The GitHub repository contains the complete source code and project files for further exploration. The achieved Hamming distances underscore the effectiveness of these PUF variants in generating unique and secure responses.

For detailed code snippets, project files, and a deeper dive into the implementation, visit our GitHub repository.

Enhancing Edge Device Image Classification: Seamless Integration in Mobile Computing

January 24, 2024 · 3 min read

Abid Arcot

CS grad @ ASU

Introduction

In the fast-evolving landscape of mobile computing, the convergence of edge devices and advanced image classification techniques holds immense promise. This blog documents a compelling project undertaken within the purview of CSE 535 Mobile Computing, spotlighting the seamless integration of server-side processes for image classification on edge devices. Through a meticulous blend of Android application development, convolutional neural networks (CNNs) and Server integration, this endeavor illuminates the transformative potential of server integration in enhancing mobile computing capabilities.

Project Overview

This project aims to utilize an Android application to divide captured images into parts for classification on edges devices like mobiles, laptops etc. Leveraging a CNN model trained on the MNIST dataset, the goal is to accurately categorize handwritten digits and organize them into respective folders based on their content.

Implementation Details

Building the Android Application

MainActivity.java and ImageCaptureActivity.java formed the backbone of the application, augmented by XML layout files to craft a seamless user interface.
Image division into four parts, facilitated by bitmap manipulation, enabled efficient processing and distribution to client edge devices.

Dividing the Image

Employing bitmap division techniques, captured images were sliced into manageable components for parallel processing.
Each edge device, acting as a server, received a distinct image part for classification, optimizing resource utilization and minimizing latency.

partion-preview — Image samples after divison

Deep Learning Model

A robust CNN model, with an accuracy of 80.34% on the MNIST dataset, has been employed for the image classification efforts.
We hae used the CNN model with 1 input layer, 2 hidden layers and 1 output layer using activation function relu for hidden layers and softmax for output layer.

Server Integration

Flask servers, equipped with CNN models, awaited image parts for classification, seamlessly interfacing with the main mobile application.
Retrofit and OKHTTP3 libraries facilitated smooth communication between the mobile and server edge devices, ensuring swift data transmission and response retrieval.
Upon receiving predictions, the mobile application intelligently stored original images in designated folders based on classification confidence, exemplifying the project's end-to-end integration process.

running-servers — Servers running on different edge devices connected over local network

Results

The following are the key aspects of the project that we have implemented and achieved:

Launching and navigation through the Android application's user interface.
Capturing images using the device's camera and initiating the image classification process.
Seamless communication between the mobile application and server edge devices.
Real-time classification of handwritten digits and organization into respective folders based on classification results received from offloaded servers.

Conclusion

In summary, the integration of server-side processes for image classification on edge devices represents a paradigm shift in mobile computing. By harnessing the collaborative power of Android application development, deep learning, and server integration, this project underscores the transformative potential of mobile computing in real-world applications.

Source Code

Explore the source code of this project on GitHub: GitHub Repository

Input1	Input2	Input3	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	52	26	118	43
4	5	6	54	26	115	47
7	8	9	55	28	118	44
10	11	12	56	29	121	46
13	14	15	56	26	124	44
16	17	18	58	28	120	50
19	20	21	57	28	114	48
22	23	24	52	28	120	46
25	26	27	55	28	121	45
28	29	30	56	28	120	45

Input1	Input2	Input3	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	0	0	1
4	5	6	0	0	1
7	8	9	0	0	1
10	11	12	0	0	1
13	14	15	0	0	1
16	17	18	1	0	1
19	20	21	0	1	1
22	23	24	0	0	1
25	26	27	0	0	1
28	29	30	0	0	1

Input1	Input2	Input3	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	52	26	118	43
4	5	6	54	26	115	47
7	8	9	55	28	118	44
10	11	12	56	29	121	46
13	14	15	56	26	124	44
16	17	18	58	28	120	50
19	20	21	57	28	114	48
22	23	24	52	28	120	46
25	26	27	55	28	121	45
28	29	30	56	28	120	45

Input1	Input2	Input3	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	0	0	1
4	5	6	0	0	1
7	8	9	0	0	1
10	11	12	0	0	1
13	14	15	0	0	1
16	17	18	1	0	1
19	20	21	0	1	1
22	23	24	0	0	1
25	26	27	0	0	1
28	29	30	0	0	1

Introduction​

Taylor Expansion of the Exponential Function:​

Baseline Implementation​

Circuit Diagram:​

Pipeline Optimization​

Pipeline Block Diagram:​

Shared Hardware Design​

Shared Hardware and FSM Diagram:​

Comparative Analysis​

Exact vs. Approximated Exponential Function​

Efficiency Metrics:​

Toggle Rate Analysis:​

Conclusion​

Background​

Understanding the Task at Hand​

Modeling​

Sensors Utilized​

Line Sensors​

Encoders​

Accelerometer​

Gyroscope​

Finite State Machine​

Preparing for the Climb​

Designing/Implementation​

Analysis​

Conclusion​

References​

Background: Fully Homomorphic Encryption Unveiled​

Introduction​

Understanding OpenFHE and SEAL​

OpenFHE: A Glimpse into Versatility​

SEAL: Harnessing Microsoft's Expertise​

Homomorphic Operations Data and Analysis​

Data Generation and Analysis​

Multiplication:​

Addition:​

Comparing the overheads when FHE is not used​

Multiplication:​

Addition:​

Conclusion​

Appendix​

Prerequisites​

OpenFHE​

SEAL​

Background: Understanding Physical Unclonable Functions (PUFs)​

What are PUFs?​

Importance in Hardware Security​

Applications of PUFs​

Introduction​

Arbiter PUF: Harnessing Delay Differences​

Design Overview​

Implementation Details​

Sub-Module Hierarchy and Design Decisions​

Verification and Testing​

Ring Oscillator PUF: Unleashing FPGA Potential​

RO PUF Architecture​

Synthesis and FPGA Integration​

Results and Quality Assessment​

Conclusion​

Introduction​

Project Overview​

Implementation Details​

Building the Android Application​

Dividing the Image​

Deep Learning Model​

Server Integration​

Results​

Conclusion​

Source Code​

Introduction

Taylor Expansion of the Exponential Function:

Baseline Implementation

Circuit Diagram:

Pipeline Optimization

Pipeline Block Diagram:

Shared Hardware Design

Shared Hardware and FSM Diagram:

Comparative Analysis

Exact vs. Approximated Exponential Function

Efficiency Metrics:

Toggle Rate Analysis:

Conclusion

Background

Understanding the Task at Hand

Modeling

Sensors Utilized

Line Sensors

Encoders

Accelerometer

Gyroscope

Finite State Machine

Preparing for the Climb

Designing/Implementation

Analysis

Conclusion

References

Background: Fully Homomorphic Encryption Unveiled

Introduction

Understanding OpenFHE and SEAL

OpenFHE: A Glimpse into Versatility

SEAL: Harnessing Microsoft's Expertise

Homomorphic Operations Data and Analysis

Data Generation and Analysis

Multiplication:

Addition:

Comparing the overheads when FHE is not used

Multiplication:

Addition:

Conclusion

Appendix

Prerequisites

OpenFHE

SEAL

Background: Understanding Physical Unclonable Functions (PUFs)

What are PUFs?

Importance in Hardware Security

Applications of PUFs

Introduction

Arbiter PUF: Harnessing Delay Differences

Design Overview

Implementation Details

Sub-Module Hierarchy and Design Decisions

Verification and Testing

Ring Oscillator PUF: Unleashing FPGA Potential

RO PUF Architecture

Synthesis and FPGA Integration

Results and Quality Assessment

Conclusion

Introduction

Project Overview

Implementation Details

Building the Android Application

Dividing the Image

Deep Learning Model

Server Integration

Results

Conclusion

Source Code

Input1	Input2	Input3	OpenFHE_BFV	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	52	26	118	43
4	5	6	54	26	115	47
7	8	9	55	28	118	44
10	11	12	56	29	121	46
13	14	15	56	26	124	44
16	17	18	58	28	120	50
19	20	21	57	28	114	48
22	23	24	52	28	120	46
25	26	27	55	28	121	45
28	29	30	56	28	120	45

Input1	Input2	Input3	OpenFHE_BGV	SEAL_BFV	SEAL_BGV
1	2	3	0	0	1
4	5	6	0	0	1
7	8	9	0	0	1
10	11	12	0	0	1
13	14	15	0	0	1
16	17	18	1	0	1
19	20	21	0	1	1
22	23	24	0	0	1
25	26	27	0	0	1
28	29	30	0	0	1