## Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs

Padmini Nagaraj University of California, Berkeley, Distributed Mentor Program, Researcher

**Summer 2004** 

Professor Elaheh Bozorgzadeh University of California, Irvine, Distributed Mentor Program, Mentor



# Reconfiguration Overhead

- Reconfiguration delay is crucial in dynamic reconfigurable architecture if it is exploited at runtime.
- Project: Study the trade-off between reconfiguration delay and performance of implemented task on FPGA device.
- Reconfiguration delay is highly correlated with the physical layout of the implementation.
- In Xilinx, reconfiguration is column by column.
- Number of columns of the layout of design is highly correlated with reconfiguration delay





# **Experimental Analysis**

#### • Metrics used (Xilinx Place and Route Tools Provided):

- CLB Columns constrained
- Maximum Clock Frequency
- Maximum Pin Delay
- Average Delay of 10 Worst Nets
- Applications:
  - Matrix Multiply
  - Fast Fourier Transform
  - 2-D Discrete Cosine Transform
  - JPEG
  - Others: CORDIC, Multiply Accumulator, Comb Filter, etc.

### Experimental Data: Matrix Multiplier











Matrix Multiplier constrained at 12 columns



## Experimental Data: Fast Fourier Transform





8



Padmini Nagaraj - minar@ocf.berkeley.edu

## Experimental Data: 2-D Discrete Cosine Transform







2DCT constrained at 28 columns

Padmini Nagaraj - minar@ocf.berkeley.edu

## Experimental Data

|                            | Minimum<br>Number of<br>CLB<br>columns | Minimum Clock<br>Period | Maximum Clock<br>Frequency | Max Pin Delay | Worst 10 net<br>Delay |
|----------------------------|----------------------------------------|-------------------------|----------------------------|---------------|-----------------------|
| FFT 256                    | 20                                     | 7.571E-09               | 1.321E+08                  | 5.228E-09     | 3.702E-09             |
| FFT                        | 16                                     | 1.053E-08               | 9.501E+07                  | 6.711E-09     | 5.617E-09             |
| 2-D Disc. Cosine Transform | 14                                     | 6.923E-09               | 1.444E+08                  | 4.040E-09     | 3.382E-09             |
| FFT 1024                   | 12                                     | 9.312E-09               | 1.074E+08                  | 5.462E-09     | 4.724E-09             |
| Matrix Multiplier          | 10                                     | 6.466E-09               | 1.547E+08                  | 4.235E-09     | 3.567E-09             |
| CORDIC                     | 4                                      | 8.453E-09               | 1.183E+08                  | 2.876E-09     | 2.288E-09             |
| Digital Down Converter     | 4                                      | 8.373E-09               | 1.194E+08                  | 3.108E-09     | 2.377E-09             |
| 1-D Disc. Cosine Transform | 2                                      | 4.857E-09               | 2.059E+08                  | 2.835E-09     | 2.360E-09             |
| Cascaded Int. Comb Filter  | 2                                      | 3.380E-09               | 2.959E+08                  | 1.461E-09     | 1.009E-09             |
| Multiply Accumulator       | 2                                      | 5.443E-09               | 1.837E+08                  | 3.060E-09     | 2.388E-09             |
| Sine/Cosine Look Up Table  | 2                                      | 0.000E+00               | 0.000E+00                  | 1.677E-09     | 1.120E-09             |
| Direct Digital Synthesizer | 2                                      | 4.532E-09               | 2.207E+08                  | 1.810E-09     | 1.233E-09             |
|                            | 1                                      |                         |                            |               | 10                    |





## Conclusion

- Studied the trade-off between reconfiguration delay and performance in implementation of applications on FPGA device
- Compared performance at different layout area for implementation
- Results show the following:
  - In several cases, by having a more relaxed area constraint, the performance can be improved by the tool and in some cases it doesn't for the following reasons:
    - I/O dominated applications
    - FPGA CAD tools are not matured enough to try small area for better performance