Reconfiguration Overhead in Dynamic Task-Based Implementations on FPGAs

> Padmini Nagaraj UCB, Distributed Mentor Program, Researcher

> > **Summer 2004**

Professor Elaheh Bozorgzadeh UCI, Distributed Mentor Program, Mentor

# Outline

- I. Introduction
- II. Project Description
- III. Example Application: Matrix Multiplier
- IV. Experimental Data
  - A. Matrix Multiplier
  - B. Fast Fourier Transform
  - C. 2-D Discrete Cosine Transform
  - D. Multiple Applications
- V. Real World Application: JPEG
- VI. Conclusion

#### Π III IV<sub>ABCD</sub>V VI Introduction







**GOAL:** Application

configuration time vs. performance time.

Number of CLB Columns Application Clock Frequency

Several small to large independent applications

Real world example: JPEG

#### Ι Π IV<sub>ABCD</sub>V VI Example Application: Matrix Multiplier

8 x 8 Matrix Multiplier

Needs Lots of Data!

a.) BRAMs

b.) Lots of I/O pins

c.) Neither

Interested in seeing effects independent of other chip resources

Okay.

Really slow! Too much time reading inputs.

# I II III IV<sub>A B C D</sub> V VI Example Application: Matrix Multiplier (cont...)



### I II III IV A B C D V VI Example Application: Matrix Multiplier (cont...)

1.) Write code 2.) Simulate - Testbench 3.) Synthesis 4.) Place and Route -Constrain Time and Columns



#### I II III IV<sub>ABCD</sub> V VI Experimental Data

Xilinx CORE Generator Intellectual Property of Xilinx

Metrics Used:

CLB Columns Maximum Clock Frequency Maximum Pin Delay Average Delay of 10 Worst Nets

#### I II III IV<sub>A B C D</sub> V VI Experimental Data: Matrix Multiplier





### I II III IV<sub>ABCD</sub> V VI Experimental Data: Matrix Multiplier (cont...)



Matrix Multiplier constrained at 12 columns Matrix M Padmini Nagaraj - minar@ocf.berkeley.edu



# I II III IV<sub>ABCD</sub> V VI Experimental Data: Matrix Multiplier (cont...)

|                                 | Physical Constraint (number of CLB columns) |           |           |           |            |  |  |
|---------------------------------|---------------------------------------------|-----------|-----------|-----------|------------|--|--|
|                                 | 10                                          | 12        | 14        | 16        | Whole Chip |  |  |
| Minimum Clock Period (s)        | 6.466E-09                                   | 6.476E-09 | 6.496E-09 | 6.496E-09 | 5.930E-09  |  |  |
| Maximum Clock Frequency<br>(Hz) | 1.547E+08                                   | 1.544E+08 | 1.539E+08 | 1.539E+08 | 1.686E+08  |  |  |
| Maximum Pin Delay (s)           | 4.235E-09                                   | 4.174E-09 | 3.938E-09 | 4.120E-09 | 3.787E-09  |  |  |
| Worst 10 Net Delays (s)         | 3.567E-09                                   | 3.692E-09 | 3.406E-09 | 3.470E-09 | 3.396E-09  |  |  |

#### I II III IV<sub>A B C D</sub> V VI Experimental Data: Fast Fourier Transform





### I II III IV<sub>A B C D</sub> V VI Experimental Data: Fast Fourier Transform (cont...)





### I II III IV<sub>A B C D</sub> V VI Experimental Data: Fast Fourier Transform (cont...)

|                                    | Physical Constraint (Number of CLB columns) |           |           |           |           |            |  |  |
|------------------------------------|---------------------------------------------|-----------|-----------|-----------|-----------|------------|--|--|
|                                    | 16                                          | 20        | 24        | 28        | 32        | Whole Chip |  |  |
| Minimum Clock<br>Period (s)        | 1.053E-08                                   | 7.214E-09 | 8.276E-09 | 8.276E-09 | 8.170E-09 | 8.365E-09  |  |  |
| Maximum Clock<br>Frequency<br>(Hz) | 9.501E+07                                   | 1.386E+08 | 1.208E+08 | 1.208E+08 | 1.224E+08 | 1.195E+08  |  |  |
| Maximum Pin<br>Delay (s)           | 6.711E-09                                   | 5.545E-09 | 6.227E-09 | 5.397E-09 | 5.864E-09 | 5.540E-09  |  |  |
| Worst 10 Net<br>Delay (s)          | 5.617E-09                                   | 4.736E-09 | 5.404E-09 | 4.778E-09 | 5.067E-09 | 4.776E-09  |  |  |

### I II III IV<sub>ABCD</sub> V VI Experimental Data: 2-D Discrete Cosine Transform







#### I II III IV<sub>A B C D</sub> V VI Experimental Data: 2-D Discrete Cosine Transform (cont...)



17 2DCT unconstrained

#### I II III IV<sub>A B C D</sub> V VI Experimental Data: 2-D Discrete Cosine Transform (cont...)

|                                 | Physical Constraint (number of CLB columns) |           |           |           |           |            |  |
|---------------------------------|---------------------------------------------|-----------|-----------|-----------|-----------|------------|--|
| CLB Columns                     | 12                                          | 16        | 20        | 24        | 28        | Whole Chip |  |
| Minimum Clock Period (s)        | 7.169E-09                                   | 6.349E-09 | 6.197E-09 | 6.286E-09 | 6.163E-09 | 7.457E-09  |  |
| Maximum Clock Frequency<br>(Hz) | 1.395E+08                                   | 1.575E+08 | 1.614E+08 | 1.591E+08 | 1.623E+08 | 1.341E+08  |  |
| Maximum Pin Delay               | 4.798E-09                                   | 4.208E-09 | 4.163E-09 | 4.088E-09 | 3.707E-09 | 6.367E-09  |  |
| Worst 10 Net Delays             | 3.667E-09                                   | 3.420E-09 | 3.373E-09 | 3.295E-09 | 3.280E-09 | 5.711E-09  |  |

# I II III IV<sub>ABCD</sub> V VI Experimental Data: Multiple Applications



# I II III IV<sub>ABCD</sub> V VI Experimental Data: Multiple Applications (cont...)

|                            | Minimum<br>Number of<br>CLB<br>columns | Minimum Clock<br>Period | Maximum Clock<br>Frequency | Max Pin Delay | Worst 10 net<br>Delay |
|----------------------------|----------------------------------------|-------------------------|----------------------------|---------------|-----------------------|
| FFT 256                    | 20                                     | 7.571E-09               | 1.321E+08                  | 5.228E-09     | 3.702E-09             |
| FFT                        | 16                                     | 1.053E-08               | 9.501E+07                  | 6.711E-09     | 5.617E-09             |
| 2-D Disc. Cosine Transform | 14                                     | 6.923E-09               | 1.444E+08                  | 4.040E-09     | 3.382E-09             |
| FFT 1024                   | 12                                     | 9.312E-09               | 1.074E+08                  | 5.462E-09     | 4.724E-09             |
| Matrix Multiplier          | 10                                     | 6.466E-09               | 1.547E+08                  | 4.235E-09     | 3.567E-09             |
| CORDIC                     | 4                                      | 8.453E-09               | 1.183E+08                  | 2.876E-09     | 2.288E-09             |
| Digital Down Converter     | 4                                      | 8.373E-09               | 1.194E+08                  | 3.108E-09     | 2.377E-09             |
| 1-D Disc. Cosine Transform | 2                                      | 4.857E-09               | 2.059E+08                  | 2.835E-09     | 2.360E-09             |
| Cascaded Int. Comb Filter  | 2                                      | 3.380E-09               | 2.959E+08                  | 1.461E-09     | 1.009E-09             |
| Multiply Accumulator       | 2                                      | 5.443E-09               | 1.837E+08                  | 3.060E-09     | 2.388E-09             |
| Sine/Cosine Look Up Table  | 2                                      | 0.000E+00               | 0.000E+00                  | 1.677E-09     | 1.120E-09             |
| Direct Digital Synthesizer | 2                                      | 4.532E-09               | 2.207E+08                  | 1.810E-09     | 1.233E-09             |

### I II III IV<sub>A B C D</sub> V VI Experimental Data: Multiple Applications (cont...)



FFT constrained at 16 columns





JPEG encoding steps JPEG decoding steps Padmini Nagaraj - minar@ocf.berkeley.edu 22



Padmini Nagaraj - minar@ocf.berkeley.edu

|                    | XAPP637 RGB<br>to YCbCr | 2-D Disc.<br>Cosine<br>Transfor<br>m | XAPP615<br>Qauntiza<br>tion | XAPP615<br>Inverse-<br>Quantiza<br>tion | Inverse 2-D<br>Disc.<br>Cosine<br>Transfor<br>m | XAPP238Y<br>CrCb to<br>RGB |
|--------------------|-------------------------|--------------------------------------|-----------------------------|-----------------------------------------|-------------------------------------------------|----------------------------|
| Num of CLB columns | 2                       | 8                                    | 6                           | 6                                       | 8                                               | 2                          |
| Clock Period       | 8.343E-09               | 8.249E-09                            | 8.378E-09                   | 7.376E-09                               | 6.580E-09                                       | 6.469E-09                  |
| Clock Frequency    | 1.199E+08               | 1.212E+08                            | 1.194E+08                   | 1.356E+08                               | 1.520E+08                                       | 1.546E+08                  |
| Max Pin Delay      | 3.571E-09               | 4.097E-09                            | 4.950E-09                   | 4.847E-09                               | 3.583E-09                                       | 3.130E-09                  |
| Worst 10 net Delay | 2.712E-09               | 3.121E-09                            | 4.146E-09                   | 4.026E-09                               | 3.368E-09                                       | 2.377E-09                  |
|                    | 1                       |                                      |                             |                                         |                                                 | 24                         |





26 IQuantize constrained at 8 columns

#### I II III IV<sub>ABCD</sub> V VI Conclusion

Place and Route Tools

Density of application affects everything

User defined constraints

Lack sufficient intelligence

Clock period, maximum pin delay and worst 10 net delay

Helps Place and Route tools