DEVELOPMENT OF A SMART CAMERA SYSTEM ON AN FPGA. Monica Jane Whitaker. A thesis submitted in partial fulfillment of the requirements for the degree

Size: px

Start display at page:

Download "DEVELOPMENT OF A SMART CAMERA SYSTEM ON AN FPGA. Monica Jane Whitaker. A thesis submitted in partial fulfillment of the requirements for the degree"

Samuel Cole
5 years ago
Views:

1 DEVELOPMENT OF A SMART CAMERA SYSTEM ON AN FPGA by Monica Jane Whitaker A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering MONTANA STATE UNIVERSITY Bozeman, Montana November, 2016

3 ii ACKNOWLEDGEMENTS I would like acknowledge the faculty and staff of the Electrical and Computer Engineering Department as well as those of the Gianforte School of Computing at Montana State University for their continued support and encouragement throughout my undergraduate and graduate education. Funding Acknowledgment This work was kindly supported by the Montana Research and Economic Development Initiative, Montana Board of Research and Commercialization Technology and Resonon, Inc.

4 iii TABLE OF CONTENTS 1. INTRODUCTION AND BACKGROUND...1 Introduction...1 Hyperspectral Imaging...3 Classifying a Hyperspectral Image...3 Object Sorting...4 Smart Cameras...5 Existing Smart Cameras...6 Matrix Vision...6 Matrox Imaging...7 Eye Vision Technology...7 Teledyne Dalsa...8 The Winner is... None of the Above MOTIVATION Beneficiaries Current Processing System Why FPGA? SYSTEM DESIGN Logistic Regression Algorithm Hardware Elements Arria 10 SoC Development Board Components Additional Custom Boards Project Overview Camera Interface DRAM Interface Computation Unit Pixel Classification Object Classification FPGA Interface Performance... 32

5 iv TABLE OF CONTENTS - CONTINUED 4. IMPLEMENTATION DETAILS Programmable Oscillator Registers Programmable Clock Generator Design Decisions Registers Burning a Configuration Utilizing the Clock Generator Altera IP Timing Constraints Toolchain Fights SignalTap TimeQuest Timing Analyzer Chip Planner MATLAB Toolchain Tricks TEST AND VERIFICATION Camera Interface DRAM Computation Unit FPGA to FPGA Transmission CONCLUSION REFERENCES CITED APPENDICES APPENDIX A: Register Descriptions APPENDIX B: VHDL Code APPENDIX C: MATLAB Code

6 v LIST OF TABLES Table Page 3.1 Pixel Information in DRAM Register settings for Si A.1 ENABLE Register Description A.2 IRQ ENABLE Register Description A.3 IRQ PENDING Register Description A.4 NUM BINS Register Description A.5 NUM PIXELS Register Description A.6 NUM CLASSES Register Description A.7 FRAME COUNT Register Description A.8 MEAN Register Description A.9 STD DEV I Register Description A.10 COEFFICIENT Register Description A.11 INNER PRODUCT Register Description A.12 DECISION VECTOR Register Description... 64

7 vi LIST OF FIGURES Figure Page 1.1 A mock-up of the full system as it is intended to operate An example of hyperspectral line scan images over several frames Robot sorting almonds Depiction of a typical image processing system [1] This is a depiction of the future of image processing, with an integrated camera sensor and FPGA processor [1] Graphical depiction of relative resources in the Arria 10 SoC chip Example inner product calculation High level view of the components external to the SoC utilized in the system. The colored regions depict the individual PCBs The PCBs created for the hyperspectral camera Block diagram of the full system functionality Block diagram of the camera interface subsystem Block diagram of the memory subsystem Block diagram of the computation subsystem The connection between the Arria V development board (top) and the Arria 10 development board (bottom) Factory Default Clock Register Settings for Si Preferred Clock Register Settings for Si Diagram of Pin Assignments for VersaClock 6 Programmable Clock Generator A fitted floor plan in the Arria Generated plot depicting ratios between the pixels of the line scan camera and the pixels of the hyperspectral camera Zoomed in plot of ratios between monochrome and hyperspectral camera pixels... 53

8 vii ABSTRACT In recent years, hyperspectral cameras have been appearing in many applications that need more information than what conventional color cameras can provide. A hyperspectral camera is able to capture data ranging in wavelengths from the visible spectrum all the way into the infrared. In this way, it is able to see hundreds of colors, much more than the human eye or any standard camera that typically uses only 3 spectral values (corresponding to the standard red, green, and blue colors). Due to the large amount of data that these cameras can generate at increasingly faster frame rates, conventional computers are not able to perform all the necessary processing in real-time. Because of this limitation, a new system is needed to perform the image processing. This master s thesis is meant to contribute to the development of a smart camera targeted for hyperspectral image processing using a Field Programmable Gate Array (FPGA) and object sorting with a prototype waterfall system. Through the use of a Hardware Description Language (HDL), a currently used image processing algorithm has been implemented to classify pixels. Additionally, design and test of an architecture for full object classification has been developed for the FPGA. High-speed transceivers are used to move data between multiple FPGA development boards. When paired with a hyperspectral camera and a monochrome line scan camera, this prototype system is capable of scanning objects in freefall and deciding within milliseconds whether or not to keep the object. This decision will dictate the action of air jets to displace unwanted objects. This full system is potentially of interest to small businesses or farms as it will enable farmers to perform their own premium bulk sorting in a cost effective manner.

9 1 INTRODUCTION AND BACKGROUND Introduction A smart camera system is being developed to target sorting applications using a hyperspectral camera. The overall system in development includes the hyperspectral camera, a monochrome line scan camera and a sorting mechanism that uses air jets to perform the physical sorting. This camera system will replace existing systems by removing the need for cables between the camera and the processing unit as well as replacing conveyor belts and robots with a vibrator feeder and air jets. In doing so, with the help of the hyperspectral data, sorting may become more accurate and the unit may end up being cheaper and consequently more accessible to small businesses. This project is a prototype for the end result and is consequently not as compact as the final product is anticipated to be, but it performs all the necessary calculations and produces a result to trigger the air jets for the sorting of objects with high precision due to the inclusion of hyperspectral data. This smart camera system utilizes two System-on-Chip (SoC) devices that each consist of a Field Programmable Gate Array (FPGA) fabric and a Hard Processor System (HPS) implemented on a single silicon chip for easy and fast interactions. The fabric of these SoC FPGAs is used for the processing of all data generated by the cameras, while the HPS is utilized in user interactions and memory transfers. The monochrome line scan camera is included for detection of the objects at the time of imaging and building an object profile for the processing unit to make a complete object decision based on the compiled individual pixel decisions. The decisions are made based on classes designed around the hyperspectral characteristics found using the hyperspectral camera included in this system.

air jets. This thesis focuses primarily on the development and implementation of the image processing algorithm in addition to the interaction between development boards.

10 2 Figure 1.1: A mock-up of the system as it is intended to operate. The product will fall from the conveyor belt and be imaged by both cameras (one high resolution monochrome and one hyperspectral) simultaneously before either continuing its fall or being ejected by air jets. This thesis focuses primarily on the development and implementation of the image processing algorithm in addition to the interaction between development boards. The air jet system is in development by a separate team of engineers, as is the monochrome camera processing subsystem that identifies object boundaries for the hyperspectral camera. In implementation of the prototype design for this project, the author of this thesis is responsible for the development and testing of the image processing algorithm for the hyperspectral camera data and the transceiver communication between development boards. The author also worked with the development tools to compile the whole project and fix timing errors. Additionally,

11 3 this author set up the data access method between the FPGA and the off-chip Dynamic Random Access Memory (DRAM) connected directly to the FPGA fabric via dedicated and hardened DRAM controllers. The details of this system are abstracted away and a few control lines are available for use by other subsystems. Hyperspectral Imaging Resonon defines a hyperspectral image as a digital image with far more spectral information for each pixel than traditional color cameras. The resulting data can be pictured as a cube with dimensions in the spatial x and y directions and a third dimension in the spectral wavelength, as seen in Figure 1.2. The cameras utilized in the sorting applications explored within this thesis are line scan cameras, so a frame consists of a single line (spatial y = 1) of pixels (spatial x) and then the spectral bands occupying the third dimension. With the extra wavelength values, including those in the infrared, hyperspectral cameras are able to sense much more information than the human eye and your typical RGB color camera. This technology is used in anything from remote sensing to quality control to sorting [2]. Classifying a Hyperspectral Image Every pixel within an object contains nearly unique spectral signatures which can be used to classify it. In order to do so, a class is defined by compiling a variety of images of the object and determining the spectral signature that most commonly defines the pixels within the object. This is done for each of the possible classes expected to be seen in the surveyed objects. In this way, each class is a vector of values across all the spectral bands. Using these classes, a statisticallybased machine learning algorithm is used in order to come up with a probability that the pixel belongs to each class. A number of different machine learning algorithms

12 4 Figure 1.2: (lines). [3] An example of hyperspectral line scan images over several frames are acceptable for this objective, however a logistic regression approach has been chosen for the implementation of the prototype based on currently used systems. For logistic regression, the considered computation utilizes an inner product calculation between the spectral signature of the pixel and the considered classes to reach a probability. The highest probability is kept for each pixel in an object and by adding the probabilities of each class over the object, the highest probability is kept and determined to be the class to which the object belongs. Object Sorting Sorting has been a large concern in developed nations for many years as manufacturers strive to put out quality products. It is gaining popularity in developing countries and becoming even more important in the places where it already exists, as quality control is brought to the forefront of society s attention. Particularly in the food industry, increased industrialization has brought forward a push toward healthy convenience foods. Sorting is also very important in agricultural applications

5 as farmers need to sort their crops after harvesting. In many cases, this is done by an industrial company who then reports back with the percent loss due to the sorting mechanism.

13 5 as farmers need to sort their crops after harvesting. In many cases, this is done by an industrial company who then reports back with the percent loss due to the sorting mechanism. Of course, by sending it away, farmers have no way of verifying the reported loss and it would be easier and more reliable for them to have their own means of sorting. Machine sorting helps to avoid the inconsistent nature of manual sorting [4] and avoids significant loss of good product that may result from vibration sorting or other mechanical means. Sorting machines come in a variety of shapes, sizes, and technologies. These include using lasers, cameras, or x-ray in conjunction with robotic arms or air jet systems to sort products and separate the bad from the good [4]. As the technology advances, the sorting abilities will be expected to do so, as well. Figure 1.3: Robot sorting almonds [2] Smart Cameras In many applications, cameras of all varieties are used to acquire data for studying something about the subject matter that can be viewed at a later time. This

14 6 data is also generally processed external to the system in which it abides. However, it is becoming more necessary and common for processing to happen on-board, enabling the system to adjust in real-time. Because of this, real-time processing is in high demand and the techniques are still being perfected. Further, the algorithms to process data generated by cameras are in constant refinement as researchers learn what they want to see from the data and how best to achieve those results. As algorithms are refined and cameras are built to generate more data than ever before, it becomes necessary to have the right infrastructure to support the real-time processing of imaging data, and thus we find the niche for a smart camera. Existing Smart Cameras There are several smart cameras currently in existence, including some with a programmable FPGA or select-able image processing algorithms to utilize in the desired application. These include several by Matrix Vision and some by other companies such as Matrox Imaging, EVT, and Teldyne Dalsa as further described in the following subsections. Though these cameras are likely very useful in some applications which require on-board image processing, they lack the real-time processing advantages gained in the use of the Arria 10 FPGA, which are detailed in the last subsection below. Matrix Vision Matrix Vision has created several smart cameras, two of which are notable for image processing in industry. The mvbluegemini is touted as a Tool box technology camera and includes an SoC with FPGA and Dual-Core Cortex-A9 with 800 MHzcapable clocks and a camera sensor with 1280 x 1024 resolution. The software that

15 7 comes with the camera includes a Graphical User Interface (GUI) with which users can choose the task to complete. The frame rate on this single sensor is unspecified. [5] The other smart camera by Matrix Vision is the mvbluelynx-x. This camera has options for either CCD or CMOS sensors in addition to a hybrid dual core. This features a Cortex-A8 ARM CPU with a separate real-time Digital Signal Processing (DSP) unit with video interface. The CPU has a clock speed up to 1 GHz, while the DSP can run up to 800 MHz. Though available in a number of different resolutions, the largest, 2592 x 1944 has a maximum frame rate of 14.4 Hz and the next largest, 1280 x 1024 has a maximum frame rate of 60 Hz. [6] Matrox Imaging Matrox makes a smart camera entitled the Iris GT which comes with a design assistant and a web-based interface for the integrated development environment. This camera has an Intel Atom embedded processor running Windows as well as a builtin keyboard, monitor, and mouse for friendly user interface. It is compatible with a variety of monochrome and color CCD sensors. Matrox claims this camera and software are ideal for a variety of machine vision applications including agriculture, aerospace, and more. The highest frame rate is 110 frames per second (fps), with an effective resolution of 640 x 480. [7] Eye Vision Technology Eye Vision Technology (EVT) creates several variations of smart cameras. The RazerCam, for instance, is packaged with a free programmable FPGA and two ARM Cores based on the Xilinx Zynq SoC. Users are limited to choosing between one of two matrix sensors or a line scan sensor with 4K resolution. That line sensor claims a frame rate of fps with 10-bit pixel data, but the matrix sensors are not above 60 fps. The ARM cores are running Linux for user convenience in interaction. The

16 8 greatest downfall of this camera is the lack of hardened floating point on the FPGA, which could hinder the speed or accuracy of results, not to mention development speed. EVT also has a series of EyeCheck smart cameras which are almost all around 30 or 60 fps at resolutions in the thousands. One version has 180 fps and a Xilinx Artix-7 FPGA with 28K logic elements. This FPGA is in the low-end of Xilinx s product portfolio, designed to optimize power and cost. [8] Teledyne Dalsa Teldyne Dalsa offers several vision systems with embedded software applications. There are monochrome sensors available with resolution up to 1600 x The processing includes an embedded CPU and DSP with a choice of software. These are built robustly for integration in factory environments. These cameras are ideal for still-image quality control and do not have the high clock rate possible with an FPGA. [9] The Winner is... None of the Above As seen here, there are many different options for smart cameras already on the market that could be fitted with a hyperspectral front end and used for sorting objects. But ultimately, none of these were chosen. This is because they lack what could be known as the best combination of options. Some of these are outfitted with DSP software and pre-programmed algorithms to choose from. Others use FPGAs for user configurability. However, the DSP software is not all-encompassing and the FPGAs more than likely do not have hardened floating point blocks. In the application space targeted here, the hardened floating point is particularly valuable for cheaper calculations with greater accuracy. Further, by using only an FPGA to do all the processing, any algorithm could be configured and used. Real-time processing also greatly benefits from the deterministic latencies of FPGAs whereas other systems are

17 9 compromised by the inclusion of numerous memory accesses or operating systems. Additionally, the sensors available for these cameras have frame rates less than 100 fps in most cases and the ideal sensor will be collecting data much faster than that. For these reasons, it was decided that a new smart camera should be developed and thus, this project was born.

18 10 MOTIVATION Beneficiaries This project is done in support of, and with support from, Resonon Inc. in the belief that they will be able to utilize the smart camera in their machine vision technology systems. Upon completion of a system prototype, of which this thesis is a subsection, they could utilize the processing method and small form factor in other integrated systems that they pair with their optical technology. Further, the Montana Board of Research and Commercialization Technology helped to start the work on this project and its first proof of concept iteration as they were providing the primary source of funding for materials and man-hours spent developing this technology implementation. The primary focus for this smart camera implementation is in food sorting, but the technology could be utilized in any sort of assembly-line environment requiring quality assurance checks. Currently, in areas using the Resonon machine vision technology, there is still the need for manual sorting after the machine has performed its sorting because the current system is not capable of processing all the necessary data at a sufficiently fast speed in order to make highly accurate classifications. Due to the unavailability of a suitable image processing system, the images are lower resolution than the Resonon optics technology is ultimately capable of in order to allow processing to be done on a traditional PC. The goal of having an efficient real-time integrated machine sorting system that is able to process high-resolution images, is to eliminate the need for manual sorting after the machine which will free up workers for other tasks. Using an FPGA enables a fully customizable smart camera implementation that could apply in several application areas.

11 Current Processing System A typical image processing system is shown in Figure 2.1. This system is comprised of a camera, a frame grabber to configure and capture image data from the camera, and a computer to perform the processing.

19 11 Current Processing System A typical image processing system is shown in Figure 2.1. This system is comprised of a camera, a frame grabber to configure and capture image data from the camera, and a computer to perform the processing. Figure 2.1: Depiction of a typical image processing system [1]. Though this method has worked for many years, it limits the speed at which images can be processed due to several bottlenecks. The first bottleneck comes from the cables that limit bandwidth. The second bottleneck is the speed of the computer that limits the speed of real-time computations. The previous proof-of-concept system utilized Camera Link connections to connect the camera to the FPGA. The Camera Link standard was designed modeling the Channel Link technology, which is able to transmit data at up to 2.38 Gbps [10]. With current applications of image processing, the need for real-time results is growing and placing a strain on the capabilities of existing systems. This project seeks to provide a solution for the replacement of these traditional systems by integrating all three components as shown in Figure 2.2 and removing the need for any cables. The proposed implementation involves short ribbon cables to move data between the board housing the camera sensor and the FPGA board. This is done so that the camera sensor can be easily placed at a 90 angle

20 12 to the board for this prototype. These ribbon cables will eventually be replaced by board-edge connectors since the cables are not required for implementation purposes, as far as the data movement is concerned and could easily be removed in a final product. Figure 2.2: This is a depiction of the future of image processing, with an integrated camera sensor and FPGA processor [1]. Why FPGA? One of the biggest advantages of FPGAs over standard computers is the deterministic low-latency data paths achievable in custom application-specific architectures. CPUs have fixed architectures and variable latency depending on where the data is stored or moved (cache, main memory, etc.). FPGAs are made of programmable logic blocks, SRAM, and DSP blocks that can be reconfigured for varying applications. The architecture of an FPGA is like a grid, with logic connected over interconnects between blocks. Because of this structure, FPGAs have no cache and a flexible architecture can be programmed by the user using a hardware description language. In doing so, ultimate control is maintained over what occurs on each clock cycle and even where each of the internal registers are placed to garner the best path through the device. The deterministic latency is key for real-time systems

21 13 as the user is able to guarantee that the performance is real-time. Fabric is also easily expanded by adding more logic blocks, which enables manufacturers to create similar devices of varying size and complexity. In this way, FPGAs have been tailored to be suitable for a whole market of people with varying needs, resources, and cost targets. In this project two SoCs are used, instead of the standard FPGA that does not include the ARM CPUs. A SoC contains a dual-core ARM processor on the same chip as the FPGA along with hardened peripheral (Ethernet, USB, etc.), which is referred to as a Hard Processor System (HPS). Providing an HPS that can serve as a smart interface between the FPGA logic and the outside world makes it easier to communicate with external computers for the passing of data. This is generally accomplished using the Ethernet connection to achieve an IP address on the Linux system running on the HPS. However, while the HPS is still functioning as a standard computer and is subject to timing constraints applied by the OS scheduler, the FPGA can continually be running and performing the computationally demanding or timing specific tasks concurrently. It is able to send interrupts to the HPS and the HPS can read or write to the memory available to the FPGA. A depiction of the resources available and their relative locations within the Arria 10 SoC chip is shown in Figure 2.3. The architecture of the Arria V SoC is similarly laid out, though with lower-level technology in the transceivers and DSP blocks.

22 14 Figure 2.3: Graphical depiction of relative resources in the Arria 10 SoC chip [11]

23 15 SYSTEM DESIGN Logistic Regression Algorithm One of the simplest hardware-implemented classification algorithms is logistic regression. The inputs are a vector each of classification coefficients, means, and standard deviations, the input data, and matrices of a full bright image and a full dark image. The equations describing the process are: normalized = (data dark)/white (3.1) corrected = (normalized mean)/standard dev (3.2) product = coef f icient corrected + previous (3.3) In these equations, previous represents the running sum. It starts with the zeroth coefficient value and subsequent products are added on. An example is shown here assuming input vectors of the corrected data and the coefficients. The coefficients vector is one more in length than the input vector. 1 for i in 1 to NUMBER BINS 2 i f i = 1 3 product = c o e f f i c i e n t ( i ) + c o r r e c t e d ( i ) c o e f f i c i e n t ( i +1) ; 4 else 5 product = product + c o r r e c t e d ( i ) c o e f f i c i e n t ( i +1) ; 6 end 7 end The final step of the logistic regression calculation includes calculating the probability associated with the inner product result. This probability calculation is as in Equation 3.4 where X = product after the inner product is completed. For this system, the classification is not dependent on the probability value, only on the relative probability. Given the one-to-one mapping between the inner product result and the probability, it is sufficient to use the inner product result as a representative

24 16 of the relative probability for each class to determine the class that each pixel most likely belongs to. P = 1 1 e X (3.4) To ensure that the computations are hardware-friendly, only multiplications and additions are implemented. This means, that any numbers needing to be divided are first inverted in the software before being ported to the hardware. The computation of logistic regression involves matrix inner products. Given normalized inputs and classification coefficients, the inputs are multiplied by the corresponding coefficients and a running sum is kept over a pixel to achieve a single result representing the probability that the pixel belongs to that class. Figure 3.1: Inner product calculation with vector of coefficients and matrix of normalized values with dimensions of number of bins by number of pixels. The normalization that takes place uses the white and dark values that are passed with each data input as well as stored mean and standard deviation values that represent the mean and standard deviation across the spectral bins from a training set. The white and dark values are used to normally distribute the data between 0 and 1, while the mean and standard deviation account for the frequency of particular values. The dark value is subtracted from the data and the result is multiplied by the inverse white value. Subsequent operations involve subtraction of the mean and multiplication by the inverted standard deviation. The inverse of both the white values and the standard deviations are calculated externally before being stored in

25 17 system memory so no hardware divides are required, enabling fewer resources used and faster clocks. Hardware Elements Arria 10 SoC The Arria 10 SoC by Altera (which was acquired by Intel in 2015) was chosen for the primary computation engine on this project for several reasons. The primary reason is its hardened floating point units which enables the device to allow for over 1.5 trillion floating point operations per second of performance [12]. This is the first device on the market to contain single-precision floating-point multipliers and adders incorporated into the hard DSP blocks [12], the addition of which provides a great improvement in system development since fixed-point algorithms take much more effort to develop and soft floating-point calculations use unnecessary amounts of resources to create floating-point multipliers in the FPGA fabric. In addition, the largest device in this family has 660K logic elements (LE), over 42Mb M20K memory, and up to 48 transceivers capable of 17.4 Gbps [11]. This device is the best middle-class FPGA on the market today. Future iterations of this system could utilize different versions of the Arria 10 or move to a higher-end device in the Stratix 10 (the largest of which will have around 30 billion transistors [13]). The Arria 10 is the best FPGA for this purpose right now because of its high performance, which surpasses the speed requirements of the cameras, for data alignment purposes while still maintaining affordability. Also, the Stratix devices, while better in number of logic elements and DSP performance, really excel in transceiver performance and are best suited for tasks involving high transmission. Though this design does use transceivers, and may benefit in the future from moving to these more advanced

26 18 devices, it is not necessary to have the higher capability given the limit of the data rate from the cameras. Development Board Components In addition to the Arria 10, other components on the development board that were utilized for this project include the DDR3 DRAM, the SMA connectors, and the FMC connector. The DRAM on the board is 1 GB of memory for each of the FPGA and HPS to utilize. This is used for storing the light and dark matrix values. The SMA connectors are used as the interface for the transceivers to communicate with the Arria V FPGA. A daughter card was developed to plug into the FMC for the purposes of bringing camera data into the FPGA. Also an important part of the project is the Arria V SoC, which is also on a development board that includes SMA connectors, an FMC connector, a Max V CPLD, and a programmable oscillator. The SMA connectors here are again the interface for the transceivers. The FMC is used for the custom daughter card to connect the monochrome camera to the FPGA and the oscillator is utilized for achieving the ideal clock frequency for the transceiver communications. The oscillator is programmed over I 2 C by the Max V, which has to be programmed separately prior to running the desired program on the FPGA. The high level diagram of system components is shown in Figure 3.2. Not shown is the external PC which will interact with the HPS over Ethernet. While in future system implementations, the Arria V will be replaced with an Arria 10, it is currently used for the monochrome camera subsystem because of its initial use in the development of this subsystem. Additional Custom Boards In addition to the two FPGA development boards the project also required the creation of several daughter cards, i.e. printed circuit boards (PCBs). Three cards

The first card is designated the sensor board, which contains all of the circuitry to connect to the lines of the CMOS sensor chosen to be the face for the hyperspectral front-end.

27 19 Figure 3.2: High level view of the components external to the SoC utilized in the system. The colored regions depict the individual PCBs. were developed using the PADS software from Mentor Graphics [14], by teammates Connor Dack and Alex Matejunas. The first card is designated the sensor board, which contains all of the circuitry to connect to the lines of the CMOS sensor chosen to be the face for the hyperspectral front-end. All of the lines are drawn out to two 100-pin ribbon cable connectors. This board is separate for the purposes of being able to orient at a 90 angle to the rest of the boards, but also so it is modular and could be easily swapped with a different sensor, should the need or desire arise. A second board contains more ribbon cable connectors and connects the data from the cables to the FMC, which will bring it into the FPGA for processing. This board also contains circuitry for the transceivers, including SMA connectors and a clock generator to provide a reference clock, since the Arria 10 development board does not contain any SMA connectors for transceivers. Both boards are shown in Figure 3.3 connected to the Arria 10 pre-production development board.

20 Figure 3.3: The PCBs created for the hyperspectral camera connection to the Arria 10 FPGA. They are shown attached to the FMC, without the ribbon cables and coaxial cables.

28 20 Figure 3.3: The PCBs created for the hyperspectral camera connection to the Arria 10 FPGA. They are shown attached to the FMC, without the ribbon cables and coaxial cables. A board was also created to connect to the FMC on the Arria V with inputs for the monochrome line scan camera s Camera Link cables. A custom board was created for this purpose because not all of the FMC pins are connected on the Arria V development board, though they are needed for communication with the camera. Consequently, a daughter card was developed to appropriately map the camera signals to connected FMC pins on the FPGA. Project Overview In order to implement a processing system on the FPGA, the tasks required were broken into system blocks as detailed in the following sections. These blocks are the camera interface, the DRAM interface, the computation block, and the FPGAto-FPGA interface, which encompasses the communication system between the two boards in order to send signals to the air jet system and transmit object information.

21 Figure 3.4: Block diagram of the full system functionality Camera Interface In order to integrate the camera sensor on this prototype system, two additional boards were fabricated.

29 21 Figure 3.4: Block diagram of the full system functionality Camera Interface In order to integrate the camera sensor on this prototype system, two additional boards were fabricated. The first board houses the sensor and has connectors for the data to pass to the second board, which is connected to the development board housing the FPGA, and routes all the data signals to the appropriate places to be accessed from the FPGA as well as ensuring the control signals are appropriately routed. This board also contains clock generator circuitry and SMA connectors for transceiver communication purposes. As previously mentioned, the two boards are connected via ribbon cable in the prototype to allow the sensor to be at a 90 angle from the other boards. Configuring the sensor on its own board makes this a modular product, in which other sensors (on their respective boards) could be used to replace the current one so long as the signals are routed in the same way through the connectors.

30 22 Figure 3.5: Block diagram of the camera interface subsystem. The programmable hardware interface for the camera consists of a state machine to compile all the bits per pixel as they are presented, and attaching location information which describes which pixel and spectral band the data belongs to. It also pulls the data from DRAM through a FIFO interface and verifies the location information matches that of the pixel that is being compiled. This interface also sends any control signals to the camera required for triggering a start and providing a clock signal to the camera. The latency in the Camera Interface is determined based on the camera data rate as well as the number of cycles needed to delay the data before it is passed on in order for it to be parallelized. Since the data is presented in 10 taps, one bit at a time, there is a delay needed to accumulate all the relevant bits per pixel per band. Additionally, to create 5 parallel channels, the data is delayed because it is initially presented serially. It was chosen to add this delay and parallelize because while it slows down the initial presentation of the data to the computation unit, the increase in computations completed through parallel channels is great enough to overcome this initial serial latency.

31 23 DRAM Interface In order to account for the effects of the camera, all incoming data is normalized by the white and dark values as in Equation 3.1. These are meant to correspond to the largest and smallest possible data values that have been, or could be, seen. This data is captured in still images taken prior to operation of the system. The dark image is taken while the lens cap is on the camera to provide a theoretical darkest environment possible. In contrast, the light image is captured while the camera is fixed on a white strip that is lit up to its brightest value seen given the operational lighting conditions. Given that these are full matrices, with potentially variant values at each pixel/band, all the information captured needs to be stored. Due to its size, it was decided to store this data in off-chip Dynamic Random Access Memory (DRAM) so that there is still plenty of room for more frequently accessed and changing data in the on-chip SRAM memory. This was also deemed an acceptable choice of storage location because the values are only accessed before the computations and on-chip memory is used to buffer the values as they are accessed, so there is less time-critical need of the data from the time of address specification (i.e. the DRAM matrix values are pre-fetched, alleviating any effects of DRAM refresh stalls). DRAM consists of a grid of capacitors and transistors where each capacitor is capable of storing a single bit based on its voltage level. The transistor is used to access that particular capacitor and charge or drain it as necessary. The memory has to be refreshed occasionally to keep the stored values as the capacitors drain their charge over time. Since each stored bit only requires a single transistor and capacitor, this memory is very dense and cheap, making it attractive in industrial applications. However, DRAM is not quick to access in comparison to SRAM that is located in the FPGA fabric. The timing of controller interactions with the memory is also

24 Figure 3.6: Block diagram of the memory subsystem. very technically challenging, but is handled through a hardened memory controller in the FPGA that provides the direct interface functionality.

32 24 Figure 3.6: Block diagram of the memory subsystem. very technically challenging, but is handled through a hardened memory controller in the FPGA that provides the direct interface functionality. A custom controller was developed to be the master of this interface and this controller issues read or write commands. As camera data enters the system, prior to being processed in the computation subsystem, it needs to be properly aligned with the corresponding location s white and dark matrix values. In order to accomplish this, the matrix values are pulled from memory sequentially in the same order that camera data is received. It is stored in such a way that this order can be accessed sequentially in the memory. By doing so, bursting access, reading multiple address locations in multiple sequential memory accesses, can be used to take advantage of the structure of DRAM. A pre-buffering system is in place to hold the bursting read data and enable alignment with the data

33 25 that is coming in faster than the DRAM can be accessed, if each location were to be read individually. The white and dark values for each location are stored together, requiring only one memory access per location. This was done to facilitate ease of access and use as both values need to be aligned with the incoming data. Additionally, the bus between DRAM and the FPGA can sustain signals of the width needed to accommodate each of the data values and the location (see Table 3.1). This buffer s output is made available to the camera interface to enable the data alignment with incoming values. Table 3.1: Information Associated With a Pixel in DRAM ZERO PADDING 0 31 LOCATION 0 31 DARK 0 31 WHITE 0 Computation Unit The pixel and object classifications are done in the computation unit. This block consists of the normalization step as well as the inner product engine to complete the classification. It also compiles a full object classification, sorts the results, and makes a decision at the end. The system is using linear regression to classify the pixels, as was introduced in Section 3, Logistic Regression Algorithm. Presented below is a detailed explanation of the block to classify the pixels, then subsequently the objects. Pixel Classification In order to perform the logistic regression calculation on the incoming pixels, there are a number of parallel blocks implemented. The first is the normalization which performs the calculations in Equations 3.1 and 3.2 on the incoming data in parallel. At this step there is a DSP block per calculation step per parallel data channel. The mean and standard deviation values are stored in on-chip RAM for easy access. The output of the normalization is passed to the inner product blocks.

34 26 Figure 3.7: Block diagram of the computation subsystem. There is an inner product block for each class within each parallel channel. This corresponds to NUMBER OF CLASSES*NUMBER OF PARALLEL CHANNELS DSP blocks, as each inner product requires only one DSP unit. For this prototype design, that means there are 16 5 = 80 parallel DSP blocks used for the inner product. The class coefficients are located in memory blocks for each class, addressed by band number. At the end of each pixel, the results of the inner product blocks for each parallel channel are added together to have one result per class. The output, then, is a vector of probabilities designating how likely it is that the pixel belongs to each class. This information is stored in on-chip memory for access by the user, as well as being passed to the object classification subsystem. The computational complexity of this classification is found by analyzing the number of operations that could be happening at any one time. Once the system is fully in operation, all of the normalize DSP blocks and inner product blocks can be running at once. Assuming this is the case, the performance of the pixel classification when running on the 70 MHz data clock with a 210 MHz operation clock is 4.55 GFLOPS. Concurrently, the on-chip memory bandwidth can be analyzed for each of the instantiated memory blocks. Taking in to consideration the blocks for the

35 27 means, standard deviations, classes, intercepts, and results, the total on-chip memory bandwidth is 44.8 GBytes/s. Much of the data pulled from these memory blocks is actually not used, but is required to fulfill acceptable memory port width ratios. This number was found using the 70 MHz clock rate that is used to read or write to the memory on the processing side of the system. Object Classification In classifying the full object, data is utilized from the monochrome line scan camera as well as the pixel classifications obtained using the hyperspectral data. The line scan camera is taking images at 80,000 frames (lines) per second (fps) and the Arria V FPGA is performing calculations to find the location of an object. The line scan line number and pixel number are translated into the corresponding hyperspectral numbers on the Arria V to prevent transmission of numerous repetitive entries. This translation is done because the monochrome pixels are at a much finer resolution than the hyperspectral pixels. Information about the object s location is transmitted to the computation block including the line number, an object number, and the beginning and ending pixel in that line that defines the object. A record is kept in the computation unit on the Arria 10 of object locations, which is used to accumulate pixel classifications within each object. After an object disappears from the scan line, the overall results are used in a lookup table to determine if the highest level classification probability is good or bad, though future systems could look at the accumulated class probabilities of all the classes to make a decision. The decision to eject the object is made off of this lookup and sent to the Arria V, which is also controlling the air jets. The final sorted results are made available to the HPS through a streaming process which feeds a modular Scatter Gather Direct Memory Access (msgdma) block that will write streaming data to SRAM belonging to the HPS.

36 28 Since not every pixel will need to be accumulated into an object, and pixel results do not show up every clock edge, there is only one DSP block implemented in this section. If it were constantly in operation, it would achieve a performance of 70 MFLOPS. The memory block that holds the accumulated results for each object has a theoretical maximum bandwidth of 4.48 GBytes/s. This is theoretical because, as with the adder, there will not be a memory access on every clock period due to the intermittent nature of the data that will be accumulated. The VHDL code for operation of the computation subsystem can be found in Appendix B. These files are computation unit.vhd regression.vhd normalize.vhd channel sum.vhd sort.vhd object tracking.vhd FPGA Interface In order to communicate between the two camera subsystems, a high-speed (6 Gbps) serial interface was designed to connect the Arria V and Arria 10 FPGAs. The monochrome line scan camera is connected to, and its data is processed on, the Arria V while the hyperspectral camera is connected to the Arria 10 where the inner product computations for classification occur. In order to make a full object classification, the information about each object s location needs to be passed from the Arria V to the Arria 10 and the ultimate decision to keep or discard each object is sent back from the Arria 10 to the Arria V, which also controls the air jet system. One reason for

37 29 Figure 3.8: The connection between the Arria V development board (top) and the Arria 10 development board (bottom). the two separate boards is the availability of FPGA Mezzanine Connectors (FMC) on the development boards. The monochrome line scan camera requires two Camera Link cables between the Arria V and the camera. The FMCs are connected to the FPGA in such a way that both connectors are required to connect all the desired signals for the camera. There are only two FMCs on the Arria V development board, and consequently, both are used for this camera. The hyperspectral camera also uses an FMC to connect to the FPGA. Since the Arria 10 is the larger device and includes hardened floating point, it is imperative that this board is used for connecting to the

38 30 hyperspectral camera. Since a daughter card has been developed such that all the correct signals are routed to one FMC for the monochrome camera, both cameras could be connected to the Arria V but not the Arria 10 as the available board only has one FMC. The Arria 10 board currently contains an engineering sample (preproduction) version of the Arria 10 since we have not been able to get a production evaluation board of the Arria 10. The production evaluation board will have two FMCs and at that time, the full design can be ported to one board, provided the logic will fit on the device, eliminating the need for the communication interface. Unknown at this point is if the Arria 10 has enough resources to fit the full design or if two Arria 10 FPGAs will be needed. The communication interface between the two FPGA boards is through SMA connectors connected to coax cables that use the high-speed transceivers in each of the FPGAs. Using the SerialLite2 protocol, the transceivers can establish a link and transmit data. SerialLite is a communication protocol that is particularly good for high-speed serial communication and has less overhead than other serial protocols. The protocol includes CRC checking as well as optional scrambling/descrambling of the data. It can also be used with multiple receivers and a broadcast mode, if desired. Though the SerialLite2 IP core provided by Altera is not yet readily available for the Arria 10, it is indirectly supported. Since SerialLite3, which is available for the Arria 10, is not compatible with SerialLite2 due to different encoding schemes and packet structure differences, the SerialLite2 core was needed to be able to communicate between the two boards. In addition to the SerialLite2 core, the native transceiver phy cores are used for their respective devices. These implement the PMA (physical medium attachment) aspect of the transceivers as well as some of the PCS (physical coding sublayer) and handle the physical transmission of the data.

39 31 The SerialLite2 core then sits on top of the native core and handles additional PCS tasks of transmission, such as providing a CRC for the data. In order to set up the reference clock, which is required to be MHz to be compatible with the general transceiver accepted reference clocks, a programmable clock was needed. At first, this was done using the programmable oscillator on the Arria V development board, which provides the reference clock to the transceivers that are connected to the SMA connectors on the board, and is programmable over I2C from the Max V CPLD controller. After the FMC breakout board was completed, in order to provide additional SMA connectors for testing of the transceiver channel, the clock generator on the breakout board had to be programmed over I 2 C from the FPGA to generate the desired frequency clock. The clock generator chosen for this purpose has four one-time programmable (OTP) configurations, so that the correct frequency can be loaded on power-up. After programming the volatile RAM with the desired values, they were burned into a configuration on the OTP memory of the generator and subsequent projects need only enable the output of the clock generator to get the desired frequency. This made it much easier to ensure the right frequency was available at transmission time, rather than programming the clock each time the power was cycled. On the Arria 10 pre-production development board, the FMC breakout board can be used, however the reference clocks connected to the clock generator outputs do not connect to the reference clock inputs on the same bank as the populated transceivers. One of the reference clocks on this bank is provided by a programmable oscillator on the development board that has approximately 10 clock outputs. Instead of programming yet another oscillator, the SMA clock outputs were found to provide a MHz clock that can be transmitted over SMA to one of the receivers to be used as a reference clock on the breakout board.

40 32 Performance A significant benefit of this system is the increase in performance from the previous method of processing. Previously, Resonon has been using a camera with a frame rate of 140 fps with a spatial resolution of 640 pixels and a spectral resolution of 240 bands. This system under development is comprised of a 500 fps camera sensor (full resolution) running at 2000 fps (partial resolution) with a spatial resolution of 1024 pixels (reduced to 256) and a spectral resolution of 160 bands. A large increase in spectral bands was neither needed nor desired by Resonon because they have found that the data becomes redundant and unuseful after a certain point. With a clock speed of 70 MHz on the computation side and approximately 157 cycles to classify a pixel, this means that it takes only 1.57µs to compute the classification for a full pixel. With 1024 pixels, this hyperspectral computation takes 1.6ms per frame (i.e. line scan). The monochrome line scan camera can run up to 80,000 lines per second, and the transmission rate between the two boards is at 6250 Mbps. With 54 bits needed to represent the information per object per line, objects are transmitted at a rate of MHz. When packaged in 32-bit data words and including start and end packets, the transmission is still accomplished in 20.48ns. Since the monochrome line scan calculations can run on an 83.5 MHz clock, it is able to keep the transmission buffer full (i.e. calculations take place faster than they are needed), but not overflowing since there will not be objects found on every clock edge. The decision on the hyperspectral side is made using the 70 MHz clock that the computation unit runs on, so there may be some dead spots in the return transmission. Upon receipt of the object information, the line and pixel numbers are stored using the object number in an array updated with each transmission to note where objects are on the line.

41 33 IMPLEMENTATION DETAILS Programmable Oscillator In order to achieve an accepted clock frequency for the transceiver reference clock on the Arria V board, a clock generator had to be programmed. The first iteration involved programming the programmable oscillator, a Si570 device from Silicon Labs, provided on the development board. In order to achieve the desired frequency of MHz, 6 of the available registers are required to be programmed via the I 2 C lines which are connected to the MAX V CPLD system controller that is also on the board. The oscillator does not have persistent memory, so it must be reprogrammed after every power loss to consistently have the desired frequency on every run of the device. This can be arranged by programming the Max V to run the I 2 C code as part of the device configuration. Programming the oscillator requires knowledge of the current frequency and register values as the calculations for new values are based off the current configuration. These default values and the new values were obtained using the Clock Controller GUI provided as part of the board test system from Altera. Registers The device has two sets of identical registers, one set for devices with 20 or 50 ppm temperature stability, and the other for devices with 7 ppm temperature stability. The oscillator provided has 7 ppm temperature stability, 20 ppm total stability as determined by the part number. The critical values needed to program the registers are the output divider values (N1 and HS DIV ) and the crystal frequency multiplication ratio (RF REQ). The output dividers are found by changing the existing values as little as possible, but keeping the digitally controlled oscillator (DCO) frequency within the acceptable range of operation. The factory default is

42 34 a 100 MHz clock with divider values and DCO frequency as shown in Figure 4.1. Using the GUI, the necessary values to program were easily obtained as shown in Figure 4.2. Though provided by this tool, they could also be found using a couple of equations, which were utilized in the MATLAB script created to print the VHDL constants for the programming of the registers. Based on the required values, all the registers needed to be programmed with values as shown in Table 4.1. The steps to derive these values are in Equations 4.1, 4.2, and 4.3 [15]. The RF REQ value is a 38-bit number with 28 decimal places, so is divided by 2 28 to achieve the correct decimal value prior to performing the calculations and multiplied by 2 28 at the end in order to shift the decimal accordingly. The values for HS DIV and N1 are chosen from a selection of allowed values with the goal of minimizing the DCO frequency (f dco) within an acceptable range, and also achieving the lowest possible N1 and the highest HS DIV. fxtal = (f0 HS DIV N1)/RF REQ (4.1) fdco = f1 HS DIV N1 (4.2) RF REQ = (fdco/fxtal) 2 28 (4.3) Table 4.1: Register settings for Si570 Register Number Old Value (Hex) New Value (Hex) A C3 15 BC B7 17 EE 0C 18 FA D9

43 35 Figure 4.1: Factory Default Clock Register Settings for Si570 Figure 4.2: Preferred Clock Register Settings for Si570 In order to perform the programming of the device, an I 2 C master component was utilized, provided by Scott Larson on EE Wiki [16]. A state machine was devised to progress through each of the registers and start individual transactions with the master driver. Following each write, a stop is sent, rather than continuously writing in order to ensure that the correct register is written to each time. Since all registers are written sequentially, this is not a strictly necessary course of action and all registers could have been written in a streaming write sequence, but using

44 36 individual transactions ensures that a specific register receives the data designated for it. This also set up the state machine in a useful manner for the clock generator, which does not require programming of all registers. The code for the implemented driver can be found in Appendix B under i2c driver.vhd. Programmable Clock Generator In order to further test the transceiver communication, two sets of transceivers were required. Since the development board for the Arria V only contains one set of SMA connectors and the Arria 10 engineering sample development board does not contain any, a daughter card was fabricated to utilize the transceivers through the FMC connector, with SMA connections. In order to achieve a viable reference clock on the transceivers utilized by the daughter card, a clock generator was included on the card along with the necessary circuitry. The VersaClock 6 Low Power Programmable Clock Generator from Integrated Device Technology was chosen because it is programmable over I 2 C, it has two configurable clock outputs, and it has the option for four one-time programmable configurations stored in non-volatile memory. The one-time programmable configurations are appealing in this project because it does not require any setup once the configuration has been programmed; the required frequency will be available on power-up of the device, unlike with the oscillator on the development board. Design Decisions Many of the additional circuitry required by the clock generator is specified in the datasheet, with recommendations such as using a 25 MHz crystal, and terminations for different output configurations [17]. One of the design decision made includes the connections of the I 2 C lines and the select line, pins 8, 9, and 24 respectively as

45 37 shown in Figure 4.3. Pin 24, OUT0 SEL I2CB is used to determine whether pins 8 and 9 will be select lines for one of the four stored configurations or the clock and data lines for I 2 C communication. If connected to a pull-up resistor, they will be select lines, otherwise, they will be used for I 2 C. Consequently, a pad was placed on the PCB to enable a pull-up to be used, but it was not populated so the device could be programmed over I 2 C. After power-up, this pin also serves as a clock output, acting as a buffer for the selected reference clock [17]. Each of the clock outputs is connected to a reference clock pin for the transceivers through the FMC connector and one of them is also connected to the global clock network for use in FPGA logic, if desired. Figure 4.3: Diagram of Pin Assignments for VersaClock 6 Programmable Clock Generator [17]

46 38 Registers The VersaClock Clock Generator has registers programmable for four output clocks, despite the fact that there are only two output clocks available on the device, in addition to the reference clock output. The registers available to be programmed include settings for the internal PLL divider and output dividers, both integer and fractional. There is also the option to choose between the crystal reference and a reference clock provided by the FPGA. The pins are shown in Figure 4.3. The registers chosen to program include those for the programmable capacitors, the internal PLL frequency dividers, and the output dividers. The values for the programmable tuning capacitors were chosen based on Equation 4.4 [17] with an estimated combined stray and external capacitance of 2 pf. Several values were tested to verify the values, but there was not a large noticeable difference between any of the results, as seen on an oscilloscope, so the originally designated values were kept. In choosing the values for the PLL frequency dividers and the output frequency dividers, a voltage controlled oscillator (VCO) frequency of 1250 MHz was targeted, which is the lower bound of the desired range for the oscillator. Using this value with the known expected output frequency of MHz meant there was no fractional divider values for the PLL or the output, which means fewer registers to program in addition to a more accurate clock division. A MATLAB script was used to print out the desired register configurations and the resulting VHDL code is included in Appendix A. CL = (9pF + 0.5pF XT AL[5 : 0] + Cs + Ce)/2, (4.4) Burning a Configuration Unlike with the programmable oscillator on the development board, the clock generator has the ability to hold four non-volatile configurations. The benefit of

47 39 using a non-volatile configuration, is that the clock output is available very soon after power-up, without having to re-program the generator each time. In order to burn a configuration, all the registers in RAM were set to the desired values, the VCO was calibrated, and then the registers designated for control of the OTP were programmed to define the registers to burn and then check to be sure that the burn completed successfully. By setting bit 7 in the OTP Control register, the part will automatically load data from OTP on power-up. Utilizing the Clock Generator With the configuration needed burned into the part and automatically loaded on power-up, the only thing needed to ensure that the clock can be used by the transceivers is to enable the output and select the appropriate reference clock. For the default configuration burned, the default reference is the crystal input at 25 MHz. The enable and select signals are both driven low on pins 6 and 7. Altera IP Within the Quartus software, Altera provides many different IP blocks as Megafunctions that can be customized and dropped in a design. These make handling transmission interfaces or creating memory blocks much simpler. However, in our efforts to make the system as modular as possible, some of these had to be bypassed and implemented by hand. Fortunately, the compiler will synthesize the components and create the desired blocks even when not created in a megafunction. One of the things that require care when writing the block by hand, however, is the rules of the block. For instance, a dual port memory is very tricky to implement by hand, as it cannot have arbitrary values on either side of the block. A benefit of creating the unit within the Megawizard, is the tools will inform the user of valid

48 40 values for each of the parameters. Without this interface, users must carefully choose their values or learn of a fail when the design is compiled. This was encountered in the memory block instantiations used within the computation subsystem. In order to avoid creating multiple different memory components and also in an effort to create a modular design, a memory block component was created that instantiates the Altera altsyncram megafunction with generic parameters that can be input at the time of instantiation. This is a perfectly reasonable approach until a port width ratio is violated. Rule violations happened several times over the course of development and fixing them resulted in the creation of extra locations within the memory block that were skipped on one port and ignored on the other, but required to be there to enable the port ratios to work within the block. This is an unfortunate waste of memory but not a huge concern for the design as it stands currently. Timing Constraints The hyperspectral camera is able to produce data at 6.6 Gbps with 500 fps when using the full 1280x1024 pixel image. Having reduced the image size to 256 pixels for this application, the frame rate is up to 2014 fps. After compilation of the parallel data streams, it will be passed into the computation unit at 66 MHz. In order to stay ahead of the incoming data, the computation unit needs a base clock at least this fast, though preferably faster. Fortunately, faster is possible. The base clock in the regression unit is targeted at 70 MHz. One of the tricks in running faster than data is produced is to ensure that the blank times are not affecting the overall results, since the unit is constantly adding in new values over each pixel. Therefore, a signal was added to classify each incoming data chunk as valid or not. Using this signal, the computation unit determines whether or not the value should be added into the existing calculations. The valid signal is not, however, the antithesis to the error

49 41 signal passed by the camera block. Error is set when the incoming data is bad or the location of the white/dark matrix values does not line up with the location of the incoming data. In this case, the pixel currently being calculated is zeroed out and no incoming data is considered until the start of the next pixel. The design uses two primary clocks for the computations and classification, one with a frequency three times faster than the other. The slower is required to keep up with the incoming data rate. The triple speed clock was included when it was found that the floating point adder and multiplier each take three cycles to complete. In the original design of the inner product unit, a multiplier was pipelined with an adder, but the result from the adder was needed as an input to the multiplier for the following calculation. This was a carryover from a previous implementation which received data from each spectral bin at a time, rather than from each pixel. Once the design was changed to accommodate data arriving for a full pixel before moving on, the inner product unit could also be changed. Quartus provides a multiply-accumulate floating point megafunction that completes in four cycles. Using this, the faster clock was no longer needed in this unit and data alignment was much easier. The faster clock was kept, though, and utilized in the normalization step and combination of the parallel data for the benefit of speed. A challenge encountered in the timing requirements of the computation unit was achieving the correct setup and hold timing for each of the clocks and a maximum frequency of the clocks that is at least the desired run frequency. The Quartus software contains a timing analyzer known as TimeQuest, which will check paths and analyze timing requirements as well as providing statistics on each of the paths. It will also provide some recommendations to help close timing, when possible. With the first inner product unit design, TimeQuest found the faster clock with a maximum rated frequency of 100 MHz less than where it needed to be in order to be triple the

50 42 speed of the other clock. This issue was the primary motivation for changing the design of the inner product unit. The paths that were failing setup timing were all related to the inner product and the Chip Planner, another tool within Quartus, was used to show the paths that were being taken. In most cases, the path involved an unnecessary stop at a register before passing back into the DSP block. By switching out the adder and multiplier for the multiply-accumulate megafunction, there was a significant decrease in required paths and registers. Therefore, routing was simpler and clocks were not bouncing around nearly as much with fewer registers required. There were a few changes made in order to accommodate the new architecture, but it helped with timing immensely and functionality was verified in MATLAB. The change allowed for the faster clock to have a maximum frequency up to 100 MHz faster than its required speed and the slower clock also has a significant increase in the ceiling for its speed. The setup timing failures were also removed with the removal of the extra registers outside of the DSP blocks. In analyzing the compilation results generated by Quartus, it was found that the software was optimizing out several design-critical signals, including the data inputs which caused much of the subsequent logic to also be optimized out. After issuing a few changes to combat these optimizations, including fixing the parenthesization around signal indices and utilizing the noprune attribute, new timing errors were uncovered. This is one of the biggest tricks in working with software programs and large projects. There are limitations to sizes of the inputs, and the optimizer will remove seemingly unused signals. If the developer is unaware of these optimizations, they could be placed in a false sense of security. Fortunately, this was discovered and the work done to check if the removals were legitimate. Most of them actually were because of extra space allocated in a signal that ends up never changing or remaining unused.

51 43 The timing battle continues when the full design is compiled together. Not only is the routing more challenging, new setup timing errors are uncovered because of the routes taken. Though floorplanning was attempted, in most cases it actually prevented the fitter from being able to fit the design. This simply continued the need for timing analysis and tweaks to re-achieve minimal setup and hold errors in each of the clocks. Toolchain Fights Some of the greatest frustration in implementation of the design, was simply in figuring out how to work with and achieve the desired results from the tools utilized. Oftentimes, it was a matter of tweaking settings in the software to display what you want to see (and that is actually occurring), rather than a problem with the hardware code that is being tested. SignalTap Altera provides an internal logic analyzer to watch signals in a design that could not be reached from an external analyzer. This is helpful in debugging a design, however since the logic analyzer uses device resources in the FPGA, anytime a change is made in the analyzer, the design has to be re-compiled. Additionally, the extra resource usage could make it a challenge for large designs. In this case, it is best for modular designs when you can break out a portion to look at without requiring the full design. This was a frequent problem encountered when debugging the part of the project that communicates with the external DRAM because there was no other way to look at the signals, and the particular signals that this project is passing in and out of DRAM are 128 bits wide, so they each take up a lot of resources. The trick,

52 44 then, is to choose only the signals critically needed to be looked at and minimize the size of the overall project to be scaled up after debugging is completed. TimeQuest Timing Analyzer Altera s Timequest Timing Analyzer is both a useful tool and a nuisance. It is helpful in predicting timing and showing what the maximum achievable frequency is. However, it requires proper user input to help interpret how data is moving through the design and how different clocks are related. Given the use of a clock that was set to be three times faster than another clock and data that moved freely from one clock s domain to the other, interpretation for the tool was critical. Without it, the setup and hold analysis had a total failure through the system of hundreds of seconds. The needed input was the correct multicycle paths to tell the tool how to analyze data that crosses clock domains between the system clock and the triple-speed clock. By adding this information, the setup time error went from hundreds of seconds to twenty seconds for the whole project. Further, upon changing the inner product unit, the maximum frequency of the clocks was correctly above where they need to be and the setup time error was completely mitigated. However, upon removal of some of the optimizations that were compiling away needed registers, some of the setup timing errors returned. These occurred mostly in the line scan camera side of the project, as the object is built. It is anticipated that utilizing the clock from the transceiver block that actually corresponds to the line scan data will fix some of those errors. Chip Planner The Chip Planner utility provided with Quartus is extremely useful in visualizing where resources are being used and the proximity of certain resources to each other. It displays where each of the registers, DSP blocks, and I/O are being used for the design after a compilation. It will also show data paths and can be linked to from

53 45 TimeQuest for viewing critical paths. A useful aspect that helps with timing is the floorplanning feature. As the designer for the project, I was able to group signals together and instruct the fitter to place them co-located. In doing so, the compile time was decreased because paths were found easier and the clock speed increased. This is not always the case, though. When a floorplanning technique from a project containing only the regression step was applied to the project containing the full computation unit, the fitter was unable to fit the design. This is likely due to the significant increase in size of the overall project, so the additional resources and paths prevented the use of the same techniques for fitting as were utilized in the smaller project. Nevertheless, even grouping a small portion of the design together assisted the fitter in finding placement for the whole design sooner and in a more efficient manner than if it were to do it itself without clues as to the grouping. Grouping for floorplans was particularly helpful in this project because of the way the design is implemented. Due to the many generate statements used for working with the parallel channels, it is helpful to the fitter to define what data is moving through each path since as the user, that should be clear. In doing so, the fitter is able to try and place the appropriate signals and data paths in proximity to related paths. A fitted floor plan for the production Arria 10 development board is shown in Figure 4.4. MATLAB The design was verified using the HDL Verifier toolbox available within the MATLAB tool set. In order to use this with the floating point computational blocks, there are a few specific files that need to be included in a particular order to ensure correct compilation with all libraries able to be located. Using this method, in conjunction with ModelSim cosimulation, was useful in verifying the design, but not easy to figure out at first. The HDL verifier is particularly useful in large projects

46 Figure 4.4: A fitted floor plan in the Arria 10. such as this because it will provide inputs to the system and the outputs can be compared with MATLAB calculations for easy verification.

54 46 Figure 4.4: A fitted floor plan in the Arria 10. such as this because it will provide inputs to the system and the outputs can be compared with MATLAB calculations for easy verification. However, it was also useful to have Modelsim running the design because it was sometimes easier to follow the data path through each of the signals visually, rather than trying to pull out the right information on the outputs. This was also a way to track internal signals without having to port them out. Verifying with MATLAB is an exercise in making sure that the functionality of the design is fully understood. It has to be programmed in both the MATLAB language as well as the hardware that you are testing. Because it is user code testing user code, it is important that the desired functionality is fully understood and the MATLAB code is believed to be correct. It is often necessary and useful to do a couple of iterations by hand in order to assure oneself of the working nature of the MATLAB program. When this does not happen, the debugging process is infinitely more frustrating. This was experienced in testing the regression system. MATLAB was used to provide the inputs and upon receiving an interrupt, it read the outputs from the registers and wrote them to a file. The results in this file were compared

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then