Circuit board red

Machine learning has become an integral part of many of the cloud services we use on a daily basis such as Google Assist and Apple Siri. The implementation of the neural networks comprising the back end of these services has taken the form of high performance computing (HPC) nodes using GPU hardware accelerators. Emerging applications such as autonomous cars and drones implement a mixture of machine learning and computer vision that require a high-throughput, low-latency platform while at the same time maintaining strict size, weight and power (SWAP) restrictions. These applications in particular are perfectly suited for FPGA hardware accelerators. Updated for Vivado/Petalinux 2017.1

GPUs vs FPGAs

So why go through all the trouble to target an FPGA over a GPU? Below is a typical vehicle sensor processing system using a traditional CPU-GPU SoC, such as an Nvidia Tegra, and a Zynq FPGA SoC developed with the reVISION toolchain from Xilinx.

CPU-GPU SoC vs FPGA SoC.

Latency

Take for example the vehicle’s automatic breaking system. The induced latency from transferring data between the sensor processing in the GPU to the critical decision making in the CPU can result in a significant delay in applying the brakes. By removing the CPU from the datapath and embedding the sensor processing, critical decision making and output control in the FPGA fabric you can achieve significantly better, and deterministic, results.

Autonomous Vehicle Breaking.

Performance Per Watt

Not only can you achieve significantly lower latencies but in many applications the performance per watt is far better as well.

Performance Per Watt.

It’s all about the Floating Point

The issue, at least for now, when it comes to GPU vs FPGA neural network performance is all about floating point operations. FPGAs use embedded DSP blocks to compute floating point operations; and today even the biggest, fastest FPGAs can’t compete with GPUs in that respect. However, according to recent research such as FINN: A Framework for Fast, Scalable Binarized Neural Network Inference “convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values”. This is great news for FPGAs especially when the training data for Binary Neural Networks (BNN) can fit into on-chip-memory, further reducing latencies and increasing performance per watt.

The Toolchain

Now the bad news. The toolchain for creating FPGA logic from high-level languages such as OpenCV, OpenVX and Caffe are in their infancy. Currently Xilinx provides the reVISION toolchain and the SDSoC high-level compilers to translate these applications into Hardware Description Language (HDL). This process typically starts with cross-compiling your application onto the ARM CPUs in the FPGA SoC.

That’s why we’re here: To develop our application on the FPGA SoC, and use the FPGA as an accelerator, we need a well supported Linux development environment such as Ubuntu.

FPGA Accelerator Toolchain.

Meet the UltraZed

The suite of Zedboards have been at the forefront of the Xilinx Zynq development community since the the original Zynq release, and the latest UltraZed board follows the same sytstem-on-module (SOM) design as the MicroZed and PicoZed before it. The UltraZed, however, moves from the Zynq 7000 SoC to the new Zynq Ultrascale+ MPSoC, specifically the XCZU3EG-SFVA625.

The UltraZed Starter Kit includes the UltraZed SOM and I/O breakout board which provides access to high speed peripherals, JTAG, USB-UART’s and plenty of I/O including an Arduino Shield and several Digilent PMOD connectors. Also available is the UltraZed PCIe Carrier Card which provides a high-speed FMC connector as well as a PCIe edge connector for high-performance hardware acceleration applications.

UltraZed Starter Kit.

Zynq Ultrascale+ MPSoC (ZynqMP)

The ZynqMP Technical Reference Manual provides detail about the ZynqMP architecture featuring a quad-core Cortex-A53 Application Processing Unit (APU) and a dual-core Cortex-R5 Real-Time Processing Unit (RPU) as well as an ARM Mali-400 GPU. The addition of the Platform Management Unit (PMU) and Configuration Security Unit (CSU) provides a hardware-backed, triple-redundant, real-time system monitoring capability and a much higher level of reliability and security.

Zynq Ultrascale+ MPSoC.

ZynqMP Boot Flow

The ZynqMP Software Development Guide provides detail about both secure and non-secure boot flows (we’ll be using the non-secure boot flow). The CSU is a 32-bit triple redundant processor that runs out of boot ROM. The system can access its resources such as the AES, SHA and RSA hardware accelerators as well as key management blocks but we don’t need to build any firmware for the CSU.

ZynqMP Non-Secure Boot.

ZynqMP SD Card Boot

The ZynqMP Software Development Guide provides detail about booting from various sources. We’re going to boot from the SD Card, so we need to generate a boot image, Linux device tree and kernel for the first FAT32 partition and the Ubuntu root file system for the second EXT4 partition.

ZynqMP SD Card Boot.

Let’s Install Ubuntu

The following guide was developed running native 64-bit Ubuntu 16.04.2 Desktop Linux with Vivado 2017.1 and Petalinux 2017.1. Xilinx officially supports CentOS 7.1, RHEL 6.6/6.7/7.1/7.2, SUSE 12.0 and Ubuntu 16.04. Currently the UltraZed ships with ES1 silicon and requires a license voucher provided with the UltraZed Starter Kit.

Software Environment

  1. Using the PetaLinux Reference Guide install Petalinux and the required dependencies for your Linux distribution of choice.
  2. Install the 2017.1 Xilinx SDK – This can be done seperately or as part of the Vivado 2017.1 installation below.
  3. The Xilinx SDK calls gmake rather than make when building the First Stage Bootloader (FSBL) and PMU. Make a symbolic link for gmake:

Hardware Environment

Install Vivado and UltraZed board files:

  1. Currently the UltraZed Starter Kit is shipping with ES1 devices, the Vivado Webpack only supports production silicon, so we need to install the full version of Vivado 2017.1. Make sure to check the “Engineering Sample” box during installation.
  2. Download and install the 2016.4 board definition files from the UltraZed Reference Design. Extract and follow the included instructions.

Enable ES parts in Vivado:

  1. Create a “Vivado_init.tcl” TCL script in ~/.Xilinx/Vivado/ to enable use of ES parts.

Fix the Vivado License Manager:

If your ethernet adapter defaults to “eth0” skip this step

  1. The Vivado License Manager only seems to be able to read the MAC address of the eth0 interface which needs to match the MAC used to generate your node locked license. As of Ubuntu 16.04 the network adapter naming scheme has changed and no longer defaults to eth0.
  2. Edit /etc/default/grub with the following parameter: GRUB_CMDLINE_LINUX=”net.ifnames=0 biosdevname=0”
  3. Update grub and reboot:

UltraZed Development

Clone the UltraZed Development Repository and setup your environment. The required Xilinx Software Repositories (ARM Trusted Firmware, Linux Kernel and uBoot) are cloned during the Petalinux build process; the Linux Device Tree Compiler has been included as a submodule. Edit env_setup.sh to point to your Vivado and Petalinux 2017.1 installation directories.

Build the FPGA

  1. Open the Vivado project (./fpga/ultrazed_base/ultrazed_base.xpr). If the licensing and Vivado_init.tcl script are setup properly the project should open normally showing the UltraZed I/O breakout board on the main Vivado screen, if not check the TCL console and license manager for errors.
  2. Opening the block design shows the ZynqMP Processing System (PS), a small block memory, GPIO for LED control and an external AXI4 Lite interface. The AXI interface is connected to a dummy register block – this is intended as a starting point for communicating to custom FPGA logic.
  3. The ZynqMP PS component was configured using the presets from the UltraZed board definition files installed earlier. The only interfaces needed are SD1, GEM3, UART0, USB2/3 and SATA. The Display Port interface must be removed to work around a kernel panic during the boot process.
  4. Any changes made to the PS configuration need to be exported into the Hardware Definition File (HDF): ./fpga/ultrazed_base/ultrazed_base.sdk/ultrazed_top.hdf. The HDF can be exported using: File -> Export -> Export Hardware. The PMU and FSBL both import settings from the HDF.
  5. Click Generate Bitstream to build the design – The FPGA bitstream will be here: ./fpga/ultrazed_base/ultrazed_base.runs/impl_1/ultrazed_top.bit

Base System Block Design.

Build the FPGA – Vivado TCL

It’s a good idea to take a look at the block design at least once, as this is what you’ll be using to make changes to the PS or PL interfaces, but you can also build the FPGA design through the Vivado TCL interface.

Build the Platform Management Unit (PMU)

The PMU is responsible for configuration and monitoring of power supplies and system resources. You can also load custom PMU firmware for built in self tests and more advanced system monitoring. Any build errors here are most likely due to missing 32-bit cross compile libraries (PMU is also a 32-bit triple redundant processor).

Build the First Stage Bootloader (FSBL)

The FSBL is configured to run in lock-stop mode on the Cortex-R5 (RPU) and is responsible for configuring PS I/O and the PS-PL interfaces. We’re also using it to configure the FPGA at boot as the FPGA configuration driver in Linux is not currently working.

Build the Linux Device Tree Compiler (DTC)

The DTC is a utility used for building Linux Device Tree Binaries (DTB) as well a decompiling binaries to Device Tree Source (DTS).

Build the Linux Device Tree

Using the default Petalinux device tree results in the ZynqMP Ethernet MAC driver failing to connect to the TI DP83867 Ethernet PHY and the SD1 interface timing out during initialization. We’re going to compile a patched device tree that fixes both of these issues using the DTC tool.

Build ARM Trusted Firmware, uBoot and Linux Kernel

The Petalinux tool suite is designed to generate all the neccessary components of the system including a base root file system. However, since we want to use Ubuntu we’re only going to use Petalinux to build ARM Trusted Firmware (ATF), uBoot and kernel. The Petalinux Tools Reference Guide and Petalinux Command Line Reference Guide provide a ton of valuable information about using the Petalinx tool suite.

The only modifications made to the Petalinx build are to override the default “CONFIG_EXTRA_ENV_SETTINGS” in the uBoot build. By modifying ./software/petalinux_build/project-spec/meta-user/recipes-bsp/u-boot/files/platform-top.h we tell uBoot to load it’s default environment from the “uEnv.txt” file in the boot partition. This allows us to modify the way uBoot boots the system without having to rebuild the image.

Build the Peek/Poke Application

We need an application to access the logic we’ve built into the FPGA. The simplest way to do this is to memory map the desired FPGA address space using /dev/mem. This is not as elegant as a linux device driver, and also requires root access to use /dev/mem, but it’s a quick and dirty method for testing FPGA interfaces.

As you can see from the block diagram in Vivado all of our FPGA resources are connected to the HPM0 AXI interface on the ZynqMP. Using the Address Editor in the Vivado you can see the base address for this interface is 0x0080000000.

ZynqMP Address Map

ZynqMP Address Map.

Build the ZynqMP Boot Image

The ZynqMP Software Development Guide lists all the available configuration parameters used by the Vivado bootgen tool to create the boot image. The boot image contains the FSBL, FPGA image, PMU, ATF and uBoot images as well as configuration headers needed to load each image.

Write the SD Card Boot Partition

The boot partition (FAT32 partition 1 on the SD card) needs to include the boot image, Linux device tree binary and the Linux kernel image. Since we’ve configured uBoot to source it’s environment from the uEnv.txt file, it also must be copied here.

Write the Ubuntu Root File System Partition

The root file system partition (EXT4 partition 2 on the SD card) is where the kernel will look to mount and run Ubuntu 16.04.2. The root file system included here is a slightly modified version of the SQAUSHFS image available from the Ubuntu Daily Build. The only modification was to add the “zynqmp” user group with root privileges and a python library used for testing our design.

The wr_rootpart.sh script needs root privileges in order to maintain root permissions while extracting the Ubuntu root file system

Write the SD Card

Now we need to partition, format and write our boot and root file system partitions to an SD card. The default partition scheme is a 100MB boot partition and the remainder of the SD card for the root file system partition (a 4GB SD card is big enough for testing).

The wr_sdcard.sh scirpt needs root privileges in order to mount/unmount, partition and format the SD card

The “–dev” argument is required and “–part” is only required the first time to properly partition the SD card.

Boot Ubuntu and Test

The ultrazed.py python library was added to the root file system for testing the block RAM and LED GPIO controller built into the FPGA. It uses the Peek/Poke application to test these interfaces.

  1. Using the UltraZed Starter Kit set the boot DIP switches to select the SD card as the boot source.
  2. Connect to and observe the boot through the USB-UART.
  3. Once booted, login with the zynqmp user account. Login: zynqmp Password: password

Test Ethernet connectivity:

  1. Ping google.com or a host on your local network.
  2. Run “sudo apt-get update” then “sudo apt-get upgrade” to update the Ubuntu installation.

Test the FPGA register block:

Test the FPGA block RAM:

Test the FPGA LED GPIO:

Issues

  1. The Linux FPGA configuration driver is not functional.
  2. The PS to PL reference clocks are disabled during the kernel boot process. I suspect this is intentional behavior from the PMU kernel driver in an attempt to save power. To work around this issue the FPGA uses the 300MHz PL reference clock from the I/O breakout boards clock generator.
  3. The display port PLL configuration is not functional. During boot, with the display port enabled, a kernel panic occurs indicating that only one device can use the VPLL output at a time. This seems like an error as both the display port video and audio blocks should use the same PLL as a reference.
  4. Petalinux will occasionally get the following error: “ERROR: No space left on device or exceeds fs.inotify.max_user_watches?”. Run “sudo sysctl -n -w fs.inotify.max_user_watches=32768” to resolve it and rebuild. Also, the Petalinx tools will occasionally fail to source ‘bitbake’. Restarting your terminal session and sourcing env_setup.sh should resolve this.

More Information

  1. Xilinx Wiki – Most of this guide was based on the information found here – A really helpful resource for Zynq and ZynqMP development.
  2. Xilinx Vivado – Great resource for learning how to use Vivado.
  3. Xilinx Petalinux – Official Xilinx Petalinux page.
  4. PetaLinux Tools Reference Guide – Provides documentation on installing and using the Petalinx tools.
  5. Petalinux Command Line Reference Guide – Provides documentation on using the various tools in the Petalinux suite.
  6. ZynqMP Software Development Guide – Boot process, security features and everything ZynqMP software.
  7. ZynqMP Technical Reference Manual – ZynqMP architecture details including block by block overview of the entire SoC and low level register maps.
  8. Ultrazed Starter Kit – Documentation and support for the UltraZed.
  9. Xilinx reVISION Backgrounder – Implementation details for reVISION.
  10. Xilinx reVISION Toolchain – Real-time machine learning and computer vision based on Zynq/ZynqMP SoC’s using FPGA fabric for hardware acceleration.
  11. SDSoC – Xilinx’s high-level compiler used to translate C/C++ and OpenCL to FPGA HDL.
  12. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference – A paper sponsored by Xilinx Research Labs providing a methodology for reducing CNNs to BNNs for FPGA implementation.
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *