Content follows this message
If you have enjoyed my articles, please consider these charities for donation:

Young Lives vs Cancer - Donate.
Blood Cancer UK - Donate.
Children's Cancer and Leukaemia Group - Donate.

Designing a RISC-V CPU in VHDL, Part 16: Arty S7 RPU SoC, Block Rams, 720p HDMI

Posted Sep 26, 2018, Reading time: 13 minutes.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

It’s finally time – the big deploy onto Digilent’s Arty S7 board.

In my previous part, I went over at a high level the changes made to my TPU cpu core in order to make it consume RISC-V. The CPU itself is still very simple, and I removed some of the more interesting features from TPU such as interrupts. Interrupts as implemented on TPU would not comply to the RISC-V spec, so best they were stripped out for the time being.

After I got my RISC-V SoC up and running on MiniSpartan6+, I was looking to develop my own Spartan 7 FPGA board to use as a programmable computer kit – FPGA for Soft CPU, another for Soft GPU, a microcontroller for system management – maybe even a 3rd FPGA for chipset I/O. However I quickly came to my senses and realised just how much of an endeavour that is. Spinning my own PCBs, soldering those big BGA chips (and ultimately failing to solder those £40-a-piece chips) would be a very costly affair. It is still a long term goal, but in the meantime I wanted to get an off-the-shelf Spartan 7 FPGA board, essentially to bringup the FPGA side of an eventual move to my own development board. When I saw the Digilent Arty S7 announced last year, I kept an eye on it knowing it would be a contender for my upgrade path to the 7-series chips. The Arty S7 I ended up purchasing is the S7-50 variant, sporting the XC7S50 Spartan 7 FPGA with significantly more resources than my previous Spartan 6 board. It also had 256MB DDR3 RAM, but lacked HDMI connectivity. Thankfully, the HDMI/DVI-D output issue has been solved and you can read about that in a previous article.

I’m based in the UK, and was able to purchase the Arty S7-50 from digikey.co.uk for £119 delivered. Checking just now, it looks like the price has increased and you’d now be paying in the region of £125. It’s still a very nice board for that! So, this post is dealing with porting my existing RISC-v “SoC” to this new FPGA board.

The SoC consists of my RPU CPU, fast internal FPGA Block RAM storage, external (and slow!) DDR3 memory, my HDMI output with legacy text mode HDMI output, and finally, access to SD card storage via SPI. First, we have to tackle a new development environment.

Xilinx Vivado

With the new 7-series FPGAs comes a new set of design tools for authoring HDL and deploying to devices. As with ISE, which we used for Spartan 6 FPGAs, Vivado has a free “webpack” edition which is compatible with Arty S7. You can grab the download from Xilinx here. For clarity, I use the 2018.1 version. 2018.2 is the latest version as this is written.

The UI has changed quite a bit from ISE. In my opinion, Vivado is harder to navigate and parts of the interface seem very clunky. The general areas of interest remain; a Project Manager “flow” holds the source hierarchy, and the Simulation, Synthesis, and Implementation flows hold the respective details and commands for those aspects of the project.

To create projects for the board, we will need the board definition files which are provided by Digilent. There is a small guide on how to download and install them, available here. If you already have Vivado installed, you want to skip to section 3. Digilent also have a demo project available with your usual flashy LED hello world functionality, if you want to start very simple. For the rest of this article however, I’m jumping straight into the project.

Vivado base Arty S7 project

With our RPU core interface already defined, we need to just import the VHDL source files into our project and begin a new top level design which incorporates the core. As a reminder, the RPU core entity is shown below, and it still heavily resembles the old TPU core.

The top level component will have various input and output definitions. We require the following:

Clock input
Switch input
LED output
TMDS DVI-D HDMI output
SPI input/output
DDR3 memory input/output

We will leave out SPI and DDR3 definitions for now. With this, our definition is as follows:

entity rpu_top is
Port (
-- Input 100MHz clock
CLK100MHZ : in STD_LOGIC;

-- Input switches from board
sw : in STD_LOGIC_VECTOR (3 downto 0);

-- Output leds to board
led : out STD_LOGIC_VECTOR (3 downto 0);

-- HDMI (DVI-D) video output
hdmi_out_p : out STD_LOGIC_VECTOR(3 downto 0);
hdmi_out_n : out STD_LOGIC_VECTOR(3 downto 0)
);
end rpu_top;

We need to ensure these signals are mapped to the pin constraints of the Arty S7. My board is a Rev. B board, so I made a copy of the respective .xdc file provided with the Digilent board files and exited it to point to my named signals. If you’re familiar with Xilinx ISE, this is like the old .ucf constraints file from my TPU CPU project. An example is below, with LEDs, Clock and the HDMI output defined. With HDMI we need to ensure the signal standard is TMDS_33. This is the definition required to map to my simple HDMI Pmod connectors.

Block RAM

The next thing we need to do is figure out some block ram memory. The block ram primitive objects have changed in the Spartan 7 series FPGAs and are larger. Unlike the MiniSpartan6 implementation, where I manually initialized the block rams and added additional switching logic, I am now using the Xilinx IP Block memory generator. This is available in the free Webpack version of vivado and allows for generation of a block ram object of your own data width and memory size. Internally, multiple block ram primitives will be combined into a single interface. I want 64Kbyte block rams, which are made up of 16 smaller 4Kbyte hardware block rams. With a 32 bit data interface, this Block Ram Generator saves us a lot of work. It also allows for a initialization file to be provided, so the block ram can have a defined contents at reset. This allows us to have our bootloader present in memory for bootstrapping the SoC.

Another option in the block memory generator which we must ensure is unselected is “common clock”. We want our block ram to be a true dual-port ram, with separate clocks for each. This allows the rams to be connected both to the CPU core for read/write, and also another system, such as our HDMI character generator for use as text console storage – running at a pixel clock rate, instead of the CPU clock rate.

With this, you can create the interface via the GUI and generate an HDL wrapper.

I created a wrapper so I could edit the source and ensure the data out ports were tri-stated when the block ram was disabled, for easier plumbing of multiple block rams together.

With our block ram wrapper available, we can connect this to memory interface of the RPU core. This is fairly simple, and we can attack multiple block rams by enabling them and muxing data lines depending on address bits. Because we tri-state the data output bus when the block ram is not selected we should be able to use a single output for all block rams, but for now I assign a signal to each output and explicitly select the one we need.

With the above, we should have 196Kbyte of total addressable RAM. However, there is a significant issue which needs attention before any code will run from these BRAMs. RPU currently expects data presented to it to be formatted into the expected data format. The memory interface does not have a byte enable or such like, as you would typically expect. So memory requests need swizzled for endianness and size prior to being given to RPU. This system will be getting a significant overhaul soon, which will change all of this so RPU is presented with simply a raw view of memory – but for now, we need to swizzle data from the BRAM before it enters the CPU. To do this, there are two additional processes, and the signals assigned by these processes are what is read from or written to the CPU interface.

There is one more process associated with memory, and that handles the state machine for handling the requests from the CPU. It also will assign various I/O data. It simply maps addresses to signals. I think this process can be implemented better, but for ease of editing for tinkering it works well for now.

Getting some code running

At this point, you should be able to write your “chasing LED” program and have it as the BRAM initial contents, so when the Arty S7 board is flashed with the FPGA bitcode you will see the onboard LEDs flash.

Wayhey! My RISC-V cpu is running a simple led test on Arty S7-50 - at 266MHz! pic.twitter.com/O6jqfRYbHk
— Colin Riley 🎗 (@domipheus) July 18, 2018

The code for this is just as simple as you’d expect.

{
    unsigned int i = 0u;
    volatile unsigned int* ui_addr_leds = (unsigned int *)IO_ADDR_LEDS;

    while (1)
    {
        *ui_addr_leds = (i++)>>18;
    }
}

IO_ADDR_LEDS above is defined to be 0xf0009000, so the MEM_proc process you can see earlier picks up this memory write and redirects the lowest significant 4 bits to the external LED I/O pins.

I previously mentioned that I’d build my own RISC-V toolchain using Windows Subsystem for Linux. I attempted to build the latest version of riscv-tools, however I kept running into build issues this time around. I instead have switched to using the GNU MCU Eclipse RISC-V Embedded GCC toolchain which is very handily released as a full windows binary package. It can be obtained from the github releases project page. A basic main() function with the above code is compiled with a linker script to place it at location 0x00000000 with no standard libraries or start files. The resulting elf binary is tiny, and you can use objdump with the -s argument to get part of a hex dump output which you can then transform into the .coe file required by the Xilinx block ram generator to use as the initial BRAM contents. The .coe file format is simple, and consists of two declarations – the radix of the data to follow, and a comma separated vector containing the data itself.

memory_initialization_radix=16;
memory_initialization_vector=
37810000, 1301c1ff, ef005074, 6f000000,
13060500, 13050000, 93f61500, 63840600,
...

I use this method to create the real bootstrap firmware which initializes and ends up copying code from SD card into the DDR3 ram for execution – but more on that in the next post! I was also able to use the new simulator in Vivado to check internal signals while some code ran.

Taken longer than anticipated, but finally worked out Vivado and the IP designer to get chunky Spartan 7 block rams integrated into my RPU simulation. My RISC-V core worked immediately, which was nice to see! pic.twitter.com/z3Q3yrHj2H
— Colin Riley 🎗 (@domipheus) July 12, 2018

One thing that caught me out is that changing the .coe initial BRAM contents file and rebuilding the project will not bring the changes from that file into the new BRAM IP. You need to right click the BRAM in the designer, select reset output products, and then generate them again for the updated .coe to be integrated. A rather annoying, slow and unnecessary step in my opinion – but maybe there is a reason for this that I do not understand as yet.

HDMI output

Flashing LEDs are cool, but we have a character generator to port! My previous miniSpartan6 design ran HDMI out at 640×480, using a widely available USB powered HDMI panel targeted at Raspberry Pi use. With ArtyS7, I wanted more resolution, and have defaulted to outputting 720p60. The changes to allow this on the DVI-D side of things are minimal – pixel clock updated, and the VGA timing signals for 1280×720 at 60Hz are used.

The pixel clock for 720p should be 74.25MHz, but the much-easier to obtain 75MHz will generally still work. The previous character generator (discussed in this blog post) works off of a 5x pixel clock – which in this case would be 375MHz. This is too high – the Spartan7 block rams of the Arty S7 are rated for around 350MHz – so we need to architect the character generator to run off of the raw pixel clock of 75MHz. This is actually fairly simple to do. The character generator works by feeding in an X and Y position of the current VGA timing location as it’s scanned out. If we offset the timing locations, by putting the various sync signals through an 8-deep FIFO, we can feed the character generator pixel values which are 8 pixels early – allowing the generator 8 cycles of latency in order to perform the necessary text and font lookups from BRAM. As the font glyphs are 8 pixels wide, we can prefetch the next glyph data. By the time the sync signals are at the end of the FIFO, the character generator will be providing the correct pixel colour values for the glyph row required.

The new pipeline for the character generator is as follows:

This provides a 160×45 text console display with 8×16 glyphs. To connect the FPGA dev board to an HDMI monitor, I took what I learned from my previous HDMI pmod post and made my first PCB using PCBWay. Then I soldered an surface mount HDMI connector and the PMOD 0.1″ angled header terminals. I may post a video of this in the future. I tinned the surface pads with a soldering iron and used a hot air gun with lots of flux to reflow the connector to the pads.

My first PCB works all soldered up. Knew that £25 hot air solder station would come in handy one day! Digilent FPGA High-Speed PMOD to HDMI! pic.twitter.com/4RBabPG9Zw
— Colin Riley 🎗 (@domipheus) June 11, 2018

It’s a very useful converter! And for my first attempt at soldering 0.5mm pins, a successful first PCB 🙂

I am using an Atrix Lapdock as my HDMI sink – it’s a laptop form factor with HDMI input for the screen, and a USB hub with integrated keyboard, trackpad and battery. Again, this is usually used to make raspberry pi laptops, as the Lapdock itself can provide 5v power to devices as well as powering the screen. So with this, I have a RISC-V Laptop!

Tested it with my miniSpartan6+ RPU cpu. The Lapdock has a scaler(!), so it's automatically up-scaling the FPGA HDMI output to the 1366x768 panel. I don't have USB yet, so they keyboard/trackpad doesn't work. The lapdock internal battery should power the FPGA board fine. pic.twitter.com/iK5hT6iE0d
— Colin Riley 🎗 (@domipheus) July 18, 2018

That is it for now – I intended to discuss the DDR3 memory in this post, but it just got too long. That post will follow shortly.

I have put the RPU CPU Core HDL on Github, as well as the Arty SoC project. This code is in front of these blog posts and includes the DDR3 implementation, so if you are impatient you can go and look now. There are timing constraint violations introduced with the HDMI output and DDR3 IP implementation, but I have yet to look into them – and the built FPGA bitfile flashes to my board and runs at a lower speed.

Thanks for reading! As always, I am available on twitter @domipheus for any queries. If you try out the SoC from github, let me know!

Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley