Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley

Content follows this message
If you have enjoyed my articles, please consider these charities for donation:
  • Young Lives vs Cancer - Donate.
  • Blood Cancer UK - Donate.
  • Children's Cancer and Leukaemia Group - Donate.

Designing a RISC-V CPU in VHDL, Part 22: Doom as a benchmark and adding Cache to RPU

Posted Jan 9, 2022, Reading time: 17 minutes.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Seems it’s a right of passage for any project of this kind that Doom needs ported to it. For me, I wanted it as a more involved benchmark - the likes of Drystone that gets used everywhere is pretty boring. Why have that when you can publish a Doom timedemo score for your FPGA CPU? :)

Initial porting duties

I searched for folks who had already built Doom for a RV32 target, and found the RISC_666 emulator on github. This emulator has a version of Doom included with an api exposed which allows the emulator to pull the framebuffer/input handling out so the graphical output can be seen. This is actually very useful for me, as it meant I could quickly hook up some custom functionality up for how Doom would render its scene on our FPGA. Getting the source to build with my RV32I toolchain was fairly straightforward. Due to this work starting before I had hardware integer multiply and divide, I leveraged LLVM’s compiler-rt project to get fast software implementations of them, and hooked up the fopen/fread functions to operate on the same SD card where the bootloader existed.

File I/O was the first major issue. We will use the shareware Doom distribution, Doom1.wad, which comes in at 4MB. I hooked up petitFatFs to read the SD card - a library which is used in the bootloader already. An issue with this is that it’s for resource constrained systems, and as such is really very slow. Reading the 4MB file was so slow in fact that it was taking several minutes, always seeking and reading and then seeking elsewhere, which made it worse. The solution to this is to read the entire file once into the DDR3 memory space, and then the seeks and reads are simply a memory offset and load. We now have a delay at the start as the whole file is read, but this is followed by incredibly quick subsequent fread() operations. We are now at the point where we need a framebuffer to display the video output.

(It should be known that I had way more issues to solve here before Doom ran, but they were mostly dumb compiler things that I don’t think is interesting with respect to this post - so I’ve omitted them)

Mode 13h

Original Doom uses Mode 13h, which is the 256-color, 320 x 200 resolution mode available on VGA graphics hardware. With each pixel represented linearly by a single byte index into the 256-color palette, we’d need a ~64KB area of memory to represent this framebuffer and the palette.

With the RPU SoC set up to output 720p60, my plan is to have the top left part of the screen display this graphical mode, at a 2x scale - so one pixel in the 320x200 buffer respresents 4 in the output, to take up 640x400 of the 1280x720 screen resolution. This lets me see things a bit clearer, will allow us more cycles to fetch our data from the framebuffer/palette, and also still has plenty of the text mode display available to use for printing other important information whilst running Doom.

I added a memory mapped flag for enabling the video memory (VMEM) area, so it’s only active when used - else it would mask some output from the text mode buffers. I also added a memory mapped register which set a location in the 256-color palette. My implementation simply has an array in the VHDL for the palette. This meant we could set the palette colors from inside the relevant Doom source by writing a 32-bit value to the location in memory.

int av_set_palette(struct av_color *palette, int ncolors)
{
	// write the palette set MMIO
	volatile unsigned int* palette_mmio = 0xf000c120;
	for (unsigned int c = 0; c < 255;c++)
	{
		// format is 32-bit comprising 8 bit values Index, R, G, B
		palette_mmio[0] = c << 24 | ((uint32_t)palette[c].r) << 16 | ((uint32_t)palette[c].g) << 8 | ((uint32_t)palette[c].b);
	}
}

Then, I just set the current screen buffer in Doom to the start of my VMEM block ram area. It’s set up as a two-port block ram - one port used so it is readable and writeable from the CPU, with the other port running at the pixel clock for reading out to the display.

The display readout always runs, using the X and Y of the VGA generator to pipeline requests for the subsequent pixels, taking into account the additional latency required to do the palette lookup. The VRAM enable flag simply selects whether a pixel in the “VMEM viewport” of the 720p output comes from the text generator or this VMEM display readout.

-- re-adjust vmem area so it shows correct pixels taking into account vmem and palette fetch latency
VMEM_AREA <= '1' when (
    (unsigned(pixel_h_pref(11 downto 1)) < 328) and
    (unsigned(pixel_h_pref(11 downto 1)) >= 9) and
    (unsigned(pixel_v_pref(11 downto 1)) < 201) and
    (unsigned(pixel_v_pref(11 downto 1)) >= 1)
    ) else '0';

VOUT_R <= VMEM_R when VMEM_ENABLE = '1' and VMEM_AREA = '1' else TXT_R;
VOUT_G <= VMEM_G when VMEM_ENABLE = '1' and VMEM_AREA = '1' else TXT_G;
VOUT_B <= VMEM_B when VMEM_ENABLE = '1' and VMEM_AREA = '1' else TXT_B;

Our memory map additions to add this VMEM are as below.

size type name address
64KB Block Ram Video MEM (VMEM) 0x30000
4 MMIO palette 0xf000c120
4 MMIO VMEM Enable 0xf000c110

Trying Doom!

At this point, we got graphical output. This was quite an exciting moment. I’m a graphics engineer in my day job, and I’ve always thought I was drawn to that line of work due to the immediate feedback you get from viewing the result of your code run in real time. This was certainly pushing the definition of real time, however. It was running very slowly, not even 1 FPS, and you could see what the engine was working on rendering at any given moment - the walls filled into view, the floors and ceilings, then the enemies and other sprites, before it started the next frame of rendering. The following video tweet showed the effect, although when viewed on-device it looks much more glitchy!

What we needed to make the rendering more natural was to write to an intermediate buffer, and then copy the completed image at once to the VRAM memory.

The timedemo demo3 score was “2134 gametics in 93812 realtics”, or in more common terms: 0.8fps.

So, why is it so slow?

It’s a slow CPU/SoC design - whilst it is running at 100MHz here and can go higher, each instruction executed can take a significant amount of time to execute - and when it’s executing from DDR3 memory its significantly more cycles required for memory operations, given the additional request latency. When looking at an instruction trace (see my post here on how this is implemented) running Doom, well over 50% of all cycles are in the memory stage, waiting.

Cache

Thankfully, we already know of a technique we can use to aid us here - move a small amount of data from the slow DDR3 into a faster, lower latency memory closer to the CPU - a cache. RPU is completely cacheless, so every memory request to DDR3 actually pulls out 16 bytes. If the request was for one byte, we have effectively thrown away 15 bytes of data.

I’m not going to go into all the various types of cache that commonly find themselves used CPU systems. For that, if an introduction is needed, you can refer to https://en.wikipedia.org/wiki/CPU_cache. At the most basic, we have a limited number of fixed blocks of data - cache lines - which are used to temporarily mirror data from the slower memory behind it. The cache lines are associated with an address in the slower memory via a tag system, and there are policies chosen for what to do if the cache is full, and what to do on data write operations. The cache is low latency and very quick to access once the data from the slower memory has been transferred into it.

For us, we need to look at our requirements. The memory operations we are doing in Doom are fetching the instruction stream, reading/writing from stack, writing the frame buffer, and reads and writes of global state. Our instruction stream is entirely in DDR3 and so very slow - we want these to go through the cache. The stack is set to a location in our fast block ram, not DDR3, so we don’t need to worry about that. We write a buffer in DDR3 which is eventually copied into fast VMEM block ram, but those writes won’t benefit that much - the copy into VMEM however should. The global state will also end up in DDR3 - we want this cached.

The very simplest cache we can implement would be one which operated on our current DDR3 request size of 16 bytes. We can have a small number of lines equal in size to this request size, and make it fully associative - i.e, allow any of those cache lines to hold any location in our DDR3. It means we will need to search every line tag for a cache hit, but as we expect this to be a small cache it will be okay.

The basic look of our internal cache data in VHDL is as follows:

    type tag is record
        addr : std_logic_vector(27 downto 0);
        valid : std_logic;
    end record;

    constant CACHE_DEPTH : integer := 4;

    type cache_data is array ((CACHE_DEPTH - 1) downto 0) of STD_LOGIC_VECTOR (127 downto 0);
    type cache_tags is array ((CACHE_DEPTH - 1) downto 0) of tag;

    signal data : cache_data := (others => (others => '0'));
    signal tags : cache_tags := (others => ((others => '0'), '0'));

On a read request, we will loop through the cache_tags looking for our address. If we find a location, and the valid flag is set, we will use the offset from the request address to subscript the corresponding cache_data line - providing that data back to the CPU within a couple of cycles instead of the 20+ required for a DDR3 request. If we do not find it, we will issue the DDR3 request as normal, and capture the data in the cache by replacing another cache line - we will just round robin using an incrementing index to decide which cache line to replace, nothing fancy.

if I_CMD = CMD_READ then
  var_tag_found := '0';
  -- check tags for I_ADDR
  for I in 0 to (CACHE_DEPTH - 1) loop
    if tags(I).addr = I_ADDR(31 downto 5) and tags(I).valid = '1' then
      S_REQ_TAG_ID <= I;
      var_tag_found := '1';
    end if;
  end loop;
  if (var_tag_found = '0') then
      state <= STATE_REQDATA_INIT;
  else
      state <= STATE_READDATA;
  end if;
elsif I_CMD = CMD_WRITE then
  ...

This is the first time I used process variables - the var_tag_found variable acts like a register which can be written and then read within a process, unlike signals which values are not available immediately after a write. The memory process that handled all memory requests, from MMIO to BRAM to DDR3, was updated to always go through this cache when the memory address was within our DDR3 space. This process also handled the “STATE_REQDATA_*” functionality of kicking off a DDR3 request to fill a new cache line, and then forwarding that data to the cache object, where it would then take a cycle to flip a cache line over to the new data and address tag. But, this then raises another question - what about writes?

Cache writes

Looking at our requirements for this cache from before, we saw that reads are the overwhelming majority of requests. So, why even bother with the complexity of adding writes? When we see a write request, we just go through the existing slow DDR3 write path. Additionally, however, we ensure the cache still sees this request, so that it can search the tags for the request address, and then invalidate this line. This ensures coherency between the data in the cache and the data in the DDR3. One last thing we can do is to set the round robin line counter for use when we miss the cache to the location of this newly invalidated line. This is a small optimization that ensures a new cache line will take this invalid line, rather than pushing out a valid line. With this, the cache state machine looks similar to this:

Now, I could have implemented another way - whereby instead of invalidating the line, the cache simply updated the cache with the write value. This sounds great but with my very simple (i.e, poor) cache implementation, it complicated it to the point where for larger caches the utilization on the FPGA was huge, and for Doom the gain was not worth it.

The benchmark impact

The latency reduction for read from a cached address is quite significant. Excluding CPU request overheads, we go from a 27 cycle wait for a response from DDR3, to a 2 cycle response from cache - but how did this affect our benchmark? Did we get more than our 0.8fps?

Wayhey! A significant uplift with the cache in front of DDR3 access. At this time I was also doing some optimizations to the memory controller parts of the CPU, which cut a few cycles from the fetch request latency, but as can be seen below, the next biggest jump came from hardware multiply operations. I’ve written about implementing multiply already in this series but at this point I had not done divide, and there was no ability for the compiler to only issue a certain multiply instruction - so it was added by manually inserting the instruction into the software multiply function, leaving divide in pure software. Nowadays you can use a compiler option to only issue hardware multiply, and use software divide - and there is even a new risc-v extension for this, but when these experiments were done this was not available to me.

However, the first graph above showing number of 16-byte lines shows that above 16 lines, the performance increase really drops off. This is a pretty big indicator that what we really need is a longer cache line size. Thankfully, we could do this- the DDR3 controller could burst read the next 16 bytes very quickly, meaning we could read a 32-byte line into the cache from DDR3 in one operation with only a small addition to latency. I made the relevant changes, doubling the size of the cache lines and all of the relevant signals used to populate the data. The increase in benchmark score was well worth it.

Note here the linear increase in performance as the number of lines increase, which let me know to look at other areas before going and increasing the cache line size any further.

Write-through tests.

I mentioned earlier that I did a simple implementation of cache write-through, instead of simply invalidating the cache line and writing direct to ddr3 as before. Due to constraints with FPGA resources, the only version I could get to build correctly was when a 32-byte, naturally aligned write was performed. In this case, it wrote the new value into the cache line, keeping the cache line valid - but also behind the scenes performed the DDR3 write as before. This introduced some complicated latency problems which were difficult to fix - so the write could still “stall” the DDR3 read/write system for more cycles than expected, but it was still less cycles than issuing another DDR3 read request after a line was invalidated, so there was a small performance uplift that can be seen in the following graph. However, for the added complexity, and issues with building for the FPGA, I stayed with the “Write invalidation” scheme.

VRAM swapping

The way we implemented updating the framebuffer - by writing to an intermediate buffer, before copying to our VRAM which was located in fast block ram, was inefficient. Copying 64KB may seem hilariously tiny, but given how slow the CPU is when it comes to anything memory related, this had a significant impact on the render speed. Instead, I decided to create another 64KB VRAM area of memory - So it took twice the block ram memory, but, the display readout could flip between the two buffers. The display would be reading one buffer out to screen, whilst rendering was occurring into the second buffer, and then it would ping-pong each frame. This was the best way of getting smooth output with no additional memory copying.

The memory map was updated with the second block ram VMEM area, and an MMIO register to indicate which VMEM should be scanned out to display.

size type name address
64KB Block Ram Video MEM A (VMEMA) 0x30000
64KB Block Ram Video MEM B (VMEMB) 0x20000
4 MMIO palette 0xf000c120
4 MMIO VMEM Enable 0xf000c110
4 MMIO VMEM Select 0xf000c130

This gave another large increase in performance. At this point, the RPU core was now RV32IM with full multiply and divide instructions, so there was also a bump from being able to build Doom without the software emulated multiply/divide runtime.

We are now running at over 6fps, with very little changes to Doom itself, a large increase from our original 0.8fps.

Further optimizations and Cache Counters

The next set of optimizations were not really to do with the FPGA, and more to do with small changes to the way Doom wrote the VRAM - not redrawing the border every time, only doing things once, moving some things around - R_InitBuffer was one that was moved, and ST_refreshBackground modified. There was also some manual loop unrolling.

I did this with the aid of performance counters I added to the cache - on a request, increment a 64-bit counter. On a miss, increment the miss counter. I then fixed a few addresses in our MMIO memory space to read the counters. With them, I was able to output cache statistics with the timedemo score and see that the hitrate was actually very high. This does make sense given its limited scope.

size MMIO name address
uint32_t perfcounter_cache_missl 0xff002000
uint32_t perfcounter_cache_missh 0xff002004
uint32_t perfcounter_cache_requestsl 0xff002010
uint32_t perfcounter_cache_requestsh 0xff002014

As you can see from this chart, all of these gains up to nearly 8fps were due to reducing the amount of cache requests - I don’t have numbers for what exactly the reduction in reads was from in the code, but as you can see it was significant.

I think I may be able to get even more performance by reducing loop unrolling - a branch in RPU isn’t any slower, and if the hot loop fits in cache it could be better instead of a larger unrolled block of code which doesn’t fit in cache.

But, for now, I’ll take the 10x speedup!

Conclusion

This post has taken a very long time to prepare, there was a lot of work done to test all the various configurations and quite a few issues that took up valuable development time I have not included here - the default signedness of the char datatype caused issues that resulted in many nights looking through CPU traces of code sign-extending to larger types before I found the problem.

Running Doom is a fantastic milestone, and going from 0.8fps to nearly 8fps was great - even if some of that was from software changes. I was able to find a “timedemo demo3” run on a 386 pc which produced a result of 8.75fps(https://www.youtube.com/watch?v=TfKQ-uREnsI). Getting within 1 fps of this is a win in my book.

The cache implementation itself can be improved massively. It is not using block rams, so it’s utilization is very high when there are many cache lines in the configuration.

I actually finished this work mid 2021, and since then have been sporadically working on fixing many of the timing issues that are present in the RPU design. For the longest time I was trying to shave cycles off operations such as the memory accesses, but this was introducing timing problems which prevented the design running over 100MHz. I have found many issues, which will probably form another post in the series. The aim is to significantly simplify the memory subsystem of the CPU/SoC - things like endian flipping being outside the CPU is an oddity and it’s causing problems. Given I still have an aim of running Linux on RPU with virtual memory, simplifying the current memory system is a must!

Once simplified, I can see myself revisiting the cache and making a better version. For now though, this works nicely.

Thanks for reading. If you have any further questions, please do not hesitate to send me a message on twitter @domipheus!