Content follows this message If you have enjoyed my articles, please consider these charities for donation: |
This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
Seems it’s a right of passage for any project of this kind that Doom needs ported to it. For me, I wanted it as a more involved benchmark - the likes of Drystone that gets used everywhere is pretty boring. Why have that when you can publish a Doom timedemo score for your FPGA CPU? :)
I searched for folks who had already built Doom for a RV32 target, and found the RISC_666 emulator on github. This emulator has a version of Doom included with an api exposed which allows the emulator to pull the framebuffer/input handling out so the graphical output can be seen. This is actually very useful for me, as it meant I could quickly hook up some custom functionality up for how Doom would render its scene on our FPGA. Getting the source to build with my RV32I toolchain was fairly straightforward. Due to this work starting before I had hardware integer multiply and divide, I leveraged LLVM’s compiler-rt project to get fast software implementations of them, and hooked up the fopen/fread functions to operate on the same SD card where the bootloader existed.
File I/O was the first major issue. We will use the shareware Doom distribution, Doom1.wad, which comes in at 4MB. I hooked up petitFatFs to read the SD card - a library which is used in the bootloader already. An issue with this is that it’s for resource constrained systems, and as such is really very slow. Reading the 4MB file was so slow in fact that it was taking several minutes, always seeking and reading and then seeking elsewhere, which made it worse. The solution to this is to read the entire file once into the DDR3 memory space, and then the seeks and reads are simply a memory offset and load. We now have a delay at the start as the whole file is read, but this is followed by incredibly quick subsequent fread() operations. We are now at the point where we need a framebuffer to display the video output.
(It should be known that I had way more issues to solve here before Doom ran, but they were mostly dumb compiler things that I don’t think is interesting with respect to this post - so I’ve omitted them)
Original Doom uses Mode 13h, which is the 256-color, 320 x 200 resolution mode available on VGA graphics hardware. With each pixel represented linearly by a single byte index into the 256-color palette, we’d need a ~64KB area of memory to represent this framebuffer and the palette.
With the RPU SoC set up to output 720p60, my plan is to have the top left part of the screen display this graphical mode, at a 2x scale - so one pixel in the 320x200 buffer respresents 4 in the output, to take up 640x400 of the 1280x720 screen resolution. This lets me see things a bit clearer, will allow us more cycles to fetch our data from the framebuffer/palette, and also still has plenty of the text mode display available to use for printing other important information whilst running Doom.
I added a memory mapped flag for enabling the video memory (VMEM) area, so it’s only active when used - else it would mask some output from the text mode buffers. I also added a memory mapped register which set a location in the 256-color palette. My implementation simply has an array in the VHDL for the palette. This meant we could set the palette colors from inside the relevant Doom source by writing a 32-bit value to the location in memory.
int av_set_palette(struct av_color *palette, int ncolors)
{
// write the palette set MMIO
volatile unsigned int* palette_mmio = 0xf000c120;
for (unsigned int c = 0; c < 255;c++)
{
// format is 32-bit comprising 8 bit values Index, R, G, B
palette_mmio[0] = c << 24 | ((uint32_t)palette[c].r) << 16 | ((uint32_t)palette[c].g) << 8 | ((uint32_t)palette[c].b);
}
}
Then, I just set the current screen buffer in Doom to the start of my VMEM block ram area. It’s set up as a two-port block ram - one port used so it is readable and writeable from the CPU, with the other port running at the pixel clock for reading out to the display.
The display readout always runs, using the X and Y of the VGA generator to pipeline requests for the subsequent pixels, taking into account the additional latency required to do the palette lookup. The VRAM enable flag simply selects whether a pixel in the “VMEM viewport” of the 720p output comes from the text generator or this VMEM display readout.
-- re-adjust vmem area so it shows correct pixels taking into account vmem and palette fetch latency
VMEM_AREA <= '1' when (
(unsigned(pixel_h_pref(11 downto 1)) < 328) and
(unsigned(pixel_h_pref(11 downto 1)) >= 9) and
(unsigned(pixel_v_pref(11 downto 1)) < 201) and
(unsigned(pixel_v_pref(11 downto 1)) >= 1)
) else '0';
VOUT_R <= VMEM_R when VMEM_ENABLE = '1' and VMEM_AREA = '1' else TXT_R;
VOUT_G <= VMEM_G when VMEM_ENABLE = '1' and VMEM_AREA = '1' else TXT_G;
VOUT_B <= VMEM_B when VMEM_ENABLE = '1' and VMEM_AREA = '1' else TXT_B;
Our memory map additions to add this VMEM are as below.
size | type | name | address |
---|---|---|---|
64KB | Block Ram | Video MEM (VMEM) | 0x30000 |
4 | MMIO | palette | 0xf000c120 |
4 | MMIO | VMEM Enable | 0xf000c110 |
At this point, we got graphical output. This was quite an exciting moment. I’m a graphics engineer in my day job, and I’ve always thought I was drawn to that line of work due to the immediate feedback you get from viewing the result of your code run in real time. This was certainly pushing the definition of real time, however. It was running very slowly, not even 1 FPS, and you could see what the engine was working on rendering at any given moment - the walls filled into view, the floors and ceilings, then the enemies and other sprites, before it started the next frame of rendering. The following video tweet showed the effect, although when viewed on-device it looks much more glitchy!
Whilst getting some timing numbers for my next designing a CPU in VHDL blog I messed things up. Doom was running slowly and writing into the currently displayed videobuffer - rather than the next one.
— Colin Riley 🎗 (@domipheus) August 4, 2021
Lets you see how a doom frame is put together though. pic.twitter.com/b5B2L4nHHu
What we needed to make the rendering more natural was to write to an intermediate buffer, and then copy the completed image at once to the VRAM memory.
The timedemo demo3 score was “2134 gametics in 93812 realtics”, or in more common terms: 0.8fps.
It’s a slow CPU/SoC design - whilst it is running at 100MHz here and can go higher, each instruction executed can take a significant amount of time to execute - and when it’s executing from DDR3 memory its significantly more cycles required for memory operations, given the additional request latency. When looking at an instruction trace (see my post here on how this is implemented) running Doom, well over 50% of all cycles are in the memory stage, waiting.
Thankfully, we already know of a technique we can use to aid us here - move a small amount of data from the slow DDR3 into a faster, lower latency memory closer to the CPU - a cache. RPU is completely cacheless, so every memory request to DDR3 actually pulls out 16 bytes. If the request was for one byte, we have effectively thrown away 15 bytes of data.
I’m not going to go into all the various types of cache that commonly find themselves used CPU systems. For that, if an introduction is needed, you can refer to https://en.wikipedia.org/wiki/CPU_cache. At the most basic, we have a limited number of fixed blocks of data - cache lines - which are used to temporarily mirror data from the slower memory behind it. The cache lines are associated with an address in the slower memory via a tag system, and there are policies chosen for what to do if the cache is full, and what to do on data write operations. The cache is low latency and very quick to access once the data from the slower memory has been transferred into it.
For us, we need to look at our requirements. The memory operations we are doing in Doom are fetching the instruction stream, reading/writing from stack, writing the frame buffer, and reads and writes of global state. Our instruction stream is entirely in DDR3 and so very slow - we want these to go through the cache. The stack is set to a location in our fast block ram, not DDR3, so we don’t need to worry about that. We write a buffer in DDR3 which is eventually copied into fast VMEM block ram, but those writes won’t benefit that much - the copy into VMEM however should. The global state will also end up in DDR3 - we want this cached.
The very simplest cache we can implement would be one which operated on our current DDR3 request size of 16 bytes. We can have a small number of lines equal in size to this request size, and make it fully associative - i.e, allow any of those cache lines to hold any location in our DDR3. It means we will need to search every line tag for a cache hit, but as we expect this to be a small cache it will be okay.
The basic look of our internal cache data in VHDL is as follows:
type tag is record
addr : std_logic_vector(27 downto 0);
valid : std_logic;
end record;
constant CACHE_DEPTH : integer := 4;
type cache_data is array ((CACHE_DEPTH - 1) downto 0) of STD_LOGIC_VECTOR (127 downto 0);
type cache_tags is array ((CACHE_DEPTH - 1) downto 0) of tag;
signal data : cache_data := (others => (others => '0'));
signal tags : cache_tags := (others => ((others => '0'), '0'));
On a read request, we will loop through the cache_tags looking for our address. If we find a location, and the valid flag is set, we will use the offset from the request address to subscript the corresponding cache_data line - providing that data back to the CPU within a couple of cycles instead of the 20+ required for a DDR3 request. If we do not find it, we will issue the DDR3 request as normal, and capture the data in the cache by replacing another cache line - we will just round robin using an incrementing index to decide which cache line to replace, nothing fancy.
if I_CMD = CMD_READ then
var_tag_found := '0';
-- check tags for I_ADDR
for I in 0 to (CACHE_DEPTH - 1) loop
if tags(I).addr = I_ADDR(31 downto 5) and tags(I).valid = '1' then
S_REQ_TAG_ID <= I;
var_tag_found := '1';
end if;
end loop;
if (var_tag_found = '0') then
state <= STATE_REQDATA_INIT;
else
state <= STATE_READDATA;
end if;
elsif I_CMD = CMD_WRITE then
...
This is the first time I used process variables - the var_tag_found variable acts like a register which can be written and then read within a process, unlike signals which values are not available immediately after a write. The memory process that handled all memory requests, from MMIO to BRAM to DDR3, was updated to always go through this cache when the memory address was within our DDR3 space. This process also handled the “STATE_REQDATA_*” functionality of kicking off a DDR3 request to fill a new cache line, and then forwarding that data to the cache object, where it would then take a cycle to flip a cache line over to the new data and address tag. But, this then raises another question - what about writes?
Looking at our requirements for this cache from before, we saw that reads are the overwhelming majority of requests. So, why even bother with the complexity of adding writes? When we see a write request, we just go through the existing slow DDR3 write path. Additionally, however, we ensure the cache still sees this request, so that it can search the tags for the request address, and then invalidate this line. This ensures coherency between the data in the cache and the data in the DDR3. One last thing we can do is to set the round robin line counter for use when we miss the cache to the location of this newly invalidated line. This is a small optimization that ensures a new cache line will take this invalid line, rather than pushing out a valid line. With this, the cache state machine looks similar to this:
Now, I could have implemented another way - whereby instead of invalidating the line, the cache simply updated the cache with the write value. This sounds great but with my very simple (i.e, poor) cache implementation, it complicated it to the point where for larger caches the utilization on the FPGA was huge, and for Doom the gain was not worth it.
The latency reduction for read from a cached address is quite significant. Excluding CPU request overheads, we go from a 27 cycle wait for a response from DDR3, to a 2 cycle response from cache - but how did this affect our benchmark? Did we get more than our 0.8fps?
Wayhey! A significant uplift with the cache in front of DDR3 access. At this time I was also doing some optimizations to the memory controller parts of the CPU, which cut a few cycles from the fetch request latency, but as can be seen below, the next biggest jump came from hardware multiply operations. I’ve written about implementing multiply already in this series
However, the first graph above showing number of 16-byte lines shows that above 16 lines, the performance increase really drops off. This is a pretty big indicator that what we really need is a longer cache line size. Thankfully, we could do this- the DDR3 controller could burst read the next 16 bytes very quickly, meaning we could read a 32-byte line into the cache from DDR3 in one operation with only a small addition to latency. I made the relevant changes, doubling the size of the cache lines and all of the relevant signals used to populate the data. The increase in benchmark score was well worth it.
Note here the linear increase in performance as the number of lines increase, which let me know to look at other areas before going and increasing the cache line size any further.
I mentioned earlier that I did a simple implementation of cache write-through, instead of simply invalidating the cache line and writing direct to ddr3 as before. Due to constraints with FPGA resources, the only version I could get to build correctly was when a 32-byte, naturally aligned write was performed. In this case, it wrote the new value into the cache line, keeping the cache line valid - but also behind the scenes performed the DDR3 write as before. This introduced some complicated latency problems which were difficult to fix - so the write could still “stall” the DDR3 read/write system for more cycles than expected, but it was still less cycles than issuing another DDR3 read request after a line was invalidated, so there was a small performance uplift that can be seen in the following graph. However, for the added complexity, and issues with building for the FPGA, I stayed with the “Write invalidation” scheme.
The way we implemented updating the framebuffer - by writing to an intermediate buffer, before copying to our VRAM which was located in fast block ram, was inefficient. Copying 64KB may seem hilariously tiny, but given how slow the CPU is when it comes to anything memory related, this had a significant impact on the render speed. Instead, I decided to create another 64KB VRAM area of memory - So it took twice the block ram memory, but, the display readout could flip between the two buffers. The display would be reading one buffer out to screen, whilst rendering was occurring into the second buffer, and then it would ping-pong each frame. This was the best way of getting smooth output with no additional memory copying.
The memory map was updated with the second block ram VMEM area, and an MMIO register to indicate which VMEM should be scanned out to display.
size | type | name | address |
---|---|---|---|
64KB | Block Ram | Video MEM A (VMEMA) | 0x30000 |
64KB | Block Ram | Video MEM B (VMEMB) | 0x20000 |
4 | MMIO | palette | 0xf000c120 |
4 | MMIO | VMEM Enable | 0xf000c110 |
4 | MMIO | VMEM Select | 0xf000c130 |
This gave another large increase in performance. At this point, the RPU core was now RV32IM with full multiply and divide instructions, so there was also a bump from being able to build Doom without the software emulated multiply/divide runtime.
We are now running at over 6fps, with very little changes to Doom itself, a large increase from our original 0.8fps.
The next set of optimizations were not really to do with the FPGA, and more to do with small changes to the way Doom wrote the VRAM - not redrawing the border every time, only doing things once, moving some things around - R_InitBuffer was one that was moved, and ST_refreshBackground modified. There was also some manual loop unrolling.
I did this with the aid of performance counters I added to the cache - on a request, increment a 64-bit counter. On a miss, increment the miss counter. I then fixed a few addresses in our MMIO memory space to read the counters. With them, I was able to output cache statistics with the timedemo score and see that the hitrate was actually very high. This does make sense given its limited scope.
size | MMIO name | address |
---|---|---|
uint32_t | perfcounter_cache_missl | 0xff002000 |
uint32_t | perfcounter_cache_missh | 0xff002004 |
uint32_t | perfcounter_cache_requestsl | 0xff002010 |
uint32_t | perfcounter_cache_requestsh | 0xff002014 |
As you can see from this chart, all of these gains up to nearly 8fps were due to reducing the amount of cache requests - I don’t have numbers for what exactly the reduction in reads was from in the code, but as you can see it was significant.
I think I may be able to get even more performance by reducing loop unrolling - a branch in RPU isn’t any slower, and if the hot loop fits in cache it could be better instead of a larger unrolled block of code which doesn’t fit in cache.
But, for now, I’ll take the 10x speedup!
This post has taken a very long time to prepare, there was a lot of work done to test all the various configurations and quite a few issues that took up valuable development time I have not included here - the default signedness of the char datatype caused issues that resulted in many nights looking through CPU traces of code sign-extending to larger types before I found the problem.
Running Doom is a fantastic milestone, and going from 0.8fps to nearly 8fps was great - even if some of that was from software changes. I was able to find a “timedemo demo3” run on a 386 pc which produced a result of 8.75fps(https://www.youtube.com/watch?v=TfKQ-uREnsI). Getting within 1 fps of this is a win in my book.
The cache implementation itself can be improved massively. It is not using block rams, so it’s utilization is very high when there are many cache lines in the configuration.
I actually finished this work mid 2021, and since then have been sporadically working on fixing many of the timing issues that are present in the RPU design. For the longest time I was trying to shave cycles off operations such as the memory accesses, but this was introducing timing problems which prevented the design running over 100MHz. I have found many issues, which will probably form another post in the series. The aim is to significantly simplify the memory subsystem of the CPU/SoC - things like endian flipping being outside the CPU is an oddity and it’s causing problems. Given I still have an aim of running Linux on RPU with virtual memory, simplifying the current memory system is a must!
Once simplified, I can see myself revisiting the cache and making a better version. For now though, this works nicely.
Thanks for reading. If you have any further questions, please do not hesitate to send me a message on twitter @domipheus!