Content follows this message
If you have enjoyed my articles, please consider these charities for donation:

Young Lives vs Cancer - Donate.
Blood Cancer UK - Donate.
Children's Cancer and Leukaemia Group - Donate.

Designing a RISC-V CPU in VHDL, Part 21: Multi-cycle execute for multiply and divide

Posted May 29, 2021, Reading time: 10 minutes.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

One of the things which RPU has done from the start is keep the cpu pipeline very simple. It’s a Fetch, Decode, Execute, [Memory], Writeback pipeline, but it does not run pipelined. I.e, at any cycle, only one of these stages is active, and all state within the cpu corresponds to a single instruction from fetch to writeback. Due to this our instructions-per-cycle (IPC) count is very low, but the implementation is simpler to understand in terms of dataflow.

The control unit decides what stage is active, drawing on state outputs from previous stages to make those decisions. Each of the decode, execute and writeback pipeline stages takes a single cycle to execute. Memory stages, like Fetch and Memory can take a variable amount of cycles. The CSR unit actually requires multiple cycles to operate – a write or read-modify-write operation like csrrci takes 3 cycles to fully complete – however this is done asynchronously with the traditional pipeline, so does not impact upon the 1 cycle per stage limitation. The execute stage for a csrrci instruction remains one cycle.

Some operations really do require multiple cycles to execute – without asyncronous operation – and the execute stage will need to wait for it to complete. An example of this is integer multiply and divide, operations defined in the RISC-V M extension. RPU did not support these operations, but now does – along with an execution phase which can span an arbitrary amount of clock cycles. This is how it’s implemented.

Let’s start – Multiply

Currently, the RPU decode stage requests an illegal instruction exception be raised if an opcode relating to the M-extensions are found. Previously, I have used this mechanism to implement software multiply and divide in the exception handler, which allowed Zephyr RTOS to boot while building binaries which targeted rv32m – slow, but effective.

The first thing we need to change is this behaviour. Allow the M-extension functions to pass the decoder, and set a multicycle bit in the decoder output for the control unit to consume.

The multi-cycle bit

The control unit of RPU is basic, and the main cpu stages simply advanced to the next stage every cycle – as long as no interrupts are requested. We need to modify this behaviour when we are in the execution stage to enable multi-cycle use. We do this with a pair of new signals, one from the decoder “O_multycyAlu” and one from the ALU, “O_wait”.

The output from the decoder “O_multicyAlu” is really an optimization; we could ignore this and simply rely on the O_wait from the ALU, but this would add a cycle of latency to every excute stage – something we do not want. So, the decoder knows that the multiply and divide instructions take multiple cycles, so we tell the control unit this. The control unit then keeps the execute state active, until the ALU O_wait output is not asserted. Simple, and works well for our use case. This is when the differences come in for how multiply and divide will be implemented on RPU, so we’ll jump right back into how multiply is handled now in the ALU.

From the spec, we can see that there are 4 different multiply operations we need to implement. MULW only applies to rv64, so is ignored in our rv32 implementation.

MUL – write bottom 32-bit result of rs1 * rs2 into rD.
MULH – write upper 32-bit result of (signed(rs1)) * (signed(rs2)) into rD.
MULHU – write upper 32-bit result of (unsigned(rs1)) * (unsigned(rs2)) into rD.
MULHSU – write upper 32-bit result of (signed(rs1)) * (unsigned(rs2)) into rD.

For my implementation, I perform three multiplies for the various signed/unsigned options, and then in the next cycle I select what is written to the destination register.

I rely on the VHDL integer typecasts for unsigned/signed, apart from the MULHSU case, whereby I manually sign-extend one operand by appending the MSB, and then the other operand gets ‘0’ appended for forced unsigned. I use a 2-state machine to keep track of when to write the result – but as you can see from the comment, we immediately deassert O_wait on the first encounter, as there is always an additional cycle available to complete the operation.

FPGA Multiply

“But Colin – you’re just… multiplying, on an FPGA, and it’s working?!” – Yup. My target FPGA, the Xilinx Spartan 7-50 on a Digilent Arty S7 board synthesizes a 32-bit integer multiply operation in the VHDL into a set of cascaded DSP primitive slices – neat huh?

Just like the blocks I use for fast ram on the FPGA, there are fixed function hardware blocks embedded into the FPGA which can be utilized by user designs. In this case, it’s the DSP48 block, which has a 25x18bit multiplier. The synthesis tools chain multiple of these blocks together to generate our 32×32 results.

Divide

For division, it’s not so easy. I need to implement it manually. Thankfully, integer binary long division is an algorithm which can be easily understood and transformed into VHDL for use in my ALU.

I decided to create a new division entity which encapsulated this algorithm and exposed signals for the ALU to use. I took the algorithm explained on wikipedia as above, and then made it also work for the various operations required:

As with multiply, some operations only apply to rv64 which we can ignore. We need to implement the following:

DIV – signed integer division
DIVU – unsigned integer division
REM – signed integer remainder
REMU – unsigned integer remainder

The long division algorithm assumes unsigned integers, so for signed use we can test the sign bits when we start a divide operation. We then feed the negated values, if negative, through the unsigned operation loop – before restoring the sign at the end of the operation to get the correct result.

There are some edge cases that are handled separately – notably division by zero. In the RISC-V spec, division by zero specifically does not raise an exception.

The division unit runs in three states; IDLE, INFLIGHT and COMPLETE.

STATE_IDLE: Waits for the execution instruction, and if found, sets up the inputs for the required operation and passes to inflight. For division by zero or one, these edge cases are handled and we short circuit to complete.

STATE_INFLIGHTU: Runs for 32 cycles, performing the unsigned binary long division algorithm loop.

STATE_COMPLETE: Selects the required output values, and adjusts for signed/unsigned operation.

The inflight state performs the loop from the algorithm, and the VHDL can be seen to match the wikipedia psudocode:

Note that when R is referred, we always need to refer to its full representation as s_R(30 downto ) & s_N(s_i), as that initial assignment in the wikipedia code would not be available until the next cycle in the VHDL.

The interface for the division unit is as follows.

The operations passed in I_op correspond to two bits of the funct3 part of the opcode for divide operations. This means they pass straight through from the decoder to this divide unit. You’ll notice there is an interrupt output for this unit, but currently due to the RISC-V spec this is not used.

Now that we have a facility for performing the 32-bit integer divide operations we require – it’s time to put it all together! We place the the divide unit component in our alu, and control it when we recieve a divide operation. We control the o_wait output of the ALU, keeping it active until the divide unit is complete, before forwarding the result to the ALU’s own result. The control unit then transfers us to writeback, for the register write of the result.

And that’s it. I used the riscv-compliance runner for mul and div to test my implementation, and fixed a few issues with signed handling in divide – but other than that, things were generally straightforward.

In terms of performance, the biggest win was for multiply- two cycles is pretty quick compared to the super slow emulation with invalid instruction interrupts, or soft multiply libraries. Divide wasn’t as big a win, but this is mainly due to my use cases not making much use of divide. In Doom, there was a gain with the hardware multiply – I hope to cover that in more depth in a post specifically on optimizing the SoC for Doom. Using hardware multiply support instead of relying on the software multiply I was using (based on llvm compiler-rt) resulted in a 28% increase in FPS on timedemo demo3.

The performance is as follows:

Multiply: execute: 2 cycles
Divide/Remainder: execute: 34 cycles

Possible Optimizations

Reading the spec on division operations you can see the above note on operation fusing. This is an optimization whereby if we need both the quotient and remainder from an operation, the cpu will only do the slow division operation once, and then use the results from that to complete both DIV and REM requests.

The way the division unit has been fixed to the ALU means at the moment, for an instruction sequence of say div r4, r1, r2; rem r5, r1, r2, we will take at least 68 cycles to complete within the execution stage. Given that the inputs to the instructions (r1, r2) remain constant in this sequence, we should be able to reuse the intermediate results of the division instruction to complete the remainder operation quicker.

This would look somewhat like the following:

Within the alu_int32_div entity; when we get an I_exec command, test the input operands against those for a valid result we have previously computed.
- If the new input match the previously computed, valid, operation;
- Perform any state modification which depends on operation – as the operation can change, despite the data inputs being constant
- Move to STATE_COMPLETE, completely skipping STATE_INFLIGHT
Write the output as before, which will draw on the currently requested operation, and use the existing values for the s_Q and s_R intermediate results.

With this set-up, skipping the STATE_INFLIGHT section will save 32 cycles – meaning out div+rem operation will now only take 36 cycles, a considerable speedup if this is a hot path in the code.

As the comparisons for testing the input operands will compare actual data values, rather than register numbers, we don’t need to worry about hazards. A register write to r1 between the operations of a different value would mean the inflight short-circuit would not be taken. Additionally, it means that the DIV+REM instructions do not need to follow each other for this to work.

I’ve not implemented this optimization in RPU yet, but it is on my to-do list!

Conclusion

We’ve complicated the pipeline picture a little, but added some nice functionality in the process.

For those of you who are also making RV32 cores – you may be interested to know you can pass a GCC toolchain an option which emits multiply instructions, but not division. This would have been great for me during development of RPU 1.0! I Learned this from Luke Wren over on twitter.

Arrgh. Wish I'd known of this flag about a year ago when I was adding mul/div to RPU! https://t.co/VA41OnNv3e
— Colin Riley 🎗 (@domipheus) May 19, 2021

Additionally, seems the RISC-V spec is getting a multiply-only extention which in my view is a good move.

That’s it for this part in the series! thank you for reading. It’s been a long time since the last part, my excuse is that’s been a fairly crazy year. The multiply and divide functionality is already on github, and has been for quite a while as part of the RPU 1.0 release. If you have any further questions, please do not hesitate to send me a message on twitter @domipheus!

Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley