Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley

Content follows this message
If you have enjoyed my articles, please consider these charities for donation:
  • Young Lives vs Cancer - Donate.
  • Blood Cancer UK - Donate.
  • Children's Cancer and Leukaemia Group - Donate.

Designing a CPU in VHDL, Part 10: Interrupts and Xilinx block RAMs

Posted Oct 31, 2015, Reading time: 13 minutes.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

Part 10 was supposed to be a very big part, with a special surprise of TPU working with a cool peripheral device, but that work is still ongoing. It’s taking a long time to do, mostly due to being busy myself over the past few weeks. However, in this update, I’ll look at bringing interrupts to TPU, as well as fixing an issue with the embedded ram that was causing bloating of the synthesized design.


Interrupts are needed on a CPU which is expected to work with multiple asynchronous devices whilst also doing some other computation. You can always have the CPU poll, but sometimes that isn’t wise and/or suitable given other constraints. It’s also good for keeping time with something – vsync, for example. This is where interrupts come in – where a signal fed to the CPU externally can “interrupt” what the CPU is currently executing, and perform some other computation before returning to it’s previous task.

The way I have implemented the interrupts is similar to the Z80 maskable interrupts, with an external interrupt input and an interrupt acknowledge output. The system is simplified and doesn’t have the different types of modes and non-maskable interrupts available on the Z80 but it should be enough for the needs of TPU. You can only handle a single request at a time, and there is only one mode to work with – but it’s powerful enough for most situations.

An overview of how the interrupts will work are as follows:

It’s very important that the interrupt input is only acted upon during the end of the writeback stage. Doing it at any other point can result in an inconsistent execution state, whereby we do not know if the current instruction has executed to completion. Doing the interrupt at the end of a writeback means:

  1. the PC we save (to return to later) is already the ‘next’ PC, be that prev_pc+2, or a branch target;
  2. memory reads have had time to complete successfully; and
  3. any registers have had time to see and act upon write enable signals to store data.

The items that are needed, therefore, are:

Internal registers & Connections

I added a 16-bit register for the ‘next PC’ and also the ‘interrupt data’ to the ALU itself, rather than adding it to the register file. There are individual set/write control lines and also data lines for them into the ALU. It’s a bit messy and adds a lot of ports to the ALU and control unit, but it worked and I can change this later if I want to tidy things up. Having the registers part of the ALU makes the instructions that access them incredibly simple and self contained.

Control unit additions

The control unit now has an interrupt state, all of the control signals for setting the registers in the ALU and also the logic for managing the phases of calling into the interrupt handler. If interrupts are enabled, the interrupt input is active and it’s the end of the writeback phase, the following occurs:

  1. Interrupt_ack is activated
  2. A cycle of latency is provided
  3. The bits on the data in bus are sampled and the ALU instructed to store this value
  4. The current PC (which is, at this point, the next instruction to execute) is saved by the ALU
  5. The PC unit sets the current PC to the interrupt vector, currently fixed at 0x0008.
  6. The control unit resets it’s interrupt state, and proceeds to the fetch stage of the pipeline.

At the moment, interrupts are not disabled automatically when the handler is invoked, so the first instruction must be a disable interrupt instruction.

New Instructions

There are four new instructions used to manage and handle interrupts.

The Get Interrupt Event Field transfers the value on the data bus at the time after an interrupt acknowledge into a register for further use. Using this value, we can work out what caused the interrupt and perform further actions from that point. An example of this is using it with a UART, the interrupt data field could contain the uart identifier in the high 8 bits, and the byte of data which was received in the lower 8 bits.

Branch back from Interrupt is similar to the reti instruction in the Z80. It branches back to the PC value which was due to be fetched next before the interrupt handler was invoked.

The enable and disable interrupt instructions are fairly obvious.

The interrupt vector

The interrupt vector is fixed at address 0x0008. The shape of the interrupt handler should be something like the following:

  1. disable interrupts
  2. Save all registers
  3. get the interrupt event data field
  4. Perform action according to interrupt event field, or add the field data to a queue for later processing.
  5. restore all registers
  6. enable interrupts
  7. Branch back to ‘normal’ code.

Saving the registers can be done by saving to the current stack and then restoring before returning from the handler. I’ve been using r7 as a ‘standard’ stack pointer in our very ad-hoc ABI spec, so this can be done. This does use user stack, though, so it needs taken into account if stack space is a particular concern.

There are a few issues that could occur, mainly in timing between disabling and enabling the interrupts. There could be a new interrupt to be handled when the enable interrupts instruction is processed, and this interrupt will then be accepted before the bbi instruction to branch back. This will destroy the original PC value when the original interrupt was raised, so I will probably change things around. There are a few solutions to this, one being that interrupts are by definition disabled when the branch to the interrupt vector occurs, and then a bbi instruction implicitly turns interrupts on again. I’ll need to have a think about the best course of action for this.

The makeup of the test interrupt routines I’ve had are like the following (snipped for clarity)

  load.h  r7, 0x08
  subi    r7, r7, 4
  bi      $start
  dw      0x0000
intvec:   #interrupt vector 0x8
  # save the registers
  gief    r0
  #    inspect r0 for interrupt type
  #    branch to some other work
  # restore the registers
  load.l  r0, 0

The interrupt handler, whilst a bit messy in it’s implementation, works well in simulation. I’ve yet to use it when TPU is running on the FPGA with an external source, but I do not foresee many issues other than the one stated above.

A Look in the simulator

The above waveform is showing an interrupt being flagged on a UART receive event, the event field containing the UART ID (1) and the byte value received (0x4f). Walking through the waveform, we get the following:

  1. The UART has received a byte and signaled this.
  2. An interrupt is immediately raised.
  3. Several cycles later the ACK is signaled by the cpu
  4. The interrupt event field(IEF) data is placed on the data in bus after a cycle of delay
  5. The ACKis de-signaled, and the IEF is removed from data in bus and saved internally (to later be used via the gief instruction)
  6. The CPU branches to the interrupt vector 0x0008, requesting the instruction from memory

The internal RAM

I mentioned previously that the design resources had shot up, and it turns out this is due mainly to the internal ram not being synthesized as a block ram. I was getting an internal compiler error in the Xilinx toolchain when building the existing ram with a larger capacity (I think it was 512bytes at this point) and to counter this I re-implemented the ram in another way. The way I did it, though, added an asynchronous element which in turn forced the toolchain to implement the RAM via look up tables, instead of utilizing the block ram. This is why there was a jump in resource requirements when using the Spartan6.

Block Rams

I could not get around the internal compiler error without an async element, so off to the documentation for the spartan6 I went. Turns out there is a document specifically on the block rams available on the device I have.

The block rams are used by initializing a generic object in VHDL to various constants, and then interfacing with the ports that object exposes. There are two kinds of block rams available, but I decided to use the 18 kilobit, dual-port one: RAMB16BWER. It is made up of 16Kb for data and 2Kb for parity. ISE has a nice template library for instantiation of primitives, and the block ram I use is included. It can be found within Edit->Language Templates, and then within the VHDL->Device Primitives->Spartan6->RAM/ROM.

This brings up a window with initialization code to copy and paste into your own design. I took it, and edited the relevant areas to configure it for a 16-bit addressed memory.

Despite having the existing integrated ram address bytes explicitly, I decided against that with the block ram and instead addressed 16-bit values. To the TPU programmer, it still addresses bytes, but internally, it’s really stored at 16-bit, 2 byte blocks. The main reason for this was latency and complexity. By addressing 16-bit values internally in the block ram, I can implement both 16-byte reads/writes and also 8-bit reads and writes using a single port. The RAMB16BWER has a byte-wise write enable, so I can write either the high or low 8bits of a memory location internal to the block ram, leaving the other half untouched. There is one issue that arises from this method – an unaligned 16-bit read/write (i.e, the address being odd) will result in incorrect behavior. At the moment nothing happens if you try this, but I intend to add a trap/exception. I could maybe invoke the interrupt handler with a known interrupt event field value to specify an unaligned memory operation.


There were several gotchas I encountered whilst trying the block ram with a testbench. The addressing scheme, first of all, was confusing. As the generic component was initialized with relevant 16-bit addressing (18bit when you include parity), I assumed it would transform the address itself into the correct form. This did not seem to be the case after running the test bench. the documentation has a table of mappings and also a formula, but in the end it only took a few minutes of inspection in the simulator to work out what was happening.

The next issue was a rather silly affair! The initialization attributes for the block ram are from most-significant to least-significant order. Due to this, 16-bit instructions need byte-flipped when read in the code, and also, they go from right to left along the initialization attribute.

-- BEGIN TASM RAMB16BWER INIT OUTPUT                                         
INIT_00 => X"06831180E27F00300000004F4C4C454801E102E100EF03E100000CC1E91E088E",

Maps to the instruction forms (only first 3 instructions shown):

X"8E", X"08", -- 0000: load.h  r7 0x08
X"1E", X"E9", -- 0002: subi    r7 r7 4
X"C1", X"0C", -- 0004: bi      0x0018

I will not admit the amount of time spent trying to figure out the issue of byte flipping in the initialization attribute 😉

The least significant digit of the address, specifying the high/low byte of the 16-bit memory location, is managed in the VHDL process. Ive put that process (and other relevant signal operations) below for clarity. It’s a large block of text even without some of the less important generic attributes/initializations, which I have omitted.

 generic map (
    -- DATA_WIDTH_A/DATA_WIDTH_B: 0, 1, 2, 4, 9, 18, or 36
    DATA_WIDTH_A => 18,
    DATA_WIDTH_B => 18,
    -- SIM_COLLISION_CHECK: Collision check enable "ALL", "WARNING_ONLY", "GENERATE_X_ONLY" or "NONE" 
    -- SIM_DEVICE: Must be set to "SPARTAN6" for proper simulation behavior
 port map (
    -- Port A Data: 32-bit (each) output: Port A data
    DOA => DOA,       -- 32-bit output: A port data output
    DOPA => DOPA,     -- 4-bit output: A port parity output
    -- Port B Data: 32-bit (each) output: Port B data
    DOB => DOB,       -- 32-bit output: B port data output
    DOPB => DOPB,     -- 4-bit output: B port parity output
    -- Port A Address/Control Signals: 14-bit (each) input: Port A address and control signals
    ADDRA => ADDRA,   -- 14-bit input: A port address input
    CLKA => CLKA,     -- 1-bit input: A port clock input
    ENA => ENA,       -- 1-bit input: A port enable input
    REGCEA => REGCEA, -- 1-bit input: A port register clock enable input
    RSTA => RSTA,     -- 1-bit input: A port register set/reset input
    WEA => WEA,       -- 4-bit input: Port A byte-wide write enable input
    -- Port A Data: 32-bit (each) input: Port A data
    DIA => DIA,       -- 32-bit input: A port data input
    DIPA => DIPA,     -- 4-bit input: A port parity input
    -- Port B Address/Control Signals: 14-bit (each) input: Port B address and control signals
    ADDRB => ADDRB,   -- 14-bit input: B port address input
    CLKB => CLKB,     -- 1-bit input: B port clock input
    ENB => ENB,       -- 1-bit input: B port enable input
    REGCEB => REGCEB, -- 1-bit input: B port register clock enable input
    RSTB => RSTB,     -- 1-bit input: B port register set/reset input
    WEB => WEB,       -- 4-bit input: Port B byte-wide write enable input
    -- Port B Data: 32-bit (each) input: Port B data
    DIB => DIB,       -- 32-bit input: B port data input
    DIPB => DIPB      -- 4-bit input: B port parity input

 -- End of RAMB16BWER_inst instantiation

--todo: assertion on non-aligned 16b read?

CLKA <= I_clk;
CLKB <= I_clk;

ENA <= I_cs;
ENB <= '0';--port B unused

ADDRA <= I_addr(10 downto 1) & "0000";

process (I_clk, I_cs)
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        if I_addr(0) = '1' then
          WEA <= "0010";
          DIA <= X"0000" & I_data(7 downto 0) & X"00";
          WEA <= "0001";
          DIA <= X"000000" & I_data(7 downto 0);
        end if;
        WEA <= "0011";
        DIA <= X"0000" & I_data(7 downto 0)& I_data(15 downto 8);
      end if;
      WEA <= "0000";
      WEB <= "0000";
      if I_size = '1' then
        if I_addr(0) = '0' then
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(7 downto 0);
          data(15 downto 8) <= X"00";
          data(7 downto 0)  <= DOA(15 downto 8);
        end if;
        data(15 downto 8) <= DOA(7 downto 0);
        data(7 downto 0) <= DOA(15 downto 8);
      end if;
    end if;
  end if;
end process;

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

Assembler Output

The last thing to do was to add another output file generator to TASM, my c# TPU assembler. This simply outputs the whole 2KB initialization table for the input assembly. It’s then just copy/pasted into the VHDL in the appropriate attribute location.

Wrapping up

That’s it for this part. I really hope to have the next part with TPU talking to a peripheral device (and some changes to the ISA) in the next week or two. Fingers crossed!

Thanks for reading, comments as always to @domipheus.