Content follows this message If you have enjoyed my articles, please consider these charities for donation: |
This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.
Part 10 was supposed to be a very big part, with a special surprise of TPU working with a cool peripheral device, but that work is still ongoing. It’s taking a long time to do, mostly due to being busy myself over the past few weeks. However, in this update, I’ll look at bringing interrupts to TPU, as well as fixing an issue with the embedded ram that was causing bloating of the synthesized design.
Interrupts are needed on a CPU which is expected to work with multiple asynchronous devices whilst also doing some other computation. You can always have the CPU poll, but sometimes that isn’t wise and/or suitable given other constraints. It’s also good for keeping time with something – vsync, for example. This is where interrupts come in – where a signal fed to the CPU externally can “interrupt” what the CPU is currently executing, and perform some other computation before returning to it’s previous task.
The way I have implemented the interrupts is similar to the Z80 maskable interrupts, with an external interrupt input and an interrupt acknowledge output. The system is simplified and doesn’t have the different types of modes and non-maskable interrupts available on the Z80 but it should be enough for the needs of TPU. You can only handle a single request at a time, and there is only one mode to work with – but it’s powerful enough for most situations.
An overview of how the interrupts will work are as follows:
It’s very important that the interrupt input is only acted upon during the end of the writeback stage. Doing it at any other point can result in an inconsistent execution state, whereby we do not know if the current instruction has executed to completion. Doing the interrupt at the end of a writeback means:
The items that are needed, therefore, are:
I added a 16-bit register for the ‘next PC’ and also the ‘interrupt data’ to the ALU itself, rather than adding it to the register file. There are individual set/write control lines and also data lines for them into the ALU. It’s a bit messy and adds a lot of ports to the ALU and control unit, but it worked and I can change this later if I want to tidy things up. Having the registers part of the ALU makes the instructions that access them incredibly simple and self contained.
The control unit now has an interrupt state, all of the control signals for setting the registers in the ALU and also the logic for managing the phases of calling into the interrupt handler. If interrupts are enabled, the interrupt input is active and it’s the end of the writeback phase, the following occurs:
At the moment, interrupts are not disabled automatically when the handler is invoked, so the first instruction must be a disable interrupt instruction.
There are four new instructions used to manage and handle interrupts.
The Get Interrupt Event Field transfers the value on the data bus at the time after an interrupt acknowledge into a register for further use. Using this value, we can work out what caused the interrupt and perform further actions from that point. An example of this is using it with a UART, the interrupt data field could contain the uart identifier in the high 8 bits, and the byte of data which was received in the lower 8 bits.
Branch back from Interrupt is similar to the reti instruction in the Z80. It branches back to the PC value which was due to be fetched next before the interrupt handler was invoked.
The enable and disable interrupt instructions are fairly obvious.
The interrupt vector is fixed at address 0x0008. The shape of the interrupt handler should be something like the following:
Saving the registers can be done by saving to the current stack and then restoring before returning from the handler. I’ve been using r7 as a ‘standard’ stack pointer in our very ad-hoc ABI spec, so this can be done. This does use user stack, though, so it needs taken into account if stack space is a particular concern.
There are a few issues that could occur, mainly in timing between disabling and enabling the interrupts. There could be a new interrupt to be handled when the enable interrupts instruction is processed, and this interrupt will then be accepted before the bbi instruction to branch back. This will destroy the original PC value when the original interrupt was raised, so I will probably change things around. There are a few solutions to this, one being that interrupts are by definition disabled when the branch to the interrupt vector occurs, and then a bbi instruction implicitly turns interrupts on again. I’ll need to have a think about the best course of action for this.
The makeup of the test interrupt routines I’ve had are like the following (snipped for clarity)
entry:
load.h r7, 0x08
subi r7, r7, 4
bi $start
dw 0x0000
intvec: #interrupt vector 0x8
di
# save the registers
gief r0
# inspect r0 for interrupt type
# branch to some other work
# restore the registers
ei
bbi
start:
load.l r0, 0
...
The interrupt handler, whilst a bit messy in it’s implementation, works well in simulation. I’ve yet to use it when TPU is running on the FPGA with an external source, but I do not foresee many issues other than the one stated above.
The above waveform is showing an interrupt being flagged on a UART receive event, the event field containing the UART ID (1) and the byte value received (0x4f). Walking through the waveform, we get the following:
I mentioned previously that the design resources had shot up, and it turns out this is due mainly to the internal ram not being synthesized as a block ram. I was getting an internal compiler error in the Xilinx toolchain when building the existing ram with a larger capacity (I think it was 512bytes at this point) and to counter this I re-implemented the ram in another way. The way I did it, though, added an asynchronous element which in turn forced the toolchain to implement the RAM via look up tables, instead of utilizing the block ram. This is why there was a jump in resource requirements when using the Spartan6.
I could not get around the internal compiler error without an async element, so off to the documentation for the spartan6 I went. Turns out there is a document specifically on the block rams available on the device I have.
The block rams are used by initializing a generic object in VHDL to various constants, and then interfacing with the ports that object exposes. There are two kinds of block rams available, but I decided to use the 18 kilobit, dual-port one: RAMB16BWER. It is made up of 16Kb for data and 2Kb for parity. ISE has a nice template library for instantiation of primitives, and the block ram I use is included. It can be found within Edit->Language Templates, and then within the VHDL->Device Primitives->Spartan6->RAM/ROM.
This brings up a window with initialization code to copy and paste into your own design. I took it, and edited the relevant areas to configure it for a 16-bit addressed memory.
Despite having the existing integrated ram address bytes explicitly, I decided against that with the block ram and instead addressed 16-bit values. To the TPU programmer, it still addresses bytes, but internally, it’s really stored at 16-bit, 2 byte blocks. The main reason for this was latency and complexity. By addressing 16-bit values internally in the block ram, I can implement both 16-byte reads/writes and also 8-bit reads and writes using a single port. The RAMB16BWER has a byte-wise write enable, so I can write either the high or low 8bits of a memory location internal to the block ram, leaving the other half untouched. There is one issue that arises from this method – an unaligned 16-bit read/write (i.e, the address being odd) will result in incorrect behavior. At the moment nothing happens if you try this, but I intend to add a trap/exception. I could maybe invoke the interrupt handler with a known interrupt event field value to specify an unaligned memory operation.
There were several gotchas I encountered whilst trying the block ram with a testbench. The addressing scheme, first of all, was confusing. As the generic component was initialized with relevant 16-bit addressing (18bit when you include parity), I assumed it would transform the address itself into the correct form. This did not seem to be the case after running the test bench. the documentation has a table of mappings and also a formula, but in the end it only took a few minutes of inspection in the simulator to work out what was happening.
The next issue was a rather silly affair! The initialization attributes for the block ram are from most-significant to least-significant order. Due to this, 16-bit instructions need byte-flipped when read in the code, and also, they go from right to left along the initialization attribute.
-- BEGIN TASM RAMB16BWER INIT OUTPUT
INIT_00 => X"06831180E27F00300000004F4C4C454801E102E100EF03E100000CC1E91E088E",
Maps to the instruction forms (only first 3 instructions shown):
X"8E", X"08", -- 0000: load.h r7 0x08
X"1E", X"E9", -- 0002: subi r7 r7 4
X"C1", X"0C", -- 0004: bi 0x0018
...snip...
I will not admit the amount of time spent trying to figure out the issue of byte flipping in the initialization attribute 😉
The least significant digit of the address, specifying the high/low byte of the 16-bit memory location, is managed in the VHDL process. Ive put that process (and other relevant signal operations) below for clarity. It’s a large block of text even without some of the less important generic attributes/initializations, which I have omitted.
RAMB16BWER_inst : RAMB16BWER
generic map (
-- DATA_WIDTH_A/DATA_WIDTH_B: 0, 1, 2, 4, 9, 18, or 36
DATA_WIDTH_A => 18,
DATA_WIDTH_B => 18,
...snip...
-- SIM_COLLISION_CHECK: Collision check enable "ALL", "WARNING_ONLY", "GENERATE_X_ONLY" or "NONE"
SIM_COLLISION_CHECK => "ALL",
-- SIM_DEVICE: Must be set to "SPARTAN6" for proper simulation behavior
SIM_DEVICE => "SPARTAN6",
...snip...
)
port map (
-- Port A Data: 32-bit (each) output: Port A data
DOA => DOA, -- 32-bit output: A port data output
DOPA => DOPA, -- 4-bit output: A port parity output
-- Port B Data: 32-bit (each) output: Port B data
DOB => DOB, -- 32-bit output: B port data output
DOPB => DOPB, -- 4-bit output: B port parity output
-- Port A Address/Control Signals: 14-bit (each) input: Port A address and control signals
ADDRA => ADDRA, -- 14-bit input: A port address input
CLKA => CLKA, -- 1-bit input: A port clock input
ENA => ENA, -- 1-bit input: A port enable input
REGCEA => REGCEA, -- 1-bit input: A port register clock enable input
RSTA => RSTA, -- 1-bit input: A port register set/reset input
WEA => WEA, -- 4-bit input: Port A byte-wide write enable input
-- Port A Data: 32-bit (each) input: Port A data
DIA => DIA, -- 32-bit input: A port data input
DIPA => DIPA, -- 4-bit input: A port parity input
-- Port B Address/Control Signals: 14-bit (each) input: Port B address and control signals
ADDRB => ADDRB, -- 14-bit input: B port address input
CLKB => CLKB, -- 1-bit input: B port clock input
ENB => ENB, -- 1-bit input: B port enable input
REGCEB => REGCEB, -- 1-bit input: B port register clock enable input
RSTB => RSTB, -- 1-bit input: B port register set/reset input
WEB => WEB, -- 4-bit input: Port B byte-wide write enable input
-- Port B Data: 32-bit (each) input: Port B data
DIB => DIB, -- 32-bit input: B port data input
DIPB => DIPB -- 4-bit input: B port parity input
);
-- End of RAMB16BWER_inst instantiation
--
--todo: assertion on non-aligned 16b read?
--
CLKA <= I_clk;
CLKB <= I_clk;
ENA <= I_cs;
ENB <= '0';--port B unused
ADDRA <= I_addr(10 downto 1) & "0000";
process (I_clk, I_cs)
begin
if rising_edge(I_clk) and I_cs = '1' then
if (I_we = '1') then
if I_size = '1' then
-- 1 byte
if I_addr(0) = '1' then
WEA <= "0010";
DIA <= X"0000" & I_data(7 downto 0) & X"00";
else
WEA <= "0001";
DIA <= X"000000" & I_data(7 downto 0);
end if;
else
WEA <= "0011";
DIA <= X"0000" & I_data(7 downto 0)& I_data(15 downto 8);
end if;
else
WEA <= "0000";
WEB <= "0000";
if I_size = '1' then
if I_addr(0) = '0' then
data(15 downto 8) <= X"00";
data(7 downto 0) <= DOA(7 downto 0);
else
data(15 downto 8) <= X"00";
data(7 downto 0) <= DOA(15 downto 8);
end if;
else
data(15 downto 8) <= DOA(7 downto 0);
data(7 downto 0) <= DOA(15 downto 8);
end if;
end if;
end if;
end process;
O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";
The last thing to do was to add another output file generator to TASM, my c# TPU assembler. This simply outputs the whole 2KB initialization table for the input assembly. It’s then just copy/pasted into the VHDL in the appropriate attribute location.
That’s it for this part. I really hope to have the next part with TPU talking to a peripheral device (and some changes to the ISA) in the next week or two. Fingers crossed!
Thanks for reading, comments as always to @domipheus.