Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley

Content follows this message
If you have enjoyed my articles, please consider these charities for donation:
  • Young Lives vs Cancer - Donate.
  • Blood Cancer UK - Donate.
  • Children's Cancer and Leukaemia Group - Donate.

Designing a CPU in VHDL, Part 9: Byte addressing, memory subsystem and UART

Posted Sep 30, 2015, Reading time: 17 minutes.

This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing. This part is heavy going if you’ve not read the previous posts.

Byte Addressing

TPU currently operates with memory by addressing 16-bit words. It’s a fairly common set-up for custom processors (addressing non-‘byte’ sizes, that is), but I wanted byte addressing as it simplifies some assembly for operations and really shouldn’t be that difficult to add. There are various things that need to change:

The PC increment is trivial, simply changing the increment operator of the PC unit to add 2 instead of one. The embedded ram changes are not so bad either. We add a new input port to say whether or not the current operation is on bytes or words. When we are reading just a byte, we want to zero out the high byte of the output port and write just the low byte from memory. For a word operation, we perform two memory/ram reads and write them into the low and high parts of the output. It’s worth noting what we are doing here is technically big-endian; in that, when we do a read from a byte address, the most-significant byte is located at that address, followed by the least significant byte. I also added a chip select, which ‘disconnects’ the output when not enabled.

type store_t is array (0 to 103) of std_logic_vector(7 downto 0);
signal ram: store_t := ...

... snip ...

process (I_clk, I_cs)
begin
  if rising_edge(I_clk) and I_cs = '1' then
    if (I_we = '1') then
      if I_size = '1' then
        -- 1 byte
        ram(int_addr) <= I_data(7 downto 0);
      else
        ram(int_addr) <= I_data(15 downto 8);
        ram(int_addr+1) <= I_data(7 downto 0);
      end if;
    else
      if I_size = '1' then
        data(15 downto 8) <= X"00";
        data(7 downto 0)  <= b1;
      else
        data(15 downto 8) <= b1;
        data(7 downto 0)  <= b2;
      end if;
    end if;
  end if;
end process;

int_addr <= to_integer(unsigned(I_addr(7 downto 0)));

b1 <= ram(int_addr);
b2 <= ram(int_addr+1);

O_data <= data when I_cs = '1' else "ZZZZZZZZZZZZZZZZ";

The embedded ram gave me some problems when it came to synthesis – I had an internal compiler error in the Xilinx tools. I narrowed this down to a single line, and then a single token – one of the boundaries of a (X downto Y) statement. I re-wrote this component to get around the issue, and it also made me realise that this method of implementing the ram may be inefficient in terms of using the rams on-device. The version listed above is the new, re-written version. You can see there are multiple reads and multiple writes each cycle. I’ll need to look into how the internal block rams of the Spartan6 are used to make sure I’m not causing issues.

For the read and write instructions, I’d conveniently left a single bit of the instruction form free for later use. This is the ‘flag bit’ in position 8. When this bit is set, it instructs the memory system to do a byte operation instead of a word operation.

The changes to the assembler were pretty trivial; enable the read and write instructions for byte operations using the flag bit, and the changes required for byte addressing when calculating label offsets. With those changes, everything worked well and I was able to progress. An interesting point to note now is that I could enforce instructions being on 2-byte boundaries, which would mean all immediate branch offsets could be increased by an extra bit as they are currently. I’ve not done this, yet, but I likely will.

Memory Subsystem

I’d always intended to extend the memory subsystem of TPU as to be able to interface with memories that had different latencies. Currently, it’s fixed that memory takes a single cycle to respond. I wanted to expose an interface that would enable connecting up other memory-mapped devices (like a UART) or even enable the interfacing of SDRAM on the miniSpartan6+ board.

For this I’ve created a ‘core’ component which brings everything bar the embedded ram together in a single object. The memory is handled by an internal controller, which works as a state machine and is triggered by a command port. The state machine ports are exposed like the following.

entity mem_controller is
  Port( I_clk : in  STD_LOGIC;
        I_reset : in STD_LOGIC;

        O_ready : out STD_LOGIC;
        I_execute: in STD_LOGIC;
        I_dataWe : in  STD_LOGIC;
        I_address : in  STD_LOGIC_VECTOR (15 downto 0);
        I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        I_dataByteEn : in STD_LOGIC_VECTOR(1 downto 0);
        O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        O_dataReady: out STD_LOGIC;

        MEM_I_ready: in STD_LOGIC;
        MEM_O_cmd: out STD_LOGIC;
        MEM_O_we : out  STD_LOGIC;
        MEM_O_byteEnable : out STD_LOGIC_VECTOR (1 downto 0);
        MEM_O_addr : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_O_data : out  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_data : in  STD_LOGIC_VECTOR (15 downto 0);
        MEM_I_dataReady : in STD_LOGIC
  );
end mem_controller;

In this component we have the internal interface, and then the external interface (prefixed with “MEM_”). The order of operations is as follows:

  1. Wait until O_ready is active
  2. Set the various inputs: address, in data or write enable, and the dataByteEn – this enables 16 or 8-bit memory modes.
  3. Signal the I_execute input until O_ready goes inactive.

This then gets routed to the external ports. I wanted to have something like this within the core object in case the need arose for implementing segmentation or other types of memory addressing extensions. Then the external address buses would be wider, supplemented by a register somewhere to set the high order bits. The actual code for the memory_controller is trivial just now:

architecture Behavioral of mem_controller is
  signal we : std_logic := '0';
  signal addr : STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal indata: STD_LOGIC_VECTOR (15 downto 0) := X"0000";
  signal byteEnable: STD_LOGIC_VECTOR ( 1 downto 0) := "11";
  signal cmd : STD_LOGIC := '0';
  signal state: integer := 0;
  
begin

  process (I_clk, I_execute)
  begin
    if rising_edge(I_clk) then
      if I_reset = '1' then
        we <= '0';
        cmd <= '0';
        state <= 0;
      elsif state = 0 and I_execute = '1' and MEM_I_ready = '1' then
        we <= I_dataWe;
        addr <= I_address;
        indata <= I_data;
        byteEnable <= I_dataByteEn;
        cmd <= '1';
        O_dataReady <= '0';
        if I_dataWe = '0' then
          -- read
          state <= 1;
        else
          state <= 2;-- write
        end if;
      elsif state = 1 then
        cmd <= '0';
        if MEM_I_dataReady = '1' then
          O_dataReady <= '1';
          state <= 2;
        end if;
      elsif state = 2 then
        cmd <= '0';
        state <= 0;
        O_dataReady <= '0';
      end if;
    end if;
  end process;
  
  O_ready <= ( MEM_I_ready and not I_execute ) when state = 0 else '0';
  
  --
  MEM_O_cmd <= cmd;
  O_data <= MEM_I_data;
  MEM_O_byteEnable <= byteEnable;
  MEM_O_data <= indata;
  MEM_O_addr <= addr;
  MEM_O_we <= we;

end Behavioral;

The main point this serves in terms of TPU is that the control unit uses the output signals from the memory controller at the point when deciding to move to the next stage of the pipeline. This means that stages such as the fetch stage, and the memory stage, wait until the memory subsystem has indicated a command has completed:

For a write, we know the command has executed when the O_ready output goes back to active after deactivating after the signalling of I_execute.
For a read, we know the command has executed when the data we requested presents itself on the O_data line, with O_dataReady signalling the data on the line is valid for the previous request.

The controller does not have any buffering or a queue of operations, so everything waits for the memory system before continuing. At the moment, this is what I want as it makes debugging the system so much easier when things go wrong.

The ‘Core’

Our core TPU object looks like the following:

entity core is
  Port (I_clk : in  STD_LOGIC;
        I_reset : in  STD_LOGIC;
        I_halt : in  STD_LOGIC;

        -- memory interface
        MEM_I_ready : IN  std_logic;
        MEM_O_cmd : OUT  std_logic;
        MEM_O_we : OUT  std_logic;
        MEM_O_byteEnable : OUT  std_logic_vector(1 downto 0);
        MEM_O_addr : OUT  std_logic_vector(15 downto 0);
        MEM_O_data : OUT  std_logic_vector(15 downto 0);
        MEM_I_data : IN  std_logic_vector(15 downto 0);
        MEM_I_dataReady : IN  std_logic
      );
end core;

There are a few signals yet to add here; for one, there are no interrupts yet – something I’d like to add. But as you can see, the core TPU object now just exposes the memory interface, along with the clock and some control. If you apply the clock, a 16-bit request to read address 0 will be asserted, as it attempts to fetch it’s first instruction to execute.

In making a top-level module which we can flash to the FPGA, we need to have one of the cores, and an instance of our embedded ram. We also need to have some sort of logic which can handle the memory system commands from the core – and either let the embedded RAM service it, or some other component – such as our UART.

core_1: core PORT MAP (
  I_clk => I_clk,
  I_reset => I_reset,
  I_halt => I_halt,
  MEM_I_ready => MEM_I_ready,
  MEM_O_cmd => MEM_O_cmd,
  MEM_O_we => MEM_O_we,
  MEM_O_byteEnable => MEM_O_byteEnable,
  MEM_O_addr => MEM_O_addr,
  MEM_O_data => MEM_O_data,
  MEM_I_data => MEM_I_data,
  MEM_I_dataReady => MEM_I_dataReady
); 

ebram_1: ebram Port map ( 
  I_clk => I_clk,
  I_cs => CS_ERAM,
  I_we => MEM_O_we,
  I_addr => MEM_O_addr,
  I_data => MEM_O_data,
  I_size => ram_req_size,
  O_data => ram_output_data
);

uart_1: uart_simple PORT MAP (
  I_clk => I_clk,
  I_clk_baud_count => I_clk_baud_count,-- 0x1100 -- R/W
  I_reset => I_reset,             
  I_txData => I_txData,                -- 0x1102  -- W
  I_txSig => I_txSig,                  -- 0x1103  -- W
  O_txRdy => O_txRdy,                  -- 0x1104  -- R
  O_tx => O_tx,
  I_rx => I_rx,
  I_rxCont => I_rxCont,                -- 0x1105  -- W
  O_rxData => O_rxData,                -- 0x1106  -- R
  O_rxSig => O_rxSig,                  -- 0x1107  -- R
  O_rxFrameError => O_rxFrameError     -- 0x1108  -- R
);

In addition to the above parts of the top-level design, there is the LED byte output which ends up mapping to the LEDS on the miniSpartan6+ board. those are written to at location 0x1000, and looking at the uart_1 definition above, the comments indicate which other signals are mapped and at what location. Many of these are 1-bit signals, but I use a whole byte for simplicity.

Whilst the inputs to the embedded ram is connected directly from the MEM_signals, the output (O_data) is mapped to another signal, as any address above 0x1000 I have defined for now as not existing within the embedded RAM.

ram_req_size <= '1' when MEM_O_byteEnable = "10" else '0';
CS_ERAM <= '1' when MEM_O_addr < X"1000" else '0';
MEM_I_data <= ram_output_data when CS_ERAM = '1' else IO_DATA;

ram_output_data is selected by the CPU core appropriately, and there is an IO_DATA signal for any other memory-mapped device. Additionally, ram_req_size is used to translate between the byte enable signal (2 bits corresponding to byte locations) to the 8/16-bit embedded ram input port input.

Now, there is a process which manages the MEM_cmd and ready signals, and forwards/directs data where it needs to go depending on those commands. I’ll paste the whole thing, but it’s a bit hacky and is my first attempt – I’ll probably end up reimplementing it when I need SDRAM integration.

MEM_proc: process(I_clk)
  begin
    if rising_edge(I_clk) then
    
      if MEM_readyState = 0 then
        if MEM_O_cmd = '1' then
          -- LEDS memory map
          if MEM_O_addr = X"1000" then
            -- leds
            IO_LEDS <= MEM_O_data( 7 downto 0);
          end if;
          
          -- UART1 memory map
          case MEM_O_addr is
            when X"1100" =>
              if MEM_O_we = '0' then
                IO_DATA <= I_clk_baud_count;
              else
                I_clk_baud_count <= MEM_O_data;
              end if;
            when X"1102" =>
              if MEM_O_we = '1' then
                I_txData <= MEM_O_data(7 downto 0);
              end if;
            when X"1103" =>
              if MEM_O_we = '1' then
                I_txSig <= MEM_O_data(0);
              end if;
            when X"1104" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_txRdy;
              end if;
            when X"1105" =>
              if MEM_O_we = '1' then
                I_rxCont <= MEM_O_data(0);
              end if;
            when X"1106" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"00" & O_rxData;
              end if;
            when X"1107" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxSig;
              end if;
            when X"1108" =>
              if MEM_O_we = '0' then
                IO_DATA <= X"000" & "000" & O_rxFrameError;
              end if;
            when others =>
          end case;
          
          MEM_I_ready <= '0';
          MEM_I_dataReady  <= '0';
          if MEM_O_we = '1' then
            MEM_readyState <= 2;
          else
            MEM_readyState <= 1;
          end if;
        end if;
      elsif MEM_readyState >= 1 then
        if MEM_readyState = 1 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '1';
          MEM_readyState <= 0;
        elsif MEM_readyState = 2 then
          MEM_I_ready <= '1';
          MEM_I_dataReady  <= '0';
          MEM_readyState <= 0;
        else
          MEM_readyState <= MEM_readyState + 1;
        end if;
      end if;
    
    end if;
  
  end process;

As we disable the embedded ram when the address is >= 0x1000, we don’t need to worry about overwriting memory when looking for a memory mapped device. The LEDS are mapped with a simple if block, and the UART with a case. With this, the TPU core can access memory, write status to LEDS and also use the UART by reading and writing known memory locations.

We can simulate it with a simple test and it works well.

Using the UART

The UART itself is pretty much exactly the same one in my previous post on the subject. Ive fixed a few issues in the original code (some of which were mentioned in the article). All the ports are memory mapped, so to use the component from TPU assembly, we just read and write to those addresses. This is where 1) the byte addressing mode and 2) the immediate offset of memory addresses in the new read/write instructions help a lot. The general method for transmitting a byte over the UART is as follows:

  1. Wait until the ready bit is set
  2. Set the data input value
  3. Set the transmit signal bit
  4. Wait until the device signals not ready (i.e, working on a request)
  5. Unset the transmit signal bit

For receive, it’s similar:

  1. Ensure the RX is enabled
  2. Wait until the receive signal bit becomes active
  3. Read the output data value
  4. Wait until the receive signal bit de-asserts

In addition, the baud rate can be set by adjusting a counter value which is memory mapped. If you want a baud rate of 9600, you’d set the value to 5208 – which is the main system clock (50MHz) divided by the bit-rate required (50000000/9600).

In TPU assembly, the send and receive functions look simple enough.

uart_send:
  load.l  r1, 1
  load.h  r2, 0x11              #uart1 memory mapped offset r2=0x1100
us_waitready:
  read.b  r3, r2, 4             #readb [r2+4] (tx_ready) 
  cmp.u   r3, r3, r3            #compare
  bro.az  r3, %us_waitready     #loop until tx_ready nonzero
us_setdata:
  write.b r2, r0, 2             #write txdata register
us_setsig:
  write.b r2, r1, 3             #set txsig register 1
us_waitunready:
  read.b  r3, r2, 4             
  cmp.u   r3, r3, r3
  bro.anz r3, %us_waitunready   #loop until tx_ready zero
us_unsetsig:
  load.l  r1, 0
  write.b r2, r1, 3
us_fin:
  br r6

uart_recv:
  load.l  r1, 1
  load.h  r2, 0x11
ur_enable:
  write.b r2, r1, 5
ur_waitsig:
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3
  bro.az  r3, %ur_waitsig    #loop until rx_sig nonzero
ur_capdata:
  read.b  r0, r2, 6          #read data into r0
ur_waitunsig:
  read.b  r3, r2, 7             
  cmp.u   r3, r3, r3
  bro.anz r3, %ur_waitunsig  #loop until rx_sig nonzero
ur_fin:
  br r6

In these examples, the ‘functions’ are called with any argument in r0, the return value placed in r0, and the return PC to jump back to in r6. You can send and ‘H’ over the UART using:

load.l r0, 0x48
  load.l r6, $L_E
  bi $uart_send
L_E:
  ... snip ...

And receive sets r0 on return to the value sent across the connection.

I’m still assembling these programs and initializing the embedded RAM with those instructions at present; but this UART now gives the possibility of having a fixed bootloader which loads a binary blob from the serial port and then executes it. The assembled instructions can be output in VHDL for easy integration. Here is how an example which prints ‘HELLO’ after receiving a byte (which is discarded) and then continues to echo everything received back to the sender looks.

X"8D", X"04", -- 0000: load.l  r6 0x0004
X"C1", X"4E", -- 0002: bi      0x004e
X"81", X"48", -- 0004: load.l  r0 0x48
X"8D", X"0A", -- 0006: load.l  r6 0x000a
X"C1", X"34", -- 0008: bi      0x0034
X"81", X"45", -- 000A: load.l  r0 0x45
X"8D", X"10", -- 000C: load.l  r6 0x0010
X"C1", X"34", -- 000E: bi      0x0034
X"81", X"4C", -- 0010: load.l  r0 0x4c
X"8D", X"16", -- 0012: load.l  r6 0x0016
X"C1", X"34", -- 0014: bi      0x0034
X"81", X"4C", -- 0016: load.l  r0 0x4c
X"8D", X"1C", -- 0018: load.l  r6 0x001c
X"C1", X"34", -- 001A: bi      0x0034
X"81", X"4F", -- 001C: load.l  r0 0x4f
X"8D", X"22", -- 001E: load.l  r6 0x0022
X"C1", X"34", -- 0020: bi      0x0034
X"81", X"0D", -- 0022: load.l  r0 0x0d
X"8D", X"28", -- 0024: load.l  r6 0x0028
X"C1", X"34", -- 0026: bi      0x0034
X"81", X"00", -- 0028: load.l  r0 0
X"8D", X"2E", -- 002A: load.l  r6 0x002e
X"C1", X"4E", -- 002C: bi      0x004e
X"8D", X"28", -- 002E: load.l  r6 0x0028
X"C1", X"34", -- 0030: bi      0x0034
X"C1", X"64", -- 0032: bi      0x0064
X"83", X"01", -- 0034: load.l  r1 1
X"84", X"11", -- 0036: load.h  r2 0x11 #uart1 memory mapped offset
X"67", X"44", -- 0038: read.b  r3 r2 4 
X"96", X"6C", -- 003A: cmp.u   r3 r3 r3
X"D3", X"7C", -- 003C: bro.az  r3 -4   #loop until tx_ready nonzero
X"70", X"42", -- 003E: write.b r2 r0 2 #write txdata register
X"70", X"47", -- 0040: write.b r2 r1 3  
X"67", X"44", -- 0042: read.b  r3 r2 4  
X"96", X"6C", -- 0044: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0046: bro.anz r3 -4  
X"83", X"00", -- 0048: load.l  r1 0
X"70", X"47", -- 004A: write.b r2 r1 3
X"C0", X"C0", -- 004C: br      r6
X"83", X"01", -- 004E: load.l  r1 1
X"84", X"11", -- 0050: load.h  r2 0x11
X"72", X"45", -- 0052: write.b r2 r1 5
X"67", X"47", -- 0054: read.b  r3 r2 7
X"96", X"6C", -- 0056: cmp.u   r3 r3 r3
X"D3", X"7C", -- 0058: bro.az  r3 -4 
X"61", X"46", -- 005A: read.b  r0 r2 6  
X"67", X"47", -- 005C: read.b  r3 r2 7
X"96", X"6C", -- 005E: cmp.u   r3 r3 r3
X"D7", X"7C", -- 0060: bro.anz r3 -4 
X"C0", X"C0", -- 0062: br      r6
X"C1", X"64", -- 0064: bi      0x0064
X"00", X"00"  -- 0066: dw      0x0000

Using the miniSpartan6+ FTDI chip

One thing mentioned in the post about my own UART implementation is how I used a Teensy3.1 / external USB->serial TTL cable to interface with the FPGA.

I was contacted on twitter and made aware that the FTDI USB chip on the miniSpartan6+ is dual channel, and as well as interfacing with JTAG to flash the FPGA there is an additional COM port exposed through USB, and a TX/RX line connected to the FPGA. By connecting the TX and RX lines of my top level design to these pins, I can use the onboard USB to communicate with TPU!

This is done by assigning the external TX and RX ports of TPU to the pins on the Spartan6 that are connected to the FTDI USB chip. In the UCF constraints file:

NET "I_ext_rx" LOC = M7 | IOSTANDARD = LVTTL;
NET "O_ext_tx" LOC = N6 | IOSTANDARD = LVTTL;

Will expose those pins as channel B of the FTDI chip. It seems to communicate at 115200 baud, and in my tests it works well. Executing the TPU assembly above, with this USB->TPU setup, we can connect to the com port and communicate:

That ‘HELLO’ is hella-simple, but it shows a memory-mapped peripheral interface working with TPU, which is quite the milestone!

The State of TPU

TPU now has a decent set of instructions, and the ability to call functions, set up stacks, perform arithmetic and operate memory-mapped peripherals. The current top-level view of the system is below:

You can see we have some embedded RAM, our UART, and also a memory-mapped register for setting the LEDS on the miniSpartan6+ as well as reading the on-board switches. The core is our new component that simply exposes our memory interface. It’s a bit more than just a CPU, but we need all this to get the system working!

At the moment we’re using more of the FPGA than I’d like – mostly due to the UART. It takes a real chunk out, bringing utilization up to 33%. I don’t really mind this, as I know there is a lot of stuff implemented in very bad ways. So I’m not worried – this value will fall.

I’m not quite ready to update the Github with what’s here (this was very rushed) but the new ISA can be found here.

That about wraps up this post. I’m currently looking into interfacing something with the TPU via an additional UART which will be fun (and fairly ridiculous, really) – so there is that to look forward to!

Thanks for reading, comments as always to @domipheus.