Content follows this message
If you have enjoyed my articles, please consider these charities for donation:

Young Lives vs Cancer - Donate.
Blood Cancer UK - Donate.
Children's Cancer and Leukaemia Group - Donate.

Porting my VHDL Character Generator to Spartan3: Reducing clock speeds and pipelining

Posted Jul 26, 2016, Reading time: 10 minutes.

This is an article on porting my VHDL character generator from a Xilinx Spartan6 device to one with a Spartan3. It starts off as a simple port, analyzing device primitive differences and accounting for them in the design. Along the way, there were considerations on how clocks were generated, characteristics of block ram timing, and general algorithmic design. I’ll assume you’ve read the sections of my Designing a CPU in VHDL series specifically detailing the implementation of the character generator.

Reading time: 10 minutes

When I first attempted to synthesize my TPU CPU Core design on to the miniSpartan3 developer board (made by the great folks at Scarab Hardware), the bulk of the code went without a hitch. The processor core itself contains no primitive parts specific to a single vendor. However, the rest – Block Rams, Clock Generators – used instantiations of specific device primitives. These are different from family to family of FPGAs and those are where the most thought and investigation is needed, as changes can have knock-on impacts to operations further along the device path.

High level device differences are fairly minimal. On the board itself, we have a 32MHz clock input on the miniSpartan3 board instead of the 50MHZ input on the miniSpartan6 setup. So we will need to change ratios of how we generate pixel and ram clocks for the DVI-D/HDMI video output. The FTDI chip for serial communication is similar and a communications channel is connected to the FPGA. We will need to change the constraints file for the pin definitions as well, but that’s always expected.

Clocks

The Spartan6 TPU design utilizes multiple clocks:

Base Clock 50MHz
CPU Core 100MHz
Pixel 25MHz
5x Pixel 125MHz
5x Pixel Inverted 125 MHz
Char/Text Clock 250MHz

The CPU Core and Read/Write Port A’s of all Block Rams use the CPU Core clock. The UART Baud Generator uses the Base Clock. The VGA/Graphics signal generator uses the Pixel clock. The TMDS/DVI-D/HDMI encoders and output buffers use the 5x Pixel and 5x Pixel Inverted clocks. The Character generator, and relevant Port B block rams (Text and Font Rams) use the Char/Text Clock.

When porting to Spartan3, we need to use Digital Clock Managers (DCMs) instead of the Phase Locked Loops (PLLs) on Spartan6. The interface to DCMs is considerably different, but the base terms remain and you can understand what needs to change in the design without much thought.

One of the main issues is that DCMs have much less outputs than the PLLs. On the Spartan6 implementation, a single PLL primitive is used to drive all of the different clocks require. On Spartan3, we will need a DCM for each frequency.

Due to this, we will require 3 DCM objects. Our Spartan3 chip XC3S200A only has 4 in total, so we are using a significant amount of resources to generate these clocks. However, we do have the available DCMs to get started immediately.

The DCMs themselves have multiple configurations to set up. We use the clock synthesizer(DFS) to get our 25MHz pixel clock from our 32MHz input. The maximum rangers for the DFS is outlined in the Spartan-3A datasheet.

To generate the pixel clock, we multiply our 32MHz input by 15 to 480MHz then divide by 19 to get 25.2MHz.

-- 32MHz -> ~25MHz
DCM_SP_inst : DCM_SP
generic map (
  CLKDV_DIVIDE => 2.0, --  Divide by: 1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,6.0,6.5
                       --     7.0,7.5,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0 or 16.0
  CLKFX_DIVIDE => 19,         --  Can be any interger from 1 to 32
  CLKFX_MULTIPLY => 15,       --  Can be any integer from 1 to 32
  CLKIN_DIVIDE_BY_2 => FALSE, --  TRUE/FALSE to enable CLKIN divide by two feature
  CLKIN_PERIOD => 32.0,       --  Specify period of input clock
  CLKOUT_PHASE_SHIFT => "NONE", --  Specify phase shift of "NONE", "FIXED" or "VARIABLE" 
  CLK_FEEDBACK => "1X",         --  Specify clock feedback of "NONE", "1X" or "2X" 
  DESKEW_ADJUST => "SYSTEM_SYNCHRONOUS", -- "SOURCE_SYNCHRONOUS", "SYSTEM_SYNCHRONOUS" or
                                         --     an integer from 0 to 15
  DLL_FREQUENCY_MODE => "LOW",     -- "HIGH" or "LOW" frequency mode for DLL
  DUTY_CYCLE_CORRECTION => TRUE,   --  Duty cycle correction, TRUE or FALSE
  PHASE_SHIFT => 0,        --  Amount of fixed phase shift from -255 to 255
  STARTUP_WAIT => FALSE)   --  Delay configuration DONE until DCM_SP LOCK, TRUE/FALSE
port map (
  CLK0 => CLK0,     -- 0 degree DCM CLK ouptput
  CLK180 => CLK180, -- 180 degree DCM CLK output
  CLK270 => CLK270, -- 270 degree DCM CLK output
  CLK2X => open,    -- 2X DCM CLK output
  CLK2X180 => open, -- 2X, 180 degree DCM CLK out
  CLK90 => open,    -- 90 degree DCM CLK output
  CLKDV => open,    -- Divided DCM CLK out (CLKDV_DIVIDE)
  CLKFX => clock_pixel_unbuffered,   -- DCM CLK synthesis out (M/D)
  CLKFX180 => CLKFX180, -- 180 degree CLK synthesis out
  LOCKED => LOCKED, -- DCM LOCK status output
  PSDONE => PSDONE, -- Dynamic phase adjust done output
  STATUS => open,   -- 8-bit DCM status bits output
  CLKFB => CLKFB,   -- DCM clock feedback
  CLKIN => clk32_buffered,   -- Clock input (from IBUFG, BUFG or DCM)
  PSCLK => open,    -- Dynamic phase adjust clock input
  PSEN => open,     -- Dynamic phase adjust enable input
  PSINCDEC => open, -- Dynamic phase adjust increment/decrement
  RST => '0'        -- DCM asynchronous reset input
);

Block Rams

The block rams on Spartan3 are very similar to the Spartan6 counterparts. They do have different characteristics in terms of timings and therefore maximum operating frequency. The Block Ram primitives on my Spartan3 are not rated for the ~260MHz that those on the Spartan6 run at – so there will be changes required to the Character generator as to account for additional latency in the memory operations.

UART

Thankfully, Xilinx provide the PicoBlaze UART objects for Spartan3 a they do for Spartan6, so there was very little work required in porting these over, apart from using different library objects. The Baud clock routine was changed to strobe correctly using the 32MHz base clock instead of 50MHz on the miniSpartan6. That was the only significant change here.

Differential Signalling Buffers

The OBUFDS output buffers used before can be used on Spartan3, along with the ODDR2 Double Data Rate registers for generating the 10x HDMI signalling.

Character Generator

Most work was on the Character Generator. This was due to the base algorithm of the system needing slight amendments to account for the increased memory latencies and slower clocks. However, I think it’s useful to see what things can happen if we ignore all of that for a second, and simply ‘blind port’ the system ignoring the rated maximum frequencies, just to see what happens.

In fact, the blind port was my first attempt. And this was the result:

Built my CPU for the Spartan3 board with no changes to core, and seems it works! Now to fix this issue of... timing pic.twitter.com/bsolVQmFRT
— Colin Riley 🎗 (@domipheus) June 2, 2016

There are a few points of interest that you can take from this footage. I’ve singled out a frame to identify them easier.

The Colour are correct for the areas they should be
The Glyphs seem to be correct
The corruption while random occurs in X directions, as there are bars which are consistent across character locations.

If we look again at the state diagram for the character generator:

As we can tell that the character along with the colour is correct, we know it’s not data corruption in transfer. However, the vertical banding is occurring at the start of the character, indicating the glyph row data is not getting to the system in time.

In this situation I went to the datasheets and application notes to find maximum frequency ratings for the clock synthesizers and block rams. The 250MHz char/pixel clock is well within specification for generation, but the block rams are only rated for 200MHz. Instead of attempting to redesign the character generator to run off a new slightly slower clock (200MHz), I started modifying it so that it would operate correctly at the 5x pixel serialization clock – as this would free up another DCM object and reduce our utilization from 3 to 2.

The way I started with this problem was to delay the character generator by a single pixel, allowing to pipeline the memory requests up over two pixels instead of one. This would then give us the 10 sub-cycles per pixel.

Table A) shows how 10 sub-cycles are required, table B) shows how they would fit together into a pipelined 5 sub-cycle state machine and C) shows that optimized, as certain stages need only occur when you crossover into a new character. The latching and fetching of the glyph data is idempotent and does not incur additional costs, as the tram data which derives the addresses for glyph rows is only fetched on each 8-pixel character transition.

My first implementation of this seemed to work well enough, apart from there being a duplicated pixel in the glyph.

It is harder than you’d think to tell whether this duplicated pixel was at the start or end of glyph processing, so I forced the background colour of the screen to flip-flop between character locations as they were output, allowing you to see the specific zone that a glyph should reside within.

From here, I could tell it was the first pixel which was causing the problem. The first one encapsulates all memory requests – with the further 7 pixels in a glyph row only utilizing a cached version of the data, so from here it was time to go into the simulator and look at some internal signals.

In the Simulator

The first thing to notice was there was disparity between the pixel x coordinate and the actual pixel/5x pixel clocks. Due to the pixel operations being driven from the 5x clock, there could be instances where mid-request there was changes in coordinates. The way to fix this was to have a process driven off of the pixel clock, which then latched the X and Y coordinates, which then the various other logic driven from the 5x clock could utilize.

You can see in the above waveform the issue clearly. At (a) we kick off a new initial glyph row request (State 1 is only ever entered on the first pixel of a character row). If we do not latch the coordinate, half way though the request at (b) we could have the coordinate flip.

Since we already had a process running from the pixel clock to manage the blinker flags, this was a simple addition.

-- This process latches the X and Y on the pixel clock
  -- Also manages the blinker.
  process(I_clk_pixel)
  begin
    if rising_edge(I_clk_pixel) then
      blinker_count <= blinker_count + 1;
      x_latched <= I_x;
      y_latched <= I_y;
    end if;
  end process;

The last issue was a really irritating one. Irritating as it was a very basic bug in my code. A simple state < 4 check should have been <= 4, meaning the 4 state prolonged an additional cycle, throwing the first pixel off. Easily fixed, and easily spotted in the simulator.

The last thing to do was to also try it on my miniSpartan6+ project, and it worked first time – which is great 🙂

Wrap Up

We now have the character generator running off of a 5x pixel clock, with the font/text ram read ports also running at that slower clock. As well as allowing us to run to the Spartan3 FPGA specs of the device I have, it will additionally allow for higher resolutions in the future – especially on the Spartan6 variant.

Thanks for reading, let me know what you think on Twitter @domipheus.

Domipheus Labs

Stuff that interests Colin ‘Domipheus’ Riley