Engineer application - sammyuri

Minecraft name: sammyuri

What’s a thing you have made which demonstrates sufficient engineering knowledge?:
The NSSPU or “not so sluggish processing unit”, an 8 bit pipelined CPU
(my first CPU was called the “sluggish processing unit” as it had a clock speed of 82 ticks)

  • 10 tick clock
  • 5 stage waterfall pipeline (fetch, decode, read, execute, writeback)
  • 7 general purpose registers and 16 bytes of RAM
  • 8-function ALU: add, subtract, add immediate, subtract immediate, AND, XOR, barrel shift left, barrel shift right
  • 64 bytes of program memory
  • 7 branch conditions
  • Integrated 16x16 double buffered GPU
  • URCL support

Programs the NSSPU can run include:

  • Multiplication and division algorithms
  • Sorting algorithms (up to 16 byte array), stack emulation,
  • Collatz conjecture, prime number finder, Bresenham’s line algorithm
  • Cellular automata
  • Simple games such as pong, guess the number

What engineering work went into designing this device?:
My original plans were to construct a larger, dual core CPU capable of running advanced programs such as Snake or Tetris; this CPU is just a test bed for the architecture, components and particularly the ISA for the future one. However, this one turned out to already be advanced enough it may be worthy of Engineer.
The first challenge I faced was to design an instruction set that would be flexible and powerful while not tedious to program. I have always preferred 3-operand over accumulator-based instruction sets and thus wanted to stick with that, yet also wanted the word length to be 8 bits to help with caching of the future CPU. Eventually I came up with a variable length instruction set where instructions can be between 1 and 3 bytes in length, meaning instructions take up no more space than they need to while having ample room for expansion and CISC instructions. The instruction set the NSSPU uses ( is a slightly modified version of it. In the end, my efforts paid off - complicated programs such as pong and rule 110 fit in 64 and 52 bytes respectively.
The next bit of work was figuring out the pipeline; this was my first pipelined CPU and so I had never encountered all the related issues. In the end I chose a 5-stage pipeline as this would facilitate timing any additional bytes of an instruction, and since most instructions that access the ALU take up 2 bytes, you can still use the result of one operation immediately in the next one, avoiding the complexity of forwarding. However, one big issue with so many pipeline stages was conditional branching, as flags often weren’t ready before the next instruction would be executed. To solve this I decided that the CPU would always execute the next instruction after a conditional branch instruction, even if the branch happens. This still allows for chains of conditional branches in a row, and (as long as you have it in mind while programming) rarely causes pipeline bubbles to have to be inserted.
After doing all that I got to work on the data loop - it actually turned out to be only 15 ticks, so I could have increased the clock speed slightly; however, since I had planned other bits of it with the 10 tick clock in mind, I decided to keep it at that. The rest of the CPU quickly came together and after about 6 hours of trying to figure out why it was giving 2 + 3 = 7, I had a functional machine. I colour coded the CPU and plastered it with banners to make it very easy to tell which part was which.
This CPU has a few odd quirks that stem from it being a scaled down version of one I have planned. For one thing, there are no IO ports; for programs that might want to utilise them you can simply treat the last few bytes of RAM as ports and load/store to them. There are also eight unused instructions that would deal with shared memory access, stalling and IO in a multi-core version. The way I implemented the pipeline also has a couple problems such as being unable to stop and continue/step the clock without ending the program altogether (although this might be a general issue for waterfall pipelines). Finally, 64 bytes turns out to not be that much program memory for complex programs (my next CPU will have way more).
Nevertheless, I am happy with how it turned out as it is decently fast, and is fun both to program and to watch running programs while surprisingly lag-free. During an interview I’d be happy to demonstrate pong or the cellular automaton as they are my favourite programs so far.

Image/s and/or video/s of the device: /warp nsspu

NSSPU running Rule 110, 2 minutes/generation

NSSPU running pong, 45-50 seconds/frame


This looks honestly really good, I just have one question:
I just looked at your ISA, the Source seems to be on the second byte, whereas detination and operand are on the first. Why is it like that?
In a 3 operand cpu, the first thing u need is the source. If you read in 8 bytes at a time, it would make more sense to first prepare source.

Since both source A and source B (or the immediate) need to reach the ALU at the exact same time for all ALU functions, and there were only enough available bits in the first byte for one of the two, it wouldn’t speed any instructions up reading one source earlier as the ALU would still have to wait for the other source. Also, this would add further complexity with timing everything which was already hard enough, and immediates are always in the second byte so the destination would have to be in the first byte for some instructions anyway. (I assume you meant 8 bits not 8 bytes)

1 Like

Accepted for an interview!