For some time now, I’ve been thinking about designing my own CPU architecture. Last week, I couldn’t get the thought out of my head, and I finally gave in and started to really think about what I’d want from a moderately simple CPU. I’ve decided that I’m going to document the process as I go, to hopefully force myself to finish this project; I’m often quite bad at starting something I find fun, and losing interest before I get to something I’d call complete. I am hoping to change this… I really hope a future potential employers don’t read this bit…
My aim for this project is to have something I can actually run programs on, and maybe even get an LLVM backend written so I can compile basic C programs for it. I plan to implement all the hardware design using the Haskell library Kansas Lava, which allows for designing hardware which can be both simulated in Haskell (you can play with most circuits in GHCI, which is amazingly nice), as well as produce VHDL which can be synthesised and used to configure things like an FPGA. My goal is to have this design running on my Spartan-3E FPGA Starter Board, or possibly one of these (due to its 64bit wide memory interface). So, on to the design!
Obviously it has to be RISC; my skills in hardware design are rusty enough without me having to figure out how to parse binary data in hardware, and besides, no one uses CISC any more these days, they only pretend. But just saying it’s RISC doesn’t get me far, and I only had a vague idea of what I wanted to be able to do. There were lots of features from ARM that I wanted:
(Almost) all instructions are conditional, which can make for some very efficient code, both in terms of speed and space required.
Lots of registers. Well, I guess 16 is lots, but I wanted more.
Most arithmetic instructions have free shifts on their second argument, so there’s no need for dedicated shift instructions. This is pretty neat, and I decided I wanted more free stuff! We’ll get to that in my next post.
There were also some other architectures that intrigued me, notably SPARC.
A register set to constant zero, useful for simplifying many operations. Sometimes you want to perform a computation, but you only care about some of its side effects, such as whether it overflows. You can just use this register as the destination, and the result will be lost, but the side effects wont. Also there’s no need for a negation instruction, since it’s just subtracting x from zero.
A register window. On function calls, the registers available to new function are not the same as those of calling function, but there is an overlap of 8 registers, which is where the first 8 (I think) function arguments are passed. When the function returns, the window slides back, and the result will have been passed back in what are now the top 8 registers. I decided this is not a feature I need, because… it seems complicated to implement, and the advantages in such a simple design as mine aren’t really worth it as far as I can tell.
A branch delay slot. ARM also has this, but I’d forgotten about it during my initial thinking, and realised it would be a useful thing to have to make implementation easier. Essentially what this means is that when a branch instruction is executed, the instruction immediately following it is executed before the branch actually occurs. In a pipelined architecture, this can be quite useful, it can help avoid pipeline stalls. I think one of the main reasons for them existing is due to the extra time needed to fetch the next instruction in order to execute it. The CPU could either stall for a cycle (or more) waiting for it, or it could do some useful work in the mean time.
So with this I got started. I decided, somewhat arbitrarily, that I wanted this to be a 64 bit architecture, with 64 bit wide instructions. This choice would allow me to have more registers than architectures like ARM. It also meant I had more room for constants in instructions, making a lot of tasks easier, and avoiding memory accesses for almost all constants (I’ve ended up with constants up to 36 bits). Initially I was going to go all out, and have 256 64bit registers, but I figured this was a bit of a waste, and I eventually decided (again, somewhat arbitrarily) on 64 registers (plus some special purpose ones).
I also really wanted conditional execution of instructions, and almost all instructions will be conditional. For those not in the know, this means that the instruction’s result is only executed if certain condition flags are set. This can lead to some extremely efficient branchless code, where in the past you would have had to jump between the two clauses of an if-else statement, now you can just perform the comparison, and have all the instructions from each execute conditionally. Sure you waste a few cycles, but you won’t stall the pipeline, and it usually means having less instructions in the code.
Next I started to think about what operations my CPU would need, and what I would like on top of the basics. A came up with a list of basic arithmetic instructions:
Arithmetic instructions
| Instruction |
Description |
| add{c}{s} |
Addition of 64bit two’s complement numbers |
| sub{c}{s} |
Subtract |
| rsub{c}{s} |
Subtract with arguments reversed. The reason for the inclusion of this will become apparent soon int my next post. |
| mul{s} |
Multiply |
| mula{s} |
Multiply and accumulate. res = res + a*b |
| addsat{s} |
Saturating addition (Because ARM has it, and it seemed nifty) |
| padd{32,16,8} |
Parallel addition |
| psub{32,16,8} |
Parallel subtraction |
| pmul{32,16,8} |
Parallel multiplication |
| |
|
| and{s} |
Bitwise and |
| or{s} |
Bitwise or |
| xor{s} |
Bitwise exclusive or |
| nand{s} |
Bitwise not-and |
| nor{s} |
Bitwise not-or |
| |
|
| cmp |
Comparison (cmp rm op2 is an alias for subs r0 rm op2) |
| |
|
| max{,32,16,8}{s} |
Maximum |
| min{,32,16,8}{s} |
Minimum |
The things in curly brackets are variants of the instruction. Instructions with the {s} variant mean they can set the condition flags based on their result, for example, adds can set the carry flag indicating that the addition had a carry out past the end of the result. Instructions with a {c} variant (ie addc) will perform their operation using the carry flag as an input, in whatever manner makes sense for the given instruction. For example, to add two 128 bit numbers in registers r10, 11 and r20, r21 with the result going into r30, r31, you might use something like:
adds r30, r10, r20; # Add, setting flags (ie carry)
addc r31, r11, r21; # Add, using the previously set carried bit
Instructions with sizes after them, as you might expect, operate on differing sized inputs. padd can add 2×32bit numbers, 4×16bit, or 8×8bit.
Then there’s the branching instructions. Since almost all instructions will be conditional, I only really need two kinds of branches. A standard branch, which covers all types of conditional branches automatically, and some kind of call instruction, which would not only modify the program counter, but also save the return address somewhere. There would also need to be its dual, a return instruction, which sets the program counter to the value that was previously saved.
Branching instructions
| Instruction |
Description |
| br |
Normal branch, pc <- src shiftL 3 |
| call |
Function call, otherwise known as branch and link on ARM. ra <- (pc shiftR 3)+1; pc <- src shiftL 3 |
| ret |
Function return. pc <- ra shiftL 3 |
There’s some odd stuff going on here, so I’ll explain. The shifts by three come from me wanting to ensure that instructions are always word aligned. Also doing this means that we can jump to constants 8 times further away than previously possible. In the call instruction, we save the address of the next instruction to the return address register, and set the program counter to the address given to the instruction. Each function is responsible for saving the ra if it’s going to make another function call, and restoring it before returning to its callee.
So far we’ve got enough to be a sorta, kinda, maybe turing complete (assuming infinite registers…) machine, but there’s something quite important missing: memory access. This is something I have less planned out than the other forms of instructions, since I’m not sure what sort of features would be really useful, so more time will have to spent on this before I come up with a final design. What I have so far in terms of instructions are:
Load/Store instructions
| Instruction |
Description |
| ld{,32,16,8} |
Load a {64,32,15,8}bit value from memory into a register. rdest <- mem[src] |
| st{,32,16,8} |
Store mem[dst] <- rsrc |
| ldsp{,32,16,8} |
Load relative to stack pointer (Frame pointer?) |
| stsp{,32,16,8} |
Store relative to stack pointer |
| push{,32,16,8} |
Push a value onto the stack [sp] <- rsrc; sp <- sp - {8,4,2,1} |
| pop{,32,16,8} |
Pop a value off the stack rdest <- [sp]; sp <- sp + {8,4,2,1} |
Here we have some pretty standard instruction, though the push and pop are maybe not in some RISC architectures because they’re easy to implement if you have direct access to the stack as a general purpose register. I haven’t decided whether I’ll do this or not, but I think it’s likely, since one day, it might be really useful to be able to swap stacks easily (Maybe it’s a possible security risk… ha, look at me worrying about security risks in a CPU that so far has to ability to run an operating system through lack of interrupts!). I’m also quite sure (thanks to shachaf on IRC) that my definitions for my push and pop instructions are wrong, there needs to be some addition to the stack pointer’s value before referencing it. I may also add a frame pointer to make life easier when working with function calls.
I may also add instructions to save a range of registers to the stack and load them back in, like ARM has (though I have no idea how to implement that just yet)
Lastly, there were some common operations, and some just plain cool ones I wanted to have available:
| Instruction |
Description |
| ctz |
Count trailing zeros |
| ctlz |
Count leading zeros |
| popcnt{,32,16,8}{a} |
Bit population count |
| rpow2 |
Round to next power of two |
| extract |
Extract a range of bits res = op1[m..n] shift o. This might get removed, and made one of the instruction argument formats. |
| |
|
| mor |
See pages 11 and 12 (physical 16 and 17) of |
| mxor |
https://docs.google.com/viewer?url=http://www-cs-faculty.stanford.edu/~uno/fasc1.ps.gz |
Many of these instructions are trivial to implement in hardware, but can take many many cycles to implement in software without proper support. I’m open to adding more of these if anyone can come up with some instructions they wish they had in their CPU of choice.
The last thing I wanted to talk about before finishing off this first post was my ideas on what I would do about registers. I mentioned earlier that I likes the SPARC idea of having a constant zero register. After I came up with the idea of the extract instruction, realised that having a register of all 1 bits would also be useful for creating masks. Having this means you could do things like complement all the bits in a certain range like so:
xor r4, r4, r1[7:10]; # Use the constant 1's register to form a mask
I think this will turn out to be extremely useful in many situations.
So far this is what I’ve come up with as a tentative plan for registers:
| Register (alt name) |
|
| r0 |
Constant zero register |
| r1 |
Constant 0xFFFFFFFFFFFFFFFF |
| r2-r60 |
General purpose (Maybe make r60 the frame pointer?) |
| r61 (sp) |
Stack pointer |
| r62 (ra) |
Return address |
| r63 (ip) |
Instruction pointer |
I have some ideas about what the calling convention for this architecture should be, but that will have to wait for a later post.
In the coming weeks and months I hope flesh out the details and design of this architecture, and hopefully you’ll find it fun to follow along. I plan to put everything up on github eventually, but I want something more concrete first. My next post will go into more detail about the instruction formats I’ve come up with, as well as my first adventure into Kansas Lava and creating a moderately complex adder/subtracter circuit. Until next time, happy hacking!