»
S
I
D
E
B
A
R
«
Exploring CPU design using Haskell
July 6th, 2012 by axman6

For some time now, I’ve been thinking about designing my own CPU architecture. Last week, I couldn’t get the thought out of my head, and I finally gave in and started to really think about what I’d want from a moderately simple CPU. I’ve decided that I’m going to document the process as I go, to hopefully force myself to finish this project; I’m often quite bad at starting something I find fun, and losing interest before I get to something I’d call complete. I am hoping to change this… I really hope a future potential employers don’t read this bit…

My aim for this project is to have something I can actually run programs on, and maybe even get an LLVM backend written so I can compile basic C programs for it. I plan to implement all the hardware design using the Haskell library Kansas Lava, which allows for designing hardware which can be both simulated in Haskell (you can play with most circuits in GHCI, which is amazingly nice), as well as produce VHDL which can be synthesised and used to configure things like an FPGA. My goal is to have this design running on my Spartan-3E FPGA Starter Board, or possibly one of these (due to its 64bit wide memory interface). So, on to the design!

Obviously it has to be RISC; my skills in hardware design are rusty enough without me having to figure out how to parse binary data in hardware, and besides, no one uses CISC any more these days, they only pretend. But just saying it’s RISC doesn’t get me far, and I only had a vague idea of what I wanted to be able to do. There were lots of features from ARM that I wanted:

  • (Almost) all instructions are conditional, which can make for some very efficient code, both in terms of speed and space required.

  • Lots of registers. Well, I guess 16 is lots, but I wanted more.

  • Most arithmetic instructions have free shifts on their second argument, so there’s no need for dedicated shift instructions. This is pretty neat, and I decided I wanted more free stuff! We’ll get to that in my next post.

There were also some other architectures that intrigued me, notably SPARC.

  • A register set to constant zero, useful for simplifying many operations. Sometimes you want to perform a computation, but you only care about some of its side effects, such as whether it overflows. You can just use this register as the destination, and the result will be lost, but the side effects wont. Also there’s no need for a negation instruction, since it’s just subtracting x from zero.

  • A register window. On function calls, the registers available to new function are not the same as those of calling function, but there is an overlap of 8 registers, which is where the first 8 (I think) function arguments are passed. When the function returns, the window slides back, and the result will have been passed back in what are now the top 8 registers. I decided this is not a feature I need, because… it seems complicated to implement, and the advantages in such a simple design as mine aren’t really worth it as far as I can tell.

  • A branch delay slot. ARM also has this, but I’d forgotten about it during my initial thinking, and realised it would be a useful thing to have to make implementation easier. Essentially what this means is that when a branch instruction is executed, the instruction immediately following it is executed before the branch actually occurs. In a pipelined architecture, this can be quite useful, it can help avoid pipeline stalls. I think one of the main reasons for them existing is due to the extra time needed to fetch the next instruction in order to execute it. The CPU could either stall for a cycle (or more) waiting for it, or it could do some useful work in the mean time.

So with this I got started. I decided, somewhat arbitrarily, that I wanted this to be a 64 bit architecture, with 64 bit wide instructions. This choice would allow me to have more registers than architectures like ARM. It also meant I had more room for constants in instructions, making a lot of tasks easier, and avoiding memory accesses for almost all constants (I’ve ended up with constants up to 36 bits). Initially I was going to go all out, and have 256 64bit registers, but I figured this was a bit of a waste, and I eventually decided (again, somewhat arbitrarily) on 64 registers (plus some special purpose ones).

I also really wanted conditional execution of instructions, and almost all instructions will be conditional. For those not in the know, this means that the instruction’s result is only executed if certain condition flags are set. This can lead to some extremely efficient branchless code, where in the past you would have had to jump between the two clauses of an if-else statement, now you can just perform the comparison, and have all the instructions from each execute conditionally. Sure you waste a few cycles, but you won’t stall the pipeline, and it usually means having less instructions in the code.

Next I started to think about what operations my CPU would need, and what I would like on top of the basics. A came up with a list of basic arithmetic instructions:

Arithmetic instructions
Instruction Description
add{c}{s} Addition of 64bit two’s complement numbers
sub{c}{s} Subtract
rsub{c}{s} Subtract with arguments reversed. The reason for the inclusion of this will become apparent soon int my next post.
mul{s} Multiply
mula{s} Multiply and accumulate. res = res + a*b
addsat{s} Saturating addition (Because ARM has it, and it seemed nifty)
padd{32,16,8} Parallel addition
psub{32,16,8} Parallel subtraction
pmul{32,16,8} Parallel multiplication
and{s} Bitwise and
or{s} Bitwise or
xor{s} Bitwise exclusive or
nand{s} Bitwise not-and
nor{s} Bitwise not-or
cmp Comparison (cmp rm op2 is an alias for subs r0 rm op2)
max{,32,16,8}{s} Maximum
min{,32,16,8}{s} Minimum

The things in curly brackets are variants of the instruction. Instructions with the {s} variant mean they can set the condition flags based on their result, for example, adds can set the carry flag indicating that the addition had a carry out past the end of the result. Instructions with a {c} variant (ie addc) will perform their operation using the carry flag as an input, in whatever manner makes sense for the given instruction. For example, to add two 128 bit numbers in registers r10, 11 and r20, r21 with the result going into r30, r31, you might use something like:

adds r30, r10, r20; # Add, setting flags (ie carry)
addc r31, r11, r21; # Add, using the previously set carried bit

Instructions with sizes after them, as you might expect, operate on differing sized inputs. padd can add 2×32bit numbers, 4×16bit, or 8×8bit.

Then there’s the branching instructions. Since almost all instructions will be conditional, I only really need two kinds of branches. A standard branch, which covers all types of conditional branches automatically, and some kind of call instruction, which would not only modify the program counter, but also save the return address somewhere. There would also need to be its dual, a return instruction, which sets the program counter to the value that was previously saved.

Branching instructions
Instruction Description
br Normal branch, pc <- src shiftL 3
call Function call, otherwise known as branch and link on ARM. ra <- (pc shiftR 3)+1; pc <- src shiftL 3
ret Function return. pc <- ra shiftL 3

There’s some odd stuff going on here, so I’ll explain. The shifts by three come from me wanting to ensure that instructions are always word aligned. Also doing this means that we can jump to constants 8 times further away than previously possible. In the call instruction, we save the address of the next instruction to the return address register, and set the program counter to the address given to the instruction. Each function is responsible for saving the ra if it’s going to make another function call, and restoring it before returning to its callee.

So far we’ve got enough to be a sorta, kinda, maybe turing complete (assuming infinite registers…) machine, but there’s something quite important missing: memory access. This is something I have less planned out than the other forms of instructions, since I’m not sure what sort of features would be really useful, so more time will have to spent on this before I come up with a final design. What I have so far in terms of instructions are:

Load/Store instructions
Instruction Description
ld{,32,16,8} Load a {64,32,15,8}bit value from memory into a register. rdest <- mem[src]
st{,32,16,8} Store mem[dst] <- rsrc
ldsp{,32,16,8} Load relative to stack pointer (Frame pointer?)
stsp{,32,16,8} Store relative to stack pointer
push{,32,16,8} Push a value onto the stack [sp] <- rsrc; sp <- sp - {8,4,2,1}
pop{,32,16,8} Pop a value off the stack rdest <- [sp]; sp <- sp + {8,4,2,1}

Here we have some pretty standard instruction, though the push and pop are maybe not in some RISC architectures because they’re easy to implement if you have direct access to the stack as a general purpose register. I haven’t decided whether I’ll do this or not, but I think it’s likely, since one day, it might be really useful to be able to swap stacks easily (Maybe it’s a possible security risk… ha, look at me worrying about security risks in a CPU that so far has to ability to run an operating system through lack of interrupts!). I’m also quite sure (thanks to shachaf on IRC) that my definitions for my push and pop instructions are wrong, there needs to be some addition to the stack pointer’s value before referencing it. I may also add a frame pointer to make life easier when working with function calls.

I may also add instructions to save a range of registers to the stack and load them back in, like ARM has (though I have no idea how to implement that just yet)

Lastly, there were some common operations, and some just plain cool ones I wanted to have available:

Instruction Description
ctz Count trailing zeros
ctlz Count leading zeros
popcnt{,32,16,8}{a} Bit population count
rpow2 Round to next power of two
extract Extract a range of bits res = op1[m..n] shift o. This might get removed, and made one of the instruction argument formats.
mor See pages 11 and 12 (physical 16 and 17) of
mxor https://docs.google.com/viewer?url=http://www-cs-faculty.stanford.edu/~uno/fasc1.ps.gz

Many of these instructions are trivial to implement in hardware, but can take many many cycles to implement in software without proper support. I’m open to adding more of these if anyone can come up with some instructions they wish they had in their CPU of choice.

The last thing I wanted to talk about before finishing off this first post was my ideas on what I would do about registers. I mentioned earlier that I likes the SPARC idea of having a constant zero register. After I came up with the idea of the extract instruction, realised that having a register of all 1 bits would also be useful for creating masks. Having this means you could do things like complement all the bits in a certain range like so:

xor r4, r4, r1[7:10]; # Use the constant 1's register to form a mask

I think this will turn out to be extremely useful in many situations.

So far this is what I’ve come up with as a tentative plan for registers:

Register (alt name)
r0 Constant zero register
r1 Constant 0xFFFFFFFFFFFFFFFF
r2-r60 General purpose (Maybe make r60 the frame pointer?)
r61 (sp) Stack pointer
r62 (ra) Return address
r63 (ip) Instruction pointer

I have some ideas about what the calling convention for this architecture should be, but that will have to wait for a later post.

In the coming weeks and months I hope flesh out the details and design of this architecture, and hopefully you’ll find it fun to follow along. I plan to put everything up on github eventually, but I want something more concrete first. My next post will go into more detail about the instruction formats I’ve come up with, as well as my first adventure into Kansas Lava and creating a moderately complex adder/subtracter circuit. Until next time, happy hacking!


10 Responses  
  • Ben Gamari writes:
    July 7th, 2012 at 1:00 AM

    I look forward to hearing more!

  • Andy writes:
    July 7th, 2012 at 7:16 AM

    Sounds interesting, i’ll save a book mark to this.

  • solrize writes:
    July 7th, 2012 at 3:02 PM

    Main thing I’d like to see in a new architecture is a no-overhead overflow trap for ordinary int arithmetic. Unexpected integer wraparound is the new buffer overflow.

  • FUZxxl writes:
    July 7th, 2012 at 11:32 PM

    You don’t really need ctz and ctlz given a popcnt instruction. ctz y x can be implemented with popcnt by this simple c-ish sequence: y = popcnt(~x & (x-1)). Another instruction that is very handy is an andn instruction (andn a b c : a = b & ~c) and similar an orn instruction (a = b | ~c). Consider having a look at Knuth’s MMIX. There is really some pretty nice stuff inside!

  • illissius writes:
    July 8th, 2012 at 6:39 PM

    Seconded!

  • Asger Alstrup writes:
    July 9th, 2012 at 3:26 PM

    Combined load, operation, store instructions, like increment memory address. Atomic Swap memory. Atomic copy memory. Compare and exchange. Compare and increment. Random number. Bit rotation shifts. Sign and zero extension operations. Expose carry and other flags as a register. Convert big endian and little endian. Memcopy and memset and memscan.

  • Ben Gamari writes:
    July 10th, 2012 at 9:24 AM

    You may be interested to know that ARM’s new A64 instruction set has apparently[1] dropped conditional execution in all but the branch instructions.

    [1] https://lkml.org/lkml/2012/7/9/178

  • Ken writes:
    July 10th, 2012 at 12:39 PM

    You might be interested in Bluespec, a language and tool with considerable functional programming design. If you have a university affiliation, you can get a free academic license. Arvind’s classes at MIT design a simple pipelined MIPS processor in Bluespec, then do architectural exploration on it. Bluespec Inc’s own “SVP” product (which happes to be written in Bluespec) show that ARM is possible.

  • renoX writes:
    July 19th, 2012 at 11:02 PM

    IMHO, you should look at MIPS and Alpha’s ISAs, both are great. Using 64bit for the instruction’s size is a bit weird: your code density will be very low: the latest trend is 16/32 bit ISA (Thumb2, MIPS16), but maybe I should say was ARM dropped Thumb2 with the v8.

    OTOH the nice thing about 64bit instructions, is that this should allow you to have a very regular ISA dropping condition code registers: add should be an operation with three sources registers (to merge add and add with carry in one operation) plus two destination registers: one for the result, the other for the carry so 5*6bit:30bit used, reasonable in a 64bit instructions.

    @solrize: I agree with you: it’s a shame that only the MIPS has trap on arithmetic overflow instructions..

    @Asger Alstrup: Memcopy and memset and memscan are not RISC at all, think about the issues when a page fault happen! RISC is load-store (well except when you want to do atomic memory operations).

  • Nicolas writes:
    November 5th, 2012 at 11:14 AM

    You might also consider adding prefetch, perhaps as a load to a constant register.

    As far as synchronization primitives go, Compare-and-Swap and Mem-to-Mem-swap are both powerful, relatively easy to implement and well used and studied.

    Are you considering SIMD instructions or floating-point arithmetic ? If extract is in the argument format, we get fixed-point for free (though emulation using shift is quite cheap).

    Anyways, sounds exiting !


Leave a Reply

»  Substance: WordPress   »  Style: Ahren Ahimsa
© Alex Mason (Axman6) 2009