(I apologize in advance if this message seems self congradulatory, but after
a long time of being disheartened by only marginal gains over ghc, I am
finally seeing some substantial benefits, many of which are the result of
optimizations that can actually be ported back to ghc)


So, what started out as a simple pat on my own back for getting strictness and
CPR finally working turned into an adventure in assembly language.  An
incidental discovery revealed a result that completly surprised me, for tight
mathematical inner loops jhc is producing code that runs 3-7x faster than ghc.

This surprised me because I expected them to be identical, they determine the
same strictness and produce the same worker/wrapper split, but due to a quirk
of how ghc generates C code, it ends up producing an ultimatly slower result.
fortunatly, I think the problem is easy to mitigate. (and jhc loses its lead
once again :) )

(all the following snippets are unedited/unreformatted output of the programs
specified except for the removal of some profiling tags)


the motivating example is the plain old basic accumulating factorial function:

fac :: Int -> Int -> Int
fac 1 r = r
fac n r = fac (n - 1) (n*r)

the strictness info is correctly infered that fac is strict in all its
arguments and returns a CAF so it is translated into a nice and speedy
version that takes unboxed arguments and returns unboxed results.

here is the C code that jhc generates. (As an aside, I am very proud of how
readable and how much structure the jhc generated C code preserves of the
original haskell. it's a small thing, perhaps only implementors appreciate it,
but I am glad I spent the time needed to do so.)

/* fW@.fMain.fac */
static int
fWXAXDfMainXDfac(int v94, int v95)
{
        int v96;
        int v97;
        int v98;
        int v99;
        switch (v94) {
        case 1:
            return v95;
            break;
        default:
            v96 = v94;
            v97 = (int)(v94 - 1);
            v98 = (int)(v94 * v95);
            v99 = fWXAXDfMainXDfac(v97, v98);
            return v99;
        break;
        }
}

notice that besides being a bit verbose and using a tailcall, this is exactly
what A C programmer would write. In fact, the generated assembly is quite
nice and I think perhaps provably optimal even compared to hand-coded assembly :)

jhc assembly output:

fWXAXDfMainXDfac:
.L108:
	cmpl	$1, %edi
	je	.L110
	imull	%edi, %esi
	decl	%edi
	jmp	.L108
.L110:
	movl	%esi, %eax
	ret


notice, a tiny inner loop, 4 instructions and one conditional jump. no memory
acceses whatsoever.


now, lets look at what ghc does with the same function:

the generated C code:

EI_(Main_zdwfac_info);
FN_(Main_zdwfac_entry) {
W_ _s27g;
W_ _s27i;
W_ _s27l;
FB_
_s27g = *Sp;
if (_s27g != 0x1UL) goto _c282;
R1.p = (P_)(Sp[1]);
Sp=Sp+2;
JMP_(*Sp);
_c282:
_s27l = _s27g * (Sp[1]);
_s27i = _s27g - 0x1UL;
Sp[1] = _s27l;
*Sp = _s27i;
JMP_((W_)&Main_zdwfac_info);
FE_
}

it looks complicated, but what it effectivly does is pop the arguments off the
stack, run the code but with an explicit jump rather than recursion and push
the result back onto the stack. other than the stack stuff at the beginnig and
end, we would expect this to get compiled to roughly the same assembly as the
jhc version with a nice tight inner loop and just some stack futzing
boilerplate.

and now the generated assembly.

Main_zdwfac_info:
.text
	.align 8
	.text
	movq	(%rbp), %rdx
	cmpq	$1, %rdx
	jne	.L2
	movq	8(%rbp), %r13
	leaq	16(%rbp), %rbp
	movq	(%rbp), %rax
.L4:
	jmp	*%rax
.L2:
	movq	%rdx, %rax
	imulq	8(%rbp), %rax
	movq	%rax, 8(%rbp)
	leaq	-1(%rdx), %rax
	movq	%rax, (%rbp)
	movl	$Main_zdwfac_info, %eax
	jmp	.L4


ack! lets count what happens on each iteration of the loop: we have 5 (!)
memory acceses and two jumps, one of them being indirect! in fact, each time
through the loop, it loads the same values into the same registers.


this is really bad compared to the jhc assembly. the basic issue is an
interaction between ghc's use of global registers, a stack, and indirect
calls.

an indirect jump is very expensive. modern processors are pipelined,
having many instructions in the queue at once, looking ahead and beginning to
evaluate what is coming up. when an indirect jump is encountered, the CPU has
no choice but to flush the whole pipeline because it has no idea where it
goes, even with conditional branches cpus can predict with good accuracy
which branch will be taken and at worst, it discards the pipeline, but with
an indirect jump, there is no chance of it having a full pipeline.


However the major issue is the following.  %rbp is the global stack pointer
pointing to the STG stack, since it is global it can be modified from
anywhere, since gcc can't know if the function it jumps to modified said
register it has no choice but to load all values relative to it every time
through the loop because it may have changed.

furthermore gotos and labels are very problematic for gcc to optimize around.
for various tiresome reasons gcc cannot perform (most) code motion
optimizations across explicit labels and gotos, especially when they deal with
the global register variables and memory stores and loads. since these are
arguably some of the most important optimizations, this is quite bad for the
generated assembly.

loop:

if () goto loop;

is not equivalent to a do-while loop, loop invarients cannot be hoisted out of
the above for instance (except in some cases... it is all quite tricky and we
want gcc to have as much freedom as possible).

all in all, this makes the code generated by gcc compiling something
generated by ghc not very good at all.

there are a couple of things we can do to mitigate these problems:

get rid of indirect jumps whenever possible.

use C control constructs rather than gotos.


A couple simple rules seem to help greatly.

* turn anything of the form JMP_((W_)&self) where self is oneself into a goto
that gotos a label at the beginning of the function.

* do simple pattern matthing on the basic blocks to recognize where C control
constructs can be placed.

for instance turn

if (x) { goto  y; }
blah..
baz..
JMP_(foo)

into

if (x) { goto  y; } else {
blah..
baz..
JMP_(foo)
}

extending the else to after the next jump or goto.


* getting stack dereferences out of your loops.


manually performing the first two optimizations mentioned above we get:

EI_(Main_zdwfac_info);
FN_(Main_zdwfac_entry) {
W_ _s27g;
W_ _s27i;
W_ _s27l;
FB_
fac_entry:
_s27g = *Sp;
if (_s27g != 0x1UL) { goto _c282; } else {
R1.p = (P_)(Sp[1]);
Sp=Sp+2;
JMP_(*Sp);
}
_c282:
_s27l = _s27g * (Sp[1]);
_s27i = _s27g - 0x1UL;
Sp[1] = _s27l;
*Sp = _s27i;
goto fac_entry;
FE_
}

and this produces the assembly:

Main_zdwfac_info:
.text
	.align 8
	.text
	movq	%rbp, %rdx
	movq	(%rbp), %rcx
	cmpq	$1, %rcx
	je	.L3
.L6:
	movq	%rcx, %rax
	imulq	8(%rdx), %rax
	movq	%rax, 8(%rdx)
	leaq	-1(%rcx), %rax
	movq	%rax, (%rbp)
.L4:
	movq	%rbp, %rdx
	movq	%rax, %rcx
	cmpq	$1, %rax
	jne	.L6
.L3:
	movq	8(%rdx), %r13
	leaq	16(%rdx), %rbp
	jmp	*(%rbp)

we still have some unnecesarry memory accesses, but the indirect jump and the
spurious jump are gone and we have less instructions in the main loop making
this code noticably faster.

in order to get rid of the unessesary memory accesess, we need to either

1. fully convert it to use C control constructs, so gcc will do it for us.
(code motion and loop invarient being inhibited again by gotos)

or

2. do it ourselves by analyzing when the consumer of what we are putting on
the stack is ourselves.

the first is greatly preferable but not always possible.


These should be straightforward to implement in the C code generator. it also
suggests we might want to try to use the native C calling convention on leaf
nodes that deal with unboxed values (so we get register passing and return
values for free) or taking groups of mutually recursive functions and turning
them all into one function with explicit jumps between them.


to show they are actually optimizing the same function, here are both of their
core representaions, other than syntax, they are identical:

jhc core: (uses unicode, utf8 formatted)

   W@.fMain.fac∷int → int → int = λx9216∷int.λx9222∷int.(case x9216 of
            1 → x9222 ;
            x9238∷case <(int)-(int,int) x9216 1∷int> of
                    x9252∷int → case <(int)*(int,int) x9216 x9222∷int> of
                        x34∷int → case W@.fMain.fac x9252 x34 of
                            x9208∷int → x9208;;;;;)

ghc core:

  %rec
  {zdwfac :: GHCziPrim.Intzh -> GHCziPrim.Intzh -> GHCziPrim.Intzh =
     \ (ww::GHCziPrim.Intzh) (ww1::GHCziPrim.Intzh) ->
	 %case (GHCziPrim.Intzh) ww %of (ds::GHCziPrim.Intzh)
	   {%_ ->
	      Main.zdwfac (GHCziPrim.zmzh ds (1::GHCziPrim.Intzh))
	      (GHCziPrim.ztzh ds ww1);
	    (1::GHCziPrim.Intzh) ->
	      ww1}};


some random notes:

the 3x-7x factor was tested on an i386, on x86_64 the margin is much much
greater for reasons that are still unclear.

while testing this I noticed that jhc and ghc have dramatically different
results on x86_64 pretty much across the board, if their programs take
comparable amounts of time on i386, the jhc one will run twice as fast as the
ghc one on x86_64. I think ghc must be tickling the x86_64 in a particularly
bad way for this dramatic of an effect to be observed. I will poke around the
generated assembly of ghc some more and see if I can find the culprit.