Optimization of the Day

Consider this (unoptimized) code:

.412
    ...
    sta %t0
    lda %t2
    adc %t6
    sta %t2
    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %r10
    sta %r12
    __bra .414
    ; next set = { 414 }

.413
    lda [%r0]
    tax
    ldy #2
    lda [%r0],y
    stx %t0
    sta %t2
    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %r10
    sta %r12
    ;next set = { 414 }

.414
    lda %r10
    sta %r4
    lda %r12
    sta %r4+2
    __bra .411
    ; prev set = { 412, 413 }
    ; next set = { 411 }

If blocks 412 and 413 look similar, it’s because the last 7 lines (14 bytes) are identical. Since they both flow into 414 (which has no other previous blocks), those lines could be moved to the start of 414. After other optimizations, this results in an 8-byte improvement.

.412
    ...
    sta %t0
    lda %t2
    adc %t6
    __bra .414
    ; next set = { 414 }

.413
    lda [%r0]
    tax
    ldy #2
    lda [%r0],y
    stx %t0
    ;next set = { 414 }

.414
    sta %t2
    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %r10
    sta %r12
    lda %r10
    sta %r4
    lda %r12
    sta %r4+2
    __bra .411
    ; prev set = { 412, 413 }
    ; next set = { 411 }

(This optimization is entirely theoretical and has not been implemented as of the publication date.)

Function Attributes (Redux)

For various reasons, I don’t support orca’s pragmas or toolbox extensions, instead handling them via function attributes. After converting some ORCA/C toolbox headers, I decided to add support for ORCA/C’s pascal and inline keywords.

So, a toolbox call can now be declared as

extern pascal Word MMStartUp(void) inline(0x0202, dispatcher);

or

extern Word [[pascal, inline(0x0202, dispatcher)]] MMStartUp(void);

As a side note, I suspect the inline statement came to ORCA/C by way of APW C by way of MPW C and MPW Pascal, which had an inline statement with a list of hex words (usually 680x0 trap instructions for the tool call). MPW Pascal IIgs also supported inline code but generally used library glue for tool calls.

The function attributes are now included with the return type rather than after the function definition. This better matches GCC and c++11 function attributes. Additionally, the c99 inline and c11 _Noreturn specifiers are now parsed (but so far do nothing).

CPP

LCC uses a stand-alone pre-processor (cpp). The book doesn’t mention any particular details about it. Presumably, it was whatever cpp happened to be lying around the Bell Labs offices at the time. Perhaps it was even written by Brian Kernighan or Dennis Ritchie.

The Plan 9 cpp clearly shares a common ancestor, but was updated to support C 99 variadic macros and fixed a bug with improper ## and # expansion within macros. (The Plan 9 version also uses Plan 9 I/O and supports UTF-8.).

As a side note, the Plan 9 C compilers have a subset of the pre-processor built in. The stand-alone version is usually not needed.

I back-ported (or perhaps side-ported) the C 99 updates and bug fixes to LCC-816.

Function Indirection

I am aware of four known ways to handle function indirection with the 65816/Apple IIgs. As none of them were appropriate for LCC-816, I’ve also devised a new method. I can’t claim it is novel but I am unaware of any previous usage.

Self-Modifying Code

The simplest and most performant way, used by assembly language programmers, is to simply use self-modifying code.

    lda function
    sta @sm+1
    lda function+1
    sta @sm+2
@sm jsl $000000

ORCA/Pascal also uses self-modifying code for object indirection calls with optimize flag $0020.

JSL Via RTL

ORCA/C and ORCA/Pascal manipulate the stack and dispatch via an RTL.

    ldx function+2
    lda function
    phx   
    pha   
    lda 1,s
    dec a
    pha   
    sep #$20
    longa off
    longi on
    lda 5,s
    sta 3,s
    lda #^@return
    sta 6,s
    rep #$20
    longa on
    longi on
    lda #@return
    sta 4,s
    rtl 
@return  
    ...

JML

ORCA/Modula2 uses JML [address] to dispatch. However, JML [address] is utterly broken and always reads from an absolute address in bank 0. Thus, to make use of it, self-modifying code is used.

    lda >function+2
    tay   
    lda >function
    tyx   
    sta <tmp
    stx <tmp+2
    phb   
    phk   
    plb   
    ldy #$0000
    tdc   
    clc   
    adc #tmp
    per @sm
    sta ($01,s),y
    pla   
    plb   
    phk   
    per @return
@sm     
    jml [$0000]
@return

JML (Revisted)

JML [address] and JMP (address) always read from bank 0 and are generally unusable in GS/OS since you don’t know where your stack/direct page will be. But the GS/OS loader knows where your stack is and will tell you if you ask nicely. The key is to create a named stack/direct page segment. Whenever and wherever you refer to the label, the GS/OS relocator will update it to the actual location in memory.

MyDP    STACK
    ds 1024
    ENDSTACK
...
    ; create rtl address
    phk
    per @return-1
    lda function
    sta MyDP
    lda function+2
    sta MyDP+2
    jml [|MyDP]
@return

(That will use DP location 0-3.)

Function Stub

The 65816 has an instruction that pushes the return address on the stack – JSL. We can load the address into the a/x registers and call a trampoline function to dispatch to the real function.

    lda function
    ldx function+2
    jsl __jsl_ax
    ...
__jsl_ax
    short x
    phx
    long x
    dec
    pha
    rtl

This is the method LCC-816 uses. This is primarily for the benefit of the optimizer – self modifying code and screwing around with the stack make it hard to reason about. But you have to admit it looks a lot nicer, too!

MPW PascalIIgs uses this technique as well, though I was not aware of that when I discovered it.

Near Functions

LCC-816 supports near functions, which are always within the same bank and use JSR/RTS for dispatch and return instead of JSL/RTL. This makes function indirection easier – JSR (absolute,X) can be used.

    ldx function
    jsr ($0000,x)

Reverse Subtract

It is an all too common occurrence – the value in your a register needs to be subtracted. ARM provides a RSB instruction. The 65816 (which, quite frankly, needs it more than ARM does) does not. But we can still achieve the same outcome!

A - B is equivalent to A + -B which is equivalent to -B + A. Normally, negation is handled with an EOR #$ffff / INC.

    eor #$ffff
    inc
    clc
    adc ....

The INC/CLC can be combined into a SEC to save an instruction:

    eor #$ffff
    sec
    adc ....

This works for multi-word math, as well.

%t0 = %t0 - a 

    eor #$ffff
    sec
    adc %t0
    sta %t0
    lda #$ffff ; #0 ^ #$ffff 
    adc %t2
    sta %t2

Smart Branches

The 65816 branch instructions use an 8-bit offset, which gives them a range of -128 to +127. There is one exception – BRL – which has a 16-bit offset, so 16-bit conditional branches can be synthesized.

    bcs @farfaraway

becomes

    bcc *+5
    brl @farfaraway

But what’s a compiler writer to do? Blindly using an 8-bit branch will sometimes fail and blindly using a 16-bit branch wastes cycles and interferes with basic block analysis. Not to mention complex comparison (like less than or equal) which have to be composed of bcc and beq and signed comparison.

My solution was to ignore native branch instructions in favor of a synthetic smart branch. Like a real branch, it has a condition and a target. Unlike a real branch, it has a flag to indicate if it’s short or far and may expand to multiple bytes.

After most optimizations run, there is a pass to calculate the code size and determine if a far branch is needed. This is actually a multi-pass process since using a far branch increases the space used and could impact other branches in the code.

Then there is a final peephole pass to remove redundant CMP #0 instructions and the smart branch is expanded to real 65816 code.

The smart branches are:

ORCA/C

Mike Westerfield has made the ORCA/C source code available on GitHub. The source code has been since around the turn of the century, via Disc 2 of the Opus CD but this is a big step forward for bug fixes and enhancements.

I have maintained my own private version of ORCA/C (including various bug fixes) and features backported from MPW ORCA/C.

Let us hope ORCA/C has a bright future ahead of it. (And lcc-816, too!)

Addendum

It’s been taken down until licensing, forking, and such can be clarified. Hopefully it will appear somewhere else, sooner rather than later.

JSL RTL

It’s reasonable for this code:

int yyy(void);
int xxx(void) {
    return yyy(); 
}

to compile to:

    JSL >yyy
    RTL 

(ORCA/C and APW C include extra boilerplate to set up and tear down the stack. Perhaps they’re not reasonable).

But we can save an extra byte if we convert the JSL to a JML and drop the RTL.

    JML >yyy

Stack Relative Addressing

This:

int abc(int a, int b, int c) { return a + b + c; }

Generates this:

abc start
    TSC 
    PHD 
    TCD 
    LDA <$04
    CLC 
    ADC <$06
    CLC 
    ADC <$08
    PLD 
    RTL 
    end

Not bad. But it can be better. And now it is:

abc start
    LDA $04,s
    CLC 
    ADC $06,s
    CLC 
    ADC $08,s
    RTL 
    end

Before direct page locations are assigned to registers, there’s a new optimization pass that checks if stack relative addressing can be used. If so, we can use it and bypass the stack frame setup code. On paper, stack relative addressing is slower (4 or 5 cycles vs 3 or 4 cycles) but dp addressing has a 1 cycle penalty if the low byte of the direct page is not 0. There’s only a 1 in 256 chance that the cycle will be saved so eliminating direct page muckery is probably worth it.

The optimization is only available in limited circumstances:

Variadic Arguments

Variadic arguments are a low-point for ORCA/C but a highpoint for LCC.

The ORCA/C calling convention (callee cleans up) makes them dangerous – va_end has to repair the stack but actually know how many bytes to remove. You can enable debug code to check for errors, but that comes at a price. Additionally, accessing the variadic arguments involves 24-bit pointer arithmetic.

With LCC, I opted to use a caller-cleans-up ABI (cdecl) by default (and for variadic functions). This prevents stack errors. Additionally, I use a more enlightened method for accessing variadic arguments. Since the stack is always in bank 0, we don’t need a 24-bit pointer – we can use ABSOLUTE LONG,X addressing.

Today, this code:

va_list va;
int arg;
...
arg = va_arg(va, int);

will compile to:

    PEI (<va)
    LDA <va
    INC 
    INC 
    STA <va
    PLX 
    LDA >$00,x
    STA <arg

That can be safely optimized even further to:

    LDA <va
    TAX
    INC 
    INC 
    STA <va
    LDA >$00,x
    STA <arg

Even better would be to use the OFFSET,S address mode. However, va_list can be (and often is) passed around to other functions, so that’s not always possible.

For comparison, ORCA/C will generate this code:

      PEI   $11
      LDA   $0F
      CLC   
      ADC   #$0002
      BCC   cc1
      PLX   
      INX   
      PHX   
cc1   PHA   
      LDA   $01,S
      STA   $0F
      LDA   $03,S
      STA   $11
      SEC   
      LDA   $01,S
      SBC   #$0002
      STA   $01,S
      BCS   cc2
      LDA   $03,S
      DEC   A
      STA   $03,S
cc2   PLA   
      PLX   
      STA   $03
      STX   $05
      LDA   [$03]
      STA   $07

Which is more or less equivalent to:

char *va_list;
char *tmp;
va_list += sizeof(int);
tmp = va_list - sizeof(int)
arg = *(int *)tmp;

Optimization of the Day

Starting with this:

myfunc(refNum);

How did we end up with this:

    LDA <$05
    PHA 
    JSL >myfunc

Instead of:

    PEI <$05
    JSL >myfunc

The answer: LCC had to cast refNum from an int to an unsigned. In doing so, the intermediate code looks like:

    lda %r0
    sta %t0
    pei %t0
    jsl close

The optimizer can’t currently optimize it to a simple PEI. And probably never will – there are functions that take parameters via the A and X registers. However, we can fix LCC to bypass the integer conversion. Currently, it only applies to function arguments. It might be helpful elsewhere but that requires some careful thought to make sure it doesn’t break anything.

Reorder Your Stores

Consider this code:

    ...
    LDX #$2211
    JSL >$e10000
    STA >_toolErr
    PLA 
    STA <$01
    PLA 
    STA <$03
    LDA <$01
    STA <$07
    LDA <$03
    STA <$09
    LDA |_toolErr
    ...

Which was generated from this code:

path = LGetPathname2(id, 1);

By way of this code:

    ldx #$2211
    jsl $e10000
    sta >_toolErr
    pla
    sta %t0+0
    pla
    sta %t0+2
    lda %t0
    sta %r2
    lda %t0+2
    sta %r2+2
    lda |_toolErr

The code we’d like (and expect) is:

    ldx #$2211
    jsl $e10000
    sta >_toolErr
    pla
    sta %r2+0
    pla
    sta %r2+2
    lda |_toolErr

However, LCC doesn’t want to store the result into the destination variable – after all, in a normal CPU with a normal ABI, the return value is a specific register. I have made some attempts in the past to fix that at the LCC level but wasn’t successful.

The assembler could reorder those stores, though.

    ldx #$2211
    jsl $e10000
    sta >_toolErr
    pla
    sta %t0+0
    sta %r2
    pla
    sta %t0+2
    sta %r2+2
    lda %t0
    lda %t0+2
    lda |_toolErr

After peepholing and removing dead writes, we would end up with the desired code:

    ldx #$2211
    jsl $e10000
    sta >_toolErr
    pla
    sta %r2+0
    pla
    sta %r2+2
    lda |_toolErr

Function Attributes

ORCA/C uses a variety of qualifiers, statements, and #pragmas to modify functions and the code generated. For LCC, I’ve moved them into a unified function attribute, based on gcc and c++11.

pascal

Use the pascal calling convention. This is based on the toolbox and mpw pascal, not ORCA/Pascal.

Similar to ORCA/C’s pascal qualifier and #pragma toolparms 1

cdecl

Use the cdecl calling convention (default).

stdcall

Use the stdcall calling convention. This provides some compatibility with ORCA/C.

debug

Generate GSBug debug names.

Equivalent to ORCA/C’s #pragma debug $8000 (assuming ORCA/C has been patched to generate debug names…)

inline

inline(address, x-register)
inline(address)

Similar to ORCA/C’s inline qualifier and used for the same purpose (tool calls).

A toolcall would be declared as

extern Word MMStartUp(void) [[pascal, inline(0xe10000, 0x0202)]];

rather than

extern pascal Word MMStartUp(void) inline(0x0202, 0xe10000);

segment

segment("segment name")

Specifies the OMF segment name. Default is blank.

Similar to ORCA/C’s segment statement.

databank

Save the b register and set it to the proper value.

Similar to ORCA/C’s #pragma databank 1.

near

Use JSR/RTS for function calls rather than JSL/RTL. This is more efficient, particularly with function pointers, but cannot cross banks (or segments).

noreturn

Currently parsed but has no effect. Indicates that a function will not return (exit, for example).

Third Success

I just finished up a quick and dirty Classic Desk Accessory to display passages from the Dragon Wars handbook. I mention this here because it was developed entirely with lcc-816.

(The CDA header was written in asm-816. Eventually I plan to have a separate utility to generate headers, similar to orca #pragmas.)

In the process, I added the debug attribute ([[debug]]), fixed pointer alignment, added some optimizations, and fixed some bugs. I also found some more optimizations to add in the future.

Alignment woes

The type metrics were copied from a 32-bit architecture. While I updated the sizes and alignments, I somehow neglected to update the pointer alignment from 4-bytes to 2-bytes. This, predictably, screwed up struct packing and caused GS/OS calls to fail.

I fixed that but it somehow altered the intermediate code:

Original code:

mystruct = array[index]; // mystruct is 6 bytes

4-byte struct alignment:

ADDP4(CVUU4(MULU2(CNSTU2(6), INDIRU2(VREGP(index)))), ADDRGP4(array))

2-byte struct alignment:

ADDP4(MULI4(CNSTI4(6), CVUI4(INDIRU2(VREGP(index)))), ADDRGP4(array))

Which is to say, with 4-byte pointer alignment, the 6 * index multiplication is 16-bit but with a 2-byte pointer alignment, it’s 32-bit.

Addendum

LCC checks the short, int, long, and long long types to find one that matches the pointer metrics (size and alignment) and uses that for pointer arithmetic. If nothing matches (which was previously the case with the goofy alignment), it falls back to using an integer.

Optimization Of The Day

    LDA <$1b
    AND #$ff
    CMP #$00
    BMI .31
    CMP #$1e
    BCC .31

Astute observers will note that the BMI cannot possibly be taken, given that no value can be negative after anding with $ff. Why does this even need to be optimized in the first place?

The original code looked like:

unsigned char b;
...
if (b > 29) ...

b is promoted to a signed integer for the comparison (and the comparison is inverted by lcc), which would normally require extra work to deal with signed comparison. However, since it’s being compared to a known positive number, we can check for negative numbers and then use an unsigned compare. In this case, negative numbers are impossible.

This could be fixed in lcc itself (promote unsigned chars to unsigned integers), the machine description (extra rules for comparing unsigned chars cast to signed ints), or in the peephole optimizer.

For now, it’s handled in the peephole optimizer.

Second Success

Last night, I successfully compiled, assembled, linked, and executed my second test program (qtime).

The first attempt at building had bad code due to a pointer dereference clobbering itself.

void *vp;
...
vp = *(void **)vp;

Originally, that would generate this sequence if the source was the destination:

    lda [%t0]
    sta %t0
    ldy #2
    lda [%t0],y
    sta %t0+2

I changed that to use the X register as an intermediate:

    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %t0
    sta %t0+2

Initially, this generates slightly worse code (one extra instruction). However, in some circumstances, it enables better optimizations later on. Consider:

    stx %t0
    sta %t0+2
    pei %t0+2
    pei %t0

With peephole and dead-write elimination, this becomes:

    pha
    phx

A Retargetable C Compiler: Design and Implementation

A Retargetable C Compiler: Design and Implementation is more or less required when working with LCC. The source code is available, but without any comments. (The compiler is apparently a literate program, so all comments or explanations end up in the book.)

Additional information on the LCC 4.x code generation interface is also valuable.

25 Years In The Making

On IRC, comp.sys.apple2, GEnie, Delphi, AOL, CompuServe, and local BBSs – any place Apple II programmers congregate – there was a common desire: a better C compiler. Through the years, there has been much talk but not much ado.

Sunday, June 28th, 2015, that changed. I successfully compiled, assembled, linked (with the help of MPW’s LinkIIgs), then ran an actual program.

Success was short-lived, though: the second test program generated some incorrect code.

There’s still more work to be done, more optimizations, more testing.

There may not be anyone left to enjoy it, but there will be a better C compiler.

Signed Int Considered Harmful

… at least for your code size. Why? Three reasons.

  1. The 65816 compare/branch instructions don’t support signed integers. Most other architectures provide branches that act on multiple flags.

    unsigned > C == 1 && Z == 0
    unsigned <= C == 0 || Z == 1
    signed >= N == V
    signed > N == V && Z == 0
    signed < N != V
    signed <= N != V || Z == 1

    The 65816 does not provide these. The CMP instruction doesn’t even set the V flag, so a signed %t0 < %t2 expands to this ungodly sequence:

         lda %t0
         sec
         sbc %t2
         bvs *+5
         eor #$8000
         bpl @target
    
  2. With an unsigned integer, the X or Y index registers can be used for, well, indexing. This isn’t possible with signed integers, so a 32-bit pointer needs to be generated and dereferenced. This also requires a 4-bytes of dp space.

  3. Extending a signed integer to a signed long requires extra work to propagate the sign. This requires a branch and dp space.

Moral of the story: use unsigned integers whenever possible. LCC-816 uses unsigned characters for these reasons.

printf("hello, world\n");

Gentle reader:

This blog will (or will not) chronicle some of the adventures as I try to build a C-compiler for the Apple IIgs/65816.

LCC – with a very customized lburg machine description – is the basis for the C compiler. Additionally, a new optimizing assembler is in progress. Linking is currently handled by the MPW LinkIIgs linker. Subroutines libraries will be novel, copied from ORCA/C, or ported from elsewhere (BSD?) as necessary.

Enjoy the ride!