Other LCCs

My efforts aren’t the first attempt to turn LCC into a 65816 compiler. Let’s look at some other efforts.

Toshiyasu Morita

In 1993/1994, Toshiyasu Morita modified LCC 1.9. LCC was modified to generate intermediate DAG files. A separate program (dag2gs) generated macro-heavy ORCA/M code.

Fabrice France (LCC65)

Fabrice France modified LCC 1.9 to generate 6502 code, for use with the ORIC computer.

Fabrice France (LCC65802)

Fabrice France apparently also modifed LCC to generate 65802/65816 code. Sadly it was never released but the examples look promising.

Jolse Maginnis

As Part of Wings OS for the C64/SCPU.

We’re Back

There hasn’t been any activity in the past year or so but perhaps that will change. In the meantime:

I’m not sure the current state of lcc-816. I was making some changes to the assembler (macros and records) when I stopped working on it. I plan on getting the assembler back into a usable shape and putting it on GitHub, too.

Smart Branches (version 2)

Macross is a macro cross assembler for the 6502. It’s a higher level assembly language in that there you can write code like this:

    inc bleh
    if (equal) {
        inc bleh+1
    }
    do {
        lsr
    } while (!carry)

and it will generate the appropriate branches and labels. I intend to add similar high level constructs to my 816 assembler. They won’t be useful for generated lcc code, but they will be useful for hand-written assembly (which includes parts of libc and the subroutine library).

The first step is a new branch syntax as an alternative to the smart branches. Behold:

    branch condition, target

    branch cc, here
    branch signed lt, there
    branch unsigned ge, another

Condition may also be a boolean (number) for always or never.

    branch true, label
    branch false, label  ; will be optimized out!

Debugging

“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” – Brian Kernighan.

That applies, even more so, to debugging an optimizer. The optimizer is smart enough now that half the “bugs” I’ve found recently weren’t bugs at all – the optimizer code was just smarter than I expected.

Optimization of the Day

Consider this (unoptimized) code:

.412
    ...
    sta %t0
    lda %t2
    adc %t6
    sta %t2
    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %r10
    sta %r12
    __bra .414
    ; next set = { 414 }

.413
    lda [%r0]
    tax
    ldy #2
    lda [%r0],y
    stx %t0
    sta %t2
    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %r10
    sta %r12
    ;next set = { 414 }

.414
    lda %r10
    sta %r4
    lda %r12
    sta %r4+2
    __bra .411
    ; prev set = { 412, 413 }
    ; next set = { 411 }

If blocks 412 and 413 look similar, it’s because the last 7 lines (14 bytes) are identical. Since they both flow into 414 (which has no other previous blocks), those lines could be moved to the start of 414. After other optimizations, this results in an 8-byte improvement.

.412
    ...
    sta %t0
    lda %t2
    adc %t6
    __bra .414
    ; next set = { 414 }

.413
    lda [%r0]
    tax
    ldy #2
    lda [%r0],y
    stx %t0
    ;next set = { 414 }

.414
    sta %t2
    lda [%t0]
    tax
    ldy #2
    lda [%t0],y
    stx %r10
    sta %r12
    lda %r10
    sta %r4
    lda %r12
    sta %r4+2
    __bra .411
    ; prev set = { 412, 413 }
    ; next set = { 411 }

(This optimization is entirely theoretical and has not been implemented as of the publication date.)

Function Attributes (Redux)

For various reasons, I don’t support orca’s pragmas or toolbox extensions, instead handling them via function attributes. After converting some ORCA/C toolbox headers, I decided to add support for ORCA/C’s pascal and inline keywords.

So, a toolbox call can now be declared as

extern pascal Word MMStartUp(void) inline(0x0202, dispatcher);

or

extern Word [[pascal, inline(0x0202, dispatcher)]] MMStartUp(void);

As a side note, I suspect the inline statement came to ORCA/C by way of APW C by way of MPW C and MPW Pascal, which had an inline statement with a list of hex words (usually 680x0 trap instructions for the tool call). MPW Pascal IIgs also supported inline code but generally used library glue for tool calls.

The function attributes are now included with the return type rather than after the function definition. This better matches GCC and c++11 function attributes. Additionally, the c99 inline and c11 _Noreturn specifiers are now parsed (but so far do nothing).

CPP

LCC uses a stand-alone pre-processor (cpp). The book doesn’t mention any particular details about it. Presumably, it was whatever cpp happened to be lying around the Bell Labs offices at the time. Perhaps it was even written by Brian Kernighan or Dennis Ritchie.

The Plan 9 cpp clearly shares a common ancestor, but was updated to support C 99 variadic macros and fixed a bug with improper ## and # expansion within macros. (The Plan 9 version also uses Plan 9 I/O and supports UTF-8.).

As a side note, the Plan 9 C compilers have a subset of the pre-processor built in. The stand-alone version is usually not needed.

I back-ported (or perhaps side-ported) the C 99 updates and bug fixes to LCC-816.

Function Indirection

I am aware of four known ways to handle function indirection with the 65816/Apple IIgs. As none of them were appropriate for LCC-816, I’ve also devised a new method. I can’t claim it is novel but I am unaware of any previous usage.

Self-Modifying Code

The simplest and most performant way, used by assembly language programmers, is to simply use self-modifying code.

    lda function
    sta @sm+1
    lda function+1
    sta @sm+2
@sm jsl $000000

ORCA/Pascal also uses self-modifying code for object indirection calls with optimize flag $0020.

JSL Via RTL

ORCA/C and ORCA/Pascal manipulate the stack and dispatch via an RTL.

    ldx function+2
    lda function
    phx   
    pha   
    lda 1,s
    dec a
    pha   
    sep #$20
    longa off
    longi on
    lda 5,s
    sta 3,s
    lda #^@return
    sta 6,s
    rep #$20
    longa on
    longi on
    lda #@return
    sta 4,s
    rtl 
@return  
    ...

JML

ORCA/Modula2 uses JML [address] to dispatch. However, JML [address] is utterly broken and always reads from an absolute address in bank 0. Thus, to make use of it, self-modifying code is used.

    lda >function+2
    tay   
    lda >function
    tyx   
    sta <tmp
    stx <tmp+2
    phb   
    phk   
    plb   
    ldy #$0000
    tdc   
    clc   
    adc #tmp
    per @sm
    sta ($01,s),y
    pla   
    plb   
    phk   
    per @return
@sm     
    jml [$0000]
@return

JML (Revisted)

JML [address] and JMP (address) always read from bank 0 and are generally unusable in GS/OS since you don’t know where your stack/direct page will be. But the GS/OS loader knows where your stack is and will tell you if you ask nicely. The key is to create a named stack/direct page segment. Whenever and wherever you refer to the label, the GS/OS relocator will update it to the actual location in memory.

MyDP    STACK
    ds 1024
    ENDSTACK
...
    ; create rtl address
    phk
    per @return-1
    lda function
    sta MyDP
    lda function+2
    sta MyDP+2
    jml [|MyDP]
@return

(That will use DP location 0-3.)

Function Stub

The 65816 has an instruction that pushes the return address on the stack – JSL. We can load the address into the a/x registers and call a trampoline function to dispatch to the real function.

    lda function
    ldx function+2
    jsl __jsl_ax
    ...
__jsl_ax
    short x
    phx
    long x
    dec
    pha
    rtl

This is the method LCC-816 uses. This is primarily for the benefit of the optimizer – self modifying code and screwing around with the stack make it hard to reason about. But you have to admit it looks a lot nicer, too!

MPW PascalIIgs uses this technique as well, though I was not aware of that when I discovered it.

Near Functions

LCC-816 supports near functions, which are always within the same bank and use JSR/RTS for dispatch and return instead of JSL/RTL. This makes function indirection easier – JSR (absolute,X) can be used.

    ldx function
    jsr ($0000,x)

Reverse Subtract

It is an all too common occurrence – the value in your a register needs to be subtracted. ARM provides a RSB instruction. The 65816 (which, quite frankly, needs it more than ARM does) does not. But we can still achieve the same outcome!

A - B is equivalent to A + -B which is equivalent to -B + A. Normally, negation is handled with an EOR #$ffff / INC.

    eor #$ffff
    inc
    clc
    adc ....

The INC/CLC can be combined into a SEC to save an instruction:

    eor #$ffff
    sec
    adc ....

This works for multi-word math, as well.

%t0 = %t0 - a 

    eor #$ffff
    sec
    adc %t0
    sta %t0
    lda #$ffff ; #0 ^ #$ffff 
    adc %t2
    sta %t2

Smart Branches

The 65816 branch instructions use an 8-bit offset, which gives them a range of -128 to +127. There is one exception – BRL – which has a 16-bit offset, so 16-bit conditional branches can be synthesized.

    bcs @farfaraway

becomes

    bcc *+5
    brl @farfaraway

But what’s a compiler writer to do? Blindly using an 8-bit branch will sometimes fail and blindly using a 16-bit branch wastes cycles and interferes with basic block analysis. Not to mention complex comparison (like less than or equal) which have to be composed of bcc and beq and signed comparison.

My solution was to ignore native branch instructions in favor of a synthetic smart branch. Like a real branch, it has a condition and a target. Unlike a real branch, it has a flag to indicate if it’s short or far and may expand to multiple bytes.

After most optimizations run, there is a pass to calculate the code size and determine if a far branch is needed. This is actually a multi-pass process since using a far branch increases the space used and could impact other branches in the code.

Then there is a final peephole pass to remove redundant CMP #0 instructions and the smart branch is expanded to real 65816 code.

The smart branches are: