My efforts aren’t the first attempt to turn LCC into a 65816 compiler. Let’s
look at some other efforts.
Toshiyasu Morita
In 1993/1994, Toshiyasu Morita modified LCC 1.9. LCC was modified to
generate intermediate DAG files. A separate
program (dag2gs
) generated macro-heavy ORCA/M code.
Fabrice France (LCC65)
Fabrice France modified LCC 1.9 to generate 6502 code, for use with
the ORIC computer.
Fabrice France (LCC65802)
Fabrice France apparently also modifed LCC to generate 65802/65816 code.
Sadly it was never released but the
examples look promising.
Jolse Maginnis
As Part of Wings OS
for the C64/SCPU.
There hasn’t been any activity in the past year or so but perhaps that will
change. In the meantime:
I’m not sure the current state of lcc-816. I was making some changes to the
assembler (macros and records) when I stopped working on it. I plan on getting
the assembler back into a usable shape and putting it on GitHub, too.
Macross
is a macro cross assembler for the 6502. It’s a higher level assembly language
in that there you can write code like this:
inc bleh
if (equal) {
inc bleh+1
}
do {
lsr
} while (!carry)
and it will generate the appropriate branches and labels. I intend to
add similar high level constructs to my 816 assembler. They won’t be useful
for generated lcc code, but they will be useful for hand-written assembly
(which includes parts of libc and the subroutine library).
The first step is a new branch syntax as an alternative to the smart branches.
Behold:
branch condition, target
branch cc, here
branch signed lt, there
branch unsigned ge, another
Condition may also be a boolean (number) for always or never.
branch true, label
branch false, label ; will be optimized out!
“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” – Brian Kernighan.
That applies, even more so, to debugging an optimizer. The optimizer is smart enough now that half the “bugs” I’ve found recently weren’t bugs at all – the optimizer code was just smarter than I expected.
Consider this (unoptimized) code:
.412
...
sta %t0
lda %t2
adc %t6
sta %t2
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %r10
sta %r12
__bra .414
; next set = { 414 }
.413
lda [%r0]
tax
ldy #2
lda [%r0],y
stx %t0
sta %t2
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %r10
sta %r12
;next set = { 414 }
.414
lda %r10
sta %r4
lda %r12
sta %r4+2
__bra .411
; prev set = { 412, 413 }
; next set = { 411 }
If blocks 412 and 413 look similar, it’s because the last 7 lines (14 bytes)
are identical. Since they both flow into 414 (which has no other previous
blocks), those lines could be moved to the start of 414. After other
optimizations, this results in an 8-byte improvement.
.412
...
sta %t0
lda %t2
adc %t6
__bra .414
; next set = { 414 }
.413
lda [%r0]
tax
ldy #2
lda [%r0],y
stx %t0
;next set = { 414 }
.414
sta %t2
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %r10
sta %r12
lda %r10
sta %r4
lda %r12
sta %r4+2
__bra .411
; prev set = { 412, 413 }
; next set = { 411 }
(This optimization is entirely theoretical and has not been implemented
as of the publication date.)
For various reasons, I don’t support orca’s pragmas or toolbox extensions,
instead handling them via function attributes. After converting some ORCA/C
toolbox headers, I decided to add support for ORCA/C’s pascal and inline
keywords.
So, a toolbox call can now be declared as
extern pascal Word MMStartUp(void) inline(0x0202, dispatcher);
or
extern Word [[pascal, inline(0x0202, dispatcher)]] MMStartUp(void);
As a side note, I suspect the inline statement came to ORCA/C by way of APW C
by way of MPW C and MPW Pascal, which had an inline statement with a list of
hex words (usually 680x0 trap instructions for the tool call). MPW Pascal
IIgs also supported inline code but generally used library glue for tool calls.
The function attributes are now included with the return type rather than
after the function definition. This better matches GCC and c++11 function
attributes. Additionally, the c99 inline and c11 _Noreturn specifiers are
now parsed (but so far do nothing).
LCC uses a stand-alone pre-processor (cpp). The book doesn’t mention any
particular details about it. Presumably, it was whatever cpp happened to
be lying around the Bell Labs offices at the time. Perhaps it was even
written by Brian Kernighan or Dennis Ritchie.
The Plan 9 cpp clearly shares a common ancestor, but was updated to support
C 99 variadic macros and fixed a bug with improper ## and # expansion within
macros. (The Plan 9 version also uses Plan 9 I/O and supports UTF-8.).
As a side note, the Plan 9 C compilers have a subset of the pre-processor
built in. The stand-alone version is usually not needed.
I back-ported (or perhaps side-ported) the C 99 updates and bug fixes to
LCC-816.
I am aware of four known ways to handle function indirection with the
65816/Apple IIgs. As none of them were appropriate for LCC-816, I’ve also
devised a new method. I can’t claim it is novel but I am unaware of
any previous usage.
Self-Modifying Code
The simplest and most performant way, used by assembly language programmers,
is to simply use self-modifying code.
lda function
sta @sm+1
lda function+1
sta @sm+2
@sm jsl $000000
ORCA/Pascal also uses self-modifying code for object indirection calls with
optimize flag $0020.
JSL Via RTL
ORCA/C and ORCA/Pascal manipulate the stack and dispatch via an RTL.
ldx function+2
lda function
phx
pha
lda 1,s
dec a
pha
sep #$20
longa off
longi on
lda 5,s
sta 3,s
lda #^@return
sta 6,s
rep #$20
longa on
longi on
lda #@return
sta 4,s
rtl
@return
...
JML
ORCA/Modula2 uses JML [address] to dispatch. However, JML [address] is utterly
broken and always reads from an absolute address in bank 0. Thus, to make use
of it, self-modifying code is used.
lda >function+2
tay
lda >function
tyx
sta <tmp
stx <tmp+2
phb
phk
plb
ldy #$0000
tdc
clc
adc #tmp
per @sm
sta ($01,s),y
pla
plb
phk
per @return
@sm
jml [$0000]
@return
JML (Revisted)
JML [address] and JMP (address) always read from bank 0 and are generally
unusable in GS/OS since you don’t know where your stack/direct page will be.
But the GS/OS loader knows where your stack is and will tell you if you ask
nicely. The key is to create a named stack/direct page segment. Whenever
and wherever you refer to the label, the GS/OS relocator will update
it to the actual location in memory.
MyDP STACK
ds 1024
ENDSTACK
...
; create rtl address
phk
per @return-1
lda function
sta MyDP
lda function+2
sta MyDP+2
jml [|MyDP]
@return
(That will use DP location 0-3.)
Function Stub
The 65816 has an instruction that pushes the return address on the stack – JSL.
We can load the address into the a/x registers and call a trampoline function
to dispatch to the real function.
lda function
ldx function+2
jsl __jsl_ax
...
__jsl_ax
short x
phx
long x
dec
pha
rtl
This is the method LCC-816 uses. This is primarily for the benefit of the
optimizer – self modifying code and screwing around with the stack make it
hard to reason about. But you have to admit it looks a lot nicer, too!
MPW PascalIIgs uses this technique as well, though I was not aware of that
when I discovered it.
Near Functions
LCC-816 supports near functions, which are always within the same bank and
use JSR/RTS for dispatch and return instead of JSL/RTL. This makes function
indirection easier – JSR (absolute,X) can be used.
ldx function
jsr ($0000,x)
It is an all too common occurrence – the value in your a register needs
to be subtracted. ARM provides a RSB instruction. The 65816 (which, quite
frankly, needs it more than ARM does) does not. But we can still achieve
the same outcome!
A - B
is equivalent to A + -B
which is equivalent to -B + A
. Normally,
negation is handled with an EOR #$ffff / INC
.
eor #$ffff
inc
clc
adc ....
The INC/CLC
can be combined into a SEC
to save an instruction:
eor #$ffff
sec
adc ....
This works for multi-word math, as well.
%t0 = %t0 - a
eor #$ffff
sec
adc %t0
sta %t0
lda #$ffff ; #0 ^ #$ffff
adc %t2
sta %t2
The 65816 branch instructions use an 8-bit offset, which gives them a range
of -128 to +127. There is one exception – BRL – which has a 16-bit offset,
so 16-bit conditional branches can be synthesized.
bcs @farfaraway
becomes
bcc *+5
brl @farfaraway
But what’s a compiler writer to do? Blindly using an 8-bit branch will
sometimes fail and blindly using a 16-bit branch wastes cycles and
interferes with basic block analysis. Not to mention complex comparison
(like less than or equal) which have to be composed of bcc and beq and
signed comparison.
My solution was to ignore native branch instructions in favor of a synthetic
smart branch. Like a real branch, it has a condition and a target. Unlike
a real branch, it has a flag to indicate if it’s short or far and may expand
to multiple bytes.
After most optimizations run, there is a pass to calculate the code size
and determine if a far branch is needed. This is actually a multi-pass process
since using a far branch increases the space used and could impact other
branches in the code.
Then there is a final peephole pass to remove redundant CMP #0 instructions
and the smart branch is expanded to real 65816 code.
The smart branches are:
- __bcc
- __bcs
- __bpl
- __bmi
- __bvs
- __bvc
- __beq
- __bne
- __bra
- __bugt (unsigned >)
- __buge (unsigned >=, aka __bcs)
- __bult (unsigned <, aka __bcc)
- __bule (unsigned <=)
- __bsgt (signed >)
- __bsge (signed >=)
- __bslt (signed <)
- __bsle (signed <=)