Consider this (unoptimized) code:
.412
...
sta %t0
lda %t2
adc %t6
sta %t2
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %r10
sta %r12
__bra .414
; next set = { 414 }
.413
lda [%r0]
tax
ldy #2
lda [%r0],y
stx %t0
sta %t2
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %r10
sta %r12
;next set = { 414 }
.414
lda %r10
sta %r4
lda %r12
sta %r4+2
__bra .411
; prev set = { 412, 413 }
; next set = { 411 }
If blocks 412 and 413 look similar, it’s because the last 7 lines (14 bytes)
are identical. Since they both flow into 414 (which has no other previous
blocks), those lines could be moved to the start of 414. After other
optimizations, this results in an 8-byte improvement.
.412
...
sta %t0
lda %t2
adc %t6
__bra .414
; next set = { 414 }
.413
lda [%r0]
tax
ldy #2
lda [%r0],y
stx %t0
;next set = { 414 }
.414
sta %t2
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %r10
sta %r12
lda %r10
sta %r4
lda %r12
sta %r4+2
__bra .411
; prev set = { 412, 413 }
; next set = { 411 }
(This optimization is entirely theoretical and has not been implemented
as of the publication date.)
For various reasons, I don’t support orca’s pragmas or toolbox extensions,
instead handling them via function attributes. After converting some ORCA/C
toolbox headers, I decided to add support for ORCA/C’s pascal and inline
keywords.
So, a toolbox call can now be declared as
extern pascal Word MMStartUp(void) inline(0x0202, dispatcher);
or
extern Word [[pascal, inline(0x0202, dispatcher)]] MMStartUp(void);
As a side note, I suspect the inline statement came to ORCA/C by way of APW C
by way of MPW C and MPW Pascal, which had an inline statement with a list of
hex words (usually 680x0 trap instructions for the tool call). MPW Pascal
IIgs also supported inline code but generally used library glue for tool calls.
The function attributes are now included with the return type rather than
after the function definition. This better matches GCC and c++11 function
attributes. Additionally, the c99 inline and c11 _Noreturn specifiers are
now parsed (but so far do nothing).
LCC uses a stand-alone pre-processor (cpp). The book doesn’t mention any
particular details about it. Presumably, it was whatever cpp happened to
be lying around the Bell Labs offices at the time. Perhaps it was even
written by Brian Kernighan or Dennis Ritchie.
The Plan 9 cpp clearly shares a common ancestor, but was updated to support
C 99 variadic macros and fixed a bug with improper ## and # expansion within
macros. (The Plan 9 version also uses Plan 9 I/O and supports UTF-8.).
As a side note, the Plan 9 C compilers have a subset of the pre-processor
built in. The stand-alone version is usually not needed.
I back-ported (or perhaps side-ported) the C 99 updates and bug fixes to
LCC-816.
I am aware of four known ways to handle function indirection with the
65816/Apple IIgs. As none of them were appropriate for LCC-816, I’ve also
devised a new method. I can’t claim it is novel but I am unaware of
any previous usage.
Self-Modifying Code
The simplest and most performant way, used by assembly language programmers,
is to simply use self-modifying code.
lda function
sta @sm+1
lda function+1
sta @sm+2
@sm jsl $000000
ORCA/Pascal also uses self-modifying code for object indirection calls with
optimize flag $0020.
JSL Via RTL
ORCA/C and ORCA/Pascal manipulate the stack and dispatch via an RTL.
ldx function+2
lda function
phx
pha
lda 1,s
dec a
pha
sep #$20
longa off
longi on
lda 5,s
sta 3,s
lda #^@return
sta 6,s
rep #$20
longa on
longi on
lda #@return
sta 4,s
rtl
@return
...
JML
ORCA/Modula2 uses JML [address] to dispatch. However, JML [address] is utterly
broken and always reads from an absolute address in bank 0. Thus, to make use
of it, self-modifying code is used.
lda >function+2
tay
lda >function
tyx
sta <tmp
stx <tmp+2
phb
phk
plb
ldy #$0000
tdc
clc
adc #tmp
per @sm
sta ($01,s),y
pla
plb
phk
per @return
@sm
jml [$0000]
@return
JML (Revisted)
JML [address] and JMP (address) always read from bank 0 and are generally
unusable in GS/OS since you don’t know where your stack/direct page will be.
But the GS/OS loader knows where your stack is and will tell you if you ask
nicely. The key is to create a named stack/direct page segment. Whenever
and wherever you refer to the label, the GS/OS relocator will update
it to the actual location in memory.
MyDP STACK
ds 1024
ENDSTACK
...
; create rtl address
phk
per @return-1
lda function
sta MyDP
lda function+2
sta MyDP+2
jml [|MyDP]
@return
(That will use DP location 0-3.)
Function Stub
The 65816 has an instruction that pushes the return address on the stack – JSL.
We can load the address into the a/x registers and call a trampoline function
to dispatch to the real function.
lda function
ldx function+2
jsl __jsl_ax
...
__jsl_ax
short x
phx
long x
dec
pha
rtl
This is the method LCC-816 uses. This is primarily for the benefit of the
optimizer – self modifying code and screwing around with the stack make it
hard to reason about. But you have to admit it looks a lot nicer, too!
MPW PascalIIgs uses this technique as well, though I was not aware of that
when I discovered it.
Near Functions
LCC-816 supports near functions, which are always within the same bank and
use JSR/RTS for dispatch and return instead of JSL/RTL. This makes function
indirection easier – JSR (absolute,X) can be used.
ldx function
jsr ($0000,x)
It is an all too common occurrence – the value in your a register needs
to be subtracted. ARM provides a RSB instruction. The 65816 (which, quite
frankly, needs it more than ARM does) does not. But we can still achieve
the same outcome!
A - B
is equivalent to A + -B
which is equivalent to -B + A
. Normally,
negation is handled with an EOR #$ffff / INC
.
eor #$ffff
inc
clc
adc ....
The INC/CLC
can be combined into a SEC
to save an instruction:
eor #$ffff
sec
adc ....
This works for multi-word math, as well.
%t0 = %t0 - a
eor #$ffff
sec
adc %t0
sta %t0
lda #$ffff ; #0 ^ #$ffff
adc %t2
sta %t2
The 65816 branch instructions use an 8-bit offset, which gives them a range
of -128 to +127. There is one exception – BRL – which has a 16-bit offset,
so 16-bit conditional branches can be synthesized.
bcs @farfaraway
becomes
bcc *+5
brl @farfaraway
But what’s a compiler writer to do? Blindly using an 8-bit branch will
sometimes fail and blindly using a 16-bit branch wastes cycles and
interferes with basic block analysis. Not to mention complex comparison
(like less than or equal) which have to be composed of bcc and beq and
signed comparison.
My solution was to ignore native branch instructions in favor of a synthetic
smart branch. Like a real branch, it has a condition and a target. Unlike
a real branch, it has a flag to indicate if it’s short or far and may expand
to multiple bytes.
After most optimizations run, there is a pass to calculate the code size
and determine if a far branch is needed. This is actually a multi-pass process
since using a far branch increases the space used and could impact other
branches in the code.
Then there is a final peephole pass to remove redundant CMP #0 instructions
and the smart branch is expanded to real 65816 code.
The smart branches are:
- __bcc
- __bcs
- __bpl
- __bmi
- __bvs
- __bvc
- __beq
- __bne
- __bra
- __bugt (unsigned >)
- __buge (unsigned >=, aka __bcs)
- __bult (unsigned <, aka __bcc)
- __bule (unsigned <=)
- __bsgt (signed >)
- __bsge (signed >=)
- __bslt (signed <)
- __bsle (signed <=)
Mike Westerfield has made the ORCA/C source code
available on GitHub. The source
code has been since around the turn of the century, via Disc 2 of the Opus
CD but this is a big step forward for bug fixes and enhancements.
I have maintained my own private version of ORCA/C (including various
bug fixes) and features backported from
MPW ORCA/C.
Let us hope ORCA/C has a bright future ahead of it. (And lcc-816, too!)
Addendum
It’s been taken down until licensing, forking, and such can be clarified.
Hopefully it will appear somewhere else, sooner rather than later.
It’s reasonable for this code:
int yyy(void);
int xxx(void) {
return yyy();
}
to compile to:
JSL >yyy
RTL
(ORCA/C and APW C include extra boilerplate to set up and tear down the stack.
Perhaps they’re not reasonable).
But we can save an extra byte if we convert the JSL
to a JML
and drop
the RTL
.
JML >yyy
This:
int abc(int a, int b, int c) { return a + b + c; }
Generates this:
abc start
TSC
PHD
TCD
LDA <$04
CLC
ADC <$06
CLC
ADC <$08
PLD
RTL
end
Not bad. But it can be better. And now it is:
abc start
LDA $04,s
CLC
ADC $06,s
CLC
ADC $08,s
RTL
end
Before direct page locations are assigned to registers, there’s a new
optimization pass that checks if stack relative addressing can be used. If
so, we can use it and bypass the stack frame setup code. On paper, stack
relative addressing is slower (4 or 5 cycles vs 3 or 4 cycles) but dp
addressing has a 1 cycle penalty if the low byte of the direct page is not 0.
There’s only a 1 in 256 chance that the cycle will be saved so eliminating
direct page muckery is probably worth it.
The optimization is only available in limited circumstances:
- No [dp], [dp],y dp,x, dp,y, (dp) or (dp,x) addressing used
- No instructions that affect the stack (PHA, PLA, et cetera)
- No TDC (used when taking the address of a local variable)
Variadic arguments are a low-point for ORCA/C but a highpoint for LCC.
The ORCA/C calling convention (callee cleans up) makes them dangerous –
va_end has to repair the stack but actually know how many bytes to remove.
You can enable debug code to check for errors, but that comes at a price.
Additionally, accessing the variadic arguments involves 24-bit pointer
arithmetic.
With LCC, I opted to use a caller-cleans-up ABI (cdecl) by default (and for
variadic functions). This prevents stack errors. Additionally, I use a
more enlightened method for accessing variadic arguments. Since the stack
is always in bank 0, we don’t need a 24-bit pointer – we can use
ABSOLUTE LONG,X addressing.
Today, this code:
va_list va;
int arg;
...
arg = va_arg(va, int);
will compile to:
PEI (<va)
LDA <va
INC
INC
STA <va
PLX
LDA >$00,x
STA <arg
That can be safely optimized even further to:
LDA <va
TAX
INC
INC
STA <va
LDA >$00,x
STA <arg
Even better would be to use the OFFSET,S address mode. However, va_list can
be (and often is) passed around to other functions, so that’s not always
possible.
For comparison, ORCA/C will generate this code:
PEI $11
LDA $0F
CLC
ADC #$0002
BCC cc1
PLX
INX
PHX
cc1 PHA
LDA $01,S
STA $0F
LDA $03,S
STA $11
SEC
LDA $01,S
SBC #$0002
STA $01,S
BCS cc2
LDA $03,S
DEC A
STA $03,S
cc2 PLA
PLX
STA $03
STX $05
LDA [$03]
STA $07
Which is more or less equivalent to:
char *va_list;
char *tmp;
va_list += sizeof(int);
tmp = va_list - sizeof(int)
arg = *(int *)tmp;
Starting with this:
myfunc(refNum);
How did we end up with this:
LDA <$05
PHA
JSL >myfunc
Instead of:
PEI <$05
JSL >myfunc
The answer: LCC had to cast refNum
from an int to an unsigned. In doing so,
the intermediate code looks like:
lda %r0
sta %t0
pei %t0
jsl close
The optimizer can’t currently optimize it to a simple PEI
. And probably
never will – there are functions that take parameters via the A and X
registers. However, we
can fix LCC to bypass the integer conversion. Currently, it only applies
to function arguments. It might be helpful elsewhere but that requires
some careful thought to make sure it doesn’t break anything.
Consider this code:
...
LDX #$2211
JSL >$e10000
STA >_toolErr
PLA
STA <$01
PLA
STA <$03
LDA <$01
STA <$07
LDA <$03
STA <$09
LDA |_toolErr
...
Which was generated from this code:
path = LGetPathname2(id, 1);
By way of this code:
ldx #$2211
jsl $e10000
sta >_toolErr
pla
sta %t0+0
pla
sta %t0+2
lda %t0
sta %r2
lda %t0+2
sta %r2+2
lda |_toolErr
The code we’d like (and expect) is:
ldx #$2211
jsl $e10000
sta >_toolErr
pla
sta %r2+0
pla
sta %r2+2
lda |_toolErr
However, LCC doesn’t want to store the result into the destination variable –
after all, in a normal CPU with a normal ABI, the return value is a specific
register. I have made some attempts in the past to fix that at the LCC level
but wasn’t successful.
The assembler could reorder those stores, though.
ldx #$2211
jsl $e10000
sta >_toolErr
pla
sta %t0+0
sta %r2
pla
sta %t0+2
sta %r2+2
lda %t0
lda %t0+2
lda |_toolErr
After peepholing and removing dead writes, we would end up with the
desired code:
ldx #$2211
jsl $e10000
sta >_toolErr
pla
sta %r2+0
pla
sta %r2+2
lda |_toolErr
ORCA/C uses a variety of qualifiers, statements, and #pragmas to modify
functions and the code generated. For LCC, I’ve moved them into a unified
function attribute, based on gcc and c++11.
int myfunction() __attribute__((...));
int myfunction() [[...]];
pascal
Use the pascal calling convention. This is based on the toolbox and
mpw pascal, not ORCA/Pascal.
Similar to ORCA/C’s pascal
qualifier and #pragma toolparms 1
cdecl
Use the cdecl
calling convention (default).
stdcall
Use the stdcall
calling convention. This provides some compatibility
with ORCA/C.
debug
Generate GSBug debug names.
Equivalent to ORCA/C’s #pragma debug $8000
(assuming ORCA/C has been
patched to generate debug names…)
inline
inline(address, x-register)
inline(address)
Similar to ORCA/C’s inline
qualifier and used for the same purpose
(tool calls).
A toolcall would be declared as
extern Word MMStartUp(void) [[pascal, inline(0xe10000, 0x0202)]];
rather than
extern pascal Word MMStartUp(void) inline(0x0202, 0xe10000);
segment
segment("segment name")
Specifies the OMF segment name. Default is blank.
Similar to ORCA/C’s segment
statement.
databank
Save the b
register and set it to the proper value.
Similar to ORCA/C’s #pragma databank 1
.
near
Use JSR
/RTS
for function calls rather than JSL
/RTL
. This is more
efficient, particularly with function pointers, but cannot cross banks
(or segments).
noreturn
Currently parsed but has no effect. Indicates that a function will not
return (exit
, for example).
I just finished up a quick and dirty Classic Desk Accessory to display
passages from the Dragon Wars
handbook. I mention this here because it was developed entirely with lcc-816.
(The CDA header was written in asm-816. Eventually I plan to have a separate
utility to generate headers, similar to orca #pragmas.)
In the process, I added the debug
attribute ([[debug]]), fixed pointer
alignment, added some optimizations, and fixed some bugs. I also found
some more optimizations to add in the future.
The type metrics were copied from a 32-bit architecture. While I updated the
sizes and alignments, I somehow neglected to update the pointer alignment
from 4-bytes to 2-bytes. This, predictably, screwed up struct packing and
caused GS/OS calls to fail.
I fixed that but it somehow altered the intermediate code:
Original code:
mystruct = array[index]; // mystruct is 6 bytes
4-byte struct alignment:
ADDP4(CVUU4(MULU2(CNSTU2(6), INDIRU2(VREGP(index)))), ADDRGP4(array))
2-byte struct alignment:
ADDP4(MULI4(CNSTI4(6), CVUI4(INDIRU2(VREGP(index)))), ADDRGP4(array))
Which is to say, with 4-byte pointer alignment, the 6 * index multiplication
is 16-bit but with a 2-byte pointer alignment, it’s 32-bit.
Addendum
LCC checks the short, int, long, and long long types to find one that matches
the pointer metrics (size and alignment) and uses that for pointer arithmetic. If
nothing matches (which was previously the case with the goofy alignment), it
falls back to using an integer.
LDA <$1b
AND #$ff
CMP #$00
BMI .31
CMP #$1e
BCC .31
Astute observers will note that the BMI
cannot possibly be taken, given
that no value can be negative after anding with $ff
. Why does this even
need to be optimized in the first place?
The original code looked like:
unsigned char b;
...
if (b > 29) ...
b
is promoted to a signed integer for the comparison (and the comparison
is inverted by lcc), which would normally require extra work to deal with
signed comparison. However, since it’s being compared to a known positive
number, we can check for negative numbers and then use an unsigned compare.
In this case, negative numbers are impossible.
This could be fixed in lcc itself (promote unsigned chars to unsigned integers),
the machine description (extra rules for comparing unsigned chars cast to
signed ints), or in the peephole optimizer.
For now, it’s handled in the peephole optimizer.
Last night, I successfully compiled, assembled, linked, and executed my second
test program (qtime).
The first attempt at building had bad code due to a pointer dereference
clobbering itself.
void *vp;
...
vp = *(void **)vp;
Originally, that would generate this sequence if the source was the destination:
lda [%t0]
sta %t0
ldy #2
lda [%t0],y
sta %t0+2
I changed that to use the X register as an intermediate:
lda [%t0]
tax
ldy #2
lda [%t0],y
stx %t0
sta %t0+2
Initially, this generates slightly worse code (one extra instruction). However,
in some circumstances, it enables better optimizations later on.
Consider:
stx %t0
sta %t0+2
pei %t0+2
pei %t0
With peephole and dead-write elimination, this becomes:
pha
phx
A Retargetable C Compiler: Design and Implementation
is more or less required when working with LCC. The
source code is available, but without any comments.
(The compiler is apparently a
literate program, so
all comments or explanations end up in the book.)
Additional information on the
LCC 4.x code generation interface
is also valuable.
On IRC, comp.sys.apple2, GEnie, Delphi, AOL, CompuServe, and local BBSs –
any place Apple II programmers congregate – there was a common desire: a
better C compiler. Through the years, there has been much talk but not
much ado.
Sunday, June 28th, 2015, that changed. I successfully compiled, assembled,
linked (with the help of MPW’s LinkIIgs), then ran an actual program.
Success was short-lived, though: the second test program generated some
incorrect code.
There’s still more work to be done, more optimizations, more testing.
There may not be anyone left to enjoy it, but there will be a better C
compiler.
… at least for your code size. Why? Three reasons.
-
The 65816 compare/branch instructions don’t support signed integers. Most
other architectures provide branches that act on multiple flags.
unsigned |
> |
C == 1 && Z == 0 |
unsigned |
<= |
C == 0 || Z == 1 |
signed |
>= |
N == V |
signed |
> |
N == V && Z == 0 |
signed |
< |
N != V |
signed |
<= |
N != V || Z == 1 |
The 65816 does not provide these. The CMP
instruction doesn’t even set
the V
flag, so a signed %t0 < %t2
expands to this ungodly sequence:
lda %t0
sec
sbc %t2
bvs *+5
eor #$8000
bpl @target
-
With an unsigned integer, the X or Y index registers can be used for, well,
indexing. This isn’t possible with signed integers, so a 32-bit pointer
needs to be generated and dereferenced. This also requires a 4-bytes of
dp space.
-
Extending a signed integer to a signed long requires extra work to
propagate the sign. This requires a branch and dp space.
Moral of the story: use unsigned integers whenever possible. LCC-816 uses
unsigned characters for these reasons.
Gentle reader:
This blog will (or will not) chronicle some of the adventures as I try to build
a C-compiler for the Apple IIgs/65816.
LCC – with a very customized lburg machine
description – is the basis for the
C compiler. Additionally, a new optimizing assembler is in progress. Linking
is currently handled by the MPW LinkIIgs linker. Subroutines libraries will
be novel, copied from ORCA/C, or ported from elsewhere (BSD?) as necessary.
Enjoy the ride!