
Forthmacs Implementation
************************

This chapter describes how RISC OS Forthmacs implements the Forth virtual 
machine on the ARM processors.  It assumes that you have a fairly good 
knowledge of conventional Forth implementations; it does not attempt to be a 
tutorial on how Forth works.  


Dialect
=======

RISC OS Forthmacs has been an implementation of the Forth-83 standard, with a 
few exceptions.  It is now far on it's way to be an ANS compliant 
implementation.  It is still rather compatible with the other implementations 
for Sun-68k, Sparc, Atari, Macintosh and OS-9 computers.  


Stack Width and Addressing
==========================

In RISC OS Forthmacs, all stack items as well as memory cells are 32-bit wide, 
remember this when writing portable programs.  Use the portable ANS operators 
like cell+ cells or the RISC OS Forthmacs specific words like /cell cells+. 

The address could conceivably grow to 2 to the 32nd power (4 gigabytes), but 
this is restricted by the current CPU/MMU versions to 16/256 MBytes.  16-bit 
or 2-byte memory accesses are not supported any longer and must be emulated if 
necessary.  

Note: word accesses are simulated by two byte accesses, take care about 
interrupts occurring here! 

The current ARM MMUs don't support non-aligned memory accesses.  NOTE: They 
don't abort or run any exception vector but just do something UNDEFINED and 
CPU core dependent.  Take care of this, it took me hours to find a bug! 

All accesses must be one of: 

1) byte-wide access to any address in the address area 

2) cell-wide ( 32-bit ) access to any aligned address 

The word wide access possible at least on StrongARM and ARM8 cpus is NOT 
supported by the RiscPC platforms.  Special source extensions are available 
for single-board platforms.  


Both stacks are pre-decrementing/post-incrementing.  The parameter stack holds 
its top-of-stack in the top-register top - r10, this allows much faster code 
definitions because of the CPUs load-and-store architecture.  


Register Usage
==============


    r9      user area pointer       up
    r10     top-of-stack register   top
    r11     returnstack pointer     rp
    r12     instruction pointer     ip
    r13     stack pointer           sp
    r14     link register           lk
    r15     pc + status + flags     pc
    r15     sr                      sr  hold the flags part of r15

Note: The internal structure of the pc and flags registers differs between 
cpus.  It seems to be better, to generally imagine pc and status register as 
two registers.  The hardware-errors and the .registers instruction know about 
this.  

r0, r1, r2, r3, r4, r5, and r6 are available for use within code definitions.  
Don't try to use them for permanent storage, because they are used by many 
code words with no attempt to preserve the previous contents.  

Registers r7-r14 can be used within code definitions with great care, but you 
have to save and restore their values at the beginning/end of the definition.  


Inner <address> Interpreter
===========================

The inner interpreter next is direct threaded, post incrementing.  The 
compilation address of all definitions contain machine code to be executed, 
not a pointer.  Each code definition ends with the next code, assembled 
in-line.  The next code is: 
         pc      ip )+   ldr
This means: Load the program-counter pc ( don't affect the CPU status ) from 
the 4-byte cell pointed to by the instruction pointer ip, postincrement the 
instruction-pointer.  So the next is only one CPU instruction and very fast.  
It is much faster than 
         address  dolink branch
         ...
         pc link mov
constructions because of only one pipeline reload per next. But on the other 
hand, there is definitely a larger overhead for calling secondaries.  

RISC OS Forthmacs versions >= 3.1/2.70 can switch to another next scheme.  The 
first 8 cells in the user area are free debugging purposes.  SLOW-NEXT  ( you 
find this in lib.arm.debugm ) patches all next as well as conditional next 
calls to 
    pc up mov
Normall there is a normal next instruction at up@ but may install any service 
routine there to do additional checking at run time.  The new debugger uses 
this to branch into the debugger handler.  After 'debugging' you switch back 
to the normal next with FAST-NEXT 

If you want to use this scheme, there is one thing to remember.  As SLOW-NEXT 
patches all instructions 
    pc ip )+ condition ldr
your handler might be patches as well.  In these cases you should use 
    pc 1  ip ia!  ldm
instead.  This does the same - at least from RISC OS Forthmacs point of view - 
but isn't patched.  See lib.arm.debugm again for an example in the debugger.  

For discussions about subroutine threaded ( macro extended ) versus threaded 
code implementations see the Forth literature.  Generally, macros do bring 
some advantage in execution speed but give less information about the code 
itself, so debuggers are less useful.  The penalty for direct threaded code is 
hard to predict, it depends very much on the type of application.  Something 
like 50% sounds reasonable, so optimising the bottlenecks could bring big 
advantages.  The 'runtimer ' utilities might help you doing this.  

The assembler macro c; assembles the next instruction and ends assembling by 
end-code. A fast conditional next can be done by 
         ...
         r2 0 cmp
         eq next
         ...


Other Definitions
=================

Any word that is not a code definition contains a branch+link instruction at 
the code-field, this makes a relative branch to an inline-address and saves 
the pc+sr to the lk register.  
         runtime-addr    dolink branch
The inline address points to a code fragment (headerless in most cases) that 
implements the run-time action of the word.  The parameter field starts just 
after this branch+link instruction and can be found by clearing the flags in 
the link register like this: 
         r0 lk    th fc000003 #   bic
         r0  get-link

The run-time codes may have to push the top-register to the stack, save the 
return pointer to the return-stack and set the instruction or stack pointer to 
the parameter field address.  All standard runtime codes (those of variables, 
constants, colon definitions, user variables ...) have been optimized for best 
cache-hit rates.  

Note: word-type ( cfa -- addr ) finds the address of the words runtime code in 
this implementation.  


Colon definitions
=================

The runtime code: 
    mlabel docolon  assembler
         ip      rp      push
         ip      get-link c;
The body of a Colon Definition starts 4 bytes after the compilation address.  
The body contains a list of compilation addresses of other words.  Each such 
compilation address is a 32-bit number which is an absolute address.  


Variable
========

The Parameter Field of a variable contains a 32-bit number which is the value 
of the variable.  The runtime code: 
    mlabel dovariable  assembler
         top     sp      push
         top     get-link c;


Constants
=========

The Parameter Field of a constant contains the 32-bit value of the constant.  
The runtime code: 
    mlabel doconstant  assembler
         top     sp      push
         r0      get-link
         top     r0 )    ldr c;


User Variables
==============

The value of a user variable is stored in the user area as a 32-bit number.  
The Parameter Field of a user variable contains a 32-bit offset into the user 
area of the current task.  r8 contains the base address of the current user 
area.  r8 is symbolically defined as up in the assembler.  The runtime code: 
    mlabel douser  assembler
         top     sp      push
         r0      get-link
         r0      r0 )    ldr
         top     r0      up add c;


Deferred words
==============

The compilation address of the word to be executed by a defer word is stored 
as a 32-bit absolute address in the user area.  The Parameter Field of a 
deferred word contains a 32-bit number which is an offset into the user area 
of the current task.  The runtime code: 
    mlabel dodefer  assembler
         r0      get-link
         r0      r0 )    ldr
         pc      r0      up  ib ldr end-code
The last line holds a somewhat optimized next instruction, it means: Load the 
pc from the address in the user area with the offset r0.  


;code
=====

The compilation address of a word created by a create ...  ;code data type 
construction contains the standard branch+link instruction that branches to 
the runtime code.  

The runtime code is defined by the programmer in the ;code part of the 
definition.  

In versions up to 3.1/2.62 ;code assembled two instructions for your 
convenience 
         top     sp      push
         top     get-link
this is not the case any more.  I changed this to be more portable with the 
FirmWorks implementation and i feel that all Forth programmers using ;code 
should be able to handle this.  


does>
=====


    mlabel dodoes  assembler
         ip      rp      push
         ip      get-link c;
The runtime code is defined by the programmer in the does> part of the 
definition.  Before branching to the dodoes code, the does> instruction 
assembles 
         top     sp      push
         top     lk      th fc000003 # bic
to get the parameter field address.  


local variables
===============

RISC OS Forthmacs has built in ANS Forth conforming local variables spending 
their lifetime on the return-stack in stack-frames.  The stack-frames are 
linked via a user variable LOCAL-FRAME which is also used to locate a local 
variables value.  The frame structure is like: 
    | cfa:frame>   | old-frame     | old-rs        | loc   | loc   | .........
with cfa:pop-frame on top of the return-stack.  pop-frame removes the current 
frame and switches to the last frame.  
    headerless code pop-frame \ this routine is pushed on return stack by push-locals
         here /cell+ token,
    	r0 rp 2	rp ia	ldm
    	r0	'user local-frame str
    	ip	rp	pop c;
    
The local variables are accessed using (loc) followed by an stack frame index.  
    code (loc)	\ ( -- n )  runtime-code of any local
    	r0	'user local-frame ldr
    	r1	ip )+	ldr
    	top	sp	push
    	top	r0 r1 2 #asl db ldr c;

Note: The decompiler can not know the local variables names, so it assumes 
names like ( v0 v1 ...).  


Tokens
======

Within the body of a colon definition, calls to other Forth words are compiled 
as the 32-bit absolute compilation address of those words.  These tokens have 
a corresponding bit in the relocation table.  


Branching
=========

Branch targets are offsets relative to the location that contains the branch 
offset.  They are stored as 32-bit twos-complement numbers representing the 
number of bytes between the offset location and the branch target.  For 
example, a branch to the following location could be compiled with: 

         postpone branch   4 ,

NOTE: This is implemented different in version 3.1/2.00.  The relative offset 
is replaced by an immediate absolute relocated address.  


Doubles
=======

RISC OS Forthmacs versions newer than 1.83 have full double number support, 
all conversion tools convert, number?, d. use doubles, the 'scaling' words */ 
*/mod um/mod use double intermediate results.  

Also the text-interpreter and compiler accept literals as doubles when there 
is a period at the end of it.  
         : test 1234. d. ;
1234.  is a double number and d. displays it.  

This could only be achieved with changing stack effects in a number of words.  
So these new RISC OS Forthmacs versions are no longer compatible when these 
words are used.  The lib.compatible tool does not cover these changes.  

The advantage of the new stack behaviour is it's ANS compliancy and the 
improved arithmetic capabilities.  


Floats
======

RISC OS Forthmacs versions newer than 3.1/2.13 have the ANS Floating and 
Floating Extended wordsets included.  There isn't any further documentation 
available so far, please use the ANS docs for this purpose.  


StrongARM compatibility
=======================

Versions from 3.1/2.30 run on StronARM based machines but optimized code is 
available from 3.1/2.40 onwards.  


Cache
=====

The newer ARM based cpus ( ARM8, StrongARM ) have a different cache structure 
than the elder versions.  Separate instruction- and data caches are used and 
code synchronizing has to be done after change of the code space.  

flush-cache and sync-cache are both implemented in current RISC OS Forthmacs 
versions >3.1/2.40 in such a way, that the compiler is not significantly 
slowed down, in fact a StrongARM compilation is much faster than on the older 
ARM710.  


Header format - # of bytes in parentheses
=========================================

Source Field (4), Link Field (4), Name Field (n), Padding (0 to 3), Flags (1), 
Code Field (4), Parameter Field (n).  

As all addresses need to be, the Link Field, Name Field, and Code Field are 
all aligned.  

Links point to links ( not to Name Fields, as in FIG Forth! ) 

The name field is a normal Forth packed string.  (Many Forth implementations 
set the high bit in the first and last characters of the name field; 
RISC OS Forthmacs does not).  

Name Field: length-byte, 0-31 character name.  



Vocabularies
============

Vocabularies have #threads - way hashing.  This means that each vocabulary has 
16 separate linked lists of words.  The threads are stored in the user area.  
The Parameter Field of a vocabulary contains the 32-bit offset of the threads 
in the user area, followed by the vocabulary-link, a 32-bit pointer to the 
previous vocabulary. The runtime high-level code is: 
         does> body> context token!

Before searching a vocabulary, a hashing function is applied to the name to be 
located.  The hashing function selects one of the 16 linked lists to search.  

The hashing function is very simple.  The lower 4 bits of the first character 
in the name (the first name character, not the length byte) are interpreted as 
a number from 0 to 15, selecting a linked list.  

Vocabularies are not chained to one another.  Search order is implemented 
using the also / only scheme.  Each vocabulary thread is terminated with a 
special link field in the final word.  The special link address is the address 
of the origin of the Forth system (which may change from session to session 
due to the relocation that the operating system applies when loading and 
executing the Forth system.  

The parameter field for a vocabulary looks like: 

User number (/cell), Voc-link (/cell) 

The user number selects the place in the user area where the head of list 
pointers for the 16 vocabulary threads are stored.  Each vocabulary requires 
16 cells bytes of user area storage for these 16 threads.  The values stored 
in the user area are the Link field Addresses for the top word in each thread.  


Relocation
==========

In the RISC OS environment all programs of the absolute type are loaded at 
$8000 and executed from there.  So on first sight the relocation table doesn't 
make much sense in this version if you don't care about being portable to 
other RISC OS Forthmacs implementations.  

But the relocation table can be used for target/meta-compiling or for 
relocating code during run-time.  This is necessary for producing turnkey 
applications with an 'Application Stripper', use of the application stripper 
requires strict adherence to the rules of relocatability.  

If the program is not relocatable, then a file saved with save-forth will work 
only if it is later executed at the same address where it was executing when 
it was saved.  If saved with save-forth, a program that is not relocatable 
will not work at all, regardless of the address where it is later executed.  
Consequently, use of the application stripper requires strict adherence to the 
rules of relocatability.  

In most cases, the relocation bitmap is maintained automatically, without 
requiring any special effort on the part of the programmer.  However, there 
are some cases where the programmer must take explicit actions to ensure that 
the program is relocatable.  

The executable file contains a relocation list used to identify the locations 
in the program's binary image which contain absolute addresses.  When the 
program is loaded, each of these locations is modified by adding the starting 
address of the program to the number contained in that location.  Only 32-bit 
numbers may be so modified.  

While RISC OS Forthmacs is running, it maintains its own relocation table, 
identifying those locations in the Forth dictionary which must be relocated 
during cold-code. Each bit in the map represents the address of one aligned 
location.  This relocation table is completely different from the standard 
RISC OS relocation tables, it is only used from within RISC OS Forthmacs.  

In order for this to work properly, the programmer must be careful to use 
token, A, link, token!, A! or link! to store an address or token in the 
dictionary, all six set the relocation flags.  

Addresses may be stored into variables with ! ( without requiring the use of 
token! ) if the variable is re-initialized every time that the application is 
started.  token! is only necessary if the variables value must be set before 
save-forth is executed, and then is used when the saved application is later 
invoked, without being re-initialized by the application's initialization 
code.  

If , or ! is used instead, the address will not be properly relocated if 
save-forth has been used to write the dictionary image to an executable file.  

Note: The lib/checkrel.fth program can help you catch relocation problems in 
your applications.  It should be loaded before you load your application, and 
will warn you if your application does things that may not be relocatable.  
After you have fixed the relocation problems, you can load your application 
without lib/checkrel.fth .  


See: .buffers .pointers token! token, A! A, link! link, set-relocation-bit 
relocation-map 


Program header
==============

The header of the executable binary image looks like this: 

 h_magic   (  0)    \ Magic Number
 h_tlen    (  4)    \ length of text (code)
 h_dlen    (  8)    \ length of initialised data
 h_blen    (  c)    \ length of BSS unitialised data
 h_slen    (  10)   \ length of symbol table
 h_entry   (  14)   \ Entry address
 h_trlen   (  18)   \ Text Relocation Table length
 h_drlen   (  1c)   \ Data Relocation Table length

the magic number is the branch+link instruction just behind this header.  
Note: this header might be changed with future releases according to Acorns 
executable binary code standard.  


Heap memory
===========

RISC OS Forthmacs is loaded to $8000 and will have as much memory available as 
was defined by 'WimpSlot' .  

The main-tasks user area immediately follows the first instructions and some 
permanent data at $8040.  

$600 byte will be allocated in the module-heap RMA, it will hold the env-area, 
the command-line area plus all handlers used by shelled programs.  

The implementation of the dynamic memory manager has changed in Version 
3.1-2.00.  From now on the dictionary and the heap share the same memory area, 
the dictionary grows from lower addresses and the heap can be as large as the 
area between the stacks and here. 

Note: Of course you may install another memory manager or add more heaps.  


Dictionary memory
=================

At the top of the dictionary are both stacks defined by rp0 - rs-size and sp0 
- ps-size and the tib, below this are MBytes of free memory (well, hopefully).  
here marks the end of the allocated dictionary, classically pad is here plus 
something.  

RISC OS Forthmacs knows about two dictionary areas, the resident (which is the 
dictionary you know in all implementations) and the transient. The transient 
dictionary is in the heap memory, definitions defined here won't use 
dictionary space in the target application.  So it might be useful to do: 
    transient
      fload assembler
      fload debugger
    resident
      fload myapplication
Now the debugger and assembler will be in transient address space.  To remove 
all links, pointers etc.  into the transient address space use dispose, it 
will do this for you.  .dispose will also give some informations what is 
removed while executing dispose. 

