
Forthmacs Implementation
************************

This chapter describes how Risc-OS Forthmacs implements the Forth 
virtual machine on the ARM processors.  It assumes that you have a 
fairly good knowledge of conventional Forth implementations; it does 
not attempt to be a tutorial on how Forth works.  


Dialect
=======

Risc-OS Forthmacs is an implementation of the Forth-83 standard, with 
a few exceptions.  It is a descendent of the public-domain F83 
implementation by Laxen and Perry and contains most of the F83 
extensions as well as many new ones.  It is compatible with the other 
implementations for Sun-68k, Sparc, Atari, Macintosh and OS-9 
computers.  


Stack Width and Addressing
==========================

Risc-OS Forthmacs deviates from the 83 Standard in the width of the 
stack.  Forth 83 specifies that stack items are 16-bit numbers, and 
that the address space is 64K.  This wastes much of the power of 
modern CPUs and is almost impossible to implement on ARM based 
computers.  

In Risc-OS Forthmacs, all stack items as well as memory cells are 
32-bit wide, remember this when writing portable programs.  

The address could conceivably grow to 2 to the 32nd power (4 
gigabytes), but this is restricted by the current CPU/MMU versions to 
16 MBytes.  16-bit or 2-byte memory accesses are not supported any 
longer and must be emulated if necessary.  

Note: word accesses are simulated by two byte accesses, take care 
about interrupts occurring here! 

The current ARM MMUs don't support non-aligned memory accesses.  NOTE: 
They don't abort or run any exception vector but just do something not 
clearly defined and CPU dependent.  Take care of this, it took me 
hours to find a bug! 

All accesses must be one of: 

1) byte-wide access to any address in the address area 

2) cell-wide ( 32-bit ) access to any aligned address 


Both stacks are pre-decrementing/post-incrementing.  The parameter 
stack holds its top-of-stack in the top-register r10, this allows much 
faster code definitions because of the CPUs load-and-store 
architecture.  


Register Usage
==============


    r7      floating stack pointer  fsp
    r8      instruction pointer     ip
    r9      user area pointer       up
    r10     top-of-stack register   top
    r11     returnstack pointer     rp
    r12     RiscOS frame pointer    fp  never use this!
    r13     stack pointer           sp
    r14     link register           lk
    r15     pc + status + flags     pc
    r15     sr                      sr  hold the flags part of r15

Note: In future CPU Versions the internal structure of the pc-register 
might be different, it seems to be better, to imagine pc and status 
register as two registers.  The hardware-errors and the .REGISTERS 
instruction know about this.  

r0, r1, r2, r3, r4, r5, and r6 are available for use within code 
definitions.  Don't try to use them for permanent storage, because 
they are used by many code words with no attempt to preserve the 
previous contents.  

Registers r7-r11 and r14 can be used within code definitions, but you 
have to save and restore their values at the beginning/end of the 
definition.  r12, r13 and r15 should not be used.  


Inner <address> Interpreter
===========================

The inner interpreter NEXT is direct threaded, post incrementing.  The 
compilation address of all definitions contain machine code to be 
executed, not a pointer.  Each CODE definition ends with the NEXT 
code, assembled in-line.  The NEXT code is: 
         pc      ip )+   ldr
This means: Load the program-counter PC ( don't affect the CPU status 
) from the 4-byte cell pointed to by the instruction pointer ip, 
postincrement the instruction-pointer.  So the NEXT is only one CPU 
instruction and very fast.  It is much faster than 
         address  dolink branch
         ...
         pc link mov
constructions because of only one pipeline reload per NEXT. But on the 
other hand, there is definitely a larger overhead for calling 
secondaries.  

For discussions about subroutine threaded ( macro extended ) versus 
threaded code implementations see the Forth literature.  Generally, 
macros do bring some advantage in execution speed but give less 
information about the code itself, so debuggers are less useful.  The 
penalty for direct threaded code is hard to predict, it depends very 
much on the type of application.  Something like 50% sounds 
reasonable, so optimising the bottlenecks could bring big advantages.  
The 'runtimer ' utilities might help you doing this.  

The assembler macro C; assembles the NEXT instruction and ends 
assembling by END-CODE. A fast conditional next can be done by 
         ...
         r2 0 cmp
         eq next
         ...


Other Definitions
=================

Any word that is not a code definition contains a branch+link 
instruction at the code-field, this makes a relative branch to an 
inline-address and saves the pc+sr to the lk register.  
         runtime-addr    dolink branch
The inline address points to a code fragment (headerless in most 
cases) that implements the run-time action of the word.  The parameter 
field starts just after this branch+link instruction and can be found 
by clearing the flags in the link register like this: 
         r0 lk    th fc000003 #   bic
         r0  get-link

The run-time codes may have to push the top-register to the stack, 
save the return pointer to the return-stack and set the instruction or 
stack pointer to the parameter field address.  All standard runtime 
codes (those of variables, constants, colon definitions, user 
variables ...) have been optimized for best cache-hit rates on ARM3/6 
machines.  

Note: WORD-TYPE ( cfa -- addr ) finds the address of the words runtime 
code in this implementation.  


Colon definitions
=================

The runtime code: 
    mlabel docolon  assembler
         ip      rp      push
         ip      get-link c;
The body of a Colon Definition starts 4 bytes after the compilation 
address.  The body contains a list of compilation addresses of other 
words.  Each such compilation address is a 32-bit number which is an 
absolute address.  


Variable
========

The Parameter Field of a VARIABLE contains a 32-bit number which is 
the value of the variable.  The runtime code: 
    mlabel dovariable  assembler
         top     sp      push
         top     get-link c;


Constants
=========

The Parameter Field of a CONSTANT contains the 32-bit value of the 
constant.  The runtime code: 
    mlabel doconstant  assembler
         top     sp      push
         r0      get-link
         top     r0 )    ldr c;


User Variables
==============

The value of a USER variable is stored in the USER area as a 32-bit 
number.  The Parameter Field of a user variable contains a 32-bit 
offset into the user area of the current task.  r8 contains the base 
address of the current user area.  r8 is symbolically defined as UP in 
the assembler.  The runtime code: 
    mlabel douser  assembler
         top     sp      push
         r0      get-link
         r0      r0 )    ldr
         top     r0      up add c;


Deferred words
==============

The compilation address of the word to be executed by a DEFER word is 
stored as a 32-bit absolute address in the USER area.  The Parameter 
Field of a deferred word contains a 32-bit number which is an offset 
into the user area of the current task.  The runtime code: 
    mlabel dodefer  assembler
         r0      get-link
         r0      r0 )    ldr
         pc      r0      up  ib ldr end-code
The last line holds a somewhat optimized NEXT instruction, it means: 
Load the pc from the address in the user area with the offset r0.  


;code
=====

The compilation address of a word created by a CREATE ...  ;CODE data 
type construction contains the standard branch+link instruction that 
branches to the runtime code.  

The runtime code is defined by the programmer in the ;CODE part of the 
definition.  At first the ;CODE instruction assembles 
         top     sp      push
         top     get-link
before the programmers code.  This is mainly for convenience, so the 
top is already saved to the stack and points to the parameter field.  
The programmer might do '-2 cells allot ' to forget this for speed 
optimized code.  


does>
=====


    mlabel dodoes  assembler
         ip      rp      push
         ip      get-link c;
The runtime code is defined by the programmer in the DOES> part of the 
definition.  Before branching to the dodoes code, the does> 
instruction assembles 
         top     sp      push
         top     lk      th fc000003 # bic
to get the parameter field address.  


local variables
===============

Risc-OS Forthmacs has built in ANS Forth conforming local variables 
spending their lifetime on the return-stack in stack-frames.  The 
stack-frames are linked via a USER variable LOCAL-FRAME which is also 
used to locate a local variables value.  The frame structure is like: 
    | cfa:frame>   | old-frame     | old-rs        | loc   | loc   | .........
with cfa:pop-frame on top of the return-stack.  pop-frame removes the 
current frame and switches to the last frame.  
    headerless code pop-frame \ this routine is pushed on return stack by push-locals
         here-t /token-t + token,-t
    	r0 rp 2	rp ia	ldm
    	r0	'user local-frame str
    	ip	rp	pop c;
    
The local variables are accessed using (loc) followed by an stack 
frame index.  
    code (loc)	\ ( -- n )  runtime-code of any local
    	r0	'user local-frame ldr
    	r1	ip )+	ldr
    	top	sp	push
    	top	r0 r1 2 #asl db ldr c;

Note: The disassembler can NOT know the local variables names, so it 
assumes names like V0 V1 ... .  


Vocabularies
============

Each VOCABULARY has #THREADS ( currently 16 ) 32-bit pointers which 
are called "threads".  A thread is the head of a linked list of words.  
A hashing function selects which of the 16 linked lists a particular 
word belongs to.  The threads are stored in the USER area.  The 
Parameter Field of a VOCABULARY contains the 32-bit offset of the 
threads in the USER area, followed by the vocabulary-link, a 32-bit 
pointer to the previous VOCABULARY. The runtime high-level code is: 
         does> body> context token!


Tokens
======

Within the body of a colon definition, calls to other Forth words are 
compiled as the 32-bit absolute compilation address of those words.  
These tokens have a corresponding bit in the relocation table.  


Branching
=========

Branch targets are offsets relative to the location that contains the 
branch offset.  They are stored as 32-bit twos-complement numbers 
representing the number of bytes between the offset location and the 
branch target.  For example, a branch to the following location could 
be compiled with: 

         postpone branch   4 ,


Doubles
=======

Risc-OS Forthmacs versions newer than 1.83 have full double number 
support, all conversion tools CONVERT, NUMBER?, D. use doubles, the 
'scaling' words */ */MOD UM/MOD use double intermediate results.  

Also the text-interpreter and compiler accept literals as doubles when 
there is a period in it.  
         : test 1234. d. ;
1234.  is a double number and D. displays it.  

This could only be achieved with changing stack effects in a number of 
words.  So these new Risc-OS Forthmacs versions are no longer 
compatible when these words are used.  The lib.compatible tool does 
NOT cover these changes.  

The advantage of the new stack behaviour is it's ANS compliancy and 
the improved arithmetic capabilities.  


Header format - # of bytes in parentheses
=========================================

Source Field (4), Link Field (4), Name Field (n), Padding (0 to 3), 
Flags (1), Code Field (4), Parameter Field (n).  

As all addresses need to be, the Link Field, Name Field, and Code 
Field are all aligned.  

Links point to links ( NOT to Name Fields, as in FIG Forth! ) 

The name field is a normal Forth packed string.  (Many Forth 
implementations set the high bit in the first and last characters of 
the name field; Risc-OS Forthmacs does not).  

Name Field: length-byte, 0-31 character name.  



Vocabulary Format
=================

Vocabularies have 8-way hashing.  This means that each vocabulary has 
16 separate linked lists of words.  Before searching a vocabulary, a 
hashing function is applied to the name to be located.  The hashing 
function selects one of the 8 linked lists to search.  

The hashing function is very simple.  The lower 3 bits of the first 
character in the name (the first name character, not the length byte) 
are interpreted as a number from 0 to 7, selecting a linked lists.  

Vocabularies are not chained to one another.  Search order is 
implemented using the ALSO / ONLY scheme.  Each vocabulary thread is 
terminated with a special link field in the final word.  The special 
link address is the address of the origin of the Forth system (which 
may change from session to session due to the relocation that the 
operating system applies when loading and executing the Forth system.  

The parameter field for a vocabulary looks like: 

User number (4), Voc-link (4) 

The user number selects the place in the user area where the head of 
list pointers for the 4 vocabulary threads are stored.  Each 
vocabulary requires 8 bytes of user area storage for these 4 threads.  
The values stored in the user area are the Link field Addresses for 
the top word in each thread.  


Relocation
==========

In the RiscOS environment all programs of the absolute type are loaded 
at $8000 and executed from there.  So on first sight the relocation 
table doesn't make much sense in this version.  

But the relocation table can be used for target/meta-compiling or for 
relocating code during run-time.  

The executable file contains a relocation list used to identify the 
locations in the program's binary image which contain absolute 
addresses.  When the program is loaded, each of these locations is 
modified by adding the starting address of the program to the number 
contained in that location.  Only 32-bit numbers may be so modified.  

While Risc-OS Forthmacs is running, it maintains its own relocation 
table, identifying those locations in the Forth dictionary which must 
be relocated during COLD-CODE. Each bit in the map represents the 
address of one aligned location.  This relocation table is completely 
different from the standard Risc OS relocation tables, it is only used 
from within Risc-OS Forthmacs.  

In order for this to work properly, the programmer must be careful to 
use TOKEN, or TOKEN! to store an address in the dictionary, both set 
the relocation flags.  If , or ! is used instead, the address will not 
be properly relocated if SAVE-FORTH has been used to write the 
dictionary image to an executable file.  

See: TOKEN! TOKEN, SET-RELOCATION-BIT, RELOCATION-MAP 


Program header
==============

The header of the executable binary image looks like this: 

 h_magic   (  0)    \ Magic Number
 h_tlen    (  4)    \ length of text (code)
 h_dlen    (  8)    \ length of initialised data
 h_blen    (  c)    \ length of BSS unitialised data
 h_slen    (  10)   \ length of symbol table
 h_entry   (  14)   \ Entry address
 h_trlen   (  18)   \ Text Relocation Table length
 h_drlen   (  1c)   \ Data Relocation Table length

the magic number is the branch+link instruction just behind this 
header.  Note: this header might be changed with future releases 
according to Acorns executable binary code standard.  


Heap memory
===========

Risc-OS Forthmacs is loaded to $8000 and will have as much memory 
available as was defined by 'WimpSlot' .  

The MAIN-TASKs user area immediately follows the first instructions at 
$8050.  

$600 byte will be allocated in module-heap, it will hold the ENV-AREA, 
the command-line area, the INTERRUPT-CODE giving GET-TICKS plus all 
handlers used by shelled programs.  

The implementation of the dynamic memory manager has changed in 
Version 3.1-2.00.  From now on the dictionary and the heap share the 
same memory area, the dictionary grows from lower addresses and the 
heap can be as large as the area between the stacks and HERE. 

Note: Of course you may install another memory manager or add more 
heaps.  


Dictionary memory
=================

At the top of the dictionary are both stacks defined by RP0 - RS-SIZE 
and SP0 - PS-SIZE and the TIB, below this are MBytes of free memory 
(well, hopefully).  HERE marks the end of the allocated dictionary, 
classically PAD is HERE plus something.  

Risc-OS Forthmacs knows about two dictionary areas, the RESIDENT 
(which is the dictionary you know in all implementations) and the 
TRANSIENT. The transient dictionary is in the heap memory, definitions 
defined here won't use dictionary space in the target application.  So 
it might be useful to do: 
    transient
      fload assembler
      fload debugger
    resident
      fload myapplication
Now the debugger and assembler will be in transient address space.  To 
remove all links, pointers etc.  into the transient address space use 
DISPOSE, it will do this for you.  .DISPOSE will also give some 
informations what is removed.  

