<!-- Forthmacs Formatter generated HTML output -->
<html>
<head>
<title>Forthmacs Implementation</title>
</head>
<body>
<h1>Forthmacs Implementation</h1>
<hr>
<p>
This chapter describes how Risc-OS Forthmacs implements the Forth virtual 
machine on the ARM processors.  It assumes that you have a fairly good knowledge 
of conventional Forth implementations; it does not attempt to be a tutorial on 
how Forth works.  
<p>
<p>
<h2>Dialect</h2>
<p>
Risc-OS Forthmacs is an implementation of the Forth-83 standard, with a few 
exceptions.  It is a descendent of the public-domain F83 implementation by Laxen 
and Perry and contains most of the F83 extensions as well as many new ones.  It 
is compatible with the other implementations for Sun-68k, Sparc, Atari, 
Macintosh and OS-9 computers.  
<p>
<p>
<h2>Stack Width and Addressing</h2>
<p>
Risc-OS Forthmacs deviates from the 83 Standard in the width of the stack.  
Forth 83 specifies that stack items are 16-bit numbers, and that the address 
space is 64K.  This wastes much of the power of modern CPUs and is almost 
impossible to implement on ARM based computers.  
<p>
In Risc-OS Forthmacs, all stack items as well as memory cells are 32-bit wide, 
remember this when writing portable programs.  
<p>
The address could conceivably grow to 2 to the 32nd power (4 gigabytes), but 
this is restricted by the current CPU/MMU versions to 16 MBytes.  16-bit or 
2-byte memory accesses are not supported any longer and must be emulated if 
necessary.  
<p>
Note: word accesses are simulated by two byte accesses, take care about 
interrupts occurring here! 
<p>
The current ARM MMUs don't support non-aligned memory accesses.  <strong>Note:</strong> 
They don't abort or run any exception vector but just do something not clearly 
defined and CPU dependent.  Take care of this, it took me hours to find a bug! 
<p>
All accesses must be one of: 
<p>
1) byte-wide access to any address in the address area 
<p>
2) cell-wide ( 32-bit ) access to any aligned address 
<p>
<p>
Both stacks are pre-decrementing/post-incrementing.  The parameter stack holds 
its top-of-stack in the top-register r10, this allows much faster code 
definitions because of the CPUs load-and-store architecture.  
<p>
<p>
<h2>Register Usage</h2>
<p>
<p>
<br><code>    r7      floating stack pointer  fsp</code><br>
<br><code>    r8      instruction pointer     ip</code><br>
<br><code>    r9      user area pointer       up</code><br>
<br><code>    r10     top-of-stack register   top</code><br>
<br><code>    r11     returnstack pointer     rp</code><br>
<br><code>    r12     RiscOS frame pointer    fp  never use this!</code><br>
<br><code>    r13     stack pointer           sp</code><br>
<br><code>    r14     link register           lk</code><br>
<br><code>    r15     pc + status + flags     pc</code><br>
<br><code>    r15     sr                      sr  hold the flags part of r15</code><br>
<p>
Note: In future CPU Versions the internal structure of the pc-register might be 
different, it seems to be better, to imagine pc and status register as two 
registers.  The hardware-errors and the  <code><A href="_smal_AX#47"> .registers </A></code> 
instruction know about this.  
<p>
r0, r1, r2, r3, r4, r5, and r6 are available for use within code definitions.  
Don't try to use them for permanent storage, because they are used by many code 
words with no attempt to preserve the previous contents.  
<p>
Registers r7-r11 and r14 can be used within code definitions, but you have to 
save and restore their values at the beginning/end of the definition.  r12, r13 
and r15 should not be used.  
<p>
<p>
<h2>Inner &lt;address&gt; Interpreter</h2>
<p>
The inner interpreter  <code><A href="_smal_AK#3a"> next </A></code> is direct 
threaded, post incrementing.  The compilation address of all definitions contain 
machine code to be executed, not a pointer.  Each  <code><A href="_smal_AH#187"> code </A></code> 
definition ends with the  <code><A href="_smal_AK#3a"> next </A></code> code, 
assembled in-line.  The  <code><A href="_smal_AK#3a"> next </A></code> code is: 
<br><code>         pc      ip )+   ldr</code><br>
This means: Load the program-counter  <code><A href="_smal_BA#18"> pc </A></code> 
( don't affect the CPU status ) from the 4-byte cell pointed to by the 
instruction pointer ip, postincrement the instruction-pointer.  So the  <code><A href="_smal_AK#3a"> next </A></code> 
is only one CPU instruction and very fast.  It is much faster than 
<br><code>         address  dolink branch</code><br>
<br><code>         ...</code><br>
<br><code>         pc link mov</code><br>
constructions because of only one pipeline reload per  <code><A href="_smal_AK#3a"> next </A>.</code> 
But on the other hand, there is definitely a larger overhead for calling 
secondaries.  
<p>
For discussions about subroutine threaded ( macro extended ) versus threaded 
code implementations see the Forth literature.  Generally, macros do bring some 
advantage in execution speed but give less information about the code itself, so 
debuggers are less useful.  The penalty for direct threaded code is hard to 
predict, it depends very much on the type of application.  Something like 50% 
sounds reasonable, so optimising the bottlenecks could bring big advantages.  
The <em>runtimer </em> utilities might help you doing this.  
<p>
The assembler macro  <code><A href="_smal_BJ#171"> c; </A></code> assembles the  <code><A href="_smal_AK#3a"> next </A></code> 
instruction and ends assembling by  <code><A href="_smal_BV#1dd"> end-code </A>.</code> 
A fast conditional next can be done by 
<br><code>         ...</code><br>
<br><code>         r2 0 cmp</code><br>
<br><code>         eq next</code><br>
<br><code>         ...</code><br>
<p>
<p>
<h2>Other Definitions</h2>
<p>
Any word that is not a code definition contains a branch+link instruction at the 
code-field, this makes a relative branch to an inline-address and saves the 
pc+sr to the lk register.  
<br><code>         runtime-addr    dolink branch</code><br>
The inline address points to a code fragment (headerless in most cases) that 
implements the run-time action of the word.  The parameter field starts just 
after this branch+link instruction and can be found by clearing the flags in the 
link register like this: 
<br><code>         r0 lk    th fc000003 #   bic</code><br>
<br><code>         r0  get-link</code><br>
<p>
The run-time codes may have to push the top-register to the stack, save the 
return pointer to the return-stack and set the instruction or stack pointer to 
the parameter field address.  All standard runtime codes (those of variables, 
constants, colon definitions, user variables ...) have been optimized for best 
cache-hit rates on ARM3/6 machines.  
<p>
Note:  <code><A href="_smal_AI#338"> word-type </A></code> ( cfa -- addr ) finds 
the address of the words runtime code in this implementation.  
<p>
<p>
<h2>Colon definitions</h2>
<p>
The runtime code: 
<br><code>    mlabel docolon  assembler</code><br>
<br><code>         ip      rp      push</code><br>
<br><code>         ip      get-link c;</code><br>
The body of a Colon Definition starts 4 bytes after the compilation address.  
The body contains a list of compilation addresses of other words.  Each such 
compilation address is a 32-bit number which is an absolute address.  
<p>
<p>
<h2>Variable</h2>
<p>
The Parameter Field of a  <code><A href="_smal_BN#325"> variable </A></code> 
contains a 32-bit number which is the value of the variable.  The runtime code: 
<br><code>    mlabel dovariable  assembler</code><br>
<br><code>         top     sp      push</code><br>
<br><code>         top     get-link c;</code><br>
<p>
<p>
<h2>Constants</h2>
<p>
The Parameter Field of a  <code><A href="_smal_AT#193"> constant </A></code> 
contains the 32-bit value of the constant.  The runtime code: 
<br><code>    mlabel doconstant  assembler</code><br>
<br><code>         top     sp      push</code><br>
<br><code>         r0      get-link</code><br>
<br><code>         top     r0 )    ldr c;</code><br>
<p>
<p>
<h2>User Variables</h2>
<p>
The value of a  <code><A href="_smal_BK#322"> user </A></code> variable is 
stored in the  <code><A href="_smal_BK#322"> user </A></code> area as a 32-bit 
number.  The Parameter Field of a user variable contains a 32-bit offset into 
the user area of the current task.  r8 contains the base address of the current 
user area.  r8 is symbolically defined as  <code><A href="_smal_BD#31b"> up </A></code> 
in the assembler.  The runtime code: 
<br><code>    mlabel douser  assembler</code><br>
<br><code>         top     sp      push</code><br>
<br><code>         r0      get-link</code><br>
<br><code>         r0      r0 )    ldr</code><br>
<br><code>         top     r0      up add c;</code><br>
<p>
<p>
<h2>Deferred words</h2>
<p>
The compilation address of the word to be executed by a  <code><A href="_smal_AE#1b4"> defer </A></code> 
word is stored as a 32-bit absolute address in the  <code><A href="_smal_BK#322"> user </A></code> 
area.  The Parameter Field of a deferred word contains a 32-bit number which is 
an offset into the user area of the current task.  The runtime code: 
<br><code>    mlabel dodefer  assembler</code><br>
<br><code>         r0      get-link</code><br>
<br><code>         r0      r0 )    ldr</code><br>
<br><code>         pc      r0      up  ib ldr end-code</code><br>
The last line holds a somewhat optimized  <code><A href="_smal_AK#3a"> next </A></code> 
instruction, it means: Load the pc from the address in the user area with the 
offset r0.  
<p>
<p>
<h2>;code</h2>
<p>
The compilation address of a word created by a  <code><A href="_smal_BD#19b"> create </A></code> 
...   <code><A href="_smal_BN#115"> ;code </A></code> data type construction 
contains the standard branch+link instruction that branches to the runtime code.  
<p>
The runtime code is defined by the programmer in the  <code><A href="_smal_BN#115"> ;code </A></code> 
part of the definition.  At first the  <code><A href="_smal_BN#115"> ;code </A></code> 
instruction assembles 
<br><code>         top     sp      push</code><br>
<br><code>         top     get-link</code><br>
before the programmers code.  This is mainly for convenience, so the top is 
already saved to the stack and points to the parameter field.  The programmer 
might do <em>-2 cells allot </em> to forget this for speed optimized code.  
<p>
<p>
<h2>does&gt;</h2>
<p>
<p>
<br><code>    mlabel dodoes  assembler</code><br>
<br><code>         ip      rp      push</code><br>
<br><code>         ip      get-link c;</code><br>
The runtime code is defined by the programmer in the  <code><A href="_smal_BF#1cd"> does&gt; </A></code> 
part of the definition.  Before branching to the dodoes code, the does&gt; 
instruction assembles 
<br><code>         top     sp      push</code><br>
<br><code>         top     lk      th fc000003 # bic</code><br>
to get the parameter field address.  
<p>
<p>
<h2>local variables</h2>
<p>
Risc-OS Forthmacs has built in ANS Forth conforming local variables spending 
their lifetime on the return-stack in stack-frames.  The stack-frames are linked 
via a  <code><A href="_smal_BK#322"> user </A></code> variable <strong>local-frame</strong> 
which is also used to locate a local variables value.  The frame structure is 
like: 
<br><code>    | cfa:frame&gt;   | old-frame     | old-rs        | loc   | loc   | .........</code><br>
with cfa:pop-frame on top of the return-stack.  pop-frame removes the current 
frame and switches to the last frame.  
<br><code>    headerless code pop-frame \ this routine is pushed on return stack by push-locals</code><br>
<br><code>         here-t /token-t + token,-t</code><br>
<br><code>    	r0 rp 2	rp ia	ldm</code><br>
<br><code>    	r0	'user local-frame str</code><br>
<br><code>    	ip	rp	pop c;</code><br>
<br><code>    </code><br>
The local variables are accessed using (loc) followed by an stack frame index.  
<br><code>    code (loc)	\ ( -- n )  runtime-code of any local</code><br>
<br><code>    	r0	'user local-frame ldr</code><br>
<br><code>    	r1	ip )+	ldr</code><br>
<br><code>    	top	sp	push</code><br>
<br><code>    	top	r0 r1 2 #asl db ldr c;</code><br>
<p>
Note: The disassembler can  <code><A href="_smal_AM#27c"> not </A></code> know 
the local variables names, so it assumes names like <strong>v0 v1 ...</strong> .  
<p>
<p>
<h2>Vocabularies</h2>
<p>
Each  <code><A href="_smal_BR#329"> vocabulary </A></code> has  <code><A href="_smal_BK#82"> #threads </A></code> 
( currently 16 ) 32-bit pointers which are called "threads".  A thread is the 
head of a linked list of words.  A hashing function selects which of the 16 
linked lists a particular word belongs to.  The threads are stored in the  <code><A href="_smal_BK#322"> user </A></code> 
area.  The Parameter Field of a  <code><A href="_smal_BR#329"> vocabulary </A></code> 
contains the 32-bit offset of the threads in the  <code><A href="_smal_BK#322"> user </A></code> 
area, followed by the vocabulary-link, a 32-bit pointer to the previous  <code><A href="_smal_BR#329"> vocabulary </A>.</code> 
The runtime high-level code is: 
<br><code>         does&gt; body&gt; context token!</code><br>
<p>
<p>
<h2>Tokens</h2>
<p>
Within the body of a colon definition, calls to other Forth words are compiled 
as the 32-bit absolute compilation address of those words.  These tokens have a 
corresponding bit in the relocation table.  
<p>
<p>
<h2>Branching</h2>
<p>
Branch targets are offsets relative to the location that contains the branch 
offset.  They are stored as 32-bit twos-complement numbers representing the 
number of bytes between the offset location and the branch target.  For example, 
a branch to the following location could be compiled with: 
<p>
<br><code>         postpone branch   4 ,</code><br>
<p>
<p>
<h2>Doubles</h2>
<p>
Risc-OS Forthmacs versions newer than 1.83 have full double number support, all 
conversion tools  <code><A href="_smal_AW#196"> convert </A>,</code>  <code><A href="_smal_AR#281"> number? </A>,</code>  <code><A href="_smal_BR#1a9"> d. </A></code> 
use doubles, the 'scaling' words  <code><A href="_smal_AH#c7"> */ </A></code>  <code><A href="_smal_AI#c8"> */mod </A></code>  <code><A href="_smal_AU#314"> um/mod </A></code> 
use double intermediate results.  
<p>
Also the text-interpreter and compiler accept literals as doubles when there is 
a period in it.  
<br><code>         : test 1234. d. ;</code><br>
1234.  is a double number and  <code><A href="_smal_BR#1a9"> d. </A></code> 
displays it.  
<p>
This could only be achieved with changing stack effects in a number of words.  
So these new Risc-OS Forthmacs versions are no longer compatible when these 
words are used.  The lib.compatible tool does  <code><A href="_smal_AM#27c"> not </A></code> 
cover these changes.  
<p>
The advantage of the new stack behaviour is it's ANS compliancy and the improved 
arithmetic capabilities.  
<p>
<p>
<h2>Header format - # of bytes in parentheses</h2>
<p>
Source Field (4), Link Field (4), Name Field (n), Padding (0 to 3), Flags (1), 
Code Field (4), Parameter Field (n).  
<p>
As all addresses need to be, the Link Field, Name Field, and Code Field are all 
aligned.  
<p>
Links point to links (  <code><A href="_smal_AM#27c"> not </A></code> to Name 
Fields, as in FIG Forth! ) 
<p>
The name field is a normal Forth packed string.  (Many Forth implementations set 
the high bit in the first and last characters of the name field; 
Risc-OS Forthmacs does not).  
<p>
Name Field: length-byte, 0-31 character name.  
<p>
<p>
<p>
<h2>Vocabulary Format</h2>
<p>
Vocabularies have 8-way hashing.  This means that each vocabulary has 16 
separate linked lists of words.  Before searching a vocabulary, a hashing 
function is applied to the name to be located.  The hashing function selects one 
of the 8 linked lists to search.  
<p>
The hashing function is very simple.  The lower 3 bits of the first character in 
the name (the first name character, not the length byte) are interpreted as a 
number from 0 to 7, selecting a linked lists.  
<p>
Vocabularies are not chained to one another.  Search order is implemented using 
the  <code><A href="_smal_BU#14c"> also </A></code> /  <code><A href="_smal_BB#289"> only </A></code> 
scheme.  Each vocabulary thread is terminated with a special link field in the 
final word.  The special link address is the address of the origin of the Forth 
system (which may change from session to session due to the relocation that the 
operating system applies when loading and executing the Forth system.  
<p>
The parameter field for a vocabulary looks like: 
<p>
User number (4), Voc-link (4) 
<p>
The user number selects the place in the user area where the head of list 
pointers for the 4 vocabulary threads are stored.  Each vocabulary requires 8 
bytes of user area storage for these 4 threads.  The values stored in the user 
area are the Link field Addresses for the top word in each thread.  
<p>
<p>
<h2>Relocation</h2>
<p>
In the RiscOS environment all programs of the absolute type are loaded at $8000 
and executed from there.  So on first sight the relocation table doesn't make 
much sense in this version.  
<p>
But the relocation table can be used for target/meta-compiling or for relocating 
code during run-time.  
<p>
The executable file contains a relocation list used to identify the locations in 
the program's binary image which contain absolute addresses.  When the program 
is loaded, each of these locations is modified by adding the starting address of 
the program to the number contained in that location.  Only 32-bit numbers may 
be so modified.  
<p>
While Risc-OS Forthmacs is running, it maintains its own relocation table, 
identifying those locations in the Forth dictionary which must be relocated 
during  <code><A href="_smal_AK#18a"> cold-code </A>.</code> Each bit in the map 
represents the address of one aligned location.  This relocation table is 
completely different from the standard Risc OS relocation tables, it is only 
used from within Risc-OS Forthmacs.  
<p>
In order for this to work properly, the programmer must be careful to use  <code><A href="_smal_BX#2ff"> token, </A></code> 
or  <code><A href="_smal_BW#2fe"> token! </A></code> to store an address in the 
dictionary, both set the relocation flags.  If , or ! is used instead, the 
address will not be properly relocated if  <code><A href="_smal_BK#2c2"> save-forth </A></code> 
has been used to write the dictionary image to an executable file.  

See:  <code><A href="_smal_BW#2fe"> token! </A></code>  <code><A href="_smal_BX#2ff"> token, </A></code>  <code><A href="_smal_BU#2cc"> set-relocation-bit </A>,</code>  <code><A href="_smal_AL#2ab"> relocation-map </A></code> 
<p>
<p>
<h2>Program header</h2>
<p>
The header of the executable binary image looks like this: 
<p>
<p><pre>
 h_magic   (  0)    \ Magic Number
 h_tlen    (  4)    \ length of text (code)
 h_dlen    (  8)    \ length of initialised data
 h_blen    (  c)    \ length of BSS unitialised data
 h_slen    (  10)   \ length of symbol table
 h_entry   (  14)   \ Entry address
 h_trlen   (  18)   \ Text Relocation Table length
 h_drlen   (  1c)   \ Data Relocation Table length
</pre><p>
<p>
the magic number is the branch+link instruction just behind this header.  Note: 
this header might be changed with future releases according to Acorns executable 
binary code standard.  
<p>
<p>
<h2>Heap memory</h2>
<p>
Risc-OS Forthmacs is loaded to $8000 and will have as much memory available as 
was defined by <em>WimpSlot</em> .  
<p>
The  <code><A href="_smal_BN#265"> main-task </A>s</code> user area immediately 
follows the first instructions at $8050.  
<p>
$600 byte will be allocated in module-heap, it will hold the  <code><A href="_smal_AC#1e2"> env-area </A>,</code> 
the command-line area, the  <code><A href="_smal_BO#236"> interrupt-code </A></code> 
giving  <code><A href="_smal_AG#216"> get-ticks </A></code> plus all handlers 
used by shelled programs.  
<p>
The implementation of the dynamic memory manager has changed in Version 
3.1-2.00.  From now on the dictionary and the heap share the same memory area, 
the dictionary grows from lower addresses and the heap can be as large as the 
area between the stacks and  <code><A href="_smal_AM#21c"> here </A>.</code> 
<p>
Note: Of course you may install another memory manager or add more heaps.  
<p>
<p>
<h2>Dictionary memory</h2>
<p>
At the top of the dictionary are both stacks defined by  <code><A href="_smal_BC#2ba"> rp0 </A></code> 
-  <code><A href="_smal_BE#2bc"> rs-size </A></code> and  <code><A href="_smal_AK#2da"> sp0 </A></code> 
-  <code><A href="_smal_AA#2a0"> ps-size </A></code> and the  <code><A href="_smal_BM#2f4"> tib </A>,</code> 
below this are MBytes of free memory (well, hopefully).   <code><A href="_smal_AM#21c"> here </A></code> 
marks the end of the allocated dictionary, classically  <code><A href="_smal_BL#293"> pad </A></code> 
is  <code><A href="_smal_AM#21c"> here </A></code> plus something.  
<p>
Risc-OS Forthmacs knows about two dictionary areas, the  <code><A href="_smal_AO#2ae"> resident </A></code> 
(which is the dictionary you know in all implementations) and the  <code><A href="_smal_AD#303"> transient </A>.</code> 
The transient dictionary is in the heap memory, definitions defined here won't 
use dictionary space in the target application.  So it might be useful to do: 
<br><code>    transient</code><br>
<br><code>      fload assembler</code><br>
<br><code>      fload debugger</code><br>
<br><code>    resident</code><br>
<br><code>      fload myapplication</code><br>
Now the debugger and assembler will be in transient address space.  To remove 
all links, pointers etc.  into the transient address space use  <code><A href="_smal_AS#1c2"> dispose </A>,</code> 
it will do this for you.   <code><A href="_smal_BG#de"> .dispose </A></code> 
will also give some informations what is removed.  
<p>
</body>
</html>
