<document title="Tokenize">
<define name="version" value="">
<define name="date" value="" length=30 align="right">




<literal mode="Text">
=============================================================================
Tokenize - Turn ASCII Files into Tokenized BASIC                 Version $$version$$

(c) Stephen Fryatt, 2014-2020                  $$date$$
=============================================================================

</literal>



<chapter title="Licence" file="Licence">

<cite>Tokenize</cite> is licensed under the EUPL, Version&nbsp;1.2 only (the &ldquo;Licence&rdquo;); you may not use this work except in compliance with the Licence.

You may obtain a copy of the Licence at <link ref="#url http://joinup.ec.europa.eu/software/page/eupl">http://joinup.ec.europa.eu/software/page/eupl</link>

Unless required by applicable law or agreed to in writing, software distributed under the Licence is distributed on an &ldquo;<strong>as is</strong>&rdquo; basis, <strong>without warranties or conditions of any kind</strong>, either express or implied.

See the Licence for the specific language governing permissions and limitations under the Licence.

The source for <cite>Tokenize</cite> can be found alongside this binary download, at <link ref="#url http://www.stevefryatt.org.uk/software">http://www.stevefryatt.org.uk/software</link>

</chapter>




<chapter title="Introduction" file="Introduction">

<cite>Tokenize</cite> is a tool for converting ASCII text files into tokenized BASIC (of the ARM&nbsp;BASIC&nbsp;V variety) at the command line. It is written in C, and can be compiled to run on platforms other than RISC&nbsp;OS.

In addition to tokenizing files, <cite>Tokenize</cite> can link multiple library files into a single output &ndash; both by passing multiple source files on the command line and by processing (and removing) <code>LIBRARY</code> statements found in the source.

Finally, <cite>Tokenize</cite> can perform the actions of the BASIC <code>CRUNCH</code> command on the tokenized output, substitute &lsquo;constant&rsquo; variables with values passed in on the command-line, and convert textual <code>SYS</code> names into their numeric equivalents.

Be aware that <cite>Tokenize</cite> is <strong>not</strong> a BASIC syntax checker. While it will check some aspects of syntax that it requires to do its job, it will not verify that a program is completely syntactically correct and that it will load into the BASIC Interpreter and execute as intended.

</chapter>



<chapter title="Command Line Usage" file="Use">

<cite>Tokenize</cite> is a command line tool, and should placed in your library or on your path as appropriate. Once installed, it can be used in its most basic form as

<command>tokenize &lt;source&nbsp;file&nbsp;1&gt; [&lt;source&nbsp;file&nbsp;2&gt; ...] -out &lt;output&gt; [&lt;options&gt;]</command>

where one or more source files are tokenized into a single output file.

It is possible to add the <param>-verbose</param> parameter switch, in which case <cite>Tokenize</cite> will generate more detailed information about what it is doing.

For more information about the options available, use <command>tokenize -help</command>.


<subhead title="Source Files">

The source files used by <cite>Tokenize</cite> are plain ASCII text: as saved from BASIC using the <code>TEXTSAVE</code> command, or by loading a BASIC file into a text editor and saving it as text.

The same requirements in terms of spacing and layout are placed on the files as would be applied by BASIC&rsquo;s own tokenizer. The only difference is that &ndash; as with loading files with <code>TEXTLOAD</code> and editing BASIC in most text editors on RISC&nbsp;OS &ndash; line numbers are optional. Thus a file could read

<codeblock>
 80 FOR I% = 1 TO 10
 90   PRINT "Hello World!"
100 NEXT I%
</codeblock>

but it could equally read

<codeblock>
FOR I% = 1 TO 10
  PRINT "Hello World!"
NEXT I%
</codeblock>

In the latter case, the line numbers will be applied automatically as described in the following section.

Keyword abbreviations are fully supported, using the rules defined in the <cite>BASIC Reference Manual</cite>. Thus <code>PRINT</code> and <code>P.</code> would both convert into the same token, while <code>OR</code> and <code>OR.</code> would not. For clarity, the use of abbreviations in source code is discouraged.


<subhead title="Line Numbers">

All BASIC programs must have line numbers, but if they are not included in the supplied source files then <cite>Tokenize</cite> will add them automatically.

A line number is considered to be any digits at the start of a line &ndash; otherwise, lines should start with non-numeric characters. The whitespace around line numbers is optional; if a number is present, the space before it will be ignored for the purposes of indenting what follows. Numbers can fall in the range from 0 to 65279 inclusive, which is the range allowed by BASIC.

If line numbers are present in the source files, then they will be used in the output. If they are not present, then <cite>Tokenize</cite> will number the lines it creates automatically: starting at the value given to the <param>-start</param> parameter and incrementing by the value given to the <param>-increment</param> parameter &ndash; these both default to 10 if not specified.

If line numbers are given, they must be sequential: duplicate numbers are not allowed, and lines can not be numbered backwards. It is possible to mix numbered and un-numbered lines, in which case <cite>Tokenize</cite> will switch between the two numbering methods: numbers must still be sequential, and if the numbered lines break the automatic sequence an error will be given. Thus

<codeblock>
   FOR i% = 1 TO 10
15   PRINT "Hello World!"
   NEXT i%
</codeblock>

would be valid using the default value for <param>-start</param> and <param>-increment</param> and would result in lines 10, 15 and 25 being generated. However

<codeblock>
   FOR i% = 1 TO 10
 5   PRINT "Hello World!"
   NEXT i%
</codeblock>

would not be valid, and would give the error &ldquo;Line number 5 out of sequence at line 2 of &lsquo;example file&rsquo;&rdquo; (because this would be trying to create the sequence 10, 5 and 15 using the default settings of <param>-start</param> and <param>-increment</param>).


<subhead title="Linking and Libraries">

Used with a single source file, <cite>Tokenize</cite> will take the file and convert it into a single tokenized BASIC program:

<command>tokenize &lt;source&nbsp;file&gt; -out &lt;output&nbsp;file&gt;</command>

If more than one source file is specified, then the files will be concatenated in the order that they are included in the parameters:

<command>tokenize &lt;source&nbsp;1&gt; &lt;source&nbsp;2&gt; &lt;source&nbsp;3&gt; -out &lt;output&nbsp;file&gt;</command>

would result in the files being linked in the order

<codeblock>
REM File 1
REM File 2
REM File 3
</codeblock>

In addition to files explicitly given on the command line, <cite>Tokenize</cite> can link in files referenced via <code>LIBRARY</code> commands in the source files themselves if the <param>-link</param> parameter is used. Thus if the statement <code>LIBRARY&nbsp;&quot;MoreCode&quot;</code> was included in a source file, the file <file>MoreCode</file> would be added to the list of files to be linked in and the statement would be removed from the tokenized output. Such files will be added to the queue for linking after any given on the command line, and will be processed in the order that they are found in the source.

<code>LIBRARY</code> commands must appear alone in a statement in order to be considered for linking. In other words, these two examples would be linked

<codeblock>
LIBRARY "WimpLib"
PROCfoo : LIBRARY "BarLib" : REM This is OK
</codeblock>

but on the other hand this example

<codeblock>
IF foo% THEN LIBRARY "BarLib"
</codeblock>

would not, because the <code>LIBRARY</code> statement is not able to be processed in isolation from the rest of the code. If a <code>LIBRARY</code> command is skipped in this way, a warning will be given.

All <code>LIBRARY</code> commands will be considered for linking, including any found within linked source files. The filenames given to <code>LIBRARY</code> must be string constants: if a variable is encountered (eg. <code>LIBRARY variable$</code>) then the statement will be left in the tokenized output and a warning will be given. Filenames found in <code>LIBRARY</code> commands are assumed to be in &lsquo;local&rsquo; format, including being treated as having relative location if applicable.

When used on RISC&nbsp;OS, filenames are passed straight to the filing system: any path variables must be set correctly before use &ndash; using the conventional <command>*Set</command> command &ndash; to point to ASCII versions of the specified library files. The <param>-path</param> parameter described below is <strong>not available</strong> on RISC&nbsp;OS, and will result in an error if used.

When used on other platforms, it is possible to specify &lsquo;path variables&rsquo; to <cite>Tokenize</cite> so that statements such as <code>LIBRARY &quot;BASIC:Library&quot;</code> can be used. On RISC&nbsp;OS, such a filename would resolve to the file <file>Library</file> somewhere on <code>&lt;BASIC$Path&gt;</code>. <cite>Tokenize</cite> allows paths to be specified using the <param>-path</param> parameter: <command>-path BASIC:libs/</command> would mean that <code>BASIC:</code> would expand to <code>libs/</code> and therefore result in <code>LIBRARY &quot;libs/Library&quot;</code> for the example here. As with RISC&nbsp;OS path variables, paths must end with a directory separator in the local format. More than one <param>-path</param> parameter can be specified on the command line if required.


<subhead title="Constant Variables">

When tokenizing a file, <cite>Tokenize</cite> can replace specific variables with constant values given via the <param>-define</param> parameter on the command line. Any instances of the variable found in the code will be replaced by the constant value, unless they are being assigned: in which case the statement containing the assignment will be removed completely.

A <param>-define</param> parameter is followed by a variable assignment in the form <code>&lt;name&gt;=&lt;value&gt;</code>. To assign an integer variable <code>int%</code> the constant value 10, for example, use <command>-define int%=10</command>; to give the string variable <code>name$</code> the value "Hello World", use <command>-define &quot;name$=Hello&nbsp;World&quot;</command>. Note that string values are not surrounded by double quotation marks, but if the value contains spaces then the whole parameter value must be enclosed as here.

Note that support for constants is currently limited: all references to the variable will be replaced, including those defined as <code>LOCAL</code>. In general, <cite>Tokenize</cite> will correctly identify assembler mnemonics within assembler blocks, but there may be some sequences which are not recognised and get treated as variables by mistake. It is advisable to check the generated code carefully when substitutions are being carried out.

Remember that there are some locations in BASIC where variables can be used but constant values can not. An example is diadic indirection, where

<codeblock>
PRINT a%!0
</codeblock>

is valid and will print the integer value stored at the location pointed to by a%, while

<codeblock>
PRINT &8200!0
</codeblock>

is unlikely to have the expected effect (printing 33280 followed by the inter value stored at &0). <cite>Tokenize</cite> will not validate your choice of substitutions for you.

Within these limitations, however, it is possible to use constants to insert details such as build dates into code. For example, given the code

<codeblock>
REM >Build Date
:
VERSION$ = "1.23"
BUILD_DATE$ = "12 Mar 2014"
:
PRINT "Example Program"
PRINT VERSION$;" (";BUILD_DATE$;")"
</codeblock>

then passing the definitions <command>-define VERSION$=0.12 -define &quot;BUILD_DATE$=01 Apr 2014&quot;</command> would result in the output

<codeblock>
REM >Build Date
:
:
PRINT "Example Program"
PRINT "0.12";" (";"01 Apr 2014";")"
</codeblock>

being produced.


<subhead title="SWI Number Conversion">

When tokenising a BASIC program, <cite>Tokenize</cite> can convert any SWI names used in <code>SYS</code> commands into numeric constants &ndash; removing the need for the BASIC interpreter to look them up when the program is run. To perform this operation, use the <param>-swi</param> parameter on the command line.

When running on RISC&nbsp;OS, <cite>Tokenize</cite> will look the SWI names up with the help of the operating system: as with any other BASIC cruncher, the modules used to provide any extension SWIs must be loaded when the tokenisation happens, or the names will be left in place (with a warning).

When running on other platforms, the option of looking up SWI names via the OS is not available. Instead, <cite>Tokenize</cite> can read the details from suitable C header files: <file>swis.h</file> that comes with GCC and Acorn&nbsp;C is a good option. While it can not parse the files in the same way that a C preprocessor would, <cite>Tokenize</cite> will look for any <code>#define</code> lines which appear to be followed by a valid SWI name and number combination; for example

<codeblock>
#define OS_WriteC    0x000000
#define OS_WriteS    0x000001
#define OS_Write0    0x000002
</codeblock>

The X versions of SWIs are inferred from their &lsquo;non-X&rsquo; forms, and vice versa: the three <code>#define</code> statements above would implicitly add

<codeblock>
#define XOS_WriteC   0x020000
#define XOS_WriteS   0x020001
#define XOS_Write0   0x020002
</codeblock>

at the same time. Were the X versions encountered first, the &lsquo;non-X&rsquo; versions would instead be added implicitly. 

Files containing SWI names should be passed to <cite>Tokenize</cite> using the <param>-swis</param> parameter. When used alongside the GCCSDK, it may be possible to access it via the GCCSDK environment variables &ndash; for example:

<codeblock>
-swis $GCCSDK_INSTALL_CROSSBIN/../arm-unknown-riscos/include/swis.h
</codeblock>


<subhead title="Tabs, Indentation and Crunching">

<cite>Tokenize</cite> can adjust the indentation and spacing of the BASIC source when generating tokenized output.

By default, any whitespace before a leading line number is ignored. Whitespace following the line number is retained, and used in the output. Thus both

<codeblock>
REM >Example
REM
:
FOR i% = 1 TO 10
  PRINT "Hello World!"
NEXT i%
</codeblock>

and

<codeblock>
 10REM >Example
 20REM
 30:
100FOR i% = 1 TO 10
110  PRINT "Hello World!"
120NEXT i%
</codeblock>

would both result in the same indentation. If a space were left between the line number and the rest of the statement in the second example, then this would be copied to the resulting file.

If tabs are used anywhere outside of string constants, then by default these will be expanded into spaces so as to align with the next tab stop if all the preceding keywords are fully expanded. By default tab stops are 8 columns apart, but this can be changed using the <param>-tab</param> parameter: for example <command>tokenize -tab 4</command> to use four column tabs. It is recommended to configure <cite>Tokenize</cite> to use the same tab width as set in the text editor used to edit the source code.

If <param>-tab</param> is used with a value of zero, then tabs will not be expanded and will instead be passed intact to the output file. Due to the requirement to expand keywords, using tabs can have unexpected effects if abbreviated keywords are used in the source.

The <param>-crunch</param> parameter can be used to make <cite>Tokenize</cite> reduce whitespace within the tokenized BASIC file. In use it operates very much like BASIC&rsquo;s <code>CRUNCH</code> command. It takes a series of letters after it, which indicate what crunching to apply &ndash; in some instances, these are case sensitive.

<definition target="E">
Setting <param>E</param> will cause empty statements to be removed from the file, along with any empty lines (whether already there or created by removing empty statements).
</definition>

<definition target="I">
The <param>I</param> option will cause all start-of-line indentation to be removed (tabs and spaces) so that all lines start in the first column.
</definition>

<definition target="L">
Setting <param>L</param> will cause completely empty lines to be removed from the file; any whitespace will cause them to be retained. This gives compatibility with the behaviour of TEXTLOAD. The <param>E</param> parameter includes the behaviour of <param>L</param>.
</definition>

<definition target="R">
The <param>r</param> and <param>R</param> options allow comments to be stripped from the source code; if used in conjunction with <param>E</param> then any lines which end up being empty will get removed. An upper case <param>R</param> will strip all comments; a lower case <param>r</param> will only strip comments after the first contiguous block of lines containing only <code>REM</code> statements at the head of the first file. In other words

<codeblock>
REM >Example
REM
REM This is the end of the head comment.

REM This line will be removed.

PRINT "Hello World!"
</codeblock>

and even

<codeblock>
REM >Example
REM
REM This is the end of the head comment.
PRINT "Hello World!" : REM This comment will be removed.
REM This line will be removed.
</codeblock>

would both result in

<codeblock>
REM >Example
REM
REM This is the end of the head comment.
PRINT "Hello World!"
</codeblock>

being output to the tokenized BASIC file given a setting of <command>-crunch er</command>.
</definition>

<definition target="T">
Setting <param>T</param> will cause trailing spaces to be stripped from lines, and lines only containing whitespace to be reduced to a single space. This gives compatibility with the behaviour of TEXTLOAD. The <param>E</param> parameter includes the behaviour of <param>T</param>.
</definition>

<definition target="W">
The <param>w</param> and <param>W</param> options allow whitespace to be removed from within lines. Using the lower case <param>w</param> results in all blocks of contiguous whitespace (tabs and spaces) being reduced to a single space, while the upper case <param>W</param> will cause it to be removed completely.
</definition>
</chapter>


<chapter title="Errors and Warnings" file="Errors">

There are a number of error and warning messages that <cite>Tokenize</cite> can generate while running. These can indicate problems with the source BASIC code, or with the availability of files on disc, and are detailed here.

Using the <param>-warn</param> flag will turn on additional warnings relating to the structure of the source code. In addition, adding the <param>-verbose</param> flag to the command line will cause <cite>Tokenize</cite> to produce additional information about what it is doing during linking and tokenizing.

<subhead title="Errors">

Errors are generated during the linking and tokenization process when a problem occurs which can not be resolved. The process will be halted, and must be re-started once the error has been resolved.

<definition target="AUTO line number too large">
Automatic line numbering applied via the <param>-start</param> and <param>-increment</param> parameters has risen above the maximum allowed value of 65279.
</definition>

<definition target="Constant variable &lt;variable&gt; already defined">
Variables can only be assigned as constants on the command line once. If a variable is listed more than once, this error is given.
</definition>

<definition target="Failed to open source file '&lt;file&gt;'">
A source file &ndash; either specified on the command line or via a linked <code>LIBRARY</code> statement &ndash; could not be opened for processing. This could be because it did not have the correct permissions, or because it did not exist in the location specified. Remember that on some platforms, filenames will be case-sensitive &ndash; references that work on RISC&nbsp;OS&rsquo;s case-insensitive Filecore systems might fail on other platform&rsquo;s case-sensitive filesystems.
</definition>

<definition target="Line number &lt;n&gt; out of range">
A line number explicitly specified in a source file is too large or too small. BASIC can only handle numbers between 0 and 65279.
</definition>

<definition target="Line too long">
A line can only be 251 bytes long once tokenized.
</definition>

<definition target="Misformed deleted statement">
If <cite>Tokenize</cite> needs to remove a statement &ndash; such as a linked in <code>LIBRARY</code> or <code>REM</code> that is to be crunched &ndash; then it mist be able to determine that the statement terminates cleanly. If it finds additional text where it is expecting to find a colon or line ending, then an error would be raised. For example

<codeblock>
LIBRARY &quot;LibFile&quot; ELSE
</codeblock>

would cause this error. Note that code which would raise this specific error &ndash; as opposed to warnings relating to the processing of <code>LIBRARY</code> commands &ndash; would almost certainly raise a Syntax Error from BASIC itself, anyway.
</definition>


<subhead title="Warnings">

Warnings are generated during linking and tokenization when an event occurs which the user should be aware of but which may well not prevent the tokenized program from working.

<definition target="Constant variable assignment to &lt;variable&gt; removed">
A variable defined on the command line has been found as the target of an assignment, resulting in the entire statement being removed.
</definition>

<definition target="Line number &lt;n&gt; out of sequence">
A line number explicitly specified in a source file is out of sequence, and would be less than or equal to the number of the line before. This could be due to a clear error such as

<codeblock>
20 REM This will give an error.
10 PRINT "Hello World!"
</codeblock>

or it could be more subtle. In this example

<codeblock>
10 REM >Example
   REM
   REM This will give an error.
20 PRINT "Hello World!"
</codeblock>

the default <param>-increment</param> of 10 will result in the two intermediate lines being given numbers of 20 and 30. By the time the <code>PRINT</code> statement is reached, the line 20 would need to be 41 or greater.
</definition>

<definition target="SYS &lt;name&gt; not found on lookup">
If the <param>-swi</param> option is in force, <cite>Tokenize</cite> failed to find a match for a textual SWI name &lt;name&gt; and therefore could not convert it into numeric form. This could be due to an error in the source file, or it could be because the name does not appear in the lookup table used by <cite>Tokenize</cite>. On RISC&nbsp;OS this could be as a result of the module providing the SWI not being loaded; if SWI definitions have been supplied via the <param>-swis</param> option (on all platforms), then it means that the SWI is not defined in these.
</definition>

<definition target="Unisolated LIBRARY not linked">
If the <param>-link</param> option is in force then in order to be able to remove <code>LIBRARY</code> statements after linking their associated files, <cite>Tokenize</cite> needs to know that they are self-contained. If <code>LIBRARY</code> does not appear at the start of a statement (such as if it appears in an <code>IF ... THEN</code> construct) then it will not be linked. Note that this will not catch <code>LIBRARY</code> as part of a multi-line <code>IF ... THEN ... ENDIF</code> &ndash; be aware that

<codeblock>
IF condition% = TRUE THEN
  LIBRARY "LibCode"
ENDIF
</codeblock>

would be linked, but might have some unexpected consequences.
</definition>

<definition target="Unterminated string">
String constants must be enclosed by double quotes (&quot;...&quot;), and a pair of double quotes (&quot;&quot;) together in a string is treated as a single double quote. <cite>Tokenize</cite> will raise a warning if it reaches the end of a line in a source file whilst thinking that it is still inside a string. Generally unterminated strings should be avoided, but in some circumstances it is possible to write (bad) code which BASIC will accept despite them being present &ndash; for this reason, their presence does not raise an error.
</definition>

<definition target="Variable LIBRARY not linked">
A <code>LIBRARY</code> statement with a variable following it was encountered while the <param>-link</param> option was in force. <code>LIBRARY</code> statements can only be linked if they are followed by a constant string (<code>LIBRARY &quot;LibraryFile&quot;</code>); when followed by a variable (<code>LIBRARY lib_info$</code>), <cite>Tokenize</cite> can not determine the name of the file and will therefore leave the statement in-situ.
</definition>


<subhead title="Optional Warnings">

By using the <param>-warn</param> parameter on the command line, <cite>Tokenize</cite> will emit additional warnings relating to the structure of the source code. At present it can check the use of variables, and the presence of function and procedure definitions in the code. The parameter is followed by a series of letters, which specify the warnings to be enabled &ndash; these may be case-sensitive.

<definition target="P">
If <param>-warn&nbsp;p</param> or <param>-warn&nbsp;P</param> are used, <cite>Tokenize</cite> will check the function and procedure definitions within the source files. A lower case <param>p</param> will report on missing or duplicate definitions; an upper case <param>P</param> will also report on any unused definitions.
</definition>

<definition target="V">
If <param>-warn&nbsp;v</param> or <param>-warn&nbsp;V</param> are used, <cite>Tokenize</cite> will check variable and array definitions.

For variables, a record is kept of each time a variable is assigned to and each time it is read. If a variable is found to have no assignments, then a warning will be raised. If an upper case <param>V</param> is used, a warning will also be given for any variable which is assigned but never read.

Note that a lack of warnings will not guarantee correct execution of the code. <cite>Tokenize</cite> makes no effort to check the order in which reads and assignments are carried out, or to watch for conditional execution. As a result, the following piece of code would fail to execute due to both variables in the <code>PRINT</code> statement being undefined, despite generating no warnings.

<codeblock>
IF FALSE THEN foo% = 19
PRINT foo%, bar%
bar% = 21
</codeblock>

For arrays, <cite>Tokenize</cite> does not track assignments but instead records the number of uses inside and outside of <code>DIM</code> statements. A warning will be given if an array is used without any <code>DIM</code> statement; if an upper case <param>V</param> is used, a warning will also be given for any array that is dimensioned but never used.

As with variables, no attempt is made to check the order of execution: the following would generate no warnings:

<codeblock>
array%(0) = 1
DIM array%(10)
</codeblock>

Multiple instances of an array being dimensioned are also not reported, as this is valid in the context of <code>LOCAL</code> arrays.
</definition>

The warning messages which can be reported are as follows:

<definition target="No definition found for &lt;name&gt;">
This warning, which will be generated if <param>-warn&nbsp;p</param> or <param>-warn&nbsp;P</param> is included on the command line, indicates that the source contains <code>FN</code> or <code>PROC</code> calls for which <code>DEF</code> statements have not been seen. These warnings are likely to be spurious unless libraries are being linked by <cite>Tokenize</cite>.
</definition>

<definition target="&lt;name&gt; defined more than once">
Also generated if <param>-warn&nbsp;p</param> or <param>-warn&nbsp;P</param> is passed on the command line, this warning indicates that two or more <code>DEF</code> statements have been found for the same <code>FN</code> or <code>PROC</code> name.
</definition>

<definition target="&lt;name&gt; is defined but not used">
This warning is only generated if <param>-warn&nbsp;P</param> is included on the command line, and indicates that a <code>DEF</code> statement has been found for an <code>FN</code> or <code>PROC</code> which is never called from within the source code.
</definition>

<definition target="Variable &lt;name&gt; referenced but not assigned">
This warning is generated if <param>-warn&nbsp;v</param> or <param>-warn&nbsp;V</param> is passed on the command line, and indicates that a variable has been found whose value is never assigned.
</definition>

<definition target="Variable &lt;name&gt; assigned but not referenced">
This warning is only generated if <param>-warn&nbsp;V</param> is passed on the command line, and indicates that a variable has been found whose value is assigned but apparently never referenced.
</definition>

<definition target="Array &lt;name&gt; used but not defined">
This warning is generated if <param>-warn&nbsp;v</param> or <param>-warn&nbsp;V</param> is passed on the command line, and indicates that an array has been found which never features in a <code>DIM</code> statement.
</definition>

<definition target="Array &lt;name&gt; defined but not used">
This warning is only generated if <param>-warn&nbsp;V</param> is passed on the command line, and indicates that an array has been found in a <code>DIM</code> statement which never appears anywhere else in the code.
</definition>

</chapter>



<chapter title="Compatibility Issues">

Since there&rsquo;s no definitive syntax for BBC&nbsp;BASIC, the performance of <cite>Tokenize</cite> has been tested against BASIC&rsquo;s <code>TEXTLOAD</code> command throughout development. The test base currently comprises over 3,300 files, of which at present less than 100 differ in their tokenised output when compared to the BASIC from RISC&nbsp;OS&nbsp;5.

The following list details known areas that <cite>Tokenize</cite> differs from BASIC, or which may cause confusion. If you find any others, please get in touch (details at the end of this file) &ndash; short examples in ASCII format which exhibit the problem, plus the version of RISC&nbsp;OS and BASIC that were used, are very helpful in identifying issues. Please be aware that even different versions of BASIC tokenise some pieces of source code in different ways.


<subhead title="Exponential numeric constants">
BASIC&rsquo;s tokeniser does not recognise the exponential part of numeric constants of the form <code>1E6</code>, whereas <cite>Tokenize</cite> does. Both BASIC and <cite>Tokenize</cite> would treat the following statement in the same way

<codeblock>
IF A &lt; 1E6 THEN GOTO 20
</codeblock>

However, if the spaces are removed to give

<codeblock>
IFA&lt;1E6THENGOTO20
</codeblock>

then while <cite>Tokenize</cite> will still generate the expected code, BASIC would run the <code>E6</code> into <code>THEN</code> and prevent its tokenisation.

While the behaviour of <cite>Tokenize</cite> differs from that of BASIC, it seems that it is actually more correct and so has been left in.


<subhead title="Numeric constants for line numbers">
BASIC treats numeric constants in its source differently depending on whether they are considered to be line numbers (such as the <code>100</code> in <code>GOTO&nbsp;100</code>) or just more general numeric expressions (such as the <code>42</code> in <code>L%=42</code>). While this distinction is generally handled correctly, the work done validating <cite>Tokenize</cite> has shown up a couple of confusing issues.

BASIC prior to RISC&nbsp;OS&nbsp;5 (including, presumably, RISC&nbsp;OS&nbsp;6) can incorrectly treat numeric constants as line numbers when they follow <code>TRACE</code> in its function form. Thus affected versions of BASIC would parse

<codeblock>
BPUT#TRACE, 2
</codeblock>

and treat the <code>2</code> as a line number despite it being a numeric parameter to <code>BPUT</code>.

A fix was implemented for this in RISC&nbsp;OS&nbsp;5, but it inadvertently prevented the recognition of other &ndash; valid &ndash; line number constants. One occurrence was conditional statements with the <code>THEN</code> omitted:

<codeblock>
IF A% = 0 GOTO 100
</codeblock>

would fail to spot the <code>100</code> as being a line number. There is a possibility that it also affected constants following <code>RESTORE</code> and <code>ON&nbsp;GOTO</code>-like constructs. A fix for this was implemented in RISC&nbsp;OS&nbsp;5 in June 2014, since when RISC&nbsp;OS&nbsp;5 BASIC and <cite>Tokenize</cite> appear to generate identical code.


<subhead title="The QUIT keyword">
The <code>QUIT</code> keyword has two standard forms: a statement with no parameters, and a function. As such, it can be used to start a variable name: the code

<codeblock>
QUITTING% = TRUE
</codeblock>

would set the variable <code>QUITTING%</code> to be <code>TRUE</code>. In RISC&nbsp;OS&nbsp;5, however, a third form has been added: a statment taking a single parameter (a value to pass back to the caller as a return code). This means that on RISC&nbsp;OS&nbsp;5, this line would tokenise as the keyword <code>QUIT</code> followed by the variable <code>TING%</code> and then a nonsensical <code>=&nbsp;TRUE</code>.

<cite>Tokenize</cite> takes the same approach as RISC&nbsp;OS&nbsp;5, such that <code>QUIT</code> can not be used to start variable names.


<subhead title="Control characters in string constants">
In BASIC, it&rsquo;s valid to include non-printing control characters in string constants if you can find a way to insert them. <cite>Tokenize</cite> will honour any such characters that it finds within string constants, but &ndash; just as with BASIC&rsquo;s <code>TEXTLOAD</code> &ndash; the presence of a newline in a string constant will be seen as the end of the line. This will result in an unterminated string warning, and the very likely failure to parse the following line of the source.

In general, the inclusion of raw non-printing characters in string constants is probably unwise: use can be made of <code>CHR$()</code> to avoid this problem. Similarly, the use of &lsquo;top-bit&rsquo; characters when manipulating BASIC source on other platforms may cause confusion unless care is taken with character sets. Again, the use of <code>CHR$()</code> can be helpful: such as

<codeblock>
PRINT CHR$(169);" John Smith"
</codeblock>

to include a copyright declaration, for example.


<subhead title="Line numbers">
BASIC&rsquo;s interpreter (as opposed to its tokeniser) is surprisingly relaxed about the use of line numbers if no reference is made to them (ie. if there are no <code>GOTO</code> or similar commands in the program). Duplicate line numbers are possible, as are non-sequential numbers (such as lines appearing, and being executed, in an order like 10, 30, 20, 40) &ndash; examples have even been seen where all the lines are numbered zero. The thing that all such BASIC programs share is the fact that they have been generated by some software other than BASIC itself (such as a BASIC linker, cruncher or automated source management tool).

Unfortunately, this relaxed approach to line numbering does not extend to re-tokenising code which has been saved out with such numbers as text. BASIC&rsquo;s <code>TEXTLOAD</code> and <cite>Tokenize</cite> differ in their handling of such source files, but both will generate tokenised files which do not match the original code. In such cases, the only options are to renumber the source text files by hand, or to remove the numbers completely.


<subhead title="Crunched code">
Although not an issue specific to <cite>Tokenize</cite>, it&rsquo;s worth re-iterating that crunching BASIC can be a one-way operation. While BASIC&rsquo;s <code>CRUNCH</code> command and <cite>Tokenise</cite>&rsquo;s <param>-crunch</param> parameter take care to ensure that the code they generate is still valid in ASCII form, most of the third-party crunchers are not concerned with the need to edit the compacted code.

Converting such code to text via <code>TEXTSAVE</code>, a text editor or similar, then passing it to either <code>TEXTLOAD</code> or <cite>Tokenize</cite> will often end in failure. The lack of spaces can easily result in keywords being mistaken for variables and vice versa &ndash; sometimes this will give a &ldquo;line too long&rdquo; error from the tokeniser; alternatively it may not show up until the affected parts of the program get executed by the interpreter.

</chapter>



<chapter title="Version History" file="History">

Here is a list of the versions of <cite>Tokenize</cite>, along with all the changes made.


<subhead title="Test Build">

Initial release for testing and feedback.


</chapter>





<literal mode="Text">

Updates and Contacting Me
-------------------------

  If you have any comments about Tokenize, or would like to report any bugs
  that you find, you can email me at the address below.

  Updates to Tokenize and more software for RISC OS computers can be found
  on my website at http://www.stevefryatt.org.uk/software/

  Stephen Fryatt
  email: info@stevefryatt.org.uk
</literal>
