
		HINTS FOR EFFICIENT PROGRAMMING
	    Copyright (c) 1992 by Omega Point, Inc.
		   Author: Ratko V. Tomic


-------------------------------------------------------------------------
NOTE: The text below is taken from the CodeRunneR (R) manual. CodeRunneR
      is a TSR library for C and assembly programmers, thus many hints
      deal implicitly with size optimization. For more information about
      CodeRunneR call Omega Point (508) 877-1819 or fax (508) 877-0915.
-------------------------------------------------------------------------

CodeRunneR was designed with the greatest attention for efficiency (size
and speed), all functions were carefully crafted in assembly language.
Combined with some care in the C code you can produce programs rivaling
in size to average assembly language programs, especially on medium and
large projects. On very small programs (few hundred bytes), the assembly
language will still be ahead. In terms of speed, the video functions
(at least) should be competitive with pure assembly code.

The C programming hints below have been found useful in reducing the
efficiency gap between C and assembly code.

1. INT or CHAR - Current compilers are notoriously bad in handling
   expressions involving byte size data. The reason for this is an
   overly literal interpretation of C "abstract machine", which mandates
   that the result should be evaluated as if extended to integer. On the
   other hand, the xxx86 processors have a rich subset of instructions
   operating on byte size items. These instructions make great majority
   of "integral promotions" unnecessary. Unfortunately, only a small
   fraction of these are utilized by C compilers, resulting in much more
   expensive operations on char versus int variables. Thus a saving of
   one byte for the variable storage is practically always more than
   paid for in code size.

   The following rules have been found useful in reducing this type
   of inefficiency:

   a) If a char variable is used within expressions, or is being passed to
      functions, or returned by functions, it is better to use int instead.
      This is true even in case when variable is only used as TRUE/FALSE
      flag. If defined as an int, the variable will occupy 1 more byte,
      but for each use one or more bytes are saved in generated code.
      In addition the speed is often improved due to an even address
      alignment of variables, preferred by the 16/32 bit CPU-s.

   b) In some situation rule (a) cannot be applied, since an array of
      characters is the source for byte size variables. In that case
      the characters should be cast as SIGNED whenever possible since
      the instruction CBW used to extend signed char to int is 1 byte,
      while the instruction MOV AH,0 used for unsigned is 2 bytes.

   c) CodeRunner functions which receive char or byte as an argument
      will ignore high byte passed. Hence you can safely pass integers
      without having to mask out high byte. This also allows application
      of the rule (b) independently of char being below or above 0x80.

   d) Auto variables of char type will occupy two bytes on stack
      anyways, hence one should always gain by using int instead.

   e) If an integer variable is to be compared to a char, or passed to
      function expecting char, casting integer to char is better than
      AND-ing it with 0xff.


2. INITIALIZED AUTO variables for strings, arrays, structures or any
   composite objects are very inefficient. Each such object will
   occupy exactly twice the size defined. Also, the compiler will
   on every function entry call an internal transfer function to
   copy the variable from original place into the auto variable.
   Besides wasting time, this transfer function is not usable by
   the C code as a memory copy function. And lastly, the space
   used as the destination is also internal space (stack space),
   which will often place much heavier demand on stack size. You
   can often create a "local copy" of such item in some temporarily
   free buffer (e.g. a disk i/o buffer when disk i/o is idle).

   a) If this type of variables is truly needed (rare case), one may
      use other general purpose memory copy function (like memcpy())
      which is usable for other purposes as well. In case of Turbo C
      this is even more important, since Turbo C uses far call and
      32 bit pointers to perform the copying, even in small model.

   b) In most situations, the intention was only to keep a variable
      local to the function and not to reinitialize variable each time
      function is activated. In that case the variable should be turned
      into static, accomplishing same result in half the size.

      Hence instead of:

        func()
        {
        char msg[]="This is an initialized local string.";
        }

      it is better to use:

        func()
        {
        static char msg[]="This is an initialized local string.";
        }

     This is one example where ANSI C standard has violated a useful
     classic C rule that more efficient constructs are shorter to type.


3. Another case of the same Classic C rule violation is the passing of
   structures by value to/from functions. Such deceivingly efficient
   constructs are wasteful on memory and time even when implemented
   well, and especially in the current C compilers where they were added
   as a marketing feature to boost the count of ANSI compliance points.

   Hence instead of:

     struct x_type func(struct x_type x)
     {
       ....
       process(x.field);
       ....
       return(x);
       ....
     }

   it is more efficient to use:

     void funct(struct x_type *x)
     {
       ...
       process(x->field);
       ...
       return;
      }

   In rare circumstances, where a local copy of a structure is needed in
   order to be modified while preserving the callers copy, it is still
   better to use a general purpose memory copy, rather than have compiler
   link in an internal-use-only transfer function.

4. Uninitialized AUTO variables are more efficient than the equivalent
   static/global variables both in memory (typically 1 byte per access)
   and speed. Additionally, the space used by auto variables is released
   when function exits. Therefore both simple and composite objects
   needed temporarily and locally should be defined as auto variables.

   Similar situation is with arguments a function receives - they are
   equivalent to auto variables in efficiency, thus they are often
   better than global variables.

5. Variables which are OFTEN being passed as arguments between functions
   should be redefined as globals, or at least static for that module.
   This rule needs to be counterbalanced with the previous rule - the
   decision parameter is "often". Namely each access to a global/static
   costs on average 1 extra byte (compared to access of a function
   argument), but the cost of passing such variable is on average 5 bytes.

6. If a variable occurs in the code as say: n-1, two or more times, it
   is better to define another variable, n1=n-1 and use n1 instead.
   Just the code space saved on the single re-use pays for the
   extra space of variable n1. The program speed will benefit too.
   This rule is applicable to char, int, long variables as well as
   pointers to any objects. More complex expressions (than n-1)
   will benefit even more.

   Note also that the assignment (n1=n-1; above) should be combined
   with the first use of (n-1):

   For example instead of:

     y = f(x) + n - 1;
     z = v * (n-1);

   use:

     y = f(x) + (n1=n-1);
     z = v * n1;


7. 2-D arrays are not efficiently handled by C compilers. Each access
   will produce multiply instruction to compute an offset into the array.
   If a 2-D array is used mostly in a sequential manner (like screens,
   edit buffers, matrices, etc.) one should define them as 1-D arrays
   and compute starting offset explicitly, and from then on use ++,--
   operators (on either pointers or indices).

8. Larger expressions are compiled more efficiently than the equivalent
   step-by-step sequence of expressions. This is also true for a function
   invocation within an argument of another function invocation. The
   reason for this is the freedom compilers have when evaluating
   expressions - compiler need not worry about side effects occurring
   during evaluation, their sequencing is in most cases undefined, thus
   compilers can ignore aliasing. On the other hand, the equivalent
   sequence of simple expressions has so called "sequence points" after
   each step, requiring often that the intermediate results be stored
   in memory. In other words the sequence of small expressions places
   additional, often useless, constraints on evaluation.

   The drawbacks to be weighed in a decision are higher susceptibility
   to errors, lower readability and lower flexibility of complex expressions.

9. Functions which accept string pointers and their lengths as arguments,
   should have lengths placed before pointers. This is advantageous when
   the following invocation method is used:

   char *s;
   ....
     str_func( str_len(s), s);
   ....

    The reason for this is the order in which arguments are pushed on the
    stack - in case of C parameter convention this is from right to left.
    Thus in the example above, the code will fetch the value s and push
    it twice than call str_len(). Had the argument order been opposite,
    the value s would be accessed (computed) twice.

10. When segments and offsets are defined within structures, offset should
    be placed first, followed by the segment. When passing far pointers
    as a separate words for offset and segment, offset should be first.
    The target function should receive this pair of words as the far
    pointer, and will access it using single instruction (LES/LDS).

11. If the total amount of data used by program exceeds 64K, the bulk of
    this space is normally occupied by arrays (programs with 30,000 simple
    variables are rare). The largest of these arrays should be moved into
    far heap (see section on (far) memory allocation). This is always more
    efficient than switching globally to large-data models.

    CodeRunner is compiled in small data model, allowing for 64K of data
    and 64K of code using near access, and up to memory size of far data.
    The library functions can move data to/from far memory. One can also
    use far pointers available in Turbo and MS C compilers (mixed models).

12. Accessing array elements via pointers (as opposed to index) is more
    efficient, especially if arrays have int elements (or larger).
    This only holds if the pointer is used without additional offset,
    i.e. as *p instead of *(p+i) or p[i].

13. If an index is used to access array elements, global arrays
    will compile more efficiently than arrays defined via starting
    pointer. Hence if an alternatives are p[i] and a[i] (p is a
    pointer, a[] is an array), the array is preferable. This should not
    be confused with the "abstract C machine" which mandates that array
    references are converted to pointers - the actual code does make
    distinction. The reason is the same as for the constants being
    more efficient than variables - in this case a[] is a constant,
    while p is a variable, both representing some memory address.

    One can unify this and the preceding rules as: The pointers are
    more efficient than arrays if they follow the array elements,
    and less efficient if they are fixed (a variable playing role
    of a constant).

14. Loops which always execute at least once, are best implemented
    as do {...} while() loops, as opposed to for() {...} loops or
    while() {...} loops. The saving is 2-3 bytes.

15. Switch statement is often implemented more efficiently using two
    arrays - one with words (or preferably bytes) containing case values,
    and another array containing function pointers for the matching cases.
    There are three reasons for this:

    A. Switch statement will often generate code that does exactly
       same thing: it has the two arrays, both with 16-bit elements
       plus the code which scans the first one (case values) then
       dispatches from the second one. You can not only use common
       search function, but also often reduce the case values to 8 bits.

    B. Actions done for particular case can be delegated to functions
       which are reusable in other switch statements or other code.

    C. For each case there will be a jump code to join the common
       end point of the switch statement - this is 2-3 byte instruction.
       With the suggested replacement, the RET instruction is used
       which is always 1 byte.

    This method also has a side benefit of allowing variables for case
    labels. Within the context of CodeRunner, this method is useful for
    hot-key response. The hot-key service routine receives as an argument
    the hot-key which has triggered TSR activation. The CR.LIB function
    word_pos() when applied to hot-key and the array of all hot-keys
    will return 1-based index of the hot-key in the array. The matching
    array of function pointers should have first pointer set
    to an error function (no match case), and the rest in the same
    order as the hot-key array. The whole decision process could
    look as follows:

    word hot_key_list[] = {  key1, key2, key3 }; /* Hot keys          */
    fp dispatch[]={err_func,func1,func2,func3 }; /* Function pointers */

    hot_key_service(key)
    {
     .....
     dispatch[word_pos(key,hot_key_list,3)]();
     .....
    }

    When a switch statement is coded in this manner, the err_func()
    should replace the default case of the switch statement.

    As usually there is a tradeoff involved - the switch statement allows
    access of auto variables belonging to the same function. If these
    have to be recoded as globals, or passed to individual case functions
    the overhead for globals (or function entry frame setup) may offset
    the benefits.

16. When left to choose register variables, compilers will generally
    pick the first two integer variables. You may override the default
    choice by weighing space/time priorities. If the space is important
    select variables which occur (literally) most frequently within a
    function. If the speed is important (and you don't use assembler)
    select variables used most frequently in the inner-most loop within
    a function.

17. For pop-up applications, compiler should be set to optimize for size,
    since the bulk of time consuming (low level) screen handling is done
    by the CodeRunner functions. In some cases the size optimization
    will actually produce faster code than speed optimization due to
    compilers disregard of CPU prefetch queue effects. And in almost
    all cases the speed difference will be beyond human perception.

18. Compiler switch for stack-overflow checking should be disabled for
    TSR application. Namely, the normal action of the compiler will
    be to abort the program and exit to DOS. This obviously will
    not work once program goes resident, since officially the program
    has already "exited" once. Thus all the checking code is at best
    useless in this environment (CodeRunneR library substitutes checking
    and exit function with a neutral function). An alternate approach
    is offered by a pair of CodeRunner functions _watch_stack() and
    _unused_stack(). These functions are intended for the test stage
    of development, when they can help determine minimal, but still
    safe stack size. They should be removed from the final program.

19. When selecting functions from the library, if there is a tossup
    choice between the two functions, you should select one already
    used somewhere else in the code (or at least the one more likely
    usable elsewhere). When the choice is not a tossup, but requires
    extra C code to make it use the same function, one should use the
    best suited library function. In other words, one should avoid
    variety used for variety sake.

20. The CodeRunner video functions are fast enough to make redisplaying
    of a line or a window an acceptable alternative to trying to repaint
    only the changed portions of the screen. This can save a great deal
    of code space.

21. For the same reason as in previous rule, the nested windows may be
    implemented without buffers to save an overlayed portion of another
    window. Calling the same function which drew the overlayed window to
    start with, will save on the data space. In case of two levels of
    nesting, a swap_blk() functions may be used to swap the screen block
    and another buffer.

22. Goto statement is still legal in C language. Your C compiler will
    never generate more than 3 bytes per goto (most often just 2).
    In the same time, even a single variable used to control the early
    exit from nested loops will generate 4 bytes to get set and 4-5
    bytes for each test, and then 2-3 bytes to jump anyways.

23. CodeRunneR's data compaction will not dispose strings specified
    as function arguments. These strings should be defined in the
    disposable data module, and accessed as an external variable
    from the disposable code module.

    For example, instead of:

      main()
      {
 	dsp("Signon Screen");  /* This string is not disposable */
      }

    it is more efficient to use:

      extern char msg[]; /* Defined in disposable data module */

      main()
      {
    	dsp(msg);
      }

    In assembler you can just define the string within the same module,
    but in _IDATA segment to make it disposable.

24. Educational C texts providing lists of DO-s and DON'T-s for portable,
    modular, readable, structured and object-oriented coding practices
    are a rich source of DON'T-s and DO-s for efficient programs.

    This advice should be taken with grain of salt. Namely, when sufficient
    effort has been spent polishing a program and when it looks that no
    more improvements are possible using "noble" methods, the point of
    tradeoffs has been reached. Only at this point, if further size/speed
    improvements are needed, one should look at the DON'T-s from such
    lists (or DO-s in your code) and choose the least of evils which will
    help achieve the desired performance.

    Violation of the good practices before the tradeoff point is sloppy
    rather than efficient programming.


25.  ELSE statement generates a JMP instruction around the ELSE-block.
     This can be eliminated as shown in the following example:

     Instead of:

        if (x<y) z=10;
	  else z=0;

     Use more compact form:

	z=0;
	if (x<y) z=10;

     NOTE: This also applies to ?: operator, which is just a compactly
     written variation of IF-ELSE statement.

26.  ELSE IF statement may often be skipped gaining in space at the
     expense of the execution speed. This is especially useful when the
     statements within each conditional block are simple assignment or
     single integer operator like +,-,^,|,&,++,--. In such case there is
     almost no speed penalty for the space gained.

     For example:

	if ((c>='a')&&(c<='z')) c-=0x20;
	  else if (c<0x20) c=0x20;

     can be replaced with seemingly redundant, but more compact check:

	if ((c>='a')&&(c<='z')) c-=0x20;
	if (c<0x20) c=0x20;

     This type of savings is especially useful in keystroke processing code
     which, due to human typing speed, need not be very fast.

     In addition to greater compactness, the decoupling of conditional
     statements makes code simpler to shuffle around.


27.  EXTRACTION of common expressions into functions can often eliminate
     lots of repetitive expressions. Simple call to a function with no
     arguments takes only 3 bytes. Any simple assignment will take at
     least that much (usually more). Therefore, any pair of assignments
     which repeats in the code 2 or more times can be made into a function.

     For example:

	...
	crs_x=X0; crs_y=Y0;	/* Put cursor in the corner */
	...
	crs_x=X0; crs_y=Y0;	/* Reset cursor to the same corner */
	...

     can be done in less space using:

	...
	corner0 ();
	...
	corner0 ();
	...

      corner0 ()
	{
	crs_x=X0; crs_y=Y0;
	}

     This example will save space even if Y0 is not used, and X0 is some
     global variable.


28.  The previous hint is an example of TIME/SPACE tradeoff. One can
     generalize the simple case above to any piece of code used more than
     once within a program. Such space optimizations are often done after
     the program is already finished, and can be done almost mechanically.
     Pieces of code which are similar, can be often made identical and
     then replaced with a function. Variables originally placed arbitrarily
     can be reorganized into records, allowing for simpler argument passing
     to common functions.

     In interactive applications, the response speed faster than the screen
     refresh time (around 20 ms) is without any visible effect. Since the
     function replacement method described above adds under 20 microseconds
     even on a 4.77 Mhz IBM PC, then fifty such replacements will add only
     one millisecond to the execution time, which is humanly unperceptible
     difference. On the other hand, few hundred bytes saved are certainly
     perceptible with ever growing DOS with each new version.

     The function-blocking method has an additional benefit that the code
     is easier to read and modify, especially if function names are
     well chosen. The resulting code will look as if designed top-down,
     without having to think through entire program up-front.


29.  EXAMINE GENERATED CODE by making compiler produce ASSEMBLER output.
     Most of todays C compilers have a switch to generate ASM file (-S for
     Turbo C and /Fa for Microsoft C). If you are familiar with assembly
     language, this exploration will show which constructs need
     improvements. Typically, you would look for expressions in which
     compiler extends char or byte to int or word (instructions like
     CBW or MOV AH,0 or XOR AH,AH). Such expressions can usually be
     typecast so that no extension instructions are generated.
     Other things to watch are Jumps-Around-Jumps necessitated by
     large conditional statement blocks. Reducing the size of these
     blocks by redistributing some work-load to functions will produce
     smaller and cleaner code (perhaps slightly slower).

30.  Arithmetic expressions can often beneficially replace IF/ELSE
     decisions.  Consider case of PING-PONG buffers: A pointer (char*)p
     is toggled between buffers buf1 and buf2. The switching of the
     pointer is usually done as:

	if (p==buf1) p=buf2;
	  else p=buf1;

     More efficient construct is:

	p = (char*) ((word)p ^ ((word)buf1 ^ (word)buf2));

     NOTE: The cast to word is done to allow use of ^ operator which is
     generally not defined on pointers. In this case it is well defined.
     In practical use the buf1^buf2 would be a constant computed in
     disposable code.

     This XOR trick can be used whenever some (integer) variable x needs to
     be toggled between two specified values a and b. You simply create
     variable ab=a^b, and toggle x via x^=ab; .

     In many cases the control variables could be defined in such way
     that the plain arithmetic could be applied to produce desired
     result.

31.  Along the lines of previous example, you should check the bit
     patterns of variables occurring in the program. There are often
     surprising properties of binary 2-complement arithmetic which can
     greatly optimize the code size and speed.

     A trivial example is replacement of say (x % 8) with (x & 7), or
     the same way for any power of two.

     Less obvious is, for example, the task of extracting the bit mask
     for the right-most bit set in a word.

     word x,mask;

     Instead of:

          mask=1;
          do if (mask & x) break;  /* Test if non-zero bit found */
            while ((mask<<=1)!=0);

     Use:
          mask = -x & x;


     Similarly if you are checking if a non-zero integer is power of two
     (i.e. it has exactly 1 bit set) you can use:

     #define is_power_of_2(x)  ( (x) & ((x)-1) )

     Thus to count bits in a word x, instead of checking all bits via
     shifting a bit mask, you can use the following:

     for (cnt=0; x; ++cnt) x &= x-1;

     This will produce bit count (cnt) on average twice as fast as the
     simple shift and test method.

     NOTE: The expressions above work for signed and unsigned integral
     values (char, int, long).


32.  Integer arithmetic in C is modular (also in assembler). Although
     the ANSI C guarantees this only for unsigned types, in practice
     on most processors (including 8086,68000,...) this is also true
     for signed integer variables (consequence of the binary 2-complement
     representation).

     This can be most often used when a bit-pattern is shifted to the
     left (also to the right for unsigned) to eliminate additional
     count variable set to 8, 16 or 32.

     Thus instead of:

         for (i=0,mask=1; i<16; i++,mask<<=1) {...}

     you can often use (assuming you need a mask):

         mask=1;
         do {
         ...
         } while ((mask<<=1)!=0);

     This way the mask doubles as a bit mask and as a counter.

     NOTE: We have also applied above the rule for loops which always
           execute at least once, therefore do-while form is used.


33.  Another consequence of modular arithmetic is that pointer/index
     overflow around 64K can be checked without resorting to long
     arithmetic.

     Thus instead of:

       char *p;
       word step; /* Assumed forward step */

         if (((long)p+step)<0x10000L) p+=step;

    You can use:

         if (p+step>p) p+=step;

    In case of backward step (check for wrap around 0) you can use:

         if (p-step<p) p-=step;


35.  You can avoid long arithmetic when subtracting from 64K.

     Thus instead:

       word used,free;
         ...
         free = 0x10000L-(long)used;
         ...

     you can do:

         free = -used;


34.  Loop counters are more efficient when counting down to 0 rather than
     from 0 up.

     Hence instead of:

        i=0; do {....} while (++i<N);

     it is better to use (whenever possible):

        i=N; do {....} while (--i);

     The latter often exploits CPU zero flag when decrementing i.
     The more recent C compilers also utilize LOOP instruction.


35.  Pre-increment/decrement operators are often more efficient than the
     post-increment/decrement version. This is valid mostly when these
     operators are combined in the expressions using compare (like ++i<N),
     and has no effect if the simple ++i; or --i; is done (without compare).
     The reason for this is that the processor flags are changed by ++ or --,
     thus the (i++ < N) must generate an extra code (usually a JMP) to
     prevent ++ from destroying the result of compare.


36.  Function argument passing has dual overhead - an overhead for pushing
     each argument on the stack, plus an overhead for clearing arguments
     from the stack. The latter is 3 bytes for MSC for any number of
     arguments (2 bytes for 1 argument in recent versions), but for Turbo C
     it is 1,2 or 3 bytes for 1,2 or 3 and more arguments, respectively.
     Thus to reduce stack clearing overhead it is useful (especially for TC)
     to keep number of arguments to 1 or 2. Of course, any reduction helps
     also in the argument pushing phase.

     One way to reduce number of arguments is to replace them with globals.
     As mentioned earlier, this has a drawback of more expensive global
     (or static) variable access. Another way is to use structures (or
     arrays) which hold several variables, and pass only the pointer to
     the structure (array).  The third method is useful when several
     arguments are passed via a chain of nested calls.

     For example instead of:

       func1(int a,int b,int c,int d)
       {
          ....
          func2(b,c,d);
          ....
       }

       func2(int x,int y,int z) {...}


     you can use:

       func1(int a,int b,int c,int d)
       {
          ...
          func2(&b);
          ...
       }

       func2(int *p)
       {
       /***  x is p[0], y is p[1] and z is p[2] ***/
       }

    Structures can be in the same manner for more complicated argument
    lists (or for code documentation purposes).

    NOTE: You should not use this for independent auto variables inside
          the func1(), such as int e,f,g; since the variables are not
          necessarily allocated in the order of declaration. On the
          other hand, the passing of arguments to functions is
          additionally constrained with the CPU stack discipline for
          C convention calls, thus it has well defined behavior across
          MS-DOS C compilers.


37. If there are 2 or more occurences of some function call, check the
    sourrounding statements. There are aften common pieces of code just
    before (or after) the call. The common section should be moved into
    the function. If the surrounding code is similar, one can often
    modify it by replacing some auto variable with a global, thus making
    it identical so that it can be moved to the function.

    If the code surrounding the call appears in the same form in several,
    but not all occurences consider what effect it would have to move it
    into the function anyways. Often the places which don't have the
    common piece are not affected by such code being executed, thus it
    can be safely moved into the function.

    The gain with this rule is even greater than the one from hint #27
    since the call already exists.


38. A switch statement may have two or more cases which are similar
    except for some variables being assigned different value. The
    similar case-blocks can be made to reuse same code (i.e. fall thru)
    by initializing the variables to one of the cases and modifying
    it in the remaining cases.

    For example this code:

      switch (i)
        {
        ...
        case KEY0:   func(0,a,b,c);
                     break;
        case KEY1:   func(1,a,b,c);
                     break;
        case KEY2:   func(2,a,b,c);
                     break;
        ...
        }

     can be optimized (for size) as follows:

      x=0;
      switch (i)
        {
        ...
        case KEY2:   x++;
        case KEY1:   x++;
        case KEY0:   func(x,a,b,c);
                     break;
        ...
        }

