Performance issues
==================

The StrongARM has significantly different performance characteristics to
older ARM processors. It is clocked 5 times faster than any previous ARM,
and many instructions execute in fewer cycles. In particular

    *  B/BL take 2 cycles, rather than 3
    *  MOV PC,Rn and ADD PC,PC,Rn,LSL #2 etc take 2 cycles rather than 3
    *  LDR takes 2 cycles (from the cache) rather than 3, and will take
       only 1 cycle if the result is not used in the next instruction.
    *  STR takes 1 cycle rather than 2, if the write buffer isn't full
    *  MUL/MLA take 1-3 cycles rather than 2-17 cycles.
    *  Many instructions will in fact take only one cycle provided the
       result is not used in the next instruction.

For fuller information see the StrongARM Technical Reference Manual,
available from Digital Semiconductor's WWW site (currently at
http://www.digital.com/info/semiconductor/dsc-strongarm.html)

The StrongARM's cache and write buffer are also significantly better than
previous ARMs, allowing an average fivefold speed increase, despite the
unaltered system bus. Pumping large amounts of data will still be limited
by the system bus, but advantage can be taken of the write buffer to
interleave a large amount of processing with memory accesses. For example
on StrongARM it is quicker to plot a 4bpp sprite to a 32bpp mode than to
plot a 32bpp sprite to a 32bpp mode; the latter case is pure data transfer,
while the former is less data transfer with interleaved (ie effectively
free) processing.

The long cache lines of the ARM710 and StrongARM can impact performance.
A random read or instruction fetch from a cached area will load 8 words
into the cache; this can make traversal of a long linked list inefficient.
It is also often worth aligning code to an 8-word boundary. In current
versions of RISC OS modules are loaded at an address 16*n+4. Future
versions of RISC OS will probably load modules at an address 32*n+4, so it
is worth aligning your service call entries appropriately in preparation
for this change.

Two significant disadvantages of StrongARM over previous processors
are:

    1) Burst reads are not performed from uncached areas. In particular
       this means that reads from the screen are slower on the StrongARM
       than on previous ARMs. A future version of RISC OS may address this
       by marking the screen cacheable before reading (eg in a block copy
       operation). Also, burst writes are not performed to unbuffered
       areas.
       
    2) Code modification is expensive. You can modify code, but a
       complete SynchroniseCodeAreas can take of the order of half
       a millisecond (ie 100000 processor cycles) to execute, and will
       flush the entire instruction cache. Thus use of self-modifying
       code is strongly deprecated; a static alternative will almost
       always be faster. Synchronisation of a single word (eg modifying
       a hardware vector) is cheaper (of the order of 100 processor
       cycles) but still requires the whole instruction cache to be
       flushed.

Note that future processors will no doubt have different performance
characteristics again; you shouldn't optimise your code too much for one
particular architecture at the expense of others. However, hopefully you
will now have a better idea how to get better performance from your
StrongARM.
