Artificial Neural Networks

There are 5 sections to this textfile

A   Background
B   Outline of backpropagation algorithm
C   Designing a network
D   Suggested Exercise
E   Instructions for use

                          

A  Background

The design of a normal computer has little to do with the design of a
nervous system. However, some workers have suggested that the
computing industry has much to learn from biology when it comes to
designing machines that can solve problems.

Artificial Neural Networks (ANN) are one attempt to use brain 'models'
as the basis for a computer. There have been several periods of
interest in them since the 1940s. The earliest work is that of the
physiologists McCulloch and Pitts (1943). They attempted to analyse
the nervous system using mathematical logic and were able to describe
and make an electronic binary-decision neuron, as a model of a nerve
cell. In a real nerve cell the effect of the synapses is variable,
this allows the neuron to demonstrate some adaptability, ie learning.
In the McCulloch and Pitt's model the inputs to the 'nerve cell' came
via variable resistors. Thus, by changing the resistance, also called
the weights, the cell could be made to be adaptable.

The inputs feed into a summing device (an amplifier) which adds the
inputs together. The summed input is passed to a threshold device
which decides if the input is above some critical value. If it is
above the cell will output a voltage, ie it will fire. In logic terms
it would fire to indicate true and not fire to indicate false. The
system can, therefore, be said to be a binary decision neuron. Note
that the inputs can be both negative (inhibitory) or positive
(excitatory). Many of the terms from the McCulloch and Pitts model
have remained in the field of ANN.


In 1949 Hebb (a neurobiologist) formulated ideas about how neurons
were connected to each other. He realised that the connections, the
synapses, must carry signals and that the strength of a connection
could be important since a weak connection could degrade the signal
whereas a strong connection could enhance it. He theorised that
associative memory lies in the synaptic connections between nerve
cells. (Note how this differs from a von-Neumann computer; information
is not stored in specific locations, neurons, but in the connections
between them.) From this he developed a theory that allowed a neural
network to learn. Hebb's rule states that pairs of neurons which are
active together become stronger by synaptic (weight) changes. The
result is reinforcement of those pathways in the brain. 

The next major development came about in the 1950s and 60s with the
work of Rosenblatz who developed a 'machine' called a perceptron (see
later). This was a network of artificial neurons which could
automatically learn to recognise simple objects. This was a time of
great excitement and expectation. Unfortunately the whole field of
ANNs was to receive a major setback in 1969 when Minsky and Papert
identified some fundamental problems with the perceptron. They proved
that it was unable to decide whether two inputs have an exclusive-or
(XOR) relationship. An XOR relationship is a concept from mathematical
logic. It exists when only one of two values is true. It can be
illustrated by imagining a robot whose job is to drill a single hole
in a piece of metal. The robot picks up a piece of metal; if there is
no hole one should be drilled, if there is a hole the robot should do
nothing. A perceptron could not make such a decision. This criticism,
combined with the general damming of AI work by the Lighthill report
in the UK, lead to a general collapse of ANN work. However, in the
1980s Hopfield developed something called the Back Error Propagation
method. Since then ANN have received much more attention. The last few
years has seen a big increase in their application to many problems,
including a number of biological ones.


The Perceptron

The perceptron was a direct attempt to mimic part of the mammalian
visual system. It consists of a number of association units
(A1,A2,..An), each of these units extracts specific localised features
from an input device. This design is based on work by Hubel and Wiesel
who had produced an explanation of the early stages of vision. They
found specific groups of cells which responded in an apparently
pre-programmed way to simple events that occurred in a limited area of
the retina. Thus, there were cells which responded to the edge of a
straight line. These were the inspiration for the associative units.
Each of these associative units feeds via a variable resistor, so that
the weight can be changed (learning can occur), into a McCulloch and
Pitts cell. Remember that Hebb said that neurons which were active
together would develop strong interconnections. It is possible to
imagine what happens the first time the eye 'sees' a straight line. A
number of neurons, presumably some random selection, would produce an
output. According to Hebb's ideas the interconnections between these
active cells would become stronger, thus reinforcing their joint
activity. It is perhaps not too surprising that Hubel and Wiesel found
such groups.

The usual example used to illustrate the working of a perceptron is
differentiating between a T and a H shape. The perceptron has two sets
of associative units. Set Ah respond to horizontal  fields, set Av to
vertical fields. Each of these units will fire if two or more of its
inputs are active (ie black). The perceptron would be trained, ie the
weights would be adjusted to give the optimum responses for T and H
shapes. Differentiating between the shapes will depend on a 'score'
generated by the perceptron; this will be determined by the weights
and active units. Thus, if the maximum score is to be achieved for a T
shape Ah1 and Av2 would have the largest weights. If the H shape is to
have the minimum score then Av1,Av3,Ah1,Ah2 and Ah3 would all have low
weights, possibly negative. Ah1 is common to both so cannot be used to
differentiate between the two shapes thus its weight would be adjusted
to 0. How many other shapes, eg L, would be misclassifed, as a T or H,
will depend upon the threshold function, ie a cutoff score: anything
above the score is a T anything below is H. A threshold of about 0
would work reasonably well. Remember, unlike a von-Neumann computer,
nobody programmed the rules for deciding that shape 1 is a T, the
conclusion derives directly from the strength of the connections
(weights).

This type of information storage system is very different from that
used in conventional computers in which the memory is arranged by
reference to its addresses and not content. In order to retrieve
information from the memory of a convential computer the address at
which it is stored must be known. In a neural network the information
is held in the connections and their weights, ie it is dispersed.
Memory of this type is said to be content addressable. This is because
information can be retrieved by presenting the network with a fragment
of the original. This has some similarities to ways in which memories
may be stored in brains.

ANNs have undergone a great deal of development, there are many models
some of which now have little relationship to real neural networks.


Major Network Topologies

Neural networks can be classified by their topology (connections
between neurons) and the training algorithm (how weights are
adjusted). There are two main topology subdivisions: feed-forward and
feedback topologies.

Feed-forward (or associative) nets are those in which there are no
closed loops. A closed loop exists when the output from one neuron is
taken as part of the input for the same or a previous neuron. It can be
easily demonstrated that single layer nets are easier to train than
those with hidden layers. Feed-forward nets are known as associative
nets because they learn to associate an input pattern with an output
pattern.

Feedback networks have connections from their outputs to their inputs
they are said to be dynamic. Note that this type of arrangement is
probably biologically unrealistic. After a new input is applied the
output is fed back to modify the input. This new input produces a new
output, etc. This could go on for ever but if the network is stable
the changes get less and less, eventually the outputs become
stabilised. Unstable networks are not unknown! In these the output
never stabilises, nonetheless they can be useful, particularly in the
study of chaotic processes. It took a long time to develop stable
feedback networks and much of the credit is assigned to John Hopfield,
indeed some configurations are known as Hopfield nets.


The training of a network can be supervised or unsupervised.
Supervised is the simplest and requires a knowledge of what the output
should be during training. The actual output is compared to the
desired (target) output and any errors are used to change the weights
and hence the output. This is continued until the net gives correct
results at the desired level of probability. This is easily achieved
on a single layer but is quite complex on a multilayer network.
Unsupervised training differs in that an ideal output is unknown so
there is nothing to compare the output to, the network is allowed to
come to its own 'conclusions' about what the output should be, in effect
the network behaves as an automatic classification system (where it defines
the classes).

The difficulties of training multilayer networks were one of the
reasons for a loss of interest in ANNs. Independent work by a number
of workers led to the appearance of the back propagation algorithm in
1985. The network is started with random weights. It is then supplied
with a training input, ie one for which the output is known. A
'teacher'  identifies the magnitude of any error in the output. An
error signal (adjusted for the magnitude of the error) is fed back
through the network, altering the weights as it goes, in order to
prevent the same errors from happening again. The first corrections
take place on the output units. Subsequently, the corrections move on
to the middle (hidden) layer, where the weights are also adjusted. The
correction is not achieved in one pass, the same training input is
applied and the output is again tested. If it is incorrect another
error signal is sent back through the network. If all is well the
magnitude of the error will decrease with each pass, ie it will be
minimised. One of the main problems with back propagation nets is the
limit on their size, ie number of neurons which can be simulated by
commercial systems.


In networks with hidden layers it is important to have the correct
number of layers, and units per layer. If there are too many layers
the network can be easily trained to work correctly with the training
set, but it will probably perform poorly with test sets. This is
because the network will fail to 'identify' general principles, it
will simply set up unique paths for each member of the training set.
If there are too few layers or units the network will be untrainable
with even relatively small training sets. There is certainly a degree
of subjectivity about the actual design of a network, some workers
have even suggested using a genetic algorithm to find the best design.

Another element in the design of an artificial neural network is the
format of the transfer function. The transfer function is responsible
for converting the input to a neuron into its output. The simplest
transfer function compares the sum of a neuron's inputs with a
threshold value, if the sum is greater than the threshold the neuron
produces an output. Other types of transfer function may be used where
the output from a neuron is a non-linear function of its inputs. One of
the most commonly used transfer functions is the logistic function
which is approximately sigmoidal. The actual shape of the logistic
function depends on the values of certain parameters which can usually
be defined when designing a network. It is possible to produce a
logistic function which has a very steep rise and thus approximates to
a simple threshold.


Evaluation of ANNs

Neural Nets are very poor at normal computer tasks such as arithmetic,
they are quite good at tasks which require some type of pattern
recognition, probably because of their content addressable memories.
They have the advantage that you do not have to specify how to
differentiate between patterns, simply showing the network examples is
sufficient. The network will 'learn' how to differentiate and should
be able to do so on future occasions. It may be possible to use ANNs
with a variety of biological problems, for example sequence matching,
secondary structure prediction, identification of locations suitable
for conservation, etc. True neural networks should be fault tolerant,
thus if one of the nodes ceases to function it should not disrupt the
network unduly. 

Roberts (1989) has an interesting discussion about whether neural
networks are like human brains. Most of this discussion was the result
of a paper by Francis Crick (1989) in Nature. Crick is very critical
of the assumption that they can tell us anything about the brain,
others are less critical. Crick agrees that they may provide useful
analytical tools but thinks that they bear little resemblance to real
neural networks. For example, in real neural networks the outputs from
a single cell are either excitatory or inhibitory, but not both as is
found in some artificial neural networks. Crick also says that the
backprop algorithm could not work in real cells as it requires
bidirectional movement of signals down the axon.

Crick also thinks that many of the people modelling the brain with
artificial neural networks are frustrated mathematicians whose desire
is to find the general principle which governs information handling in
the brain. He thinks that there may be no general principle because
the structure of the brain is the result of natural selection and
natural selection is a 'tinkerer' which works on what is available and
not on what would be the best for some grand design. 'It [natural
selection] is opportunistic: anything will do as long as it works.
..... It may prefer a series of slick tricks to achieve its aim.' He
is strongly critical of workers who treat the brain as a black box and
hope to find out how it works by looking at just its inputs and
outputs. This is perhaps equivalent to trying to find out how a
television set works by looking at the picture on the screen and the
signal arriving at the television aerial.

He thinks that by looking in the brain we may find some mechanisms
which can be applied to artificial neural networks. He gives as an
example models which incorporate a NMDA type of receptor. NMDA
receptors are thought to be involved in associative learning in the
brain.

Suggested reading (biologically biased!)

Astion, M. L. and Wilding, P. (1992) Application of neural networks to
the interpretation of laboratory data in cancer diagnosis. Clin.
Chem. 38(1) 34- 
Beard,N.(1990) Mind over Micro. Personal Computer World,  January, 182-186. 
Beard,N.(1990) Having a Brainwave. Personal Computer World, February, 186-190. 
Boddy,L. and Morris,C.W.,(1991),Feasibility of using neural networks
to decribe a simple wood decay data   set. Binary, 3, 61-64. 
Boddy,L., Morris,C.W. and Wimpenny,J.W.T.(1990) Introduction to neural
networks. BINARY, 2,179-180.  
Cicchetti, D. V. (1992 Neural networks and diagnosis in the clinical
laboratory: state of the art Clinical Chemistry 38(1) 9-10.
Colasanti, R.L. (1991) Discussions of the possible use of neural
network algorithms in ecological modelling. Binary 3 13-15. 
Crick, F. (1989) The recent excitement about neural networks. Nature 337 129-132. 
Ferran,E.A. and Ferrara,P.,(1992),Clustering proteins into families using artificial neural networks, CABIOS,8(1),39-44. 
Lippman, R.P.,(1987),An introduction to Computing with Neural Nets,The ASSP Magazine 4(2) 4-22 
von Heijne,R.,(1991),Computer analysis of DNA and protein sequences, Eur.J.Biochem., 199, 253-56.
Heppner,G.F. et al,(1990),Artificial neural network classification using a minimal training set: comparison to conventional supervised classification,Photogr.Eng.Rem.Sens.,56(4),469-73. 
Holley,L.H. and Karplus,M. (1989) Protein secondary structure prediction with a neural network. Proc.Natl.Acad. Sci.USA. 86,152-156. 
Marshall,S.J. and Harrison,R.F.,(1991) ,Optimization and training of feedforward neural
networks by genetic algorithms, Second International Conference on Artificial Neural Networks 39-43 
Marshall,S.J.,Harrison,R.F. and Kennedy,R.,(1991) ,Neural classification of chest pain symptoms.,Second   International Conference on Artificial Neural Networks 200-204 
Mural,R.J. et al,(1992),An AI approach to DNA sequence feature recognition,TIBTECH (10),66-69. 
Petersen,S.B. et al (1990) Training  neural networks to analyse biological sequences.
TIBTECH, 8,304-308. 
Qian,N. and Sejnowski,T.J. (1988) Predicting the secondary structure of globular proteins using neural network models. J.Mol.Biol.,202,865-884. 
Rataj, T. and Schindler,J.,(1991),Identification of bacteria by a neural network, Binary (3),159-164. 
Roberts,L.(1989) Are Neural Nets Like the Human Brain? Science, 243,481-2. 
Wasserman,P.D.(1989)Neural Computing:Theory and Practice.  pp1-10. Van Nostrand Reinhold, NY. 
Sondergaard,I et al,(1992),Classification of crossed immunoelectrophoretic patterns, Electrophoresis, 13,   411-15. 
Zhu,K., Noakes,P.D. and Green,A.D.P.,(1991),EEG monitoring with artificial neural networks,SecondInternational Conference on Artificial Neural Networks 205-209
          
**********************************************************************

B   Supervised Training Outline of the algorithm


1 Obtain training and test data

Each case has a set of values associated with it, they may be
separated into two parts:

               [values of descriptor variables] [target output]
for example
                12.1  8.3  2.1                        0.0
  
For example, suppose that we wished to use a network to differentiate
between male and female sanderlings (birds). It is possible to
separate the sexes on the basis of certain morphological
characteristics such as the length of the primary feathers and the
tail length. We would obtain data from birds whose sex was known. We
decide to set the target output for males to 0.0 and to 1.0 for
females (it could equally have been the other way around). Assume that
we have data for 40 birds (20 male and 20 female). We obtain a random
selection of 10 males and females to form the training set, the
remaining 20 will form the test set.


2 Present the training cases, one at a time, to the network

Note in the following equations Sum is the equivalent of sigma

For each case find the error, ie the difference between the network's
output and the target output. To avoid problems with negative and
positive errors we square the error to produce the pattern sum of
squares (pss). Thus

                              pss = (actual - target)^2

Note: if the network has more than one output neuron the pss is the
sum of the squared errors for each neuron, ie

                              pss = Sum(actual - target)^2

Use the magnitude of the error to adjust the weights (see later) in an
attempt to reduce the error. Because we take account of the size of
the pss larger errors produce greater adjustments than small errors.

When all training cases have been presented this is one epoch. We can
find the total error (tss) for an epoch by summing the pss for each
case, ie

                              tss = Sum(pss)

When the output for each case is close to its target the tss will be small.


3 Compare the tss with a preset critical value (ecrit)

If tss > ecrit we need to continue training so repeat step 2. If,
however, tss < ecrit then the network is trained to our satisfaction.
Thus, training is an iterative process which may require many hundreds
of epochs to achieve the required tss. 

4 Test the network 

Present it with the test data, ie cases which have not been used to
train it. We can assess the validity of the network by determing the
number of correctly classified test cases. If the network performs
well we could then use it to classify other cases. If it performs
poorly we may need to think about redesigning it.



The difficult part of this algorithm is adjusting the weights, ie the
method by which the network learns. Let us consider the problem in
general before looking at how it is carried out in reality.

Consider a simple case. The network has an input layer of 4 neurons, a
hidden layer of 2 neurons (A and B) and a single output neuron C
(probably sufficient for our Sanderling problem). The input to C comes
from both A and B. 

Let us assume that the current state of the network is as described
below. The output from A is 2.5, which passes through a weight (A to
C) of 0.06 to produce an input to C of 2.5 x 0.06 = 0.15. The output
from B is 2.0 which passes through a weight (B to C) of 0.025 to
produce an input to C of 0.05. In this simple example the output from
C is the sum of its inputs, ie 0.15 plus 0.05 to give 0.2. The target
for this case is 0.0, thus the error is -0.2 giving a pss of 0.04.

This error is the result of two causes:
           incorrect weights into C (ie A to C and B to C)
           incorrect output from A and B (the result of their
                     inputs and weights from the input layer).
           
If we are to reduce the error from C we must distribute the error
through its 'causes' in proportion to their contributions to the
error. In order to do this we first adjust the weights to the output
layer and then adjust the weights into the hidden layer (to adjust
their output to the output layer), ie the error is back-propogated
through the network. The input from A is the largest contribution to
the output of 0.2 hence we need to make the biggest adjustments to
this side.

Let us assume that as a result of this error the following adjustments
were made (purely hypothetical values). Weight A -> C becomes 0.04, weight
B -> C becomes 0.02 and the outputs from A and B are 2.0 and 1.7 (The changes
in output would be a result of weight changes to the hidden layer).
Note that the alterations to the A 'side' are proportionally larger,
reflecting their larger contribution to the error.

The output from C is now 0.114, resulting in a reduced pss of 0.013.
If the resulting tss from all cases was larger than ecrit we would
repeat the above so that the pss was further reduced. Ideally we
eventually have an output of 0.0 for this case, in practice it may end
up as a very small value such as 0.001.

The above was a rather simplistic explanation, the reality is described
below.

Weight Adaptation Algorithm

Note In order to create a simple text file some useful characters have
been lost. In the following equations i,j and k should be subscripts.

If wij(t) is the weight from neuron i to neuron j at the current time
(t) the new weight (wij(t+1)) will be found from:

   (wij(t+1))  =  wij(t)  +  h.delta_j.xj                             (1)

   new weight  =  old     +  learning (h) . errorj (delta_j) . activationj
                  weight     rate               

where the activation is the output from the transfer function for neuron j.

The calculation of the error (d) depends where neuron j positioned in
the network. If j is in the output layer it relatively easy to
calculate:

   delta_j = xj(1 - xj)(targetj - xj)                                  (2)

since j is an output neuron xj is the actual output (after passing
through the transfer function).

If j is in the middle layer the calculation of d is more difficult. A
neuron in the hidden layer may be connected to more than one output
neuron, each of which has its own error.

   delta_j = xj(1 - xj).Sum(deltawjk)                                  (3)
                   
        where k is the number of output units that j is connected to
        and deltawjk is the change in weight_jk.

Equations (1) can be modified to reduce the chance of oscillations by
the addition of a momentum (a) term. This smoothes the changes and
prevents wide fluctuations.

   (wij(t+1))       =  wij(t)  + h.delta_j.xj + a(wij(t) - wij(t+1))   (4)

These four equations are the central part of the back-propagation
algorithm. They are important because they control how weights are
adjusted, and hence how the network 'learns'.

Try not to be too terrified by the sight of these equations. Look at
them closely and try to work out what is happening under different
conditions. Take equation (1) as an example

          (wij(t+1))       =  wij(t)  + h.delta_j.xj

The new weight is the old weight plus a bit extra (h.delta_j.xj). How
much extra depends on the values of three components. h is the
learning rate which has a range of 0 - 1, if it is 0 then the whole
term h.delta_j.xj will be zero and the weight will not change -
reasonable if the learning rate is 0. As h becomes larger the
magnitude of the change will increase. Similarly if delta_j is 0 there
would be no change, again reasonable since delta_j  = 0 would occur
only if the target output was the same as the output (see equation 2).
The magnitude of delta_j depends, at least for an output neuron, on
the discrepancy between the output and the target and the size of the
activation. Recall that that the output from the logistic function is
in the range 0 - 1. Thus, if the output was 1 the (1 - xj) term would
be zero and delta_j would be 0. The maximum value of xj(1 - xj) occurs
when xj is 0.5*, indicative of 'indecision' and hence there is a need
for weight changes to induce a decision.

* If xj is 0.5 xj(1 - xj) is 0.25; 
  if xj is 0.1 xj(1 - xj) is 0.09;
  if xj is 0.9 then xj(1 - xj) is also 0.09.

**********************************************************************

C Designing a back-propogation artificial neural network : Practicalities

When designing a back-propogation artificial neural network there are
many decisions which need to made, some of which may need to altered
in the light of experience. Unfortunately there are no well defined
rules, we must examine the experiences of other worker for clues as to
what will work best.

The following decisions must be made 

1      What is its purpose
2      Size and membership of training and test data sets
3      Number of predictor variables
4      Number of classes to be differentiated
5      Number of neurons in the input layer (probably defined by 3)
6      Number of neurons in the hidden layer
7      Number of neurons in the output layer (probably defined by 4)
8      Connections (eg are all neurons fully interconnected between layers?)
9      Value of the Learning rate
10     Value of the Momentum
11     Value of Ecrit (point at which learning stops)
12     Type of transfer function
13     Inital weights


3 and 4 (and indirectly 5 and 7) will be determined (mainly) by the
answer to 1.

In the example here 8,12 and 13 are fixed. All layers are fully
interconnected: ie all hidden layer neurons receive inputs from all of
the input layer neurons and all output layer neurons receive input
from all hidden layer neurons. The transfer function is the logistic
function:
  
   output = 1/(1 + EXP(-input)) subject to the following conditions
               if the input >  11.5129 the output is assumed to be 0.99999
               if the input < -11.5129 the output is assumed to be 0.00001

In some descriptions of networks you will see a reference to a
parameter known as the gain. It is incorporated into the logistic
function as follows

    output = 1/(1 + EXP(-input/gain))

In the logistic function avaialable in this sofware the gain is always
equal to 1 so can be ignored. The logistic function generates a value
between 0 and 1 (0.00001 to 0.99999 in this implementation) depending
on the input. The gain affects the slope of the transition from its
minimum to its maximum output. A small gain (approaching 0) produces a
sharp transition, ie a small change in the input can produce a large
change in the output, a large gain (approaching 1) produces a gentle
transition.

Initially the weights are set to random values in the range -0.5 to
+0.5. If all weights were equal this would prevent the network from
learning since all weights would consistently undergoe the same
changes. A side effect of the random weight allocation is that, even
if everything else is kept constant, repeat runs may produce slightly
different results, particularly with respect to the number of epochs
needed for the training.
       
In 2 you should aim to get as many cases as possible, this will
increase the probability that the network will be able to generalise.
If the number of available cases is small it may be possible to add
'noise' to existing data and thus increase the size of the dataset.
Sondergaard et al (1992) also say that this increases the chance that
the network will be able to generalise. Allocation of cases to
training and test data sets should be random to avoid bias (subject to
the constraint that each class is equally represented in the two
sets). It is also advisable to ensure that the sequence of cases in
the training set is randomised, eg the training set should not contain
all of class 1 followed by all of class 2.

The most difficult parameters to set are those in 6, 9, 10 and 11, ie
the number of neurons in the hidden layer; the learning rate, the
momentum and the critical value. The following points may help in your
decision making.

As the size of the hidden layer increases so do the number of weights
in the network, if this gets too large the network will be able
'memorise' each training case ( a unique set of weights for each case
). If there is a 1:1 relationship between training cases and weights
the network will be unable to generalise. This would be identifiable
if the network correctly identifies all of the training cases but
performs poorly with the test data. Sondergaard et al (1992) summarise
the case for minimising the size of the hidden layer by saying that a
large number will prevent the network from extracting the general
features of the patterns and will lead to an internal representation
(ie the weights) of the patterns. Theye suggest the following rule:
  
number in the hidden layer = (no. input neurons + no. output neurons)/2


The critical value defines the point at which training stops. Setting
this too small has much the same effect of a large hidden layer. If it
is very small this means that the network is getting the majority of
training cases absolutely correct - this can restrict the network's
ability to generalise and hence correctly process non-training cases.
Recall that the comparison is between ecrit and tss, tss is the sum of
all pss values. Since tss is a sum its magnitude must be related to
the number of training cases, eg assume that the pss is 0.01 for all
training cases. If there are 10 training cases tss = 0.1, if there are
100 cases then tss = 1.0, even though the error on each training case
is identical. Thus the way to set ecrit is to think about the minimum
pss that you will accept for each training case (minpss) and set ecrit
at minpss x number of training cases. You will also need to recall
that the pss is a square, thus if the actual minimum error per case
you wish to accept is 0.01 (ie the actual difference between the
output and the target) minpss will be 0.01 x 0.01 = 0.0001.

The learning rate affects the magnitude of weight changes during
learning. Large weight changes should increase the rate of learning,
unfornuately if it is too large it can cause the network to oscillate,
eg a large negative output error would be converted to a large
positive error rather than following a gradual descent towards zero
error. A reasonable value for the learning rate is 0.05. It is
possible to alter the back-propogation algorithm so that it includes a
momentum term. This has the effect of damping the weight changes so
that oscillation becomes less of a problem. A typical value for the
momentum is 0.9.

It is apparent that specifying the design of a network depends on a
number of subjective decisions. It is possible to run the network many
times with different set-ups in the hope of finding the optimum
design. Marshall and Harrison (1991) have used a novel approach to
this problem. They make use of the principle of natural selection!
They employ a genetic algorithm. If a genetic algorithm is used the
parameters (learning rate, size of hidden layer etc) are setup as
'genes' on 'chromosomes'. These are allowed to breed, mutate and
undergo 'meiosis' to generate offspring. The offspring (where their
genes define the network design) are assessed for their performance,
the best of these are used to produce the next generation. This type
of solution has been shown to produce acceptable solutions, ie
networks which learn quickly and are able to generalise.

**********************************************************************

D  Suggested Experiments

Using the data provided investigate the effect of changing:

    size of hidden layer 
    learning rate
    ecrit               

Use the following values 

momentum of 0.9 in all cases

    hidden layer size : 3 sizes to be advised (depends on data set to be used)
    learning rates    : 0.025  0.05  0.1
    ecrit             : assume n = number of training cases
                           n x 0.1;  n x 0.01; n x 0.001
       
This will mean looking at 27 different networks!
       
You can assess the effects of the changes by recording

    number of training cases correctly classified
    number of test cases correctly classified
    number of epochs required for training


**********************************************************************

E  Instructions for use of BP

BP is an implementation of the back-propogation algorithm written in Basic
for the Acorn Archimedes. It was written for a research project in a very
restricted timescale, hence it is not the most efficient or 'pretty' program
ever written. If anyone would like to improve it I would be very hapy to see
the results, particularly if you wished to make it multi-tasking!


Running the program

The example described below is the classic XOR problem, ie the ability to
correctly identify a XOR relationship. This was the major criticism of Minsky
and Papert about the original neural networks. When you run this example you
will find that the tss seems to be stuck at a value of about 1, it will stay
like this for about 200-250 cycles, then it will begin fall quite sharply. If
gets stuck again (ie tss does not alter), press Escape and rerun it - a new
set of weights will be used and hopefully it will find the correct weights!


Set up the datafiles by copying them (keeping the names specified below) into a
RAM disc, which can be created by dragging the relevant bar in the Tasks
display. Then double-click on the BP icon.

The output is spooled to a filename specified in the NNsetup file. Note that
this file will have a CR (x0d) at the start of each line. If these are not
removed it will result in a printout which is double spaced.

Setting up the datafiles

Two files are used by BP

1    NNsetup

     This describes the network and consists of a series of lines as
     described below

     NNsetup                          Explanation


     xortest                          job title note NO SPACES
     ram:xor                          output file name, again no spaces
     1                                output format  1 = long  0 = short
     5                                number of neurones
     2                                number of input neurones
     1                                number of outputs
     0.025                            ecrit  program stops when tss<ecrit
     0.25                             learning rate 0 - 1
     0.9                              momentum 0 -1
     50                               number of cycles before output pauses
     0                                pause after each pattern 0=no  1=yes
     0                                pause after each epoch   0=no  1=yes


   Datafile

   This file contains the data to be analysed plus some other information
   at the top of the file.

   8           Number of cases (training plus test)
   2           Number of variables in the file excluding set,target and name
   4           Number of training patterns
   + 1 1 x     
   + 2 1 y
   1  0 0 0 pat00
   1  0 1 1 pat01
   1  1 1 0 pat11
   1  1 0 1 pat10
   0  0 0 0 tpat00
   0  0 1 1 tpat01
   0  1 1 0 tpat11
   0  1 0 1 tpat10
 
 

List of variables to be used in analysis plus a scale factor For each variable there
is a + or - sign: + means use this variable, - means do not. Thus in this
example all variables are used. Next are the variable number, scale factor, and
a name (no spaces allowed). The scale factor is used to ensure that each variable,
when used by BP, has a range of 0 - 1. In the program each value is divided by
the scale factor. Thus if variable x had a value in the range 0 - 100, a scale
factor of 100 would be used. This method allows flexibility if you have a large
number of variables, measured on variety of scales.

Next come the cases in any order. Each line begins with a set number (0 = train, 
1 = test) followed by the variables.Each line ends with a target output(s) and
a pattern name. Note that training plus test cases must = total number of cases,
the program does not verify this.


Program output

An example


Mon,07 Oct 1992.12:36:42
xortest
cycles 448.00
ecrit 0.03
learning rate 0.25  momentum 0.90
tss 0.02
training cases 4.00
test cases 4.00
variables used 
x
y

Test pattern 
No.  name  target output   pss
1    tpat00   0  0.08      0.01    Target is 0, output is 0.08
2    tpat01   1  0.91      0.01    Target is 1, output is 0.91
3    tpat11   0  0.07      0.00    Target is 0, output is 0.07
4    tpat10   1  0.93      0.00    Target is 1, output is 0.93
Weights for hidden layer
input 0   4.60   5.27  
input 1  -4.35  -5.30  

Weights for output layer
input 2  -6.31  
input 3   6.65  

In this case all outputs are close to the targets so the network
performed well!




The second example has data in the files datafile2 and NNsetup2. 
Don't forget to change the names and put them in RAM before running
BP. Sample output is in the file sex! Data are given for the height
(inches), weight (pounds) and age (years) of 10 males and 10 females.
5 of each are used for training, 5 of each for testing, ie the network
will attempt to find someones gender from their height, weight and age. 
It seems to work ok, except for male number 6 who comes out female!
(Target output for a male is 1, 0 for a female. His output is 0.01).
