
            Disc Structures and Allocation Strategies for NewCore

                              Draft Proposal


Author:    Mike Challis

History:

  0.01  17-Nov-93  Document started
  0.02  19-Nov-93  Distributed to JRe, JRo, WS, KW for comment
  0.03  01-Dec-93  Section 8 - Typical Scenarios - added
                   Minor correction to chunk table entry definition
  0.04  06-Dec-93  More "outstanding issues" added prior to moving on to next
                    topic


Outstanding issues:

  - complete the sections on deletion, extension and truncation
  - add details of cacheing/paging data structures
  - use of check sums
  - should directory updates be "protected" like map updates? (not for this
     document)
  - can an arbitrarily interrupted compaction be completed?
  - techniques for detecting memory corruption of critical mapping data
  - more diagrams and summaries of the main data structures
  - determine and describe all possible "states"
  - note maximum sizes determined by 16-bit entries - are they adequate?
  - check behaviour of hard discs when power is removed
  - consider impact/possibility of undetected write failures
  - check that hard discs don't re-order writes

  - note some ideas rejected:
      - storing last bit of file inside directory entry (p60, p66)
      - background compaction process (p60)
  - add observation about NVRAM - just a byte or two would be lovely!
  - work through example of allocating a large file (p89)

  - re-think integrity ideas - is it better to write the whole indirection
     table and free space bit map each time, rather than page them? Could
     multiple state blocks mean that a single write can include this as well?
  - consider adding a "transaction" capability for (very) small objects
  - document changes resulting from the need to allow small objects to be
     moved from one zone to another when extended:
      - object-id is no longer immutable - it may change when an object is
         re-allocated
      - no need for large objects to be indirected through an indirection
         table entry
      - no need for small/large bit in indirection table entry; this is now
         in object-id
      - maximum indirection table size no longer needs to include entries for
         each possible large object
  - assign 16 entries to hold the object-id's of final sections in each zone;
     this is cleaner than using some part of the free space bit map.
  - document possible extension to very large discs by paging the chunk table
  - document concerns about "near-full" behaviour, and proposed actions - and
     note other (rejected) possibilities
  - consider where bad-block/defect mapping should be implemented
  - can large object zones exist without mapping information?
  - discuss directories as a number of small objects - and define carefully
     what this level "knows" about the difference between directory objects
     and file objects

  - define the interface as a set of procedures



1  Summary


This document describes some of the features of "NewCore", a proposed
replacement for FileCore. These features are:

  - the representation of objects (sequences of bytes) as one or more sets of
     contiguous sectors on a disc;

  - strategies for allocating sets of sectors on the disc to objects;

  - techniques for maintaining data integrity in the face of hardware
     failures.

This document does not discuss the internal structure of directories, but
assumes that these will no longer be fixed-size objects, thus allowing
NewCore to cope with expected enhancements to RISC OS filing systems such as:

  - long filenames
  - large directories
  - more data in each directory entry (access, owner, soft link, type ...)

NewCore is also designed to manage very large discs (up to 128G at least);
FileCore is currently limited to 512M.



2  Objectives


The following list of objectives is due to John Redford, Project Manager for
RISC OS Black. Except where noted, the proposal here is intended to meet all
of these.

 * Simple layout and allocation schemes rather than the ultimate in space
   and time performance.

      NewCore's disc structures and allocation strategies are not as simple
      as might be hoped.

 * Must perform well when the disc is almost full. Eg. must always find
   space if it exists, must not be slow when disc is near full.

      NewCore should be better than FileCore in this respect, but will not
      be perfect: for example, special cases can be constructed which will
      use space badly, but these are most unlikely to arise in practice.

 * Must handle unexpected power down and unexpected media removal without
   disc structures becoming inconsistent. Eg. no lengthy disc structure
   checking required after unexpected power down.

      Jonathan Roach has expanded on this objective and identifies the
      following situations:

        * Power loss
        * Disc failure:
           * Disc head failure 
           * Disc head touchdown
           * Disc head bearing wear
           * Magnetic media failure
           * Disc head crash
        * In-memory data structure destruction

      NewCore should cope transparently with power loss, and should make
      file recovery from undamaged areas of a disc reasonably likely.

      Techniques to detect memory corruption of critical mapping data have
      yet to be determined.

 * Must handle unexpected power down and unexpected media removal in a way
   such that the contents of the disc are seen as reasonable by a typical
   user(!). Eg. if a file was renamed 10 seconds before the event, then the
   user will expect that file to have the new name after the event. Eg. if a
   file was open at the time of the event, then the user will expect the
   files contents to be some sensible mixture of old and new after the
   event. [This definition needs considerable tightening up! Most of it
   relates to file system caching schemes, but there may be some effect on
   the layout and allocation schemes.]

 * Must have no foolish numerical limitations built in. Eg. directories must
   be able to have arbitrary numbers of files in them. Discs of any
   (reasonable) size should be usable.

      Disc sizes from 128M to 128G should operate efficiently.

 * Users should not need to know their typical patterns of use in order to
   format the disc. Eg. user must not have to choose a maximum number of
   files that can exist on the disc, or have to choose minimum allocation
   sizes for the disc.

 * Must handle widely different typical types of use well. Eg. from the
   large numbers of directories and small files that a C developer may have
   to the few large files that a Replay user may have.

 * Performance must not degrade over time. Eg. heavy use of the disc should
   not lead to awkward fragmentation of space that can never be undone.

      Prolonged use will not lead to unacceptable space fragmentation, but
      may lead to a mixture of small and large object zones - whereas these
      are initially set up as separate. This is unlikely to lead to a serious
      performance drop.

 * Optimise for reading and writing of whole files and for reading and
   writing a stream from the start to the finish of a file rather than
   optimising for random access. Eg. keep all small and medium size files in
   contiguous areas on disc. Keep all large files in discontiguous areas on
   disc, however make sure that the minimum size of each contiguous fragment
   of a large file is large.

 * Must be able to map out areas of the disc that go bad without having to
   reformat disc or lose any data that was not touched by the area going
   bad.

      Details tbd - but unlikely to be a problem.

 * Would be useful (for OLE linking) if each disc object had a unique
   identifier that could be used to locate the object on a disc even if the
   user has move it. Ideally the object id would be time and space unique to
   allow location even when moving between discs.

      Each object on the disc has a unique object-id, but this allows only
      limited movement (within its zone) on the disc.

 * Must support more attributes per object than at present. Eg. space for
   creation time, owner, author, description, etc. [Needs specification. Are
   user attributes necessary, or would just the UNIX attributes and file
   type do?]

 * Must allow per user quota's to be enforced. Ie. keep track efficiently of
   how much space files owned by each user take up.

      Details tbd - but unlikely to be a problem.

 * Need not attempt to use the same schemes for both floppy and hard discs.

      The scheme proposed here is not likely to be suitable for floppy discs.

 * Need not attempt to allow files larger than a single disc.

 * Need not attempt to improve performance by distributing a file across  
   multiple discs.

 * Need not provide any fault tolerance to disc (or sector/track) failure,
   if disc (sector/track) goes bad, then affected data will be lost.

 * Need not support transactional facilities for programs. Eg. the ability
   to make groups of changes and commit them together, or the ability to
   rollback to a previously committed point.

 * Need not support locking of regions of a file. Eg. preventing other users
   from writing to an area of a file that one user has locked.



3  Properties of hard discs


A hard disc presents a continuous address space that is divided into equally
sized "sectors", numbered from zero upwards, with the following general
properties:

  a) The size of each sector is typically 512 bytes, and always a power of 2.

  b) Reading or writing contiguous sectors is faster than transferring
     data from/to sectors chosen at random.

  c) The time taken to move from one sector to another is, in the large,
     proportional to the difference between their addresses.


In fact, performance parameters for modern 3.5" discs show that (c) is not as
significant as has traditionally been the case, since average seek times now
approach average latency due to disc rotation. The figures for the new Conner
210M drive are as follows:

  512 byte sectors, 32k buffer
  Number of sectors per track varies from 63 to 100 (inside to outside)
  Number of tracks per surface = 2388
  Seek times:
     track-to-track:   3.0mS
     average:         14.0mS
     maximum:         26.0mS
  Latency:
     average:          8.3mS
     maximum:         16.7mS
  Data transfer rate varies from 2.56M/s to 4.42M/s (but see below)


We define two sectors to be "close" to each other if the average time (t2)
taken to move from the first sector to any sector on the track containing the
second sector is no more than 50% greater than the average time (t1) taken to
move from the first sector to any sector on the track next to it.

Using the figures above we have:

  t1 = 3.0 + 8.3 = 11.3
  t2 = 3.0 + (26.0 - 3.0)/2388 * n + 8.3

where n is the number of tracks which separate the two sectors. Solving for
t2 = 1.5*t1 we get n = 587; and with an average number of sectors per track
equal to 82, this shows that we can reasonably consider two sectors to be
close to each other provided they lie within the same 24Mbyte region of the
disc.


We could in the same vein define two sectors to be "very close" to each other
if the time taken to move from one to the other is at most 50% more than the
time taken to move from one sector to the sector immediately following it on
the following track. However, determining which are the sectors very close to
a given sector is a complex calculation depending on both disc geometry and
the positions of any defects on the disc; NewCore's allocation policies do
not take such considerations into account.


We can also calculate the maximum long-term data transfer rate for contiguous
data as follows:

  Time to transfer the data from one track = 16.7mS  (rotation period)
  Time to move to the next track           =  3.0mS  (track-to-track seek)
  Average data per track = 82 * 512 = 41k bytes

  Average maximum long-term data transfer rate is 41k/19.7mS = 2.08M/s

This is significantly less than the quoted (4.42+2.56)/2 = 3.49M/s; I guess
that this is because Conner are quoting the instantaneous data rate attained
during sector read [they also hold the eccentric view that 1M = 1000000].

We define a transfer rate as "fast" if it is at least 90% of this maximum
rate, and now consider the case of a large file which is divided into a
number of contiguous sections distributed at random across the disc. How
large do these individual sections need to be in order for the file as a
whole to be transferred "quickly"?

  Average time to move from one section to another = 14.0 + 8.3 = 22.3mS
  Average time to transfer n contiguous sectors = 512*n/2.08M secs
                                                = n/4.16 mS
  For a fast transfer we require n/4.16 = 9*22.3, which gives n = 835

So each individual section needs to be at least 417k long.


In summary, NewCore's allocation policies take into account the properties of
hard discs as follows:

  * Data transfer rates for contiguous sectors are much higher than can be
    achieved for sectors chosen at random - so small files are held in
    contiguous sectors.

  * It is reasonable to break up large files into contiguous sections each
    of the order of a megabyte in size.

  * Placing two objects "close" to each other means placing them within the
    same region of the disc, where regions are of the order of 16M in size.



4  Definitions and representations


4.1  Objects

An "object" is a sequence of bytes which is always an exact number of sectors
long. NewCore provides facilities to:

  - create an object of a given length
  - delete an object
  - extend an object
  - truncate an object

Each object, when created, is assigned an "object-id" which uniquely
identifies it throughout its life (but note that object-ids are local to each
disc, and are not unique in time).

Objects are expected to be used as containers for files and directories, with
each directory containing the object-ids of the files (and sub-directories)
within it. Each directory entry will also include additional information
about each object including its precise size in bytes, as well as its name
and other attribute information.

Although NewCore does not need to understand the internal structure of any
object when creating or moving it, its initial allocation strategy may depend
on whether the object is to hold a file or a directory, and some information
about the object's parent directory may be required.


4.2  Zones and chunks

After a "header section" (see below), a disc is divided into "zones", each of
which consists of a number of "chunks", which in turn contain sectors. All
chunks are the same size, and all zones except, possibly, the last are the
same size.

Zones are numbered from 0 upwards, as are sectors within each zone; we use
"zone-id" to identify a zone, and "sector-id" to identify a sector within a
zone.

Chunks are numbered from 0 upwards, starting with the first chunk in zone 0
and finishing with the last chunk in the final zone; we use "chunk-id" to
identify a chunk on the disc.

Zones are the "unit of proximity", and NewCore assumes that two objects which
lie in the same zone are "close".

Chunks are the unit for allocating space to large files, and NewCore assumes
that the long-term average transfer rate for files made up of chunks chosen
at random over the disc will be "fast".

As a concrete example, consider an 8G disc; this could be split up as shown
below:

  512 zones, each 16M in size        (8-bit zone-id)
  16 1M chunks in each zone
  Sector size 512 bytes
  Thus:
    Each chunk contains 2k sectors
    Each zone contains 32k sectors   (15-bit sector-id)
    The disc contains 8k chunks      (13-bit chunk-id)


4.3  Small and large objects

Objects are classified as "small" or "large" according as they fit inside a
single chunk or not.

Small objects are always allocated to contiguous sectors inside a zone, and
may be moved around within the zone if necessary to make further contiguous
space available.

Large objects are always allocated to one or more complete chunks (which may
come from a number of different zones). If the size of a large object is not
an exact multiple of the chunk size, then the final part of the object is
stored as a small object; this part is called the object's "final section".
Complete chunks allocated to a large object are never moved.


4.4  Indirection tables

Each zone includes mapping and allocation information in tables stored as
part of its first chunk; these are:

  * the "state block"
      - which contains information about the current state of the zone (for
        example, is any compaction in progress?), and describes where the
        current versions of the other tables can be found;

  * the "indirection table"
      - which contains information about the location of each object
        allocated to the zone;

  * the "free space bit map"
      - which contains allocation information.

Each object allocated to a zone is given an "entry-id" which is an index into
the indirection table. Each 16-bit entry is divided into a flag bit and a
15-bit value; possible formats are:

  flag   value   meaning

   0       0     This entry is not in use

   0      n>0    This entry describes a small object whose first sector has
                  sector-id = n

   1      n>0    This entry describes a large object whose first complete
                  chunk has chunk-id = n

(Details of the state block are given later.)


4.5  The free space bit map

This table has one bit for each sector in the zone; the bit is 1 if the
sector is allocated, and 0 otherwise.

[In fact, these bits are valid only for shared chunks (see below), and must
 not be relied upon for sectors inside unallocated or unshared chunks;
 indeed, in the latter case, this area of the free space bit map is used to
 hold other information (see section 4.7 below).]


4.6  The chunk table

The disc's header section contains mapping and allocation information global
to the disc as follows:

  * the "disc state block"
      - which contains information about the particular NewCore parameters
        chosen for this disc (chunk size, zone size etc.), and also describes
        where the current versions of the other tables can be found;

  * the "zone table"
      - which contains summary information about each zone;

  * the "chunk table"
      - which contains allocation information about each chunk on the disc.

The chunk table contains a 16-bit entry for each chunk on the disc, with one
flag bit and a 15-bit value field; possible formats are:

  flag   value   meaning

   0      n>=0   The chunk is shared, and has n free sectors within it

   1    n<&7ffe  The chunk is unshared, and the next complete chunk allocated
                  to this object has chunk-id = n

   1     &7ffe   The chunk is unshared, and is the final complete chunk
                  allocated to this object

   1     &7fff   The chunk is unallocated

(Details of the disc state block and zone table are given later.)

A chunk is said to be "shared" if it includes any (parts of) small objects,
and is said to be "unshared" if it is one of the complete chunks of a large
object.

Note that the first chunk of every zone can never be unshared, because of the
presence of the mapping and allocation information tables; this means that
even under optimal conditions no section of a large file can ever be more
than 15 chunks long.


4.7  Object-ids and locating (the parts of) an object

An object-id is a 32-bit quantity comprising a zone-id and an entry-id, thus
identifying an entry in an indirection table which, in turn, contains the
current location of the object (whether small or large).

If it's a large object, then the indirection table entry identifies both the
first complete chunk of the object and an entry in the chunk table - which in
its turn identifies the next complete chunk and another entry in the chunk
table, thus chaining together all of the complete chunks allocated to the
object.

What about any final section of a large object? This will be held as a small
object, and so needs a full object-id to locate it; unfortunately, there is
not sufficient space in a chunk table entry to accommodate this. However, we
can squeeze it into the free space bit map, since there is no need to hold
allocation information about the final complete chunk since we can determine
from the chunk table itself that the chunk is not shared.


4.8  Implementation notes

It's expected that the chunk table will always be in RAM, and that the
indirection tables for recently accessed zones will be cached; this means
that no disc accesses are required to identify:

  - unallocated chunks for a new large object;

  - a chunk with sufficient space for a new small object;

  - a sequence of shared chunks appropriate for compaction in order to gain
    more space.

If we allocate new files close to their parent directories, then we can also
expect to find location information in RAM for all files inside a recently
opened directory.


How much does this cost? For the 8G disc introduced above, the chunk table
occupies a mere 16k; but the space for each indirection table depends on how
many entries it contains.

The worst case arises when:

  a) every sector in the zone contains a different small object;

  b) every large object on the disc has an object-id which references this
     zone;

taken together this means that the worst case approaches the sum of the
number of sectors per zone plus the number of chunks on the disc (40k).

In practice, the number of entries required will be far fewer:

  a) if the average size of small objects is 8k, then each zone will contain
     around 16M/8k = 2k small objects;

  b) if we make the following assumptions about large objects:
       - average size 2M
       - together they occupy at most 25% of the disc
       - their object-ids are distributed evenly amongst the zones
     then the number of entries for large objects would be around 8;

so we might reasonably expect that more than 4k entries per zone - ie 8k
bytes of buffer space - is unlikely.

On the other hand, these entries will not necessarily be compact within the
indirection table, because once an entry-id has been assigned to an object
it can never be changed: so if at some stage a zone contains a large number
of small objects which are then almost - but not quite - all deleted, then
the indirection table will remain large, although most of its entries will be
unused.

To avoid the possibility of wasting lots of valuable cache space on unused
entries, we propose to cache parts of indirection tables; the unit size is
to be determined, but could be 4k, say, for our example disc.


During allocation, additional information will be needed, including the free
space bit map. For our example disc, this is 4k in size; but as we shall see
it is not always necessary to have all of it in memory at once, and so it
could also be cached if necessary.



5  High-level allocation strategy


5.1  Objectives

The main objectives are:

  * Keep small objects close to their parent directory objects.

  * Where this is not possible, try to keep such small objects in a small
    number of "close" groups.

  * Avoid the need to move large amounts of data in order to make space for
    new objects.


5.2  Strategies

NewCore adopts the following strategies to try to meet these objectives:

  1) Large objects are separated from small objects by allocating chunks
     from the base of the disc (chunk-id 0) upwards to large objects, and
     using zones from the top of the disc downwards for small objects.

  2) Try to allocate a small file in the same zone as its parent directory.

  3) Try to distribute directories across zones.

Extra allocation information is maintained in order to implement these
strategies, and it is kept inside the zone table. Each 16-bit entry contains
the following information about a zone:

  field name   size   meaning

   flags      2 bits   0 => zone not currently allocated
                       1 => small object zone
                       2 => large object zone

   num-dirs  11 bits  number of directory objects allocated to the zone

 [  spare     3 bits  ]

Other information about a zone (such as total free space available, how many
unallocated chunks, etc.) can be readily calculated from entries in the chunk
table.

A zone is designated a "small object zone" if it is first used to hold a
small object, and is designated a "large object zone" if it is first used to
hold a complete chunk of a large object. From then on, small object zones
are used in preference to large object zones when allocating small objects -
and vice-versa.

However, when the disc fills up (and allocation becomes difficult) it may be
necessary to use a zone for a different purpose; and when the disc empties
again, it may become sensible to redesignate a zone. Suggested rules are as
follows:

  If there remain shared chunks in use when the last unshared chunk is 
  released in a large object zone, that zone is redesignated a small object
  zone; otherwise it is recorded as unallocated.

  If there remain unshared chunks in use in a small object zone when the last
  small object is deleted from it, then that zone is redesignated a large
  object zone; otherwise it is recorded as unallocated.

These rules are intended to keep small objects and large objects apart, and
to adjust the ratio of small object to large object zones to reflect the
dynamic requirements of the disc as time progresses. On the other hand, the
initial split of large objects at one end of the disc and small at the other
must inevitably become less clear as some large object zones appear in the
"small" area, and vice-versa.


The following subsections give detailed tactics for the various possible
allocation functions, distinguishing where necessary between objects for
files and objects for directories.


5.3  Allocating a small directory

NewCore tries to distribute directories amongst small object zones in such a
way that a suitable compromise between the following two conflicting aims
is maintained:

  - small object zones often contain sufficient space to allocate small
    objects close to their parents;

  - small object zones are normally kept reasonably full.

The strategy is based on a value called "max_dirs_per_zone" which is set to
reflect the average amount of space required by a small directory and its
immediate small files; if we estimate this latter amount at 64k, then we
would set max_dirs_per_zone = 16M/64k = 256.

NewCore tries to make sure that:

  - no small object zone ever has more than this number of directories
    allocated to it;

  - directories are not assigned to a new small object zone unless the zone
    including their parent is at least 75% (say) full.

In detail, the strategy is as follows:


Let Z be the zone containing the directory's parent directory's object-id.

  a) Choose Z if it is a small object zone, is less than 75% full, has
     sufficient free space, and has fewer than max_dirs_per_zone directories
     already allocated to it.

  b) Otherwise, choose the nearest unallocated zone - which is then
     designated a small object zone.

  c) Otherwise, choose the small object zone which has sufficient free space
     and the largest amount of free space per directory.

  d) Otherwise, choose the large object zone which has sufficient free space
     and the largest amount of free space per directory.

Notes:

  This strategy favours small object zones for small objects; only in extreme
  circumstances are large object zones considered.

  Each small object zone is "filled up" before moving on to a new zone; this
  avoids the situation where small directories and their related small files
  are sparsely scattered across the disc, making it more difficult to find
  chunks for large files, and distancing related directories unnecessarily
  from each other.


5.4  Allocating a small file

Let Z be the zone containing the file's parent directory's object-id.

  a) Choose Z if it is a small object zone, and contains enough free space.

  b) Otherwise, choose any other small object zone which contains small files
     belonging to the same directory as this one and which contains enough
     free space.

  c) Otherwise, continue as if the small file were a small directory from
     step (b) in the subsection above.

Notes:

  This strategy assumes that the parent directory is available, and that it
  is easy to find out which zones already include small files from that
  directory. The purpose of step (b) is to encourage grouping of files with
  a common parent even if the parent's zone itself is full.


5.5  Allocating a large file

Allocate an object-id from the zone containing the file's parent directory's
object-id.

For each complete chunk required:

  a) Choose the first free chunk found in a large object zone searching from
     the bottom of the disc.

  b) Choose the first free chunk in any unallocated zone, searching from the
     bottom of the disc; the zone is then designated a large object zone.

  c) Choose the first large object zone in which a chunk can be made free by
     compaction, searching from the bottom of the disc.

  d) Choose the first free chunk found in a small object zone searching from
     the bottom of the disc.

  e) Choose the first small object zone in which a chunk can be made free by
     compaction, searching from the bottom of the disc.

For any final section:

  a) Choose the zone containing the last complete chunk of the file, if it
     has sufficient free space.

  b) Choose the first zone containing sufficient space, searching from the
     bottom of the disc.

Notes:

  This strategy favours large object zones for large objects, and will use
  chunks near the base of the disc first. This should help to keep large
  files separate from small ones, and large object zones separate from small
  object zones.

  It may be worth sorting the complete chunks into ascending chunk-id order
  before chaining them together through the chunk table.

  The best strategy for choosing the zone to contain the final section is not
  obvious (should preference be given to large object zones? should
  preference be given to a nearby zone?), but I doubt that this is a critical
  area.


5.6  Allocating a large directory

Allocate an object-id from:

  a) The zone containing the directory's parent directory's object-id,
     provided that it is a small object zone, is less than 75% full, and has
     fewer than max_dirs_per_zone directories already allocated to it.

  b) Otherwise, choose the nearest unallocated zone - which is then
     designated a small object zone.

  c) Otherwise, choose the small object zone which has the largest amount of
     free space per directory.

  d) Otherwise, choose the large object zone which has the largest amount of
     free space per directory.

Now proceed to allocate space for the large directory exactly as if it were
a large file.
     

5.7  Deleting a small object


5.8  Deleting a large object


5.9  Extending a small object

  - possibly to a large object


5.10 Extending a large object


5.11 Truncating a small object


5.12 Truncating a large object
 
  - possibly to a small object



6  Low-level allocation strategy


6.1  Principles

A zone may contain both shared and unshared chunks, and we consider each zone
to be divided into one or more "shareable sections" by any unshared chunks
that it contains.

Each shareable section consists of a contiguous sequence of chunks, and small
objects can be placed at will within it; in particular, small objects can
cross any chunk boundaries within the section.

On the other hand, the low-level routines consider each shareable section
separately, and will not move objects from one to another; this means that
one aim of the low level allocation strategy must be to minimise the number
of shareable sections in each zone. (Note that the high-level already helps
by keeping small objects and large objects apart.)

If necessary, small objects within a shareable section will be moved in order
to coalesce packets of free space to create a contiguous area of the required
size. NewCore will do this by moving small objects down across areas of free
space - but will never move one small object across another.

This principle makes it easy to determine the best course of action, and to
design and test low-level strategies; it may also make recovery procedures
(after power loss, for example) more straightforward. On the other hand, it
will mean that there are times when more data than necessary is moved. Only
experiment will show whether this is a serious problem.


6.2  Does this zone have sufficient space for a small object?

The algorithm used by the high-level to determine whether a given zone has
sufficient space for a small object of size S is as follows:

  Does the zone include a shareable section whose total free space is greater
  than or equal to S?

Note that the chunk table contains sufficient information to answer this
question.

Once the high-level has chosen a zone, the low-level locates the space as
follows:

  a) Using the chunk table, determine the smallest contiguous sequence of
     shared chunks whose total free space is at least as great as S; often
     this will be a single chunk.

  b) Scan the corresponding parts of the zone's free space bit map to
     determine the smallest number of sectors which need to be moved in order
     to make available a contiguous sequence of sectors of length S; often a
     large enough gap will be found, and so no sectors will need to be moved.

  c) If necessary, compact:

        - move data
        - update free space bit map  (to note new allocations of sectors)
        - update indirection table   (for moved objects*)
        - update chunk table         (to note possible new distribution of
                                      free space across chunks)

         * This requires a scan through all of the indirection table, looking
           for any entry which refers to a sector that is marked as allocated
           within the affected area: note that without explicit length data
           we have no way to determine when all objects within the affected
           area have been processed.


6.3  Does this zone have sufficient space for a complete chunk?

The algorithm used by the high-level to determine whether a given zone has
sufficient space to release a complete chunk is as follows:

  Does the zone include a shareable section whose total free space is greater
  than or equal to the chunk size?

The low-level then vacates a chunk as follows:

  a) Using the chunk table, determine the shareable section closest to the
     top of the zone whose total free space is at least as great as the
     chunk size.

  b) Using the chunk table, and starting from the top of the shareable
     section identified above, determine the smallest contiguous sequence of
     chunks with sufficient free space.

  c) Scan the corresponding parts of the zone's free space bit map - again
     from the top downwards - to determine which sectors need to be moved
     down in order to create at least one chunk free at the top of the
     shareable section.

  d) Compact, as above.



7  Integrity


7.1  Principles

The major objective is to make sure that data correctly recorded on the disc
can always be considered consistent; a secondary objective is to make it
possible to retrieve data from a broken disc.


To meet the primary objective, we need to make sure that certain changes to
the disc structures appear as "indivisible" or "atomic" operations; for
example, moving a small object and updating its indirection table entry.

The general technique used to do this is as follows:

  a) Write a "state" record to the disc which includes enough information
     about the operation to either complete it, or undo it, should it be
     unexpectedly interrupted.

  b) Execute the individual disc updates that constitute the operation.

  c) Write a state record to the disc to note that the operation is complete.

Whenever a disc is first opened, its state record is examined; if this
indicates that an atomic operation was in progress, then NewCore will either
complete this operation or undo it before proceeding.

Often the information necessary to undo an operation is held in the "initial"
value of a mapping table, and the information necessary to complete an
operation is held in the "final" value of that table. For this reason, we
reserve two areas for each table, writing to each alternately; the state
record includes information to say which is the "current" version.

In fact, it is convenient to have both a global state record and a local
state record for each zone; the global record is associated with the zone
and chunk tables, and the local record with the indirection table and free
space bit map.

Only the global record need be examined when a disc is opened; each local
record needs to be examined when the corresponding zone is accessed for the
first time.


The secondary objective is met as a consequence of the first; that is, the
duplication of mapping information - and its distribution across the disc -
makes it more likely that worthwhile amounts of data can be retrieved from
damaged discs.


7.2  State block structure

Each state block contains the same information twice, and has the following
overall structure:

       epoch
    <state data>
       epoch
       epoch
    <state data>
       epoch

The "epoch" is an integer that is incremented each time a new state block is
written.

The state block may be one or more sectors in size; current plans are that
one sector will be sufficient.

If, when reading the state block, the epoch values are not all identical,
then it is possible that either the first pair will be the same, or the
second pair will be the same - in which case the corresponding state data
should be taken as valid.


As mentioned above, space is reserved on the disc for two copies of each
table controlled by the state block, and the state block itself contains one
or more bits to specify which copy is the current version.

If the table is always read and written in its entirety, a single bit is
sufficient; but if it is to be cached, a separate bit for each cachable unit
will be required.

For example, suppose that the indirection table for each zone has a maximum
size of 64k, and that it is to be cached in 4k units; then the state data
would include 16 bits to describe the current version of the indirection
table.


The remaining data which may appear in a state block depends on the possible
atomic actions, but is likely to consist of a state code, and a small number
of state parameters (eg object-ids). The next subsection gives an example.


7.3  Example - compaction

Suppose we wish to compact part of a shareable section as follows:

  sector-id:  17 18 19 20 21 22 23 24 25 26 27 28

               *  *  *  1  1  1  *  *  2  2  *  3
                               to
               1  1  1  2  2  3  *  *  *  *  *  *
                               (where * indicates a free sector)

The steps involved are:

    a) move the data:
          sectors 20 to 17, 21 to 18, ... 28 to 22

    b) update the indirection table entries:
          object 1 has moved from sector 20 to 17
          object 2 from 25 to 20
          object 3 from 28 to 22

    c) update the free space bit map

We start by writing a new state block to disc with:

   state_code = "Compacting"
   state_param1 = 17
   state_param2 = 28

The two parameters are the start and end sector-id that describe the area to
be compacted.

Next, step (a) is executed; note that provided the compacted data is written
to disc from low sector to high sector, the area always includes at least
one copy of each sector of each of the objects 1, 2 and 3. This means that
no data is lost if an unexpected interruption occurs during the compaction.

Now steps (b) and (c) are executed, and the updated indirection table and
free space bit map written back to disc - but to the location alternate to
that which was last used.

Finally, a new state block is written to disc with:

  state = "OK"

and which includes bits to say that the updated indirection table and free
space bit map are now the current versions.


What happens if an unexpected interruption occurs during this process, and
so the next time the zone is accessed the "Compacting" state is found?

Firstly, the state block will identify the original versions of the
indirection table and free space bit map; together with these, the two state
parameters are sufficient to work out all of the details of the compaction:

  - which objects were to be moved
  - which indirection table entries needed to be updated (and how)
  - how to update the free space bit map

Secondly, it should be possible to determine by examination how far the
actual compaction has progressed. [I have yet to convince myself that this
is always the case, but if not I am sure it is possible to break up any
arbitrary compaction in such a way that it is.]

This means that it is now possible to complete the compaction, or to undo
it; which choice is best might depend on how much data remains to be moved.



8  Typical scenarios


8.1  Performance parameters for various discs

CFS210A is a 1993 Conner 3.5" hard disc drive; CFS420A is the two-platter
version.

CP30084E is a 1992 Conner 3.5" hard disc drive; CP30174E is the two-platter
version.

CP30100 is a 1990 Conner SCSI hard disc drive.

M2511A is a 1992 Fujitsu magneto-optical disc drive.

I325VM is a 1991 Insite Floptical drive.

The example floppy is quad density.


  disc     capacity  sector sectors  tracks   num      seek times     average
                      size    per     per    surfs  min    av    max  latency
                             track  surface               (mS)           (mS)

CFS210A       191M*    512     82*    2388     2    3.0   14.0   26.0     8.3
CFS420A       382M*                            4
CP30084E       81M     512     46     1806     2    3.0   17.0   30.0     7.8
CP30174E      162M                             4
CP30100       116M     512     39     1524     4    8.0   19.0   35.0     8.8
M2511A        122M     512     25    10000     1    5.0   30.0   55.0     8.3
I325VM         20M     512     27      753     2   18.0   80.0  170.0    41.7
Floppy (quad) 1.6M    1024     10       80     2    3.0  120.0  240.0   100.0

  * These discs have a variable number of sectors per track (63 - 100); I
    have taken the average as the basis for these calculations, which may
    account for the disc size discrepancy - Conner call the CFS210A a
    210M drive, whereas these calculations suggest that it has only 191M.


  disc            "close" region             minimum chunk size

              max     percent  min       max      min  percent  max
              size      of     num     transfer  chunk   of     num
                       disc   zones      rate     size  disc   chunks

CFS210A      23.5M     12.3%     8      2.04MB/s  410k   0.2%    477
CFS420A                 6.1%    16                       0.1%    954
CP30084E      8.1M     10.0%    10      1.21MB/s  270k   0.3%    307
CP30174E                5.0%    20                       0.2%    615
CP30100       9.0M      7.8%    13      0.74MB/s  186k   0.2%    638
M2511A       16.2M     13.3%     8      0.57MB/s  195k   0.2%    641
I325VM        2.0M      9.8%    10      0.13MB/s  142k   0.7%    142
Floppy (quad) 174k     10.9%     9      49.3kB/s   95k   6.0%     16


8.2  Suggested parameter values for various disc sizes

The following table suggests parameter values based on the following "rules"
and constraints:

  - Try to choose zone size and chunk size to match disc performance
     characteristics given in the previous section.

  - Maximum "chunks per disc" and "sectors per zone" is 32k, since there are
     only 15 bits available for each.

  - Try to keep at least 16 chunks in each zone.

The "typical" size for the indirection table is calculated on the basis that
the average small object size is 8k.



 disc sector chunk zone sectors chunks zones  indirection  free  chunk  zone
 size  size   size size   per     per   per      table     space table table
                         zone    disc  disc    max   typ    map   size  size

 128G  1024    4M  32M    32k    32k      4k  128k    8k     4k    64k    8k
  64G  1024    2M  32M    32k    32k      2k  128k    8k     4k    64k    4k
  32G   512    1M  16M    32k    32k      2k  128k    4k     4k    64k    4k
  16G   512    1M  16M    32k    16k      1k   96k    4k     4k    32k    2k
   8G   512    1M  16M    32k     8k    512    80k    4k     4k    16k    1k
   4G   512  512k   8M    16k     8k    512    48k    2k     2k    16k    1k
   2G   512  512k   8M    16k     4k    256    40k    2k     2k     8k  512
   1G   512  512k   8M    16k     2k    128    36k    2k     2k     4k  256
 512M   512  512k   8M    16k     1k     64    34k    2k     2k     2k  128
 256M   512  512k   8M    16k   512      32    33k    2k     2k     1k   64
 128M   512  512k   8M    16k   256      16    33k    2k     2k   512    32
  64M   512  256k   4M     8k   256      16    17k    1k     1k   512    32
  32M   512  256k   4M     8k   128       8    16k    1k     1k   256    16

Floptical:
  20M   512  128k   2M     4k   160      10     8k  512    512    320    20

Floppy (quad density)*:
1600k  1024   32k 256k   256     50       7   612    64     32    100    13

 * Note that the match between the parameter values chosen and the "ideal"
   values for a floppy is not good: chunk size should be > 95k, and the zone
   size should be < 174k.

