Notes on integrity options

*** Can we use any of these techniques to undelete a file?

1) More secure directories

Greater security can be achieved as follows:

  - Allocate twice the space required for each directory block - 4k instead
    of 2k

  - Surround the data in each directory block by a pair of "guard" words:

        0:  guard word
              <data>
      2047: guard word

    Each time an updated copy of the directory block is written to disc, a
    new guard value is assigned - the simplest approach is to increment
    cyclically; a more sophisticated approach is to make this value a "check
    sum" byte.

  - Whenever a directory block is written to disc, it is duplicated; the
    scatter list specifies the same data twice.

  - Whenever a directory block is read from disc, the guard values are
    investigated:

      - they must be the same
      - if a checksum, the block's content can also be checked

    If any discrepancy is found, an error is issued.

This approach costs little:

  - double the disc traffic for directory writes (but the transfers are
     contiguous)
  - small processor overhead for checksum checks if included

The advantages are:

  - greater resilience to transient failures (where for some transient reason
    a block is not correctly written): such failures may be detected at the
    time, or not noticed;

  - greater resilience to arbitrary power loss

But there is little added protection against physical damage, since the two
versions of each directory block are contiguous.

Note also that automatic recovery is not possible - we know nothing about the
state defined by the previous version of the directory block, assuming it is
still readable - but it should certainly help a disc doctor program.



2) Zone integrity


The objective is to ensure that at all times the mapping information held
on disc within a zone is consistent with the objects in that zone; together
they describe the "current disc state".

This is achieved by using the state block to switch atomically between two
alternate sets of mapping information (free space map and indirection
tables), and by obeying the following rule:

  - Never write anything to the zone which makes the current disc state
    inconsistent.

More precisely, a "consistent zone" is one in which objects, indirection
table and free space map all agree with each other - or the state block
includes sufficient information to make this the case (for example, by
undoing the latest "transaction", or by completing it).


One way to ensure consistency of all zones would be to read the state blocks
of each one whenever a disc is mounted - but this would not be an acceptable
overhead for large discs (eg 4k zones on a 128G disc).

Instead, consistency is checked only when a zone is first accessed - and at
that time any necessary recovery is undertaken.



3) Consistency of zone table and chunk table information.


The zone and chunk tables contain:

  (a) summary information about each zone and chunk
  (b) the list pointers that define the contents of large objects

Information of type (a) is redundant in the sense that it can be recalculated
from information held in the zones themselves, whereas type (b) data is vital
to the integrity of the disc structures.

When chunks are allocated/deallocated to large objects, we use the global
state block - with its map of the chunk and zone tables - to ensure that
such changes are atomic as far as the disc structures are concerned.

But this is certainly likely to be an unacceptable overhead for small object
allocations/deallocations, and so a separate mechanism is required; the
problem is illustrated as follows:

  - Create a new small object in zone 50. Suppose it is allocated to chunk
     805 which previously had 40 sectors spare, and now has only 35 spare.

  - This means that the chunk table entry for chunk 805 must be altered from
    40 to 35.

These two operations need to be atomic - or the possibility arises that the
free space information in the chunk table will not be consistent with the
free space information in the zones themselves. We can achieve this with
extra writes to the global state block:

  - Write "New object 50/203 causes chunk 805's free space to change from
     40 to 35" to the global state block.

  - Update zone 50 to record new object allocation.

  - Update zone and chunk tables in memory to reflect new allocation.

  - Write back global state block and new versions of zone and chunk tables
     with new state = NULL.

This overhead is likely to be significant because the global state block will
not normally be "close" to the zone in which the small object is allocated.

The only alternative is to accept the possibility that the free space data in
the chunk table may be incorrect when it is first read in after a disc has
been mounted. Consequences are:

  - A decision might be taken to allocate space in a chunk which, in fact,
     does not have sufficient space available.

  - Space allocation might overlook a chunk because it apparently does not
     have sufficient space to be of use.

Possible strategies to manage this problem are outlined below.


The first strategy is to reconstruct the chunk table from information in the
zone tables whenever a disc is mounted. This is simple, but probably not
acceptable for large discs - for example, a 128G disc with 4k zones would
require perhaps as much as a minute to mount.


The second strategy is to construct the chunk table dynamically - the first
time that an entry is required, the information is read from the
corresponding zone. This has a similar disadvantage to the first - the first
request for space after mounting an almost-full disc might require the
reconstruction of all of the chunk table in one go.


A third strategy is to make sure that the chunk table is always up-to-date
before dismounting a disc, and saving a marker in the global state block to
this effect. Under normal use (mount, dismount), the disc will always have
a consistent chunk table when it is mounted, and so no reconstruction or
testing is necessary. Only when a disc is powered down without dismount is it
necessary to reconstruct the chunk table next time it is mounted. This is
probably acceptable for non-removeable media - especially now that users are
encouraged to shutdown rather than just switch off - but could be a source of
irritation with large removable media.


A fourth strategy is to accept that discrepancies will arise - and that
erroneous allocation decisions may be made - but to correct for this in an
incremental fashion:

  - Whenever a zone's mapping information is read in for the first time, the
     corresponding part of the chunk table is reconstructed*.

  - When a decision is made to allocate space inside a zone which has not yet
     been accessed, this decision is revisited after loading the zone's
     mapping information.

   [* We can use "version numbers" to determine quickly whether such
      reconstruction is necessary:
          - increment and store a global version number inside a zone's state
             block whenever this is updated.
          - keep a copy of this version number in the zone table
      If the values are the same, then the corresponding part of the chunk
      table is up-to-date.]

The worst that can happen is that an allocation request is rejected for lack
of space when, in fact, space is available but not visible because the zone
containing it and the corresponding chunk table entries are out of step. This
doesn't seem likely to be a problem in practice.


Overall, I favour the fourth strategy - possibly accompanied by elements of
the third, so that at least a user message can be issued if an allocation
request fails on a disc whose chunk table free space information is known
to be out-of-date.



4) Higher-level integrity


Assuring atomicity of higher-level transactions can be achieved by using a
combination of techniques; sometimes zone states will be sufficient, but
sometimes a global state will have to be used.


As an example, imagine that two files - accnt1 and accnt2 - contain bank
balances, and that we wish to transfer 50 pounds from accnt1 to accnt2.
The following sequence is safe:

   x1 = read(accnt1);
   x2 = read(accnt2);
   write_global_state("Updating accnt1 to x1-50, accnt2 to x2+50");
   write(accnt1, x1-50);
   write(accnt2, x2+50);
   write_global_state(NULL);

An alternative would be to write zone state information to *both* zones - 
where we assume accnt1 is in zone1, and accnt2 in zone2:

   x1 = read(accnt1);
   x2 = read(accnt2);
   write_zone_state(zone1, "Updating accnt1 to x1-50, accnt2 to x2+50");
   write_zone_state(zone2, "Updating accnt1 to x1-50, accnt2 to x2+50");
   write(accnt1, x1-50);
   write(accnt2, x2+50);
   write_zone_state(zone1, NULL);
   write_zone_state(zone2, NULL);

If the state information was written only to one zone - say zone1 - then it
would still be possible to read an inconsistent balance from accnt2 after a
restart following an unexpected power loss after writing to accnt1.



5) Example - creating a new small file


Find space for the file:

  This process may involve a number of separate transactions - for example,
  if small objects need to be moved around within a zone in order to free up
  sufficient contiguous space - but the effect of each of these is such that
  the disc state is always "satisfactory" between them.

  When allocation is complete, make the following changes in memory:

    - update the zone's version number, and store this in the state block
       and in the corresponding zone table entry
    - update the corresponding chunk table entry to note the reduced amount
       of free space available
    - set the indirection table entry for the new object, say:
          indirection_table[54] = 725
    - update the free space map:
          free_space_map[725<->754] = 1
    - note the following state information:
          zone_state = "30 blocks allocated at entry_id 54 for new file
                             $.dir3.fred"


Find space for the directory entry:

  Again, this may involve other transactions - but we assume that each is
  executed atomically, and that the zone disc state remains consistent
  between them.

  Fill in the directory entry with the object_id - say 347/54.


Commit the zone in which the new object is to lie - that is, write the state
block, indirection table and free space map to disc. (Note that this may
already have happened as a consequence of other transactions during the
search for space for the directory entry).


Write out the directory block containing the new entry.


Set zone_state = NULL, and commit the object's zone again.


Notes:

  It's probably more elegant, safer and no less efficient to commit the zone
  in which the new object is to lie before starting to look for the directory
  entry.

  It may be better to look for the directory entry before looking for space
  for the file.

  Note that all is well even if the directory block is in a different zone -
  since by the time it is written, the object will have been created. [If it
  were the other way around, we might read the directory entry and imagine
  all is well, since the directory entry's zone's state is NULL.]


Comments:

  If we assume that space is already available both for the file and
  directory entry, the sequence requires five transfers:

    - write indirection table block
    - write free space map block
    - write zone state block
    - write directory block
    - write zone state block

  If we did not require atomicity, the two state block transfers could be
  omitted, resulting in just three transfers.

  If the state block, indirection table and free space map are small enough
  so that it is more cost-effective to always write them back in their
  entirety (perhaps on a floppy disc, for example), we have just three
  (longer) transfers - or two if atomicity is not required.

  It may help to allow scatter-lists which specify transfers from/to multiple
  disc addresses - thus making it more probable that indirection followed by
  free map transfers will complete in the same revolution.



6) Example - deleting a small file


  Read object-id and size from directory entry.

  Commit "Deleting $.dir3.fred - 27/54 size 30" to the directory zone.

  Update indirection table and free space map in zone 27 to reflect the
   deletion of entry-id 54 of size 30.

  Delete the directory entry - this may involve other transactions (possibly
   further object deletions if a directory block can now be released).

  Set zone_state=NULL for the directory zone.

  Commit the object zone.

  Commit the directory zone.


Note:

  If the directory and object zone are the same - often the case - then these
  last two steps are just one.



7) What FileCore does


It seems that:

  a) FileCore always writes out new mapping information twice.

     The two copies of the map are contiguous, and so if block 5 of a map of
     size 25 blocks is updated (ie an allocation is made in zone 5 of a disc
     with 25 zones), FileCore writes out - using scatter-write - blocks
     5, 6, ... , 23, 24, 0, 1, 2, 3, 4, 5 as a contiguous transfer. In other
     words, the new block 5 is written twice, and all other blocks are
     rewritten on the way. Presumably this request to transfer 26 blocks is
     quicker than asking the device driver to do two separate transfers of
     one block each.

  b) FileCore does not appear to be careful about directory entry/allocation
     consistency.

     For example, creating a new file proceeds as follows:

        - commit allocation by writing out new map information
        - write out updated directory

     And deleting a file:

        - write out updated directory
        - commit deallocation by writing out ne map information

     In both cases, it is possible that space becomes "lost" on the disc if
     an unexpected power loss occurs between the two operations.



8)  Policy


We want to make data held in NewCore filing systems resilient to drive and
media failures, but there is a cost associated with this. These notes are a
summary of my draft proposals; I'd welcome feedback - particularly on
whether you think that the overall policy is about right (for example, I'm
not convinced that checksums on mapping/directory blocks are "value for
money").


a)  Unexpected power loss (or media removal)

Assumption:

  Any sector write in progress at the time is completed.

Objectives:

  a1)  Object structures and mapping information is consistent - or can be
        made so by simple automatic recovery actions when power is restored.

  a2)  Directory structures and files are consistent - or can be made so by
        simple automatic recovery actions when power is restored.

  But note that file contents may not be up-to-date, since data can be
  buffered in RAM for indefinite periods - at least until the client closes
  or flushes the file's handle.

Discussion:

  These objectives can be achieved by judicious use of "atomic update"
  techniques and/or by ensuring that sufficient information about a change
  is written to disc before the change is started to undo it later if
  necessary.

  There will inevitably be an overhead associated with this, but only when
  creating, deleting or changing the size of objects - reading or writing
  to files will be unaffected.

  For comparison, I believe that FileCore makes a serious attempt to meet
  (a1), but does not take steps to meet (a2).


b)  Transient write failures

Assumption:

  Very occasionally, a write transfer will appear to be successful, but in
  fact the data recorded on disc is not correct.

Objective:

  b1)  Incorrectly written mapping information or directory information can
        be detected - but no automatic recovery is possible.

Discussion:

  This objective can be achieved by maintaining checksums on mapping and
  directory blocks. There is a small processor overhead associated with this.

  Note that file contents are not treated in this way.

  Thought: should new device drivers include an option to "read-after-write"
  for use by the paranoid?


c)  Permanent disc damage

Assumption:

  Damage is often limited to a relatively small area of the disc,
  corresponding to a contiguous part of the address space.

Objective:

  c1)  Provide as much assistance as possible to a disc recovery program
        which will try to recover user data from the remains of the disc.

Discussion:

  Zones are largely self-contained, and so recovery of all small objects
  from undamaged zones should always be possible.

  Satisfactory recovery of a large object will depend on whether the chunk
  table is readable, since it is there that the list of chunks which make up
  the object is held.

  Mapping blocks will live in fixed positions inside zones, and directory
  blocks will be easily recognisable - thus helping partial recovery of user
  data from damaged zones.

  We may choose to allocate space for two copies of each mapping (and/or
  directory) block, using each location alternately as part of the process
  of meeting objectives (a1) and (a2). However, these locations will be
  chosen to be close to each other (to minimise transfer overheads) rather
  than far apart (which would, instead, minimise risk of data loss if the
  media is damaged).



9)  Radical alternative suggestion


Current developments in hard disc drive technology suggest that drives are
becoming ever more intelligent, and contain larger and larger RAM caches.
It seems only a matter of time before these drives will cease to guarantee
the order of sector writes (this may already have happened), and at this
point coping with unexpected power loss begins to be more like a nightmare.

A radical alternative view is based on the following assumptions:

  - Disc drives are very reliable.
  - Users can bear to lose a little of their recent work if they can recover
     all other data without fuss (even if it takes explicit action and time)
     after unexpected power loss (or "switching off at the wrong moment").
  - Users would be delighted to have a disc recovery program which had a
     fighting chance of retrieving all data possible after a disc error or
     media damage.

Instead of trying to ensure that recorded disc states are always consistent,
we try to ensure that:

  - Vital mapping and directory information is always readable.
  - It's easy to distinguish between, say, directory blocks and file blocks,
     so that inconsistencies in the structures are readily detectable.

One way to make sure that mapping and directory information is likely to be
available after unexpected power loss/limited disc damage is to write one or
more extra copies of it from time to time - making sure that these extra
copies are geographically dispersed.

For example, we could choose to copy each zone's mapping information from the
start of the zone to the end of the zone at suitable times:

  - upon shutdown
  - after N updates (as part of the Nth update)
  - after N seconds (if it's changed)

The third option is rather un-RISC OS like, and could disturb other actions
in progress (such as Replay!) - but the first two options are distinctly
feasible.

What to do about directories is rather trickier, since there may be such a
large number of them; on the other hand, if we're willing to perform extended
updates in memory (eg deleting a number of files at once), then replacing
each disc update by two geographically distinct updates may not appear as a
performance degradation.
