The object interface

I've been thinking through some of the issues raised by your response to my
draft interface, and I'm beginning to come to the conclusion that it doesn't
make sense to implement objects as an abstraction separate from directories
and files.

One example of the difficulties that arise is my contrived attempt to pretend
that the allocation routines do not need to know whether the objects they are
allocating are directories or files.

Here are two other examples.


1) The size of an object

At present, the size of an object is not held explicitly within the object
data-structures. To determine a small object's size, a scan of the
indirection table and a look at the free space bit map is necessary; for a
large object, the chunk list in the chunk table must also be followed.

This decision was taken because we know that every object will be referenced
by a directory entry, and that that's where its size will be held. We could,
of course, replicate this information - but this would double the size of the
mapping tables (and it's not obvious where we would hold the size of large
objects).

The object allocation routines need to know the size of an object: consider,
for example, extending a small object when there is no space immediately
following it and it has to be relocated. If we separate directories from
objects, this means that:

  a) The object routines are forced to determine an object's size in the
     complex and indirect way described above;
or
  b) The directory routines pass the object's size in as a parameter; eg.
     NC_object_reallocate(handle, strategy, old_size, new_size, ...).


2) Integrity

Consider the following problem: we wish to increase the size of a small file
in such a way that mutual consistency between directory and file are
maintained in the face of arbitrary power-downs.

One way to do this is as follows:

  a)  Determine how to extend the small object. This may, of course, mean
      that its object-id changes. Suppose its original object-id and size
      are given by old-id, old-size, and its new ones by new-id, new-size.

  b)  Write a "state" record to disc noting that the following action is
      in progress:

        "Extending object old-id from size old-size to new-size; its new
         object-id will be new-id, and it is referenced by directory entry
         "fred" in object dir-id"

  c)  Re-allocate the object (assumed atomic).

  d)  Update the directory entry (assumed atomic).

  e)  Erase the "state" record.

If power is lost at some arbitrary time during this sequence, then the
possibilities can be sorted out as follows:

  a)  Directory entry records "new-size" as size of object.

      This means that only step (e) was incomplete, so no recovery action is
      required.

  b)  Directory entry records "old-size" as size of object.

    ba)  Object new-id exists and has size new-size (as determined by a scan
         of the indirection table etc.).

         In this case, the object has been successfully extended, but the
         directory entry has not been updated: recovery action is to update
         the directory entry.

    bb)  Object old-id exists and has size old-size.

         In this case, no part of the operation has been completed, and so
         no recovery action is necessary.

The point of this example is that there is a close connection between the
directory level and the object level; to define this as an interface between
the two would result in something like:

   handle = start_extend(old-id, old-size, &new-id, &new-size);     /* (a) */
   note_state(old-id, new-id, old-size, new-size, "fred", dir-id);  /* (b) */
   continue_extend(handle);                                         /* (c) */
   <update directory entry>                                         /* (d) */
   note_state(NULL);                                                /* (e) */

Not at all nice.




Here's some other thoughts on your responses - I'd like us to get together
again to talk through these, and to start discussing directory structures:
are you away over Christmas (other than 24-Dec to 03-Jan, of course!)?
I'm not.



>Mike,

>A few 'top of the head' comments on your disc object level interface:

>There is no need for a client of the disc object interface to know about
>zones. Zones are a concept used by a particular implementation of disc
>objects to try and keep related objects 'close' together. Zones may be
>irrelevant in other implementations (eg. RAM disc). Don't complicate the
>interface with the concept of zones. Use ObjectID (not ZoneID) for things
>like parent_zone.

This is another example of a problem resulting from the attempt to separate
objects from directories/files.

The allocation strategy for small files is to try to store them close to
their parent directory; if this fails, then try to store them in the same
zone as other away-from-home files in that directory. The idea is to keep all
the files in one directory in a small number of zones, rather than scattering
them to the four winds as soon as the home zone is no longer available.

This means that there must be some cooperation - or, at least, mutual
understanding - between the directory structure level and the object level.

I had half-imagined that each directory would maintain a list of the zones
in which objects within it are held, and that this would form the basis for
allocation of further files. Hopefully the number of zones in the list will
be small, even when the number of files (and hence object-ids) is large; this
is the rationale behind revealing zones to the outside world!

>Better to supply scatter lists rather than assuming contiguous buffer space
>available? This may be particularly necessary when there is an open file
>cache as the client of the disc object interface.

Yes - I'll base the next iteration on your next iteration. What should I do
about scatter list entries that cross large file segment boundaries? Perhaps
I have to copy the entire scatter list adjusting the entries as necessary.

>The sibling_zone_func() is horrible. Who owns the memory that a pointer is
>returned to? How long will the contents be valid? Need a general purpose
>list returning function. Either allow the space required to read the list to
>be determined, or allow the list to be split into chunks, or set a fixed
>maximum list size (eg. 16 entries). Or more simply just have say a first
>choice (parent) and second choice (neighbour), any more than two and it
>probably doesn't matter.

Perhaps a better interface would be:

  n2 = sibling_zone_func(handle, zone-list-buffer, n1)

That is, the object code supplies the buffer and specifies its length, and
the higher level copies zone-ids into it and says how many - if any - are
there.

I guess we could live with a limited size zone-list - only very large
directories would benefit from, say, more than 16 entries; on the other hand
I suspect two choices are not sufficient.

Once again, this problem arises only because we are trying to implement
objects indepenently of directories and files.

>How are read and write errors reported?

Don't know - based on the device driver feedback?

>What is the exact semantics of background write? Eg. How are errors
>reported? Are forground operations like reallocate possible when background
>operations are in progress? etc.

I don't think it should be possible to move an object whilst some background
read/write is in progress (if we allowed this, we'd have to update scatter
lists dynamically).

>It might be useful to have a combined some forground and some background
>operation. Eg. the case of reading ahead on a file. The part requested by an
>application would be forground (and written into application space), the
>next bit of the file would be read in the background (into buffer space). It
>is no good if two separate read calls have to be made that result in two
>separate disc reads with a disc spin in between occurring.

Sounds a good idea.

>It might be useful for the disc object interface to be able to return the
>amount of an object that can be considered close to a particular part of an
>object. Eg. from 2MB to 3MB is close to the point 2.25MB in our
>implementation with 1MB chunk sizes. This info would be used by the open
>file cache to avoid doing read ahead over a boundary of such closeness. It
>may be a bad idea to do a speculative read ahead of part of a file that is in
>a widely different part of the disc to that which a client has actually
>requested. It may be better to leave the heads near the part actually being
>used. At the moment I believe the FileCore open file cache takes note of
>this and only issues read ahead requests for the remainder of a contiguous
>lump of part of a file.

Possibly a bit over-complex. Maybe read-ahead calls should be distinct from
background reads: that is, background reads are mandatory, read-aheads are
advisory (and so the low-level could choose to read ahead only as far as the
next boundary).

>This level should fault access outside an object (and not rely on the upper
>levels to keep track of object size). (But use this level's allocated space,
>not any higher level idea of the actual size of the object).

See earlier discussion about size.

>It should be possible to find out the size of the object. (Preferably from the
>ObjectID not the handle).

See earlier discussion about size.

>The 'allocation strategy' and 'parent/neighbour objects' should be advisory
>only. Ie. they help the implementation to choose good placements for
>objects, but they are optional. This should be made clear in the interface
>description.

Yes.

>It should be made clear that the implementation will attempt to allocate
>exactly the amount of space requested, but may fail (upwards) due to
>rounding errors. (Rather than for example the implementation always
>returning 20% more space than requested on the assumption that the extra
>space may be needed in future). It should be up to the upper levels to
>allocate such 'extra' space if they feel it would be useful.

I'd go for this - although it's not what FileCore does (FileCore appears to
allocate as much as is available in one fragment).

>It may be useful in addition to the actual minimum space required to also
>pass an advistory likely maximum size. The implementation is free to ignore
>the advisory likely maximum size, but it may use it to aid placement so that
>future reallocations up to this maximum size are fast.

I'd say this was more complex than we need.

>What about changing the 'parent/neighbour objects'? Eg. when a file is moved
>to a different directory, do you expect to move the data on disc? Should
>upper levels create a new object?

Good question! No. No (but they could if they wished to!).

>Why have the concept of open objects? There is no extra state associated
>with an open object (apart from it's handle) so what is the point? Will this
>level need to cache significant amounts of info about objects that are in
>use? If not, then what is the advantage of the concept of open objects? Open
>objects complicates the semantics of background functions.

Some state which could be associated with open objects includes:
  - object size
  - information about mapping between object-relative addresses and disc
     addresses (eg chunk-id and corresponding address of latest read)
  - list of outstanding background transfers
I guess the second item is the most compelling reason for considering open
objects as different from closed ones.

>If you do have the idea of open objects, then what about whole object
>operations? Eg. create object with given data rather than create, open,
>write, close.

Might be worthwhile.

>There is no facility for atomic writes (for things like directories). Ie.
>either whole write succeeds or whole write fails.

I'd favour instead giving guarantees about the integrity of the filing system
structures, and exporting the "note_state" mechanism. Higher levels can then
use this to implement whatever indivisibility they need.

>Conventions. I would always prefix everything exported by an interface with
>the same string (modulo capitalisation and separator). In particular the
>exported types would have the same prefix as the exported routines (although
>I normally use an initial capital for a type name and initial lower case for
>function names). Personally I prefer __ as a separator between interface
>name prefix and the rest of the name (this allows _ to be used within the
>name itself without confusion). Also, rather irrationally I tend to go for
>capitalisation as a separator within interface (type) names, but _ as a
>separator within function names. So personally I would go for: eg. DiscObjID
>and discObj__read(). Also I would tend to put the primary verb first and
>modifiers later eg. discObj__read_bgr(). Personal preferences vary!

>John.
