Local area filesystems provide a separate view which has stronger consistency, but requires fairly fast connections, and removes access when the machines are disconnected. NFS, AFS, Coda, and xFS show various approaches to the local area filesystem. AFS and Coda provide some support for wide area filesystems, and Coda provides support for disconnected operation.
Question: Can these two different approaches be merged so that
the filesystem maintains varying levels of known consistency between
groups of machines?
Fault Tolerance
Various classes of machines need to be able to handle different types
of faults. Portable machines need to be able to handle a failure all
remote nodes, similarily, wide-area or weakly connected machines may
also need to be able to handle failure of all remote nodes. Different
administrative domains inside a single organization may be unwilling
to create cross dependancies between their sub-groups. Even in local
area networks, failure of a central fileserver can cause everyone to
be unable to get work done.
Optimally, this fault tolerance would be imbedded into the filesystem so that it is transparent to users. Conventionally, techiniques like rdist and track are used to handle wide-area or portable machines, and to handle a subset of important files, and then a central server is used to store shared, frequently updated data. A good filesystem would support both fault tolerance in the local area (so that some of the file servers can be lost without losing service, ala RAID disk systems), and the filesystem would support weak consistency and complete fault-tolerance for portable, wide area, and different administrative domain machines.
The simple questions is: is this feasable?
Scalability
Sites are getting much larger as computer use continues to expand in
the workplace. As a result, file service already needs to be faster
and larger. Moreover, the difference between large central servers
and clients is diminishings, so the clients are putting a larger
relative load on the servers. For both of these reasons, centralized
fileservers are becoming less effective solution. A potentially
better approach is to take advantage of the power of the clients to
assist the filesystem. For read-requests, systems like CacheFS are
starting to help in this direction. We believe that the file service
needs to also more scalability in both read and write performance.
If the shared filesystem capacity and capability scales with the
number of clients, then we gain additional benefits. We can keep
every file (including the root filesystem for each client) on the
shared filesystem. This allows very easy upgrades of clients because
the filesystem is globally available. Further, upgrades can be
applied even if the client is down. FInally in the case of permanant
client failure, it is easy to rbing up another replacement machine
since all of the client's files are on the shared filesystem.
Recoverability
There are two general calasses of recoverability. First,
recoverability from user errors. Users can lose files because of
accidental file deletion, or because they want to return to a previous
version. Second, recoverabliity from catastrophic failures. For
redundant filesystems (RAIDs), this can occur due to multiple
failures. Similarily, natural disasters can require recovereing rom
off-site backups.
Previous systems (AFS, Plan 9) have demonstrated the advantage of getting to previous versions. AFS kept a special volume (.backup) which contained a snapshot of each volume at the time of a previous backup. Plan 9 through the use of special hardware (a WORM jukebox) provided time travel to a arbitrary point in the past. We believe that it would be useful to have arbitrary, user-specified snapshots (as well as periodic system snapshots), but that this support needs to be provided without the use of specialized hardware.
The second large class of recoverability is from catastrophic
failures. For this to be feasable, the filesystem needs to be able to
write out at least one (and preferably 2) copies of the filesystem to
tape for off-site storage. It is possible that if the cost of disk
continues to fall faster than the cost of tape, it may be more
reasonable to mirror files to off-site disks. Regardless, we believe
that there are still significant challenges in supporting a filesystem
which can backup large files (many GB) as well as large filesystems
(many TB) without falling back to substantial administrative work for
partitioning up the filesystem space.
Customization
Often the view of the filesystem needs to vary between systems.
First, binaries compiled for one architecture do not work on others,
but for consistency, the files should all appear in the same place.
Second, some systems may have special hardware or roles that require
them to have additional or different files. Third, large software
collections require the ability to install and uninstall programs as
well as manage conflicts between programs. Fourth, some users may
need different versions of the same program at the same time to work
around bugs.
Clearly cuztomization at the level of machines is required. We believe that customization per user/per process may also be beneficial. Users could access different versions of programs on the same machine, processes could configure variant version of the filesystem (ala chroot for security). However this also introduces a potential nightmare for the administrator as they can no longer easily tell how the filesystem is appearing to a particular program, and the number of potential configurations may become very large.
Flexible customization poses a number of problems for the
implementation. One approach would be above the filesystem and has
been tried through depot and its varients. This approach uses symlinks
to build the customized appearance from some repository. Some
varients of depot support runnign programs after the customized
filesystem is built to create indicies of of separate parts (man page
indicies, configuration indicies, font lists, etc.) This support is
necessary for a fully general solution. However, symlinks can be
detected causing some programs to behave incorrectly. Moreover,
getting programs to install into non-conflicting places in the first
place can be challenging. For these reasons, a solution embedded into
the filesystem looks appealing. However, supporting index creation
inside the filesystem may be challenging. A hybrid approach may turn
out to be best.
Monitoring and Diagnosing a Running System
Monitoring and Diagnosing a Running System: Discovering and Fixing
Performance and Correctness Problems, Planning for Long Term Trends.
We believe that for fault tolerance, each node should run its own copy of the database from local disk and storing to local disk. This database will only store the information for that node, but this approach helps guarentee information will not be lost. Then the information needs to be collected together toward a single repository in order to make analysis and storage easier. The aggregation may happen in a tree for efficiency or to deal with geographic concerns. The aggregation will also have to happen in a fault-tolerant manner.
We would like the database itself to be fault tolerant, however,
given the type of updates, and the extreme fault-tolerance
requirements, it may be simpler to build the fault-tolerance on top of
the database. Similarily, we would like the database to automatically
update data via the various gathering mechanisms, however again we
believe we can build that on top. The primary area of concern is
scalability. With hundreds to thousands of hosts updating the
database continuosly, the database could easily see thousands of
updates per second. We would like ot avoid having to purchase a few
expensive machines solely to monitor the system.
Data Visualization
If there is only a single machine, then visualization is a
straightforward problem. THe various metrics for the machine are
displayed on the screen for the administrator to see. However, as the
number of machines grows, this approach is no longer viable. Under
traditional designs, either a very little bit of information is
displayed about a lot of sources, e.g. up/down status for all the
routers in a system. Alternately, the other machines can be ignored,
and a lot of information can be displayed about a few machines. The
problem is that the screen real-estate is fixed. Therefoer,
displaying more information about more machines requires increasing
the information/pixel ratio. We propose to achieve this by use of
aggregation and by taking advantage of the high resolution color
displays that are available.
Aggregation compresses metrics of multiple systems into a single metric. Typical examples include average, max, and median. Unfortunately, aggregation loses information about the spread of the data. Since each of the aggregation methods has an associated measure of spread, we plan to display the aggregate value as well as the spread of the data using two of the axis for display information on the screen. The related spread metrics are standard deviation, range and SIQR respectively.
High resolution color screens provide a number of axis for
display. The first is pixels turned on or off in some area. We call
this fill. The second three axis correspond to the values in the HSV
color model. They are hue (the actual color), shade (saturation), and
tint (whiteness). We do not believe that you have to use all of the
axis at the same time, but we believe that all of the axis should be
available for use as humans can see over 100,000 different colors.
Simplifying Security
Simplifying Security: Raising the Level of Abstraction to Simplify
Security Programming. Using Existing Infrastructure to Aid
Acceptance.