Stella Reliability Concepts - Predictability

 

Predictability for reliability

For any system that must interact with the outside world, in particular in control systems, the predictability of the system is a very important factor in establishing the reliability. If a system cannot respond to external events predictably, then the system will not be reliable. In this case predictable means that the response to an event is guaranteed to be within a known time frame.

Contention for resources

In principle, computer systems are predictable. In practice it can be very difficult to predict the behaviour of a system if certain algorithms are used. There are two approaches to this problem.

The most common approach is to pretend that it does not exist. For example, the leading suppliers of semaphore based "real-time" systems all claim that all operating system calls have well-defined execution times and quote them. However, since many of these calls access resources protected by semaphores, the time to access those resources depends on whether they are already in use by another task. In general, these suppliers claim "worst case" figures for these calls that are less than the time taken for a "best case" semaphore ping (i.e., the figures are false).

There are two defences of this "optimistic" performance evaluation.

  1. Contention is a rare occurrence.
  2. The time taken for a semaphore ping depends on how long the other task will take before it releases the semaphore and, therefore, cannot be evaluated.

The first justification is specious. If there were no possibility of contention, there would be no need for protection. If contention is possible, it must be taken into account for the worst case. The second justification only reveals that these suppliers are well aware that they cannot even define the worst case performance of their systems.

Using semaphores renders the performance of a system intrinsically unpredictable as the execution time any function accessing a shared resource depends on the internal state of the system. Moreover, even if semaphore deadlocks are avoided, it is possible that several critical resources may be in use by low priority tasks when required by high priority tasks. In this case, the high priority tasks may be delayed by a cascade of semaphore pings.

This is not the most serious problem. The probability of a particular resource being in use when required rises rapidly as the load on the system rises. This causes the average operating system efficiency to drop dramatically with increasing load, which has the same effect on the processor as further increasing the load. The result can be a fold back in performance that reduces the system overload point by more than an order of magnitude with respect to the "normal" overload point established during system tests.

This non-linear behaviour means that statistical techniques cannot be used to predict the probability of overload and system collapse.

The Stella approach is rather different. No mutual exclusion mechanisms are used within Stella so that the execution time of every operating system call is independent of when the call is made. Proprietary INTSAFE asynchronous operations are used for inter-task communications. More complex operations are divided into sequences of INTSAFE and atomic operations.

Contrary to current academic dogma, the use of atomic operations does not give extended interrupt latency or excessive pre-emption delays as semaphores themselves require atomic operations internally. With guaranteed interrupt latencies of 6 "instructions" and pre-emption delays of 25 "instructions" Stella outperforms the leading semaphore based systems under the most favourable conditions. Unlike semaphore delays, however, delays due to atomic operations cannot accumulate or cascade. Under unfavourable, heavy load conditions, Stella outperforms classical systems, not by "percents", but by orders of magnitude.

Response to external events

The academic approach to external event handling is that if the response is not good enough, you just need to use a faster processor. Anyone who has installed some of the current generation of "multimedia" games on a variety of PCs will know that this is far from the truth.

The typical problems of jerky displays (due to lack of video stream synchronisation), erratic interaction (due to loss of mouse or keyboard events) or sound break up (due to a failure to fetch the sound samples at the right time) are not directly related to the processor or peripheral speeds. A game with perfect sound, but erratic interaction on one machine may have perfect interaction but broken sound on another. The speed of the computer is not always a significant factor.

This type of reliability problem results from a contention for that most precious of computer resources: processor time. The response of a particular task to events (either external events such as key strokes or timers, or internal events such as requests by other tasks) depends on when, and for how long, that task executes.

Scheduling predictability

For sharing out the processor time between tasks (scheduling), classical systems rely on giving tasks a numerical priority which is either an absolute priority (a high priority task cannot be "pre-empted" by a task with a lower priority) or a timesharing priority (the tasks share the processor time with higher priority tasks getting a greater share than the lower priority tasks).

In classical systems, the relationship between the numerical priority and allocation of processor time is, however, very loose and varies considerably with changes in machine load.

Moreover, the re-allocation of the processor to a new task (task switch) does not just happen: it has to be requested. With conventional computer architectures, if a number of external events occur within a short period, all the requests to respond to external events will have to be treated before the highest priority task can service the most critical external event. This means that, for the worst case, scheduled response times depend more on the time taken by the scheduler to handle the request than to than for the task switch itself.

Classical "real-time" operating systems use algorithms than minimise the "benchmark time" - the basic task switch. This increases the cost of each scheduler request and, therefore, makes the response time to critical events highly dependent on the timing of non-critical events.

Stella tackles both of these predictability problems.

  1. Where a number of jobs co-operate, Stella provides priorities that are relative to that group of jobs. Unlike classical systems, where the priorities are all jumbled together, the allocation of processor time within a group of jobs is independent of the number or jobs outside the group or their priorities: it is much more predictable.
  2. The scheduling algorithms used by Stella are designed to minimise the scheduler request cost rather than the task switch. The result is that the operating system's contribution to the scheduler unpredictability due to low priority external events is nearly 100 times smaller than for "market leader" real time systems.

Direct external event handling

The real solution to handling external events predictably is, however, not to use even Stella's much more predictable scheduling, but to use a "state" machine approach.

Stella's External Event Manager is an non-prioritised, threaded code sequencer for external event handlers. This uses a pre-calculated scheduling sequence to guarantee that the system meets pre-defined constraints on response times and event repetition rates - with a better worst case performance than vectored interrupts.

Naturally, as it is Stella, the constraints may be changed or event handlers added or removed at any time - the External Event Manager sets up a new sequence for the new requirements. It is much easier and safer than fiddling about with vectored interrupts.  

  Back to table of contents.


Copyright (c) 1997 Tony Tebby.