Fault Tolerance for Repositories

InSight Rendering makes a reasonable attempt to protect against the loss of jobs when repository failures occur. All jobs are initially 'owned' by the repository to which the job was submitted. The owning repository is responsible for ensuring that the job gets done. However, if a repository goes down (for example, a machine failure or program crash), the fault tolerance mechanism is initiated. How fault tolerance is provided depends on:
  • Whether the domain has one or more repositories
  • Whether a repository shutdown is abnormal (failure) or normal (all jobs are suspended and the BusinessServiceServer service is stopped)
  • The state that the job is in (in-progress or queued)

If a domain has more than one repository, then each repository acts as a backup repository for the others. A backup repository is responsible for taking ownership of another repository's queued and in-progress jobs if that repository goes down.

If termination of a repository is normal (that is, user requested), in-progress jobs will be cancelled (the Jobs tab will show them as "cancelled"). If desired, the Jobs tab can be used to resubmit these jobs. If a backup repository is available, the backup will take ownership of queued jobs (that is, jobs not yet started) and resume them. If no backup is available, then queued jobs are resumed when the terminated repository is restarted.

If termination is abnormal (that is, due to failure) then the backup (if any) will take ownership of queued and in-progress jobs and resume them. If no backup, then these jobs will be resumed when the repository is restarted.

In a multiple-repository domain, repositories may send jobs to other repositories to be executed. Each repository monitors other repositories in the domain and is aware of which repositories are down. This prevents jobs from being sent to repositories that are not available. In addition, if a job is sent to another repository and that repository subsequently goes down then the sending repository will detect this condition and retry execution of the job.

Not mentioned above are scheduled jobs and event notifications. Like regular jobs, these are owned by the machines on which they were originally requested. If a repository goes down, then the backup repository (if any) will take ownership and be responsible for performing the work until the original repository is restarted. Ownership is then transferred back to the original repository and the work continues there. If no backup is available (single-repository domain), then the work will continue when the single-repository is restarted.