Analysis of on-chip fault-tolerant distributed algorithms

Függer, Matthias

Record link:

https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-39917
http://hdl.handle.net/20.500.12708/14747

Title:

Analysis of on-chip fault-tolerant distributed algorithms

Citation:

Függer, M. (2010). Analysis of on-chip fault-tolerant distributed algorithms [Dissertation, Technische Universität Wien]. reposiTUm. https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:1-39917

CatalogPlus:

AC07808074

Publication Type:

Thesis - Dissertation

Language:

English

Authors:

Függer, Matthias

Advisor:

Schmid, Ulrich

Co-advisor:

Lundelius Welch, Jennifer

Organisational Unit:

E182 - Institut für Technische Informatik

Date (published):

2010

Number of Pages:

151

Keywords:

Fehlertoleranz; Verteilte Algorithmen; System-on-Chip; Formale Analyse; Asynchrone Schaltungen; VLSI

fault-tolerance; distributed algorithms; System-on-Chip; formal analysis; clockless circuits; asynchronous circuits; VLSI

Abstract:

Die formale Spezifikation und Analyse von hochintegrierten Schaltungen (VLSI Circuits) ist für die Verwendung in hochzuverlässigen Anwendungen unerlässlich. Zwei Trends im VLSI Design sprechen für einen Modellierungsansatz analog zu dem verteilter Systeme: (i) merkliche Kommunikationszeiten zwischen Schaltungskomponenten und (ii) nicht vernachlässigbare Fehlerraten. Trotz dieser vielversprechenden Gemeinsamkeiten können VLSI Circuits mit Modellen klassischer verteilter Algorithmen nur umständlich oder gar nicht spezifiziert und analysiert werden: Im Gegensatz zu klassischen verteilten Systemen, in denen Berechnungen von Prozessen in diskreten Schritten ausgeführt werden, benötigt die Modellierung von VLSI Circuits ein kontinuierliches Berechnungsmodell. Ein on-chip Algorithmus wird in diesem Kontext zu einer Menge von Schaltungskomponenten, die auf kontinuierlichen Berechnungsströmen operieren und via kontinuierlicher Nachrichtenströme kommunizieren. In der Dissertation wird ein Spezifikations- und Analyse-Framework vorgestellt, welches auf die Besonderheiten von fehlertoleranten on-chip Algorithmen abgestimmt ist. Dieses basiert auf einer dreifachen Darstellung des Verhaltens der Komponenten über die Zeit. Die elementarste Darstellung ist ein ``signal'', eine Folge von Ereignissen. Darauf folgt die abstraktere Repräsentation des ``status'', der mehrere ähnliche signals umfasst und schließlich die ``counting function'', die eine noch größere Menge von signals repräsentiert. Diese dreifache Darstellung erlaubt die Spezifikation von unterschiedlichen Schaltungskomponenten auf verschiedenen Abstraktionsniveaus. Die Stärken des vorgestellten Frameworks werden beispielhaft an einem asynchronen on-chip Algorithmus, das ist ein Algorithmus, der nicht von einem zentralen Taktgenerator versorgt wird, illustriert. Die Erweiterung des Frameworks um eine Petrinetz-ähnliche Spezifikationssprache erlaubt es einige für das Design von asynchronen Schaltungen zentrale Komponenten kompakt zu spezifizieren. Unter diesen befindet sich auch das ``General Join'' Modul, welches das Zusammenführen von mehreren Datenpfaden von verschiedenen Quellen auf eine fehlertolerante Weise ermöglicht. Neben einer vollständigen Spezifikation werden für dieses Modul auch allgemeine Aussagen über das Zeitverhalten abgeleitet. Weiters wird eine Implementierung des General Join Moduls, bestehend aus einfachen Subkomponenten, präsentiert und deren Korrektheit bewiesen. Im Gegensatz zu asynchronen werden synchrone Designs von einem zentralen Takt versorgt, welcher einen unvermeidbaren single-point of failure darstellt. Eine übliche Methode, um synchrone Designs fehlertolerant zu machen, ist die Replikation des Designs und der Taktquelle. Dies führt allerdings zum sogenannten ``Tick Generation'' Problem: alle einzelnen Komponenten mit fehlertoleranten, synchronisierten Taktsignalen, die nicht über die Zeit divergieren, versorgen zu müssen. In der Dissertation wird eine Lösung mit Hilfe einer Schaltung aus interagierenden General Join Modulen präsentiert, deren zeitliches Verhalten charakterisiert und deren Korrektheit bewiesen. Abschließend werden Grenzen einfacher Schaltungen aufgezeigt, die nur aus kombinatorischer Logik und deren Verbindungen bestehen. Es wird bewiesen, dass es keine derartige Schaltung gibt, die das ``short-pulse-filter'' Problem löst; ein Problem von großer Wichtigkeit für das Design von zustandsbehafteten Schaltungen, wie Arbitern, an deren Ausgängen keine beliebig kurzen, transienten Zustandswechsel erlaubt sind. Die Dissertation endet mit einem probabilistischen Ansatz zur Implementierung eines short-pulse-filters.

For Very Large Scale Integrated (VLSI) Circuits intended to be used in highly reliable applications, formal specification and analysis is mandatory. Two trends in VLSI design favour a modeling approach analogous to that used for distributed systems: (i) noticeable communication delays between circuit components and (ii) increasing failure rates. Despite these striking similarities, specifying and analyzing circuits by means of classic distributed system models is either overly lengthy or not possible: In contrast to classic distributed systems, where computations are performed by processes in discrete steps in time, digital circuits in general adhere to a continuous computation model. An on-chip algorithm thus becomes a set of circuit components that compute on continuous streams of events and communicate by continuous streams of messages. In this thesis a modeling and analysis framework tied to the peculiarities of fault-tolerant on-chip algorithms is presented. It is based on a three-fold representation of the behavior of a circuit's outputs over time. The most fundamental is the signal, a trace of events; next comes the more abstract status, grouping together similar signals, and finally the counting function, grouping together an even larger set of similar signals. The three-fold representation allows the specification of diverse circuit components at different levels of abstraction. The capabilities of this framework are then illustrated by applying it to clockless on-chip algorithms, that is, circuits that are not driven by a central clock. The framework is extended by a Petri net like specification language, which is used to state pivotal circuit components for building clockless fault-tolerant on-chip algorithms. Among those is the General Join module, a module that allows to merge data provided by different sources in a fault-tolerant manner. In the thesis a complete specification is provided and generic timing properties are derived. Furthermore, an implementation of a General Join module in terms of simpler circuit components is given and proven correct. In contrast to clockless circuits, synchronous circuits are driven by a central clock which inherently constitutes a single-point of failure. A common technique to make synchronous circuits fault-tolerant is by replication of the circuit and its clock source. Thereby, the problem arises to provide fault-tolerant, synchronized clock signals that do not diverge over time to each of the replicas. This problem is termed the ``tick generation'' problem. It is shown that an alternative to replicated synchronized clock sources is to let a set of General Join modules, forming an on-chip distributed algorithm, generate synchronized clock signals in the course of their interaction. A correctness proof and performance measures of this solution are derived. Finally limitations of a restricted set of on-chip algorithms are established: It is shown that there is no circuit, only comprising combinatorial gates and wires with constant delay that solves the short-pulse-filter problem, a problem of great importance when building state-holding devices like arbiters that are not allowed to glitch at their output. The thesis concludes with a probabilistic by-pass of these limitations.

Additional information:

Zsfassung in dt. Sprache

License:

In Copyright

Appears in Collections:

Thesis