Reliability analysis in Parallel and Distributed Computing Systems
R Seethalakshmi, K Ravichandran
Citation
R Seethalakshmi, K Ravichandran. Reliability analysis in Parallel and Distributed Computing Systems. The Internet Journal of Medical Informatics. 2008 Volume 5 Number 1.
Abstract
The consistent research in the field of parallel and distributed computing has led to the growth of high-speed computers and communication technology, which had produced the vision of commercial availability of real time Distributed Systems and Parallel Systems. Distributed and parallel Systems provide cost-effective ways for improving resource sharing, performance, throughput, fault tolerance and reliability. Its redundant resources and cooperation among processing elements significantly affect the reliability performance and the fault tolerance. In general, a Parallel and Distributed System requires files, data, database, and processing elements for its successful operation whereas, the communication links and resources are important for a reliable Distributed System. A Distributed System can be typically formulated by a Network of Workstations (
Introduction
Reliability analysis is an innovative discipline that applies various mathematical techniques to the measurement and prediction of reliability and it is stated that a system is intended to function without failure for a specified period of time and under specified conditions. It is also determined by the measure of how well a System meets its design objectives and is expressed as a function of the reliabilities of the subsystems or components. The Reliability analysis is categorized as follows:
-
Reliability analysis of Parallel Processes
-
Reliability analysis of Distributed Processes
-
Reliability analysis of Object Migration
-
Reliability analysis of Configurations
This paper addresses the issue of the design, creation and distributing or parallelizing the processes or data and the analysis of such processes. A high sophisticated environment is assumed for performing the analysis.
Reliability Analysis of Parallel Processes
The aim of parallel high performance computing is to minimize turnaround time to complete specific application problem, maximize the problem size that can be solved in a given amount of time and to solve large-scale problems that could not be done otherwise. These are true supercomputers and clusters, which cost around tens of $M and O (1000) times faster than the desktop systems. They also aim at reducing the time per instruction and increase the number of instructions executed per clock cycle. Typically in a parallel system there is a master (or controller) process which sends fragments of work to each of a set of workers (processors). Workers perform the specified work and then return the results when they are done, and request more work when idle.
Analysis of Parallel Processes
MPI provides support for
-
Task Parallelism [small simulated MIMD model]
-
Data Parallelism
Task Parallelism
The task parallelism involves the parallelism of various tasks which enable communication between it. Figure 1 clearly shows the task level parallelism of various nodes in the network and the result of the task is obtained by exchanging messages.
Reliability Analysis MIMD model
The analysis of the task level parallelism is carried out in parallel environment where four different tasks are run in parallel and the different data is supplied in parallel to all and analyzed. The reliability analysis of this system is also carried out. Figure 2 below shows the analysis of the parallel processes and table 1 specifies the reliability analysis.
Data Level Parallelism
SIMD(Single-Instruction Stream Multiple-Data Stream) architectures are essential in the parallel world of computers. Their ability to manipulate large data in minimal time has created a phenomenal demand in such areas as student records and research. The power behind this type of architecture can be seen when the number of processor elements is equivalent to the size of the given array. Several processors can do the same specified operation simultaneously under the specified condition but on different data set. Figure 3 shows the graph of the reliability analysis of typical processes..
Analysis of Distributed Processes
A Distributed Network is formulated by a set of communicating Distributed Processes. Here different processes can run on different machines and they are connected using protocols and sockets. The windows based distributed Message passing scheme is used to enable communication between the processes. If the messages reach properly the destination then the reliability of link is 1. Else the link is found to be faulty then the link is repaired and then added, after which the receiver process receives the message and it tries to execute the task specified by the message. This gives rise to a dynamic process execution environment. The reliability is distributed between 0 and 1 based on their current status using a triangular membership function and it is found to be enhanced by 20% than the fault tree based technique and Minimal Spanning Forest based approach. Figure 4 below shows a message passing client server scenario which can send and receive messages and here the server executes the tasks requested by the clients and returns the result to them and Figure 5 shows a typical Distributed Process reliability.
Reliability Analysis of Object Migration
Object migration is the most observed operation that is more often performed in Distributed Computing Environment (DCE), since it enhances the performance of the DCE and the reliability analysis of DOME is considered very important. Distributed object migration environment (DOME) addresses three major issues of parallel computing in an architecture-independent manner. These are: ease of programming, dynamic load balancing, and fault tolerance. Location transparency ends where performance considerations come into play. Even in state-of-the-art distributed systems access latency to a remote object is orders of magnitude slower than to a local object. A parallel program executed in a distributed environment therefore can only utilize the full power of the parallel machine, if communication is minimized. Object migration is one way of adapting the distribution layout to changing locality requirements of the application. Manual object placement is also possible. Again the fuzzy logic based reliability analysis is done and probability is distributed between 0 –1 for various states of migration like no migration, migrated and restarted, migrated and resumed. The research work is carried out using RMI and using “JACK” an agent-based tool. Figure 6 below shows a typical file migration process.
Reliability Analysis of Configurations
The reliability analysis of configurations includes the series configuration, parallel configuration and mixed configurations.
Series Configuration
Series systems function properly only when all their components function properly. Examples are sequence of links, networks and layered company organizations in which information is passed from one hierarchical level to the next.
The reliability of a series system is easily calculated from the reliability of its components. Let Yi be an indicator of whether component i fails or not; hence Yi = 1 if component i fails and Yi = 0 if component i functions properly. Also denote by Pi = P[Yi = 1] the probability that component i fails. The probability of failure of a system with n components in series is then
Parallel Configuration
In this case, the system fails only if all its components fail. For example, if an office has n copy machines, it is possible to make a system if at least one mirrored disk is in good working condition. The probability of failure of a parallel system of this type is obtained as
{image:8}
Mixed Configurations
It encompasses a series-parallel or a parallel-series configuration which consists of some cascaded set of processes, the proxy processes and the reliability of which is determined by using the OR or AND combination
Conclusion
Distributed Computing System (DCS) has become very popular for its high fault-tolerance, potential for parallel processing, and better reliability performance. One of the important issues in the design of the DCS is the reliability performance. Traditional reliability indexes such as source-to-terminal, survivability, multiterminal reliability, and