Reliability analysis in Parallel and Distributed Computing Systems

R.  Seethalakshmi; K.S.  Ravichandran

Reliability analysis in Parallel and Distributed Computing Systems

R Seethalakshmi, K Ravichandran

Citation

R Seethalakshmi, K Ravichandran. Reliability analysis in Parallel and Distributed Computing Systems. The Internet Journal of Medical Informatics. 2008 Volume 5 Number 1.

Abstract

The consistent research in the field of parallel and distributed computing has led to the growth of high-speed computers and communication technology, which had produced the vision of commercial availability of real time Distributed Systems and Parallel Systems. Distributed and parallel Systems provide cost-effective ways for improving resource sharing, performance, throughput, fault tolerance and reliability. Its redundant resources and cooperation among processing elements significantly affect the reliability performance and the fault tolerance. In general, a Parallel and Distributed System requires files, data, database, and processing elements for its successful operation whereas, the communication links and resources are important for a reliable Distributed System. A Distributed System can be typically formulated by a Network of Workstations (NoW), which in turn consists of a network of processes, which are capable of communicating with each other using messages. The Parallel System can be either the tightly coupled system, which is the multiprocessor based system like the supercomputers or the loosely coupled system, which are multi-computer based systems called the clusters. This research work focuses on the fuzzy logic based approach for the reliability analysis of such a system. Reliability analysis is an innovative discipline that applies various mathematical techniques to the measurement and prediction of reliability and it is stated that a system is intended to function without failure for a specified period of time and under specified conditions. Thus there are many factors that influence the reliability of the system like reducing the complexity of the system, improving the reliability of components or subsystems, implementing large safety factors and using redundant components and this paper focuses the analysis on these issues.

Introduction

Reliability analysis is an innovative discipline that applies various mathematical techniques to the measurement and prediction of reliability and it is stated that a system is intended to function without failure for a specified period of time and under specified conditions. It is also determined by the measure of how well a System meets its design objectives and is expressed as a function of the reliabilities of the subsystems or components. The Reliability analysis is categorized as follows:

Reliability analysis of Parallel Processes
Reliability analysis of Distributed Processes
Reliability analysis of Object Migration
Reliability analysis of Configurations

This paper addresses the issue of the design, creation and distributing or parallelizing the processes or data and the analysis of such processes. A high sophisticated environment is assumed for performing the analysis.

Reliability Analysis of Parallel Processes

The aim of parallel high performance computing is to minimize turnaround time to complete specific application problem, maximize the problem size that can be solved in a given amount of time and to solve large-scale problems that could not be done otherwise. These are true supercomputers and clusters, which cost around tens of $M and O (1000) times faster than the desktop systems. They also aim at reducing the time per instruction and increase the number of instructions executed per clock cycle. Typically in a parallel system there is a master (or controller) process which sends fragments of work to each of a set of workers (processors). Workers perform the specified work and then return the results when they are done, and request more work when idle.

Analysis of Parallel Processes

MPI provides support for Point-to-Point message passing, Collective Communications, Communicators to group messages and processes, Inquiry routines to query the environment, Constants and data types. The analysis of parallel processes can be carried out in two ways:

Task Parallelism [small simulated MIMD model]
Data Parallelism

Task Parallelism

The task parallelism involves the parallelism of various tasks which enable communication between it. Figure 1 clearly shows the task level parallelism of various nodes in the network and the result of the task is obtained by exchanging messages.

Figure 2

Figure 2 Analysis of Task Level Parallelism

Reliability Analysis MIMD model

The analysis of the task level parallelism is carried out in parallel environment where four different tasks are run in parallel and the different data is supplied in parallel to all and analyzed. The reliability analysis of this system is also carried out. Figure 2 below shows the analysis of the parallel processes and table 1 specifies the reliability analysis.

Figure 3

Table 1 : Reliability Analysis

Figure 4

Figure 3: Reliability Analysis of Parallel Processes

Data Level Parallelism

SIMD(Single-Instruction Stream Multiple-Data Stream) architectures are essential in the parallel world of computers. Their ability to manipulate large data in minimal time has created a phenomenal demand in such areas as student records and research. The power behind this type of architecture can be seen when the number of processor elements is equivalent to the size of the given array. Several processors can do the same specified operation simultaneously under the specified condition but on different data set. Figure 3 shows the graph of the reliability analysis of typical processes..

Figure 5

Figure 4: Communication Distributed message passing systems

Analysis of Distributed Processes

A Distributed Network is formulated by a set of communicating Distributed Processes. Here different processes can run on different machines and they are connected using protocols and sockets. The windows based distributed Message passing scheme is used to enable communication between the processes. If the messages reach properly the destination then the reliability of link is 1. Else the link is found to be faulty then the link is repaired and then added, after which the receiver process receives the message and it tries to execute the task specified by the message. This gives rise to a dynamic process execution environment. The reliability is distributed between 0 and 1 based on their current status using a triangular membership function and it is found to be enhanced by 20% than the fault tree based technique and Minimal Spanning Forest based approach. Figure 4 below shows a message passing client server scenario which can send and receive messages and here the server executes the tasks requested by the clients and returns the result to them and Figure 5 shows a typical Distributed Process reliability.

Figure 6

Figure 5: Distributed Process Reliability

Figure 7

Figure 6: Shows a highlighted migrated file

Reliability Analysis of Object Migration

Object migration is the most observed operation that is more often performed in Distributed Computing Environment (DCE), since it enhances the performance of the DCE and the reliability analysis of DOME is considered very important. Distributed object migration environment (DOME) addresses three major issues of parallel computing in an architecture-independent manner. These are: ease of programming, dynamic load balancing, and fault tolerance. Location transparency ends where performance considerations come into play. Even in state-of-the-art distributed systems access latency to a remote object is orders of magnitude slower than to a local object. A parallel program executed in a distributed environment therefore can only utilize the full power of the parallel machine, if communication is minimized. Object migration is one way of adapting the distribution layout to changing locality requirements of the application. Manual object placement is also possible. Again the fuzzy logic based reliability analysis is done and probability is distributed between 0 –1 for various states of migration like no migration, migrated and restarted, migrated and resumed. The research work is carried out using RMI and using “JACK” an agent-based tool. Figure 6 below shows a typical file migration process.

Figure 8

Reliability Analysis of Configurations

The reliability analysis of configurations includes the series configuration, parallel configuration and mixed configurations.

Series Configuration

Series systems function properly only when all their components function properly. Examples are sequence of links, networks and layered company organizations in which information is passed from one hierarchical level to the next.

The reliability of a series system is easily calculated from the reliability of its components. Let Yi be an indicator of whether component i fails or not; hence Yi = 1 if component i fails and Yi = 0 if component i functions properly. Also denote by Pi = P[Yi = 1] the probability that component i fails. The probability of failure of a system with n components in series is then

P[system failure] = 1 − P[system survival] = 1− P[(Y₁ = 0) ∩ (Y₂ = 0) ∩ ... ∩ (Y_n = 0)]

Parallel Configuration

In this case, the system fails only if all its components fail. For example, if an office has n copy machines, it is possible to make a system if at least one mirrored disk is in good working condition. The probability of failure of a parallel system of this type is obtained as

{image:8}

Mixed Configurations

It encompasses a series-parallel or a parallel-series configuration which consists of some cascaded set of processes, the proxy processes and the reliability of which is determined by using the OR or AND combination

Conclusion

Distributed Computing System (DCS) has become very popular for its high fault-tolerance, potential for parallel processing, and better reliability performance. One of the important issues in the design of the DCS is the reliability performance. Traditional reliability indexes such as source-to-terminal, survivability, multiterminal reliability, and K-terminal reliability are not directly applicable for the analysis of the distributed reliability property in DCS without appropriate modification. Thus, new approaches and algorithms for the reliability analysis of the DCS must be developed. These include the task level parallelism, data level parallelism, object migration analysis, providing redundancy with the help of the reliable tested configurations. As the parallel and distributed processes are highly dynamic fuzzy logic based analysis is performed.

References

1. D. W. Davies, E. Holler and E. D. Jensen, S. R. Kimbleton, B. W. Lampson, G. Lelann, K. J. Thurber, and R. W. Watson, "Distributed systems architecture and implementation," in LNCS, vol. 105. Berlin, Germany : Springer-Verlag, 1981
2. Min-Sheng Lin, Deng-Jyi Chen and Maw-Sheng Horng, “The Reliability Analysis of Distributed Computing Systems with Imperfect Nodes”, The Computer Journal, Vol.42, No2, 1999.
3. Daniel Barbara, “The Reliability of Voting Mechanisms”, IEEE Transactions of Computers, Vol C-36, 1985
4. Min-Sheng Lin,”The Reliability Analysis on Distributed System”, Phd Dissertation National Chiao Tung University, Hsinchu, Taiwan, May 1994.
5. Satyanarayana and J. N. Hagstrom, "A New Algorithm for the Reliability Analysis of Multi-Terminal Networks," IEEE Trans. on Reliability, vol. R-30, pp. 325-334, Oct. 1981.
6. Kwang-Nam, Hyun, “Reliability Optimization for a System with Several Failure Modes”,IEEE Transactions on Reliability, Vol R-24, No. 3, 1975.
7. F.T.Boesch, “Synthesis of Reliable Networks – A Survey”, IEEE Transactions on Reliability, Vol. R-35, No.3, pp 240-246, 1986.
8. C.S.Raghavendra, Salim Hariri, “Reliability Optimization in the Design of Distributed Systems”, IEEE Transactions on Software Engineering, Vol SE-11, No 10, pp 1184-1193, 1985
9. LI Yun-fa, “Critique of Time-Constrained Distributed Program Reliability Analysis”, Wuhan University, Vol.21, No.3, 2004
10. Nancy.A.Lynch, “Distributed Algorithms”,Morgan Kaufmann,2000

ISPUB.com

Internet
Scientific
Publications

Reliability analysis in Parallel and Distributed Computing Systems

Citation

Abstract

Introduction

Reliability Analysis of Parallel Processes

Analysis of Parallel Processes

Task Parallelism

Figure 2

Reliability Analysis MIMD model

Figure 3

Figure 4

Data Level Parallelism

Figure 5

Analysis of Distributed Processes

Figure 6

Figure 7

Reliability Analysis of Object Migration

Figure 8

Reliability Analysis of Configurations

Series Configuration

Parallel Configuration

Mixed Configurations

Conclusion

References

Author Information