Ben Brothers and Fa Yoeu
ECE311 Research Paper
3 May 2000
Computers are a vital part of our modern infrastructure; their reliability, especially in high-performance and otherwise critical applications is necessary. A great deal of research is going on in this field right now, and this paper is meant as an overview of of some of that research.
We want to briefly discuss the history of distributed and parallel computing, and its importance. We'll then look at some of the design constraints, and the architectures that have been developed. We'll also compare the massively parallel processing system (MPP) with the clustered workstation network (NOW), looking at the benefits and drawbacks of each, and on potential areas of improvement. Finally, we'll take a look at some trends which we feel will guide and/or result from continued research into the field.
First, however, we want to define some terms, to make sure that there isn't any confusion. Particularly, we want to clearly distinguish concurrent processing, distributed processing, and parallel processing. Lerman and Rudolph (1993) use a very nice and fairly standard taxonomy, which we adopt.
Concurrent processing is not parallel processing. Concurrent processing refers to the scenario where a single processor handles many separate processes by switching between them and executing them for some quantum of time. To the end user, the brevity of these quanta give the appearance of true parallel processing, even though only one process is being executed at any given time.
Distributed processing and parallel processing are similar in that there are many processing elements working on a problem simultaneously. The cooperation and communication required between these processing elements is what makes distributed computing so much of a challenge.
In a distributed system, however, the processing elements are usually distinct computers or workstations, communicating over a network. In a parallel system, by contrast, the processors are in the same computer, and communicate over a standard architecture datapath.
The idea of parallelism was developed very early in the history of modern computing. John Cocke and Daniel Slotnick, working at IBM, first proposed the use of parallelism in numerical calculations in 1958. This idea, and later ideas by Slotnick for a SIMD (Single Instruction Multiple Data) machine are never turned into a final product, but the basic design became the starting point for much later work.
The first computer with hardware support for timesharing was the Honeywell 800, which was introduced in 1960. This timesharing architecture allowed the user to execute up to eight programs concurrently. Then in 1962, the Atlas computer became the first machine to use virtual memory and paging. Additionally, it has a pipelined instruction execution path, and contains a separate floating point arithmetic unit, which is capable of roughly 200KFLOPS.
In 1964, Slotnick proposed building the first MPP machine for Lawrence Livermore National Laboratory. Eventually, his design is funded by the Air Force, and built here at the University of Illinois, as the ILLIAC-IV.
In addition to the continued development of bigger and better computers, this time period also saw the development of the ARPANET, and other smaller networking protocols which would become essential both to the industry at large and the distributed computing field in general. In 1973, engineers working at Xerox Palo Alto Research Center (PARC) develop the Ethernet LAN protocol, which quickly became a widespread standard.
Further advances included a formalization of the concept of data dependence, discussed by Utpal Banerjee at Illinois, and a proposal by M.H. Van Emden and G.J. De Lucena to use predicate logic as the language for parallel programming.
The continued advances in the field have only made the distributed computing environment more important. In the early days, parallel processing was used almost exclusively for easily predictable numeric calculations, and signal processing. One powerful use was (and is) the solution of simultaneous partial differential equations, which are indispensable in many engineering calculations. A parallel system can calculate the differential on discrete points across the entire space-time continuum, and solve the resultant equations separately at each point. The independent repetition of this process lends itself well to a parallel processing algorithm (Karplus 1989).
More recently, however, studies have shown that only 20% of modern parallel computing time goes to numeric calculations. Other uses include AI development, image processing, database management, and high-speed graphics design. And more and more, parallel systems are being used for general computing work (Lerman and Rudoplh 1993).
Of course, all this discussion raises the question: how do we implement in practice the parallelism that is so straightforward in theory? The short answer is that we use replicated processing elements and connect them to some type of communication network. The speed of any system is going to be determined by two factors. First, it is limited by the physical properties of the components used. To a first approximation, this is a constant. And second, the speed of a system is determined by the architectural design of the collected components.
George Deroschers (1987) describes three essential issues to consider when designing a parallel architecture. These are: the power of the individual processing elements, the topology of the communication network and data path between the processing elements, and the allocation of control to the individual elements.
The individual processing elements can be designed with varying degrees of complexity, which will determine the design of the rest of the system. Some architectures use many single-bit, single-use processors, while some use only a few large, general-purpose processors. The communications topology can be just as varied, with some architectures that closely resemble a traditional von Neumann architecture and some designs that act and function more like a LAN. And of course, the system must be capable of allocating tasks to itself, and synchronizing the interactions. In some cases, this is done by keeping as much control as possible centralized, while in others, it is done by giving the individual processing elements as much freedom as possible.
Two important processor configurations are vector and pipelined parallelism. Vector processors are designed to, as you would guess, operate on vector data. In a vector operation, the operation can be applied to all elements of the vector array at once. A vector processor will generally have a scalar unit to process the standard instructions in the instruction stream. When a vector instruction is needed, the vector unit can load some number of register files simultaneously. The vector unit can then perform its instructions on these register files, and store the results back to memory.
The processing elements of a vector processor are not general-purpose; they rely on the scalar unit to perform non-vector instructions. A vector processor does have a highly interconnected communication topology, since the communication is performed at a very low level, and without contention and the corresponding need for contention resolution. Thus, we can get a large degree of interconnectivity, without the normally associated penalty for bus arbitration. Control of the vector processor is centrally maintained, in order to provide for the necessary synchronous data transfers from the standard execution path to the vector unit.
If a vector processor can be vaguely described as "parallel", then perhaps we can best describe a pipelined processor as "serial". Most people are familiar with the pipelining algorithm, if not from a computer architecture point of view, then certainly from a manufacturing point of view. Instructions are executed in stages, and because these stages are separate and distinct, one instruction can be executed in one stage, while the next instruction is in the stage behind it, and the previous instruction (which may not yet be complete), is executed in the stage ahead. The best analogy is probably to an automobile assembly line. More than one car at a time can be built, and each stage is constrained by the time to complete the longest stage. Thus, we would like to design the pipeline such that each stage takes approximately the same amount of time (and thus we don't waste any clock cycles that we don't need to).
While the best candidates for pipelining are those instructions that consist of a sequence of identifiable and separate steps (like arithmetic instruction fetch and decoding), we don't want to perform pipelining on instructions that can't be divided, or on long strings of instructions that are interdependent. Thus, we want to include a pipeline path in the architecture, and use some other logic to decide when to shift program control to that path. Note that this makes the pipelined processor non-exclusive. We can have a vector processor that contains a pipeline. Indeed, vector pipelines are among the most popular designs for supercomputers (Desrochers 1987).
Modern parallel computers also use architectures designed to support SIMD (Single Instruction and Multiple Data Stream) and MIMD (Multiple Instruction and Multiple Data Stream) operations. SIMD processing is similar to a vector operation, in that in data is operated on in parallel. However, in the case of a vector processor, we were looking at a single processor loading large parallel registers to operate on simultaneously. In a SIMD system, a single instruction stream will control the operation of multiple processors, each of which can process a separate data stream. Because of this tight connection, SIMD machines are often known as array processors, as well (Hwang 1989). Because a SIMD architecture requires a large degree of coherency between processes, SIMD array processors are usually found in special-use machines devoted to signal and image processing (Levialdi 1985).
In MIMD machines, by contrast, each processing element has its own independent instruction queue as well as its own data stream. Because these processors are necessarily capable of executing a general instruction stream, they tend to be used more frequently in general purpose systems (Desrochers 1987). Almost all modern SMP workstations and desktop PCs are MIMD machines.
We can also make an important distinction among MIMD machines. Some use a shared-memory architecture and some use a distributed-memory architecture. As one might guess, a shared-memory architecture results in a much tighter binding of the processors to one another. The communications architecture tends to be highly interconnects, whether the processors contend for the use of a bus, or whether there are direct connections between processors. Obviously, a directly connected architecture is preferred mainly in high-cost supercomputers and other high-end mainframes (Hwang 1989).
In a distributed-memory system, each processor has its own memory, and can thus function as an almost independent CPU. A distributed-memory system is far less interconnected than is a system using a shared-memory architecture. They depend on an efficient interconnect in order to optimize the communication necessary for high performance. Increasingly, the architecture of the distributed-memory system is being designed with communication algorithms in mind (Chan 1989).
Such systems often use a message passing interface (MPIs) to communicate among nodes, since so much can be done independently (Hwang 1989). This reliance on MPIs, in which abstractions and higher-level routines are built on top of lower-level interfaces, helps with standardization and aids communication in a distributed-memory environment (Hebert 1997).
In addition to using different architectures, parallel computers also make use of different processing algorithms, from the old standby master-slave algorithm to (relatively) newer ideas like SPMD (Single Program Multiple Data).
In a master-slave setup, there is one master node, which is responsible for dividing the work, coordinating the different processors, and maintaining communication between the nodes. The slave nodes, on the other hand, simply execute the instructions they are told to execute. They perform the actual calculations, and then report back to the master when they are finished.
SPMD processing, by contrast, describes a programming model in which different processors execute the same program on different sets of data. By running the same program on every node, we reduce the amount of source code needed, as well. Instead of developing algorithms for both the master and the slave, we can simply have one program, which is smart enough to execute and branch based on the desired performance. While such algorithms are slightly more complex, they have the benefit of being more consistent and more robust. This has been shown to reduce code development time, as well as decreasing maintenance and updating costs (Yue and Lilja 1998).
The effective use of a parallel system depends on the efficient allocation of processors to tasks as they arise. The standard allocation strategy for a uniprocessor system has traditionally been timesharing. The processor will switch processes with each quantum, adopting a concurrent processing strategy. However, timesharing is not as effective in a multiprocessor environment as it is in a uniprocessor environment (Yue and Lilja 1998).
Several other strategies have been developed for use in multiprocessing system. Co-scheduling involves allocating all related events at the same time. Of course, this requires knowing (or guessing) the relation between all the tasks. It also can lead to processor fragmentation and frequent context switching. Another strategy is known as space-sharing or static partitioning. In this design, we divide the processors into partitions, and then give each application its own partition. On the downside, static partitioning will often result in low system utilization, since some applications will not use their entire partition. This can be partially solved with dynamic partitioning, in which underutilized partitions can be split apart and have the processors reassigned to another partition which may be overloaded. This strategy does improve system utilization, and reduces the number of context switches that are required, but it demands a precise coordination between partitions. Without such coordination, it is not possible to accurately determine the optimal processor configuration.
Recent studies by Yue and Lilja (1998) have shown that an adapted version of dynamic partitioning can result in much better performance than can the aforementioned allocation strategies. Their idea of loop-level process control is an implementation of dynamic partitioning which only reevaluates the partitioning setup at the beginning of each parallel section. For example, this strategy would look into repartitioning before beginning execution of a parallel loop. This controls the number of processes an application can spawn, and reduces the amount of knowledge the partitioning control must have. Additionally, by checking at those times when it would be most advantageous to repartition, loop-level process control can maintain most of the benefit of the dynamic partitioning algorithm.
And regardless of what type of design is chosen for a parallel system, one of the main concerns is data coherence. We need to keep globally accurate data -- that is, every process and every node need to be operating on the right data. If data is written while someone else is reading it, or immediately before someone else writes it, some results might be missed, corrupted, or totally inaccurate.
Special care must be taken whenever data is stored in multiple locations -- this can be a real problem in a parallel system, and especially so in a distributed-memory environment. By distributing data across multiple processing elements, we increase the speed at which each element can access the data. But we also make it more difficult to assure that the data is up-to-date.
One case where this is especially true is in the use of the on-board processor cache. As processor speeds have increased, memory has increasingly become a bottleneck in the process. We can ameliorate this problem by designing a memory hierarchy with caches (both L1 and L2), main memory, and then virtual memory that can be swapped out to disk.
The on-board cache is much faster than memory, and can operate at speeds very close to that of the processor. However, cache is far more expensive than memory, and it is infeasible to use it as a primary memory system. In order to get the largest performance gain from the on-board cache, we need to write hardware routines. Since the speed of the cache is so near that of the CPU, software routines won't be fast enough to accomplish the job -- we don't have CPU cycles to burn executing instructions which sort of defeats the point (Desrochers 1987).
Cache memory is physically located between the processor and main memory, and interprets memory addresses using an associative memory map. This map forms a correspondence between the requested address location and the contents of the cache. Since the cache will see requests from the processor to main memory, it will check to see if it has a copy of those address contents. If it does, it will kill the request to memory and supply the processor with the information itself. If it does not, the request will continue on to main memory. Once an item is read from main memory, it is also read into the cache (thus, on its next access, we can find it in the cache).
This raises the question: how do you determine what gets removed from the cache when a new item is entered? One frequently used algorithm is known as "Least recently used" (LRU), in which, as you might have guessed, the most recently cached data remains, while data that has not been accessed in a long time is swapped back out to main memory. This idea is based on the fact idea that program references are localized within some range of memory addresses, and it works well unless there are program loops that are larger than the size of the cache.
As we mentioned above, we also need to make sure that the data in the cache is coherent with data from main memory. If a second processor (or multiple processors) writes data to main memory while one processor has a copy of that data in its cache, the cached data is now obsolete. There are two main methods of maintaining consistent data: the dirty bit and the write-through (Derochers 1987).
In the dirty bit method, each cache entry is given an extra bit, which is set whenever data in the cache is written to. Once the data is set to be expunged from the cache, it is written back to memory if the bit is set. The benefit is that the new value is only written to memory once; the downside occurs when multiple processors share main memory, and require timely updates of the data there. Also, if the value in main memory is changed, and that data is also stored in a cache, the cache entry itself needs to be updated.
This problem can be solved by the write-through method for maintaining cache coherency. Here, all writes to the cache are immediately stored to main memory. This results in unnecessary writes to memory, and also, if the writes are consecutive, slows cache access to the same speed as main memory (since the previous write to memory must finish before a new write can be initiated).
Up to this point, we have been talking about parallel computing, and how to best design computer architectures to optimize performance. As computers and workstations have gotten cheaper, and as networks and LANs have become ubiquitous, another line of thinking has developed in the field of parallel computing.
If we can successfully connect multiple computers into a network designed to accomplish some problem, we can avoid the need to develop and customize a large, expensive supercomputer to accomplish the same thing. By using low-cost parts and standard equipment, we could develop easily reconfigurable distributed computing environments, use them to solve specific problems, and then update the design as needed to allow the system to remain useful and productive for a longer period of time.
This philosophy is the driving force in the development of NOW (Network of Workstations) architectures. In a NOW architecture, each workstation, capable of independent operation, functions as a node or processing element in the system. Like a radically extended version of the distributed memory architecture, this devolves most of the program control and data storage to the individual nodes.
While widespread networking capabilities like Ethernet and Tokenring have been around for decades, the move towards NOW architectures is not as old. Recent advances in such key technologies as switch-based networks, like ATM and Myrinet, have given newer NOW architectures the ability to closely integrate remotely distributed processors, memory and disks across a network (Patterson et al. 1996).
By taking advantage of these highly integrated systems (which can be locally and geographically distributed as well), the NOW architecture seeks to "allow communication at the speed of today's MPPs to harness the resources of hundreds of computers towards a single goal but with a seamless interface that makes using those hundreds of machines as simple as using a single workstation today" (Patterson et al. 1996).
The NOW architecture is not simply a large number of workstations connected on a really fast network. It uses that design while improving over MPP architectures in the use of virtual memory and file systems, and even parallel processing. For example, a program running on a NOW may attempt to increase speed by using all the DRAM in the network, and thus avoiding the need to swap memory out to disk. If this is not sufficient, other disks on the system can be used to speed up the I/O transfer process, and finally, the code can be designed to parallelize its computations. Thus, we have several successive "layers" of use, the first of which do not require any user intervention. By contrast, on a traditional MPP, the program needed to be rewritten for parallelization before any benefit could be seen (Anderson et al. 1994).
Additionally, the communication overhead is necessarily more complex than for a directly connected architecture of a MPP, even one using a bus arbitration scheme. In fact, studies show that the overhead needed for packet processing is the primary culprit for poor performance (Rodrigues 2000). Fortunately, network applications are written using an interface which sits on top of the underlying protocols defined by RFCs and the OSI architecture. This means that we can change the underlying implementations without affecting existing applications. This is the goal of current research going on at UC Berkeley: the design of "Fast Sockets", a user-level library which continues to provide the socket API, but improving high-speed communication. Initial tests indicate that the Fast Socket implementation is significantly better than TCP/IP for small packet transfers on local area networks, and at least equivalent to TCP/IP for large transfers. This is achieved in large part to a very small startup and initialization time (57 µs for Fast Socket versus 533 µs for Myrinet TCP/IP) (Rodrigues et al. 1997).
Even in the networked environment, implementations such as cyclic redundancy checks, two-dimensional and three-dimensional parity and checksums are used to detect errors. However, this adds overhead to the process, especially since error correction is so much more difficult than error detection. Thus, the standard procedure is often to detect errors and then request retransmission of the corrupted packet. This slows down communication even more, in addition to added windowing protocols and buffers and other forms of overhead necessary to handle retransmission requests.
Researchers here at Illinois are developing a new fault tolerance architecture for use in NOW environments. "Chameleon", as it is known, is designed to be an "adaptive infrastructure, which allows different levels of dependability requirements to be concurrently supported in a networked environment" (Kalbarczyk et al. 1999). Through the use of ARMORs (Adaptive, Reconfigurable and Mobile Objects for Reliability), Chameleon proposes to develop was is essentially a self-aware network. A manager ARMOR will oversee other ARMORs, and recover from their failures. Daemon ARMORs provide error detection and serve as communication gateways to ARMORs running on the host nodes. Finally, the common ARMORs provide specific, application-required dependability. By using this ARMOR architecture, the system can provide fault tolerance at several different levels. At one level, with applications designed using the ARMOR APIs, the system can recover from data corruption and ensure accurate results. If the node fails, the ARMORs can detect this and migrate the process to another node without failure. Even in the most general case, the system can perform majority voting on black-box programs to ensure successful execution (Kalbarczyk et al. 1999).
Increasingly, the NOW architecture has been successful at improving performance in the field of distributed computing. However, the are still some areas in which the traditional MPP design is better. In order to beat MPP performance in a NOW, we need to ensure efficient messaging, a low-latency, high-throughput network interface, architectural and OS support for synchronization, support for collective communication (multicasting), and an improved paradigm for distributed shared memory (Panda and Ni 1997).
In improving these features of a NOW, it is important to leverage the strengths inherent in the NOW design. By using the existing hardware and software available for PC networks, we can ensure quicker adoption and standardization than if we forced the use of an entirely new infrastructure.
A study by Cabillic and Puaut (1997) suggests that, to operate successfully, machines configured in a NOW require an interconnection architecture that supports multiple architectures, performance specifications, and operating systems. This allows the network to leverage more existing infrastructure and variation without limited the network to OS specific applications.
In order to successfully do this, the interface is going to require an effective data type conversion procedure. Because data conversion requires runtime overhead, it is important to minimize its impact to avoid unacceptable performance degradation. There are two standard procedures for data conversion: receiver makes right, and an architecture independent delivery standard. In a receiver makes right procedure, the message is broadcast across the network in whatever format the transmitting host desires. It is then the job of the receiver to interpret this format and convert it to its native format. In an architecture independent data type, by contrast, the transmitting host converts every message it sends out to an independent standard. The receiver then receives the message in that standard and processes it however it chooses.
Although the architecture independent format requires two data type conversions per message, instead of one, studies (Cabillic and Puaut 1997) indicate that it is still the preferred method for data messaging. By knowing in advance the format that will be received, the receiving host can perform a quicker conversion than it could if it had to parse the received message in order to determine its format.
Another necessity is the ability to conduct seamless load balancing and application reconfiguration. Local users don't want machines bogged down with remote applications -- priority should be given to the local user. But at the same time, spare cycles on all machines should be available to any demanding process that requires them; that's the whole point of the design.
These constraints demand an effective method of changing states and moving processes to different nodes. It necessitates consistent and independent capture technology, that saves both global and local data and the process stack and point of execution. By doing this, the load balancing algorithm can also be used to protect the system from crashes as well as system overload (Gomez et al. 1997).
The design of NOW architectures is one of the most exciting areas in current computer research. This makes it interesting to take a look at present trends, and try to guess where they might lead in the future. The recent explosion of the Internet has made possible the idea, at least in theory, of an interconnected NOW of thousands or even millions of machines. The amount of computing power to be harvested is enormous.
To this point, however, no one has been able to successfully design a WAN environment to operate as a high-performance network. Two possible architectures are meta-computing and remote computing (Weissman 1998). In a meta-computing environment, the underlying system can collect the resources of the WAN, and present to the user the illusion of a single machine. If this algorithm were efficient and accurate enough to allocate the correct resources to each requested task, and were able to do this seamlessly without large overhead, meta-computing could be extremely effective. However, such an algorithm would also require a very large pipe for high-speed communication.
Remote computing is a slightly less revolutionary design. Each task can choose a remote site best suited for its job. This would allow expensive or delicate or specific software to be located remotely and accessible globally. It would reduce the necessity of large software installations at every node, instead relying on transparent communication of data.
While both of these ideas are promising, the current Internet protocols are nowhere near efficient enough to allow feasible WAN computing. The bandwidth penalties on file transfers are large, and will represent an unacceptable small bottleneck in the process. But as communication methods improve, and as protocols are updated to take advantage of newer capabilities, the potential is there to revolutionize computer use.
Anderson, Thomas et al. "A Case for NOW (Networks of Workstations)". 9 December 1994.
Arpaci-Dusseau, Remzi et al. "Architectural costs of streaming I/O: a comparison of workstations, clusters, and SMPs". 1998.
Bruck, Jehoshua et al. "Efficient message passing interface (MPI) for parallel computing on clusters of workstations". Journal of Parallel and Distributed Computing, 40(1): 19-34, 10 January 1997.
Cabillic, Gilbert and Isabelle Puaut. "Stardust: an environment for parallel programming on networks of heterogeneous workstations". Journal of Parallel and Distributed Computing, 40(1): 65-80, 10 January 1997.
Desrochers, George. Principles of Parallel and Multi-processing. New York: McGraw-Hill, 1987.
Hebert, Shane. "Message passing interface (MPI) FAQ". MPI - the message passing interface standard. http://www-unix.mcs.anl.gov/mpi/, 1997.
Hwang, Kai. "Exploiting parallelism in multiprocessors and multicomputers". Parallel processing for super-computers and artificial intelligence. New York: McGraw-Hill, 1989.
Iyer, Ravi. "Measurement-based Analysis of Reliability and Performance". http://www.crhc.uiuc.edu/DEPEND/, 2000.
Kalbarczyk, Z et al. "Chameleon: a software infrastructure for adaptive fault tolerance". IEEE Transactions on Parallel and Distributed Systems, June 1999. (also at http://www.crhc.uiuc.edu/DEPEND/109049.ps.gz)
Karplus, Walter. "Vector processors and multiprocessors". Parallel processing for super-computers and artificial intelligence. New York: McGraw-Hill, 1989.
Lerman, Gil and Larry Rudolph. Parallel evolution of parallel processors. New York: Plenum, 1993.
Multicomputer vision. edited by S. Levialdi. London: Academic P, 1988.
Opportunities and constraints of parallel computing. edited by Jorge Sanz. New York: Springer-Verlag, 1989.
Panda, Dhabaleswar and Lionel Ni. "Special issue on workstation clusters and network-based computing". Journal of Parallel and Distributed Computing, 43(2): 63-64, 15 June 1997.
Patterson, David et al. "NOW progress report". http://now.cs.berkeley.edu/Progress/Now/, 1996.
Polychronopoulos, Constantine. Parallel programming and compilers. Boston: Kluwer, 1988.
Rodrigues, Steven. "High-performance communications". http://now.cs.berkeley.edu/Fastcomm/fastcomm.html, 2000.
Rodrigues, Steven et al. "High-performance local area communication with Fast Sockets". USENIX '97, 1997.
Weissman, Jon B. "Gallop: the benefits of wide-area computing for parallel processing". Journal of Parallel and Distributed Computing, 54(2): 183-205, 1 November 1998.
Wilson, Gregory. "History of the development of parallel computing". http://ei.cs.vt.edu/~history/Parallel.html, 2000.
Yue, Kelvin and David Lilja. "Comparing processor allocation strategies in multiprogrammed shared-memory multiprocessors". Journal of Parallel and Distributed Computing, 49(2): 245-258, 15 March 1998.