DISTRIBUTED COMPUTER SYSTEMS: BACKWARDS TOWARDS THE FUTURE David J. Farber Professor of Computer and Information Science University of Pennsylvania 200 South 33rd Street Philadelphia PA 19104-6389 EMAIL: Farber@cis.upenn.edu Invited paper presented at the First International Symposium on Autonomous Decentralized System (ISADS) Kawasaki Japan April 1993 _________________________________________________________________ Abstract: The advent of gigabit network technology has inspired a rethinking of, among other things, the structure of network computers, operating systems, and protocol structures. This rethinking has, in the author's opinion, led to the conclusion that a globally distributed computer system represents one of the best applications of this new technology. This paper will examine some of the considerations that led to this conclusion, as well as explore the nature of such a Global Computer. Introduction: Initially, we will examine the way in which various components of a computer system are affected by their use in networks that give host-to-host potential throughput more than one gigabit per second. A short summary of this examination follows: EVERYTHING IS BROKEN We will begin by examining the bottom of the computer architecture, that is the processor to memory bus system. This system, which in many modern computers also carries the majority of the traffic that comes into memory via the I/O system, provides, in its current form, a bandwidth in the range of one gigabit per second. Thus, data coming in from the outside at one gigabit per second will substantially conflict with the needs of the processor-memory pipeline. A number of solutions to this problem have been suggested. The first and most common solution are to define another bus to serve as the memory- processor communication buses and use the main computer bus only for access to a dual-ported or multi-ported memory for the I/O system. This creates a number of difficult architectural problems and can notably increase the cost of the computer system. Another approach, and the one that the author suggests, is to treat the network as part of the path to a global multi-processo environment. In this view, the processor memory path is intrinsically the same as the processor network path and the problem then reduces to an issue of how to avoid multi-memory bank conflicts on the bus. However, this is an open and confused area. In the short term, it is probably the limiting obstruction to the widespread use of workstations on high speed networks. Moving up the hierarchy to the computer-network interface, one encounters more problems. In order to get information into the computer from an external device, it is necessary to define some form of interface card which maps from the internal structure of the computer system to the format of the network. Currently, cards are being developed which translate buffers of data into ATM format and then interconnect through SONET chips to the gigabit network Experience at the University of Pennsylvania has shown that, in many workstations, the I/O controllers are too slow to service the bandwidth requirements of such interface cards. This compounds the bus congestion problem and again creates a major bottleneck. In the future, controller chips will help alleviate this problem. For example, the University of Pennsylvania ATM interface board is capable of operating at 650 megabits per second, but is limited in the case of the RS-6000 to 135 megabits per second. Future versions of the RS-6000 I/O controller will allow speeds up to 600,000 megabits per second, but still not one gigabit. Even if information could be brought into a computer's physical memory fast enough, our problems have just begun. Modern operating systems, such as UNIX, have grown in size and complexity to the point that they create a major bottleneck for the utilization of high speed networks. Even without considering protocol processing, the manipulation of the operating system in moving data (as it comes in from an external network until it can be delivered to a requesting processor) involve a substantial number of memory-to-memory moves, each consuming considerable computer time. This problem does not have an obvious solution. Rather, it suggests an approach that could be described as a "kinder, gentler" operating system, which goes back to the original simplicity of the early UNIX systems as created at Bell Laboratories. More will be said about this later in the paper. Moving up one more level in the pipeline of bottlenecks, we run into the protocol system. Modern protocol systems such as TCP/IP have evolved in response to slow-speed communication systems operating in the low one megabit range. These systems have been characterized by high error rates and modest bandwidth latency metrics. In modern networks, one may have hundreds of thousands of bits flowing across the United States in response to a request for information transfer. Protocol systems employ windowing mechanisms to deal with the management of such latency. In the case of gigabit networks, there will be millions of bits flowing in the pipe if we take a similar approach to protocol design. This amount of data would put an inordinate strain on any rational windowing system and, more seriously, make intelligent anticipation strategies extremely difficult. That is, it would be difficult for the system to know what to ask for so that the right information gets to the requesting post at the right time. Even if we could solve, in a simple way, the above problems, the complexity of the resulting protocol systems would, when operating at the speeds anticipated in the future, require most, if not all, of the computing cycles available in modern workstations. In this case, only the fastest supercomputers would have the processing power to service such protocol systems. If one presumes that the purpose of computing is to compute and not to run protocols, then something is fundamentally wrong. Moving up one final step, we turn to the switching systems that will, in the future, support very high speed networking. In spite of everything that has been said about the convergence of communications and computing, we still build networks, be they local or national, in a way that isolates them from the computers that are their customers. In a sense, the computers are like telephone subscribers dialing connections through the network, with the network refusing and unable to help in any of the transactions other that merely establishing connections. This isolation creates additional burdens on the computer systems. A richer interaction between computer and network-- indeed, a blurring of the distinction between computer and network--is necessary BACK TO THE FUTURE In 1970, an experiment, called the DCS project attempted to create, by interconnecting a number of minicomputers, a system that would offer seamless user access to the computer system resources, fault tolerance, incremental expansion, and, most importantly, would look like a single computer to the using software. The resulting system, which went operational in 1973, utilizing a PDP-11-like system, contained a number of innovative ideas in its structure. These included one of the first micro-kernels, a fully distributed fault tolerant file system, an inter-process communication structure based on messaging, and a process-migration structure supported, as was the inter-process communication structure, by the communication subsystem. The communication subsystem was the first operational token-ring, which, in almost all its details, preceded and formed the basis for the 805.6 token-ring. It included a mechanism within the communication system that knew about process names, allowing messages to processes to be sent around the ring. In addition, the system could, by examining interfaces external to the computing system, determine if the processes addressed by these messages were resident in the attached processor. If a process was resident, a copy of the message was made and an appropriate indicator was put on the message, which continued around the ring. Resource management, process migration, and process binding all took advantage of the name tables that were in the ring. In an appendix to this paper, we have placed a short description of the DCS system. It is put here not to boast or claim priority, but to indicate that many of the ideas we will discuss later in this paper were around in the early 70's and are still reasonable ideas, as they have been for the past twenty years. The Global Computer One of the hopes for the DCS was that by adding mechanisms to the communication system, enough assistance would be provided to the computer software system to allow it to operate in an efficient fashion. Thus, we expected that by putting multiple processors on the ring, performance that approached that of the sum of the processors would be achieved. This turned out not to be quite true for a number of reasons, the most important of which was that the processors were burdened by the need to repeatedly copy information coming in from the ring interfaces, through their I/O system, into the memory system. Thus, DCS suffered from essentially the same problem we see in gigabit networks. Five years ago, when reflecting on the experience gained from DCS and from a companion system developed in the early 80's at the University of Delaware called SODS (Series One Distributed System), it was realized that if one continued to view local networks, such as the ring, as communications system that were serviced through the I/O mechanisms of computers, it would be difficult, if not impossible, to achieve greatly improved performance. In response, a project called MEMNET, which is described in a dissertation by Gary Delp, was started. This project took the point of view that the memory system is a more natural vehicle for communication between components of distributed system than is a message-based system. In the MEMNET system, we used special memory cards, one per processor. These cards were coupled together through a 230 megabit per second insertion ring. When a processor made a request to memory, the MEMNET card behaved like any other memory card in the processor. The MEMNET card contained a large memory cache which examined when it saw an address on the bus to see whether the item was in the cache. If it was not, the processor was held and a request was made around the ring to all MEMNET cards to see where the requested memory object was. This object would be brought, through the ring, into the cache of the MEMNET card and then the processor request would be honored. Since the ring was a broadcast ring, it essentially operated as a snooping cache, preserving ownership and providing proper data consistency. Experiences with MEMNET indicated that this approach is valid and that its performance is scalable up to a significantly larger number of processors The Next Step When one examines the gigabit world, one can make a number of interesting observations. A gigabit network has bandwidth equivalent to a local processor bus. Thus, one is tempted to talk about a national MEMNET. An immediate reaction to this is, "Yes, but how can we have a processor waiting for a memory request that must travel across the United States to be satisfied?" This situation is equivalent to saying that certain requests to memory might have long latencies before they are satisfied. This sounds familiar. It is the same behavior that a modern virtual memory paged computing system sees in normal operation. When a processor makes a request to memory for something from a page in such a system, often it gets it immediately, but occasionally it gets a page fault and the process is held waiting while the software syste finds the page, brings it into memory, and then restarts the requested memory access. The page retrieval time is remarkably close to the time required for a transcontinental request for a missing object. Our modern computer system have been fine-tuned to operate efficiently in such a paged environment, and, indeed, there are a number of multi-processor virtual memory computer systems on the market. Another observation one might make is that it is easy to find a missing page in a local environment and considerably more difficult to find that page (or object) when it is scattered among the geographically dispersed components of the Global Computer. Indeed, this is a valid observation. The response to this is to call for the communication system to help us. The communication system must take part in both finding the location of requested objects and, what is more important, in keeping track of the migration of these objects as they move around the Global Computer. These issues are dealt with in the CAPNET proposal of Ivan Tam, in which a set of tables was to be added to each of the switches in a modern gigabit network. These tables are similar, but not identical, to the page tables in our modern computers. When a request for an item comes into the switch, a look-up is done in the page table to determine how to route the request based on whether the switch has seen the requested page go through it. This is similar in principal to the process-tables of the DCS system. Somewhat remarkably, it is also similar to the mechanisms that must be put into switching fabrics to support the future mobile personal communications systems. A person using a personal communicator would be free to migrate anyplace within the United States and could be called by an originating person who knows only his "name." The network would search to see where that communicator was last seen and would leave information within the switching fabrics so that a global search would not have to be done each time an attempt was made to connect to that person. As the person travels, the tables are kept up to date to minimize wasted searching. If one substitutes the word memory- object for personal communications system, one has essentially the same structure, except for some "minor" details. This observation makes it feasible to discuss the actual deployment of hardware at the switching nodes that will support a Global Computer as well as personal communicators. Considerations of protection and data consistency suggest that, in practice, the items that are communicated within the Global Computer are not pages, but rather are objects that can provide ownership and access-protection mechanisms. This evolution of CAPNET is described in a paper by H. Kim called GOBNET, A Gigabit Object-oriented Network. Given the structure outlined above, we now have all the mechanisms in place to build the Global Computer. When one views this system from a software perspective, an interesting thing takes place. The Global Computer viewed from almost every level of software is nothing more than a multi-processor, virtual memory (possibly object-oriented) computer system. Protocols are no longer an issue, except for the very simplest and those of the presentation layer. There are other issues that must be dealt with as well. For example, we must define appropriate memory cards that will, among other things, cache items that are moving around the national bus and make requests through the network for needed objects. Also, If, in fact, we are to build a distributed machine of heterogeneous elements, then we must pay attention to the issues of data mapping, etc. This, however, is an issue of heterogeneous systems, not of a particular distribution design. Have we gotten away free? We have made it sound all too simple. Essentially, we have argued that ideas developed in the past twenty years for managing multi-processor machines and very large data bases are completely mappable onto the Global Computer. If one were a harsh critic of these ideas, one may ask, "Yes, but have you solved the latency problem? That is, every time you make a request outside your local environment it will take a long time for that object to get to your machine." The answer is, of course, not simple. On on hand, the problems are no different than those one encounters in the design of a paged environment. Unless data-structures are properly structured, a paged environment can collapse into a mass of long delays. The normal solution to this problem is to utilize an anticipation mechanism which will pre-page objects which must be used often and/or fix them in real local memory. Some of the twenty years of research has been towards gaining a deeper understanding of what these anticipation strategies should be within a local environment. A thesis just completed by Joseph Touch at the University of Pennsylvania addressed these issues as they extend to the Global Computer. He suggests in the thesis that the limiting performance property of a gigabit network will be determined by its ability to anticipate what the needs are for data and to move the proper data elements in advance of requests. Exploration of how or if the switching system can help in this anticipation mechanism is under study at this time. A final comment While it may appears to the reader that we have left out much prior work in shared memory systems, let us assure you that we have factored into our efforts the work of those who have preceded us. To quote Richard Hamming, and Galois: "We have stood on the shoulders of those who preceded us." The paper "Memory as a Network Abstraction" cited in the references covers this prior work in detail. Appendix Some History In this appendix, we will briefly give some technical details of the DCS effort that was undertaken at the University of California, Irvine (UCI) in the 1970 to 1976 era. First the DCS System. This work was sponsored by the National Science Foundation starting in 1970. The following material is taken from "Experience with the Distributed Computer System (DCS)" by David J. Farber and Paul V. Mockapetris which was issued in 1976 as UCI Technical Report 116. The Distributed Computer System (DCS) was a distributed timesharing system developed at the University of California, Irvine. The system consisted of multiple minicomputers coupled by a ring communication system. The system was used as a vehicle for research and development of distributed computer systems, to support a class in systems design, and as a production system in a modest, but growing, sense. The design objectives of the system are coherence, high availability through fail-soft behavior, and low cost. Enhanced system availability was achieved through the distribution of hardware, software, and control. The distribution discipline of DCS also made changes in system configuration natural. The failure of a redundant hardware or software network component was isolated; the remaining components continue to function. Mechanisms for the detection and the restart of the failed components allow the construction of fault tolerant services. The paper described the message functions and techniques that are used in the DCS and points out some possible evolutionary trends in message based distributed systems. The DCS evolved out of a desire to explore construction of a coherent distributed environment. The original DCS design [David J. Farber, "A Distributed Computer System" UCI TR 4, Sept. 1970] aimed at producing a general purpose timesharing environment. The system that evolved has the following main features: 1. The system has 3 identical 16 bit minicomputers as its main processing elements. Other machines of different types are essential for full operation of the system, but are integrated into the system in a restricted manner. The DCS system does not completely address the problems of a heterogeneous processor population. 2. The hosts in the DCS system are connected by a unidirectional data ring running at 2.2 Megabit. The ring interface (RI) supports high level protocols such as addressing by process name and implicit acknowledgments at the hardware level 3. The system has a simple process structure that provides a coheren environment. The system interface used by processes has been kept simple. A user application consists of one or more user processes which communicate between themselves and access system resources via messages. Such an amalgamation is called a process net. 4. Interprocess communication is allowed only via the message system. Hence the physical location of processes is never a constraint. 5. The operating system kernel in each machine is small and simple. All machines on the system run identical kernels. 6. Most conventional operating system functions have been moved into autonomous server processes. These servers provide I/C, etc. via message protocols to the rest of the system. The DCS system was fully operational in the mid 1970's, featured in addition to the above process migration and motivated additional research and exploration in the US and Japan. In addition, the ring system formed the technical basis for the Proteon and IBM Token Rings and latter the IEEE standard. References for DCS David J. Farber, "A Distributed Computer System", UCI TR 4, Sept. 1970 David J. Farber and John Pickens, "The Overseer", Proceedings of the International Computer Communications Conference 197 Paul V. Mockapetris, Michael R. Lyle, and David J. Farber, "On the Design of Local Network Interfaces", Information Processing 77, Proceedings of the IFIPS 77 Congress References for the Global Computer David J. Farber, "A Tale of Two Major Networking Problems - one Organizational and one Technical", invited article for The Harvard Information Quarterly - Fall 1989. Gary Delp, David Farber, Ronald Minnich, Jonathan M. Smith, and Ming-Chit Tam, "Memory as a Network Abstraction", IEEE Network, Vol. 5(4), pp. 34-41, (also appears in an IEEE CS Press book on Distributed Computing Systems) (July, 1991). David D. Clark, Bruce S. Davie, David J. Farber, Inder S. Gopal, Bharath K Kadaba, W. David Sincoskie, Jonathan M. Smith, and David L. Tennenhouse, "The AURORA Gigabit Testbed," Computer Networks and ISDN Systems, (1991). Jonathan M. Smith and David J. Farber, "Traffic Characteristics of a Distributed Memory System", Computer Networks and ISDN Systems, Vol. 22(2), pp 143-154 (September 1991). D. D. Clark, B. S. Davie, D. J. Farber, I. S. Gopal, B. K. Kadaba, W. D. Sincoskie, J. M. Smith, and D. L. Tennenhouse, "An Overview of the AURORA Gigabit Testbed", in Proceedings, INFOCOM 1992, Florence, ITALY (1992). C. Brendan S. Traw and Jonathan M. Smith, "Implementation and Performance of an ATM Host Interface for Workstations" in Proceedings, IEEE Workshop on the Architecture and Implementation of High-Performance Communications Subsystems (HPCS '92), Tucson, AZ (February 17-19, 1992).. Jonathan M. Smith, "Protection in Distributed Shared Memories", in Proceedings, 4th International Workshop on Distributed Environments and Networks, Tokyo, JAPAN (October 28-31 1991). Invited Paper Ming-Chit Tam, David J. Farber "CapNet - An Alternate Approach To Ultra-high Speed Networks", International Communications Conference, April 90, Atlanta Georgia. Ming-Chit Tam, Jonathan Smith, David J. Farber. "A Taxonomy Comparison of Several Distributed Shared Memory Systems", ACM Operating Systems Review, June 1990 Joseph D. Touch and David J. Farber ."Mirage: A Model for Ultra High-Speed Protocol Analysis and Design", Proceedings of the IFIPS WG 6.1/WG 6. Workshop on Protocols for High-Speed Networks, Zurich, Switzerland, 9-11 May 1989 Ronald G. Minnich and Dave Farber. "The Mether System: A Distributed Shared Memory for SunOS 4.0", Usenix- Summer 89 Ronald G. Minnich and David J. Farber. "Reducing Host Load, Network Load, and Latency in a Distributed Shared Memory", Proceedings of the Tenth {IEEE} Distributed Computing Systems Conference 1990