IEN 120 INTERNET ROUTING AND THE NETWORK PARTITION PROBLEM IEN #120 PRTN #279 Radia Perlman Bolt Beranek and Newman, Inc. October, 1979 I. INTRODUCTION As described in IEN 110, "Internet Addressing and Naming in a Tactical Environment", a network can become partitioned into two or more pieces. Assuming some of these pieces are still connected to the catenet, we would like the catenet to be able to efficiently deliver packets to a host in any such piece. Such a capability in the catenet could additionally be utilized by a scheme for delivering intranet traffic across partitions in a partitioned network. There are four parts to the solution: 1) detecting that a network is partitioned 2) deriving a name for each partition 3) figuring out which partition a host is in 4) routing packets to the correct partition The currently implemented gateway routing algorithm is based on the original ARPANET algorithm. To efficiently provide for routing to network partitions, routing must be based on a link state routing scheme. I will demonstrate this after first presenting the design (parts II-VIII), then showing what would be involved in modifying the original ARPANET algorithm for that purpose (part IX), and then comparing the two approaches (part X). II. TERMINOLOGY 1) neighbor gateways--two gateways attached to the same network 2) functioning neighbor gateways--neighbor gateways able to communicate with each other over their common network 3) attached network--a network physically attached to a gateway, and with which the gateway can communicate directly (not through another gateway) 4) neighbor network of gateway G--an attached network of a functioning neighbor gateway of G, excluding attached networks of G. - 1 - III. TABLES TO BE MAINTAINED BY EACH GATEWAY 1) a list of attached networks--This list is relatively constant and is updated by a gateway when it notices a network interface is down or for some other reason the gateway is incapable of communicating with an attached network. Keeping this table updated is solely the responsibility of each gateway, and does not require intergateway communication. 2) a table of all gateways and their attached networks--This table is maintained by intergateway communication -- gateways give copies of their table 1 to all other gateways. The table of all gateways never shrinks (a down gateway is assumed to exist but be unreachable). 3) a table of link states to neighbor gateways--This table in gateway G specifies, for each neighbor gateway G1, over which common networks G and G1 can communicate. This table is updated by G periodically bouncing packets off each neighbor gateway from which it has not recently received traffic. Note that I refer to two gateways as neighbor gateways even if they cannot (temporarily, hopefully) communicate with each other. 4) a list of neighbor networks--This list is derived from the table of link states to neighbor gateways and the list of gateways with attached networks (tables 3 and 2). 5) total link state--This is a table of all gateways and the state of their links to their neighbor gateways. This table is compiled from intergateway communication. When a gateway notices that its table of attached networks, or its table of link states to neighbor gateways (tables 2 and 3) changes, that gateway efficiently broadcasts this information to all other gateways in the catenet. To minimize numbers of reports when a link is flaky, a link on an attached network must be up continuously for some amount of time before its state is considered to change from down to up and trigger a link state report. 6) shortest distance matrix--This is a data structure from which routing decisions can be made directly. It is computed from the other tables. It is described more fully in part IV. IV. ROUTING COMPUTATION A gateway, using the tables described above, constructs a connectivity matrix whose rows and columns represent networks, and whose entries are 1 if any gateways claim to be attached to both networks, and infinity otherwise. Then the gateway *'s that matrix to construct a shortest distance matrix. (The operation "*" consists of "multiplying" a matrix by itself, using the operations min and plus instead of plus and times, until the result stabilizes. This is a well-known algorithm.) The gateway - 2 - then looks in the shortest distance matrix for the neighbor network (or set of such) closest to the destination network, and chooses a functioning neighbor gateway (or set of such) attached to that neighbor network, to forward packets to for that destination network. When a link state report changes the state of an entry in the connectivity matrix (remember, all gateways connecting two networks have to go down before a 1 changes to infinity), a gateway must recompute the distance matrix. This design is a slight modification of the design presented in "Gateway Routing", by Radia Perlman (PRTN #242, PSPWN #99). The modification is that the indices of the matrix are networks, not gateways. The purpose of this modification is to make the size of the matrix smaller, an important modification given that in the catenet there are many more gateways than networks. There are aspects to the scheme that are irrelevant to a discussion of how to solve the network partition problem, such as sequence numbers for link state reports, etc. The purpose of this paper is to direct a correct approach to the design, and not to present an implementation specification. Thus an implementer should read PRTN 242 to discover the details of a link state algorithm that were not relevant for presentation here. Note that an alternative to *'ing the matrix is to use the scheme that the ARPANET has switched over to, which is a link state scheme in which a shortest path routing tree is constructed from the connectivity information. The new ARPANET scheme is less costly to maintain as links change state. Its disadvantages are that it precludes load splitting, probably a very important problem in the case of the catenet, and is probably a little harder to implement. Since links will not change state very often, the author favors the overhead of the matrix *'ing scheme over the disadvantages of the ARPANET scheme. However, this decision is separable from the rest of the design and can be decided either way at a later time. V. DETECTING THAT A NETWORK HAS PARTITIONED Now we look at the problem of network partitions. In the design presented so far there is enough information for any gateway to detect a partitioned network and to isolate groups of gateways on each partition: A gateway G knows that network N is partitioned if there are two sets of gateways, set Q and set R, such that all gateways in both sets report they are attached to network N, but there are no two-way links between a member of set Q and a member of set R via network N. This information is derived independently by each gateway from the table of all gateways and their attached networks, and from the table of total link state (tables 2 and 5). - 3 - VI. DERIVING A NAME FOR EACH PARTITION It is necessary to expand the internet header to allow a field for identifying a network partition. The reason for this is to avoid the necessity for every gateway on a packet's route to discover to which partition the packet should be sent. The partition name must give sufficient information so that every gateway can make the proper routing decisions to send a packet to that partition, based on its tables of total link state and gateways/attached nets (tables 5 and 2). The following schemes for naming a partition are all done independently by all gateways, as opposed to having some central authority choose a name and inform all gateways, or having a group of gateways decide on a name "by committee". One method of identifying a partition is to use the name of any member gateway of the partition. It will not matter if two gateways choose different names for the same partition. Since the sets of gateways involved in the network partitions are disjoint, any member of the set identifies the set. Another method is to list (either by an explicit list or a bit table) the set of gateways that make up that partition. This is unnecessarily descriptive, since the list of gateways is derivable from a single member of the set. And it is a less robust scheme, because any change to the partition (a gateway going down, coming up, or the net partitioning into more pieces) can confuse a gateway trying to route to that set of gateways. In the first method, if the partition changes, the packet will be routed unambiguously to whatever partition the named gateway is in. Of course, if the named gateway goes down, the packet becomes undeliverable, but that is easier to deal with than trying to deliver a packet to a set of gateways that overlaps two partitions. A third method is for each gateway to number partitions from 1 to the number of partitions, ordered by, say, the highest numbered gateway in each partition. This method uses fewer bits in the packet header but is a much less robust scheme. With gateways having slightly differing information, partition names have different meanings. Also, partitions can switch names suddenly. For instance, a net can be partitioned into 2 pieces, numbered 1 and 2, and, assuming the highest numbered gateway was down, and comes up in partition 2, partitions 1 and 2 now switch identities. Thus the recommended method of identifying a partition is the first method. - 4 - VII. FIGURING OUT WHICH PARTITION A HOST IS IN Now we will examine several schemes for having the correct partition identified in a packet. It is the responsibility of either the source host or first gateway to do this. By examining the alternative schemes we can also determine whose responsibility it should be. a) Source host determines correct partition by trial and error -- The source host does not know about the structure of the catenet and does not know that the destination net is partitioned. When it sends a packet to that net with no partition name filled in, the first gateway to receive the packet sends back a message that that network is partitioned, and lists the partition names. Assuming there are k partitions, the source host sends k packets requiring ACKs to the destination, each packet addressed to a different partition. The packet that receives an ACK is the one addressed to the correct partition. If a gateway receives a packet with an incorrectly filled in partition name field, that gateway will send back the same kind of notification as for a packet with a blank field -- it will notify the host that the net is partitioned and list the partition names, or if the net is no longer partitioned, give that information. If the source host is sending packets that require acknowledgments, it will notice quickly if its packets stop getting successfully delivered to the destination. Then it can redetermine the host's partition. b) The first gateway, using trial and error -- If it is the first gateway that has the responsibility, it can do the same thing as the source host in scheme a, sending packets to the destination addressed to each partition to discover from which partition it receives an ACK. Since a network is unlikely to be partitioned into very many pieces, it is not costly to try all partitions. Either the correct partition will be found or no ACK will return (in which case presumably the host is down or the network is partitioned in such a way that some hosts are unreachable from all gateways). The disadvantage of having the first gateway do the work in this scheme is that a gateway does not know whether packets it is forwarding successfully reach their destination. Thus it must either keep a cache of host/partition correspondence, which can be out of date for some amount of time during which the gateway will misaddress packets to a destination, or the gateway must rediscover the correct partition on a packet by packet basis, which is of course unacceptably expensive. Also, assuming it is common for a source host to split its traffic among several gateways on the source net, after a gateway discovers the correct partition for a destination host it should inform all other gateways on the source net of the correct partition, to prevent the necessity of them rediscovering that fact. - 5 - c) gateways on a partitioned net could keep track of host/partition correspondence for their net -- Another method is for gateways on a partitioned net to find out which hosts they can reach, and exchange that information with the other gateways on that partitioned network. Then a gateway could respond more intelligently to a packet addressed to the incorrect partition by sending back a message giving the correct partition (to the packet source if that is who fills in the partition field in the packet header, or to all gateways on the source net otherwise). In addition, a gateway on the partitioned network can forward the misaddressed packet to the correct partition. This method requires gateways on the partitioned network either to keep a complete list of the hosts on the net, marked as to partition, or to keep a cache of hosts, adding hosts to the cache by querying the gateways on other partitions at the time the necessity of locating that host arises. In the complete list case, gateways on a partitioned net would periodically send packets requiring ACKs to all hosts on that net in order to keep their lists up-to-date. In the cache case, gateways would poke a host only when the need to know its location arose (when the gateway received a packet for that host, and the host was not already in its cache, or when a query from a gateway on a different partition of the net arrived, asking for that host's location). This method suffers from the same problem as method b, with the first gateway having responsibility for determining host/partition correspondence -- the tables in the gateways on the partitioned net can become out of date, during which time they will misdirect traffic, and they cannot constantly be checking their tables. Thus I recommend method a, having the source host fill in the partition field using the trial and error method of discovering host/partition correspondence. VIII. ROUTING PACKETS TO THE CORRECT PARTITION As stated above, a gateway G, distant from partitioned network N, must know which gateways are involved in a partition before G can correctly route a packet -- it might have to make a different routing decision for one partition than for another one. When G detects a network has become partitioned into n pieces, G must add n-1 rows and columns to its shortest distance matrix, i.e., it treats each partition as a separate network. It is an implementation detail, and not a difficult one, to ensure that the gateway understands the meaning of each row and column. And given that the gateway understands the meaning of each row and column, it is easy for it to fill in the connectivity matrix from its table of total link state. The computation is done exactly as in the nonpartitioned case. - 6 - IX. MODIFYING THE ORIGINAL ARPANET ROUTING FOR PARTITIONS The original ARPANET routing is the currently implemented routing algorithm in the gateways. The basic design is that gateways report their distance vector to all their neighbor gateways (their distance vector gives their distance to all destination nets). They derive their distance vector from their neighbors' distance vectors. (A gateway's distance to a destination net is 0 if the gateway is directly attached to the destination net. Otherwise, it is 1 hop further than the neighbor closest to the destination.) The major modifications that are necessary to handle partitioning are: 1) Currently distance vectors are just a list of numbers, and gateways have an assembled-in offset/net number correspondence. Thus the vectors do not need labels for each entry. If networks became partitioned, more destinations would need to be reported in the distance vector. Either some (very complicated) negotiation process would need to be carried out so that all gateways would agree, when nets became more or less partitioned, on a new offset/net number correspondence, or the distance vectors would need labels identifying the destination whose distance is being reported. The problems associated with a negotiation process make that scheme unworkable. Thus we can assume the vectors would be expanded to have an identifying label for each destination. The label would include net number and partition name. 2) Gateways do not have global knowledge of the structure of the catenet, in contrast to a link state scheme. Thus it is the responsibility of the gateways on a partitioned network to notice that the net has become partitioned and start a routing update. In the current implementation, there is no way for gateways on a partitioned net to tell the difference between having their net partitioned and having several gateways on their net go down, since they do not receive information about individual gateways -- they only receive distance vectors from their neighbors. They will no longer receive distance vectors from their neighbors on a partitioned net, or from neighbors who have gone down, so lack of response from neighbors does not distinguish between dead neighbors and a partitioned network. Thus either distance vectors would have to contain information about all catenet gateways (which adds a great deal of overhead since there are many more gateways than nets, and the only purpose of doing that is to detect partitions) or gateways on a network would report that the network has become partitioned every time a gateway goes down. - 7 - 3) Gateways in a partition must agree on a partition name, since if two of them started a routing update with two different names for the same partition, the rest of the catenet can draw no conclusion except that the two partition names refer to distinct destination partitions. Agreeing on a name is not that easy. If some simple algorithm is chosen, such as highest numbered gateway in that partition, the name of a partition can change. Suppose the old partition name was 5 and it changes to 12. A source host (or distant gateway) has gone through the overhead of determining that the proper partition for a destination host was 5. When the name of the partition changes, this overhead must be repeated. Also, when the name of a partition changes, the rest of the gateways on the catenet must be informed of that fact so that they will stop reporting about obsolete partition names in their distance vectors. X. COMPARISON OF LINK STATE AND ORIGINAL ARPANET SCHEMES The link state scheme is far more robust. Because gateways have global knowledge, routing is more likely to proceed calmly while routing updates are percolating throughout the catenet. Partition names are not as important in the link state scheme -- gateways do not have to agree on a single name for a partition. As stated above, because in the currently implemented scheme gateways report only their distances to destination networks, and not to individual gateways, either gateways would report network partitions whenever gateways went down, or the distance vectors would have to be expanded to include reports about all gateways. This is a further disadvantage of the original ARPANET scheme to this application. Another disadvantage of the original ARPANET routing, not related to partitioning, is that, because nodes do not have global knowledge of network connectivity, there are types of routing loops which they cannot distinguish from degradation of best routes due to connectivity changes. As currently implemented in the catenet, nodes report their distance to a destination as "infinity" (a number higher than the maximum possible distance in the catenet) when reporting to downstream neighbors. This fixes many kinds of routing loops. However, neither this scheme nor any variant (such as hold-down, the scheme chosen by the ARPANET as a modification of the original algorithm) can distinguish all kinds of routing loops from connectivity changes. Thus there are cases when a group of nodes will have to count up their distance to a destination until it reaches "infinity" before discovering the destination is unreachable. This does not make the scheme unworkable for the current catenet, since the longest possible path in the catenet is less than 10 hops. However, it is again a further disadvantage of the original ARPANET scheme. - 8 - Another important consideration is the link state scheme's flexibility. There are new features that the catenet is scheduled to provide, most notably extended routing, in which the functional differences between links are recognized and accounted for. As described in IEN #86 "Extended Internet Routing", by Radia Perlman, a link state scheme must be adopted eventually in order for the catenet to provide this service. Thus the link state approach should be adopted to provide for network partitioning. XI. CONCLUSIONS A link state scheme, as originally presented in PRTN 242, modified as presented in part IV of this paper should be the basis of internet routing. The internet header should include a field long enough for a gateway ID, for the purpose of specifying a partition name. A partition name is the ID of any member gateway on that partition. The first gateway that handles a packet checks to see if it is addressed to a partitioned network. If so, and if the partition name field in the internet header is blank, the gateway sends back a special packet to the source host informing it that the network is partitioned and giving it a name for each partition of that network. When a gateway on the source net handles a packet for an unpartitioned network in which the partition name field is not blank, it erases that field and informs the source that that network is no longer partitioned. When a source host receives notice that a network is partitioned, it stores the partition names for that network, and when it wishes to send a packet to a host on that net, it first tries all partitions to determine the correct one. It keeps a cache of host/partition correspondence. When packets for a host in its cache no longer reach the destination, the source host should again attempt to determine the correct partition for that host. - 9 - APPENDIX COMBINING USUALLY SEPARATE NETWORKS In IEN 110, Dr. Vinton Cerf raises the possibility of combining nets, given that the catenet could handle a partitioned network. In general if the networks in question are usually partitioned, this is a bad idea, since there is overhead involved in having a partitioned network. Every time a source wishes to send a packet to a destination, someone must discover which partition to send the packet to. However, the specific example discussed in IEN 110 is an example where there is also a cost associated with not combining networks. In the example there are two ground PR nets, A and B. There are also a number of PRs on airplanes, call them P1, P2, ... Pn. When Pi is within range of a PR in net A, Pi automatically becomes a part of network A. When Pi is within range of both PR nets, the nets become a single PR net. Keeping the two nets separate leads to problems of addressing the airplane PRs, since the net on which they reside changes. Combining the two nets into a single network has the overhead of introducing a usually partitioned network into the catenet. There is a third solution to the particular case involved here. That is to keep networks A and B as separate logical networks, and to have P1, P2, ... Pn also as separate logical networks on the internet level. On the packet radio level there might be only one net, because one of the Pi connects nets A and B. But on the internet level there will be n+2 nets. A gateway on net A, called G1, will have a half gateway associated with each of the nets it might be "directly connected to" in the internet sense. In other words, it will have a half gateway for A, P1, ... Pn. The half gateway associated with network A determines whether its interface to net A is up or down depending on the state of the hardware ready line, etc., as is now done. The half gateway associated with "network" Pi must determine whether it is "connected" to its "network" by some other means. One method is to have a special querying packet containing the number i. The packet would be addressed, with a local header only, to Pi, and sent out the interface to network A. Pi's responsibility, upon seeing this querying packet, is to send back a special answering packet, also containing the number i. The half gateway associated with network A, upon receiving one of these special answering packets, uses the number contained in the packet to dispatch the packet to the half gateway associated with Pi. The half gateway associated with Pi, upon receiving this special answering packet, knows that its "network" is up. G1's list of neighbor gateways will include, besides all the gateways on net A, all the gateways on net B, since a gateway on - 10 - net B also has the Pi as potential attached networks. If some Pi connects nets A and B, then the gateways on A and B will all consider each other functional neighbors, and A, B, and the connected Pi, which have formed themselves into a single functional PR net, will function as a single net on the internet level, too. If one of the Pi is not within reach of either net A or net B, then all the gateways on nets A and B will report that they are not attached to net Pi, and all the gateways in the catenet will know Pi is unreachable. If A and B have not merged into one net (none of the Pi are in both nets), then the gateways on each will report which Pi are reachable from them, so the catenet will automatically route packets for Pi to the correct ground PR net. [It would be reasonable to include, in gateway G1, a half gateway for net B also, since if nets A and B merged, G1 would be connected to net B. However, it is not necessary to and is slightly more efficient not to, since even if nets A and B are merged, PRs in B are probably physically closer to the gateways on net B, so the catenet should route packets for PRs in B to the gateways that "really" are on ground net B. The advantage of including a half gateway for B in G1 is that net B could potentially partition in such a way that some partition included no gateways from B, but was reachable in the catenet via net A and some Pi. It is not obvious, however, what algorithm a half gateway for B should use to determine whether its "network" is up.] The airplane PR Pi does not think of itself as a network. From its point of view it is an ordinary PR. The only difference between Pi and an ordinary PR on net A is that Pi (or the TIU attached to Pi, if we want to strictly adhere to packet radio terminology) has stored as its internet address, Pi for its net number. It also has a list of possible gateways to use for internet packets. This list includes all the gateways on nets A and B. In the current PR net there is only one gateway, and all PRs know the ID of the gateway. This will change such that there will either be a special ID for an information service that will give out the ID of a gateway on the net (so that Pi, instead of keeping a list of gateways, could ask for a gateway address, as would the rest of the PRs on nets A and B) or all PRs will have assembled in a list of gateways, and they will need to probe each in turn until they find one that responds. Thus the only difference in Pi's finding a gateway and in an ordinary PR on net A finding a gateway, is that (assuming the assembled-in gateway list scheme is used) Pi's list will be longer, since it will also include the gateways on net B. There is obviously a cost associated with this solution, too. If the number of Pi are small, then this is a reasonable solution. If there are enough Pi, then the cost of having all those logical nets becomes greater than the cost of having an often partitioned network, so the solution of combining A, B, and all the Pi into one logical net in the catenet is a more practical solution. - 11 -