|
Page 2 of 4 It is important to measure the latency of a fabric at an API level. Even if the underlying physical media can support very fast exchange of messages, the protocol implementation on top of the physical media can make a significant difference in the effective latency for the application. For example if the parallel programming library is implemented on top of a general-purpose protocol like TCP/IP, the packet will take a significant time to go through the various layers of the protocol to reach the physical media itself. E.g. in MPI implementations over TCP/IP data is first copied from NIC to kernel space and then from kernel space to user space. Although Linux has an excellent TCP/IP implementation, for many MPI application environments use of a light-weight protocol is desired. In sections ? and ? we go in detail of some of these light-weight protocols and their implementations. It is also important to account the affect of switches on the latency between two machines in the cluster. If messages have to go through multiple switches before reaching their destination node (fig 3.2), the latency for these messages could be significantly higher compared to point-to-point connected nodes. Such configurations also introduce different latency numbers between different pairs of nodes. For example, in fig 3.1 nodes 1 and 2 would likely have better latency between them, as compared to nodes 1 and 6, especially if the intermediate switches introduce significant latency. Implementation and scheduling of many parallel programs can benefit from knowledge of such non-uniformities, especially since many parallel algorithms have a significant amount of communication between neighbours. How a message is viewed in the interconnect fabric and the mechanisms used to route it to its destination make a significant difference in the latency seen by the parallel applications in the cluster. In the following section we briefly summarize the routing methodologies used by various cluster interconnects. Figure 3.2: Cluster with two switches connected via an uplink  | Message routing through the switches There are two distinct ways on how interconnect fabrics transmit a message. Packet switching (or store-and-forward switching) transmits a message as a packet (or a sequence of packets), with each packet being solely present at either a cluster node or an intermediate switch. In describing network topologies, the switches and hubs in the network are sometimes referred to as nodes. We would avoid such usage to avoid confusion with cluster nodes. On the other hand wormhole switching (or cut-through switching) transmits a message as a worm, i.e. a continous stream of bits which make their way through the fabric, potentially spanning multiple cluster nodes and switches concurrently. Using wormhole based switches can significantly improve the latency characteristics of a cluster.Switches based on store-and-forward methodology wait for the whole packet to arrive before making the routing decision for next stop for the packet. There is some terminology ambiguity here as well. A technique where the routing decision of a packet is made before the whole packet arrives, but the packet is forwarded only if it is known that whole of it could be buffered at the next destination is called virtual cut-through. Wormhole switches do not have the overhead of storing whole packets before forwarding them to the next node or switch, and thus considerably decrease any latency overhead of passing through a switch. Figure 3.3: Packet switching: A switch stores a packet in its queue until it can forward the packet to the next switch or node  | Figure 3.4: Wormhole switching: A worm cuts through various switches to reach its destination  | Packet switching technology (using store-and-forward mechanism) used in many LAN/WAN switches can significantly detriment latency.
|