| Article Index |
|---|
| Cluster Interconnects |
| Page 2 |
| Page 3 |
| Page 4 |
| All Pages |
 Bandwidth
Bandwidth refers to the amount of data that can flow through the interconnect fabric in a unit of time. As with latency, bandwidth should also be measured at the parallel API level, since the raw bandwidth of the physical network can be significantly higher than what the parallel application can actually realize.
There are two bandwidth numbers that need to be considered. First is the bandwidth of the network with a point-to-point link between two nodes in a cluster. Although in practice a cluster will rarely run in this configuration, this bandwidth number gives the best case scenario that a single port will support.Point-to-point bandwidth can be measured by a variant of the program 3.1, by sending large messages instead of small messages. Second number is the bisectional bandwidth of the interconnect fabric. Conceptually bisectional bandwidth is the rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of CPUs. Depending on how this imaginary line is drawn different bandwidth numbers can be obtained. The bisection bandwidth of the cluster is defined to be that offered by the worst case scenario. Bisectional bandwidth can get tricky to measure, in fact even difficult to define for complex network topologies.
If all the cluster compute nodes hook up to a single interconnect switch, then the cluster bisectional bandwidth is same as that of the switch. Say a switch has N ports (assuming N to be even). Now assume N/2 nodes in the cluster are sending a continuous stream of data, i.e. a very large packet, to the other N/2 nodes in the cluster. These second set of N/2 nodes are also sending a continuos stream back to the first set (this assumes network to be full duplex). The amount of data that flows through the switch per unit time defines the bisectional bandwidth of the switch. This is the maximum network bandwidth a cluster can expect if the switch in above scenario is at the center of its fabric. A non-blocking switch has the bisectional bandwidth equal to N times the point-to-point bandwidth. Bisectional bandwidth of an oversubscribed switch is less than the corresponding non-blocking switch with same number of ports. So, an oversubscribed switch may slow down a node from sending a message even if the link would allow higher bandwidth. In a cluster with very busy interconnect, with nodes having to send data out simultaneously at various times, use of a non-blocking switch may be essential. In a cluster where it is rare for nodes to be simultaneously pushing data out on the network, an oversubscribed switch may be a cost-effective choice. A critical number to note is the bisectional bandwidth per processor in the cluster. Again assuming all the nodes are connected to a single central switch, a non-blocking switch provides a constant bandwidth per processor as the number of nodes increase in the cluster. An oversubscribed switch, on the other hand, scales the bandwidth per processor upto a point after which the increase in number of nodes decreases the available bandwidth per processor.
Maintaining constant bandwidth per processor gets harder once the number of nodes in a cluster exceed the ports available in a single switch. For example, in fig 3.1 if there are no other connections between the nodes that connect to switch A and those connected to switch B, the wire connecting the two switches may become the bottleneck, limiting the total bisectional bandwidth of the cluster. One obvious way to avoid this is to use switches with higher count of ports. This is sometimes impractical because it requires investing upfront in the capacity which may get used in the future. Furthermore for bigger clusters, switches may not exist to satisfy the port requirements for all the nodes. Another way to offset at least some of this bottleneck is to use switches with faster uplinks. For example if the two switches in fig 3.2 are both 8 port Fast Ethernet switches, and the link between the two is a Gigabit Ethernet connection (see section on Ethernet), this link will likely not be a choke point anymore.
If more than two switches are needed in the interconnect fabric, or if faster uplink is not an option, more complex topologies of inter-switch connections need to be used to reduce the worst case scenarios of multiple hops. The subject of optimal network topology under different conditions has been studied fairly extensively, especially in the MPP domain. A detailed exposition of this material is out of scope of this book. Nevertheless we provide some examples of key topologies used in the technology overview below and in the last part of this book. There are many parallel applications which put higher communication load on a few selected nodes, i.e. their communication load is not evenly balanced across the cluster. If a cluster is to be tuned for such applications, an irregular topology of network links could be use e.g. multiple links can be added to a single node. E.g. in fig 3.3 all nodes are one hop away from Node 1. In this case the job scheduler will need to ensure that processes of applications are appropriately spawned to make use of such optimizations. In general we recommend using such irregular network topologies only in special cases where the cluster being deployed is going to run applications with communication skew for most of its deployment. Regular topologies result in ease of management, the allow use of well understood routing & job scheduling algorithms as well as are easy to administer.
![]() |
CPU overhead
While analyzing how the application is going to perform on a cluster it is important to consider the affect of network activity on CPU performance. If bulk of the network processing is taking place in the host CPU, then this load will compete with the load of instructions within the application itself. Many intelligent NICs try to handle various pieces of network processing within themselves, thus freeing the CPU to do application work. For a very busy interconnect this could be a critical criteria, either the NICs would need to be intelligent, or the CPU power would be required to be higher to accommodate the network load. A low CPU overhead and careful overlaping of computation with communication, allows the application to do useful work in the CPU while processing is being done on the processor embedded in the NIC. Myrinet NIC is an example of such intelligent NICs. Myrinet NIC has a processor embedded on the card, called the Lanai processor. The Lanai processor reduces the burden of the host CPU by taking over many of the network flow activities.
Another significant optimization to reduce the CPU overhead is the use of light-weight protocols. These protocols are designed to specifically ship relatively small messages across a controlled network topology like that of a cluster. Traditional protocols such as TCP/IP are designed to run on very general purpose networks and hence must take care of many exceptions and errors which are not relevant to compute clusters. The protocols used in cluster interconnects can be much more optimistic and can make various assumptions as compared to general purpose protocols. Implementation of such protocols generally requires some level of intelligence from the underlying NIC. This could be either in the form of an embedded processor, like the Lanai processor in a Myrinet NIC, or it could be in form of special firmware in the NIC. In another chapter we go into details of some popular light-weight protocols.
Multithreading
Multithreading refers to the ability of the network to transfer multiple messages concurrently as well as the ability of a node to use the fabric from different contexts. In a compute cluster using SMP nodes, i.e. nodes containing more than one processor, jobs may get scheduled in a way that independent applications on the same node want to use the interconnect for their message exchange. Certain NICs (and their corresponding drivers) are able to talk to multiple high level processes for message exchange, whereas some NICs are tied to a process until that process has completed its communication and relinquished the NIC.
Cost
Various factors should be considered while calculating the cost of an interconnect. The base cost of the interconnect includes the cost of NICs, switches, cables and required software libraries. There are secondary cost considerations as well. Particular choice of NIC may require higher speed host processors thus adding to the cost. A non-standard network also requires training cost, which may not required if a more commonly used network such as Ethernet is used. Availability of good driver and API implementations on a specific vendor's network options may reduce the in-house development and tuning costs.





