Cluster Interconnects

Print E-mail
Article Index
Cluster Interconnects
Page 2
Page 3
Page 4
All Pages

Cluster Interconnects

                    The network fabric connecting the compute nodes in a cluster to carry inter-node message traffic is referred to as cluster interconnect or simply interconnect. The term System Area Network(SAN) is also sometimes used to denote a cluster interconnect. Although the scope of SANs is somewhat broader than cluster interconnects. Note that an interconnect does not hook up to the user LAN. It is completely under the administrative domain of the cluster administrator. The LAN administrator need not be even aware of the type of interconnect being used in a cluster, unless of course the same person is playing both roles. The user connectivity of the cluster is simply through a LAN connection on the head node of the cluster and does not require much design consideration.

The choice of optimal interconnect for a cluster is very important and can be very difficult to make. If the right balance is not struck between the processing power of the nodes in a cluster and the communication ability of its interconnect, the nodes will potentially waste their computing resources while waiting for data. On the other hand depending on the interconnect that you choose, it may turn out to be a significant percentage of the cost of the cluster. The research and practice in the area of high performance cluster interconnect techonology lies interestingly between MPP interconnect technologies and generic LAN technologies. A cluster interconnect is much more specialized than a generic LAN, hence providing many opportunities for performance optimization as well as ability to use network equipment without the bells and whistles needed to support a general purpose network. On the other hand a cluster interconnect, by its very nature, is more flexible than an MPP interconnect and thus e.g. cannot use highly optimized routing algorithms which assume fixed topology.

In this chapter we will present an overview of the most popular networking hardware and software packages that are being used as cluster interconnects. We will also share practical experiences with many of these technologies. Before we start talking about the particular technologies in detail, let us first go over an abstract introduction which is needed to compare and contrast various technologies. Following is not offered as Networking 101, indeed we assume basic networking knowledge, but we go over the aspects of networking most relevant to cluster interconnects.

Interconnect Basics

Fig 3.1 shows the path of a packet in a cluster from one compute node to another. We have simplified this model in order to explain the relevant concepts in this section, e.g. we have not shown all the various layers within the interconnect fabric itself. A process, which is part of a parallel application is running on node A and is attempting to send a message to another process running on node B. The sending process initiates the message by making a call to a parallel programming library, e.g. an MPI library. The MPI library transfers the message down to the particular protocol layer being used to implement the MPI library. This protocol layer then either passes the message directly or through the operating system on to the physical network. On the receiving side events happen in the reverse chronological order as on the sending host, ultimately providing the packet to the application process. At a physical level the packet passes from sending host's memory onto the NIC, then to the network cable, going through any switches, reaching the NIC of the receiving host, finally into the memory of the receiving host.


Figure 3.1: Path of a packet between two compute nodes

When designing the interconnect fabric for a Linux compute cluster the choice of Network Interface Cards (NICs) and their corresponding switches, if necessary, depends highly on the application(s) for which the cluster is being built. The key parameters of the network type that affect this choice are: Latency, Bandwidth, CPU overhead, Multithreading and, of course, cost of the network devices and required software packages. In the following sections we define each of these parameters and describe ways of comparing various choices along these parameters.

Latency

Latency refers to the time it takes for a single packet to leave the source and reach the destination. It is measured from the time the sending process sends the packet to the time the destination process starts receiving the packet. A popular way of measuring the latency of an interconnect is to have a pair of machines repeatedly send a small message back and forth. If the packet was able to make N round-trips in time T, the one-way latency of the network will be T/2N. If MPI is the programming API being used, this measured time is referred to as half-round-trip MPI latency. Program 3.1 shows an MPI program which when run on two machines provides a measure of the latency of the fabric used.

double half_round_trip_latency(int N)

{
char packet[SMALL_PACKET_SIZE];
double startTime, endTime;
int i, rank;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD, \&rank)
MPI_Barrier(MPI_COMM_WORLD);
startTime = get_time();

for (i=0; i
if (rank == SENDING_PROCESS)
MPI_Send(packet, SMALL_PACKET_SIZE, MPI_CHAR, RECEIVING_PROCESS, i,
MPI_COMM_WORLD);

if (rank == RECEIVING_PROCESS)
MPI_Recv(packet, SMALL_PACKET_SIZE, MPI_CHAR, SENDING_PROCESS, i,
MPI_COMM_WORLD, &status);
}

endTime = get_time();

if (rank == SENDING_PROCESS)
return((endTime-startTime)/(2*N));
else
return(0);
}
Program 3.1: Procedure to compute half-round-trip MPI latency

 

Subscribe By Email

Enter your email address:

Delivered by FeedBurner

Donate

Development & maintainance needs time & money.
With your donation you can help us to keep this project alive
Donate:
  Monthly Monthly
Currency
Amount

Translate

Earn For Skills

Copyright @ 2010 | Tutorialsforu.info | Developed by Open Source Coders | Add your link.