|
Page 1 of 5 Introduction to Clusters                "Cluster" is an ambiguous term in computer industry. Depending on the vendor and specific contexts, a cluster may refer to wide a variety of environments. In general a cluster refers to a set of computer systems connected together. For the purposes of this book a cluster is set of computers which are connected to each other, and are physically located close to each other, in order to solve problems more efficiently. These types of clusters are also referred to as High Performance Computing (HPC) clusters, or simply Compute clusters.
 Another popular usage of the term cluster is to describe High Availability environments. In this environment a computer system acts as the backup system to one or more primary systems. When there is a failure in a primary system, the critical applications running on that system are failed over to its designated backup system. Detailed usage and technology behind these types of clusters is outside the scope of this book. Nevertheless, we will touch upon specific usage of high availability technology within the context of compute clusters. Clusters are becoming increasingly popular as computational resources both in research and commercial organizations. These organizations are favoring clusters over single large servers (We will be referring to the set of processing elements within the boundaries of single instantiation of an operating system as comprising a single server or a shared-memory system) for a wide variety of problems. The increasing popularity of the Linux operating system is adding fuel to this fire. Linux provides a very cost effective and open environment to build a cluster. Since its conception in the early nineties, Linux has been very popular with researchers. The open-source code of Linux allows researchers to customize the system to best suit their specific problems, and to do advanced computing research without getting tied to any specific system vendor. Starting in 1999 many system vendors started offering full production-level support on Linux platforms. This took Linux systems, and their clusters, out of the realm of early adopters into mainstream usage. In this chapter we will begin by delineating various approaches to building parallel computing environments and compare and contrast them with clusters. We will go over benefits of deploying clusters as well as expose their downsides. We will then discuss two broad categories of compute clusters and their usage to solve problems from various fields. We will end the chapter by giving a high level description of the architecture of a cluster and introduce terminology to be used throughout the book.  Approaches in building a parallel computing environment Before discussing clusters in detail, let us examine different ways of putting various computational resources together to solve problems. As explained in a subsection below, the boundaries between various types of parallel systems have been blurred significantly, both because of new technologies as well as vendor marketing efforts. The simplified taxonomy in next sections will help us discuss various trade-offs between different technologies.  Single Operating System Image This class of systems have multiple CPUs within the boundary of a single operating system (OS) image. Design emphasis is put on scaling the OS to increasing number of CPUs, and providing an environment which allows a single application instance to scale on multiple CPUs. Symmetric multiprocessor (SMP) machines and implementations of Non-Uniform Memory Access (NUMA) systems in the late nineties belong to this category. All CPUs within the system can directly access the physical memory as well as any peripherals installed in the system.  Massively Parallel Processors Designers of Massively Parallel Processor (MPP) systems put emphasis on the scalability of the hardware. MPP systems have been designed to go up to hundreds, in some cases thousands, of processors. Processing elements (sometimes referred to as nodes) are connected together using a proprietary interconnect. Each node runs its own copy of the OS kernel (or microkernel). Programmers view an MPP as having distributed memory. A processor cannot directly access the physical memory located in a remote node. The programmer or the compiler has to instruct the machine to transfer data from one node to another node on need basis. Faster and well controlled interconnects in MPPs have led to some attempts in providing a shared memory look-alike programming model on these machines. However, these attempts suffer from scalability and availability concerns.  Network of Workstations (NOWs) In many organizations, especially those with some engineering design intent, individual users have powerful workstations on their desks. These workstations usually sit idle during off-work hours. Various innovative technologies make it possible to harness these idle cycles, hence providing an optimal use of the computing infrastructure. In most environments these workstations are simply connected to the building local area network (LAN) and don't have a special high speed interconnect between them. A distributed resource manager controls the jobs submitted by users, and attempts to execute them on idle workstation(s). The Condor system developed at University of Wisconsin is an example of one such resource manager.  Clusters Given the ambiguity in the usage of terminology and blurred boundaries between various technologies, we will stick to the definition of clusters given by Pfister[]: A cluster is a type of parallel or distributed system that:  - consists of a collection of interconnected whole computers
- and is used as single, unified computing resource.
The ``whole computer'' in above definition can have one or more processors built into a single operating system image.
|