General Cluster Nodes
 Overview This chapter contains an expansion of the definitions presented in the first chapter, and lays groundwork for the following sections.
 2.1 Processor and Memory Choices This section contains an overview of the different processors that are available for use in a beowulf cluster. This section contains a brief overview of x86 compatible and other than x86 CPUs. A more in depth analysis of IA-32 CPUs by Intel and AMD and IA-64 are also presented. Low level processor, and memory subsystem performance are discussed briefly. An overview of memory architecture is also presented. 2.1.1 Common Processor Architectures 2.1.2 Intel 32 bit Processors The IA-32 is sometimes generically called x86 or even x86-32. The term means Intel Architecture, 32 bit which distinguishes it from the 16 bit versions that preceded it and the 64 bit version referred to as IA-64 that followed it. Within various programming language directives it is also referred to as i386; this directive would inform the compiler to generate code only for the IA-32 instruction set. This instruction set was introduced in the Intel 80386 microprocessor in 1985. Even though the basic instruction set has remained intact the successive generation of microprocessors that run it have become much faster at running it. The biggest supplier and inventor of this class of processors is Intel. However it is not the only supplier of this family of processors. The second biggest supplier is AMD and there are also numerous even smaller specialized supplier of these processors. The following sections will briefly describe the various features of the IA-32 family of processors. Modes of operations The IA-32 supports three basic operating modes refered to as the Real Mode, the Protected Mode and the Syetm Management mode. The operating mode determines which instructions and architectural features are accessible to the processor. For example in the Real Mode the processor is limited to accessing just 1Mb of memory, while in the Protected Mode it can access all its memory.
Real Mode Once the machine is booted the processor initiates itself into the Real Mode and then starts loading programs automatically into RAM from ROM and disk.A program inserted somewhere along the boot sequence maybe used to put the processor into the Protected Mode. Protected mode This mode is the native state of the processor. In this mode all instructions and architectural features are available providing the highest performance and capability. Besides having the additional memory addressability ability various other advantageous features get activated as well. One of them is the protected memory which prevents programs from corrupting each other. another one is the virtual memory, which lets programs use more memory than is physically installed on the machine. And the third feature is task switching known as multitasking, which lets a computer juggle multiple programs all at once to look like they are running at the same time. Another important feature of the Protected mode is the ability to directly execute "real address mode" 8086 software in a protected, multitasking environment. This feature is called te virtual-8086 mode, though strictly speaking it is not an actual processor mode. It is infact a protected mode attribute that can be enabled for any task. The size of the memory in Protected mode is limited to 4Gb. But this isnt the limit of the memory size in IA-32 processors. Using tricks in the processors' page and segment memory management systems (for example Physical address extension or PAE), IA-32 maybe able to access much more than the 32 bits address space, even without switchover to the 64 bit family of processors. System Management mode (SMM) This mode provides an operating system or executive with a transparent mechanism for implementing platform specific fucntions such as power management and system security. The processor enters SMM when the external SMM interrupt pin (SMI#) is activated or an SMI is received from the advanced programmable interrupt controller (APIC). In SMM, the processor switches to a separate address space while saving the basic context of the currently running program or task. SIMM-specific code may then be executed transparently. Upon returning from SMM, the processor is placed back into its state prior to the system management interrupt. SMM was introduced with the Intel386 SL and Intel486 SL processors and is a standard IA-32 feature. Registers The 386 has eight 32 bit general purpose registers for application use. There are 8 floating point stack registers. Other processors added new registers with various SIMD instruction sets such as MMX, 3DNow! and SSE. There are also system registers that are used mostly by operating systems but not by applications. These include segment, control, debug and test registers. There are 6 segment registers used mainly for memory management. The number of control, debug or test registers varies from model to model. General Purpose Registers The x86 general purpose registers are not really as general purpose as their name implies. That is because these general purpose registers have some highly specialized tasks that can often only be done by using one or two specific registers. These registers further subdivide into registers specializing in data and others in addressing. 8 bit and 16 bit register subsets 8 bit and 16 bit substes of these registers are also accessible. For example the lower 16 bits and 32 bit EAX registers can be accessed by calling it the AX register. Some of the 16 bit registers can be further subdivided into 8 bit subsets, for example, the upper 8 bit half of AX is called AH and the lower half is called AL. Similarly EBX is subdivided into BX (16 bit) and BH and BL (8 bit each). General data registers These include: - EAX Accumulator (with a special interpretation for arithmetic instructions;a for accumulator).
- EBX base register (used for addressing data in the data segment)
- ECX counter (with a special interpretation for loops, c for counter)
- EDX data register
General Address registers These are used for address pointing and include: - EBP stack base pointer (holds base address of stack)
- ESI source index (for string operations)
- EDI destination index (for string operations)
- ESP stack pointer (holds top address of stack)
- EIP instruction pointer (holds current instruction address)
Floating point stack registers There are 8x87 floating point registers known as ST(0) to ST(7). these registers are accessible like a FIFO stack. The register numbers are not fixed but are relative to the top of the stack; ST(0) being the top of the stack, ST(1) is the next one below the top and so on. That means that data is always pushed down from the top of the stack and operations are always done against the top of the stack. As a result these registers can only be accessed in the stack order and not randomly. SIMD registers These include the MMX, 3DNow! and SSE registers. MMX registers MMX added 8 registers to the architecture known as MM0 through to MM7. These registers are just aliases for existing x87 FPU stack registers. Hence anything that is done to the floating point stack would also affect the MMX registers. Unlike the FP stack, the MMn registers are fixed and not relative so that they are randomly accessible. Each of these registers are 64 bit integers. However one of the main concepts of the MMX instruction set is that of packed data types, which means that instead of using the whole register for a single 64 bit integer two 32 bits or four 16 bits or eight 8 bits integers may be used. 3DNow! registers 3DNow! was designed to be a natural evolution of MMX from integer to floating point. It uses the same name convention as MMX registers (MM0 to MM7), the only difference being that one could pack single precision floating points into these registers. Due to the aliasing with the FPU registers, same instruction and data structures which are used to save the state of the FPU registers can be used for these registers. SSE registers SSE is a SIMD instruction set that works only on floating point values, like 3DNow!. However unlike 3DNow! it has no connection with the FPU stack. It has larger registers than 3DNow! and can pack twice the number of single precision floats. The original SSE was designed for handling single precision only, but then the SSE2 was introduced for double precision numbers, which the 3DNow! could not handle as a double precision number is 64 bit in size which would be the full size of a single 3DNow! MMn register. At 128 bit the SSE2 can pack two double precision floats into one register. Thus SSE2 is much more suitable for scientific calculations than either SSE1 or 3DNow!. Memory management The memory that the processor addresses on its bus is called the physical memory. Physical memory is organized as a sequence of 8-bit bytes. Each byte has an unique address called physical address which ranges frm 0 to 64 Gb. Any operating system designed to work with the IA-32 will use its processor memory management facilities which provides features like segmentation, paging etc. With the flat memory model, memory appears to a program as a single continous address space called linear address space. This is byte addressable. With the segmented memory model, memory appears to a program as a group of independent address spaces called segments. When using this model, code , data and stacks are typically contained in separate segments. To address a byte in a segment, a program muct issue a logical address or far pointer. The programs running on an IA-32 processor can address upto 16383 segments of different sizes and types. The primary reason for using a segmented memory is to increase the reliability of programs and systems. For example placing a program's stack in a separate segment prevents the stack from growing into the code or data space and overwriting instructions or data. With either the flat or segmented memory model, the linear address space is mapped into the processor's physical address space either directly or through paging. When using direct mapping, each linear address has a one-to-one correspondence with a physical address. On the other hand when using the IA-32's paging mechanism, the linear address space is divided into pages which are mapped into virtual memory. The pages of virtual memory are then mapped as needed into physical memory. The real address mode memory model uses the memory model for the Intel 8086 processor. This memory model is supported in the IA-32 architecture for compatibility with existing programs written to run on 8086 processors. The real address mode uses a specific implementation of segmented memory in which the linear address space for the program and the operating system/executive consists of an array of segments of upto 64 KB in size each. 2.1.3 Intel IA-64 Processors IA-64 is a 64 bit processor architecture developed in cooperation by Intel and Hewlett-Packard for processors such as Itanium and Itanium 2. The goal of Itanium is to produce a post-RISC era of architecture using a very long instruction word (VLIW) design. Unlike previous Intel x86 processors the Itanium is not geared towards high performance exceution of the IA-32 (x86) instruction set. Architecture A key feature of the IA 64 is that it features a revolutionary 64 bit instruction set architecture which applies a new processor architecture technology known as EPIC (Explicit Parallel Instruction Computing). Another key feature is that it is fully compatible with the IA-32 instruction set. In a maninstream design, a complex decoder system examines each instruction as they flow through the pipeline and sees which can be operated on parallel across different execution units. This ability to extracct instruction level parallelism (ILP) from the instruction stream is essential to good performance in a modern CPU. However predicting which code can and cannot be split up this way is a complex task. For instance with an IF statement the inputs to one line is dependent on the output from another. The calculations although independent of one another, due to the presence of the IF statement, the THEN following the IF requires the result from the IF to know whether it should proceed at all or not. Usually in these cases the circuitry on the CPU typically "guesses" what the condition will be. However if the guesses are wrong then it causes a significant performance problem as the wrong result has to be discarded and the CPU needs to wait for the right result. The IA-64 relies on the compiler for this task. The complier examines the code and makes these decisions that would happen during run time on the chip itself. Once it decides which path to take it gathers up all the instructions and stores it in the VLIW form in the program. This strategy of moving the task from the CPU to the complier is one of the major advantages of the IA-64. Offloading the whole prediction task to the compiler reduces the complexity of the circuitry greatly as the prediction can be very complicated. Further the compiler can spend more time examining the code, which the chip itself cannot do as it has to complete the task as quickly as possible. The Itanium architectire provides mechanisms such as instruction templates, branch hints and cache hints to enable the compiler to communicate compile-time information to the processor. It also allows compiled code to manage the processor hardware using run-time information. These compiler to processor communication mechanisms are vital in minimizing the performance penalties associated with branches and cache misses. The disadvantage of this however is that the program's run time behaviour is sometimes not obvious in the code. It also makes the VLIW strategy heavily dependent on the performance of the compilers, thus there is a trade off between reducing microprocessor complexity and increasing the compiler software complexity. Registers This section briefly reviews some of the registers available in IA 64. The IA 64 includes 128 64 bit integer and 82 bit floating point registers. Besides the sheer number of the registers the IA 64, also adds in a register rotation mechanism that is controlled by the Register Stack Engine which allows the processor to rotate in a set of new registers to accomodate for new function parameters or temporaries. General registers A set of 128 (64 bit) general registers provide the resource for all integer and integer multimedia computation. These are numberes GR0 through to GR127. Each general register has 64 bits of normal data storage plus an additional bit called the NaT bit to track deferred speculative exceptions. The general registers are partitioned into two sets GR0 to GR31 are termed static general registers, while GR32 to GR127 are called stacked general registers. GR8 to GR31 contain the IA 32 integer, segment selector and segment descriptor registers. Floating point registers There are 128 (82 bit) floating point registers. Again these are numbered FR0 to FR127 and partitioned into two subsets. FR0 to FR31 are called static floating point registers, while FR32 to FR127 are called rotating floating point registers. Floating point registers FR8 to FR31 contain IA 32 floating point and multi-media registers while executing IA 32 instructions. Register Stack Configuration registers The RSC register is a 64 bit register used to control the operation of the Register Stack engine (RSE). Instructions that modify RSC can never set the privilege level field to a more privileged level than the currently executing process. Predicate registers A set of 64 (1 bit) predicate registers are used to hold the results of comparable instructions. These are numbered PR0 to PR63 and are used for conditional execution of instructions. These are further partitioned into two subsets static predicate registers (PR0 to PR15) and rotating predicate registers (PR16 to PR63). Branch registers A set of 8 (64 bit) registers are used for holding branch information and are numbered from BR0 to BR7. Instruction set The architecture provides a CISC like complement of instructions where there are explicit instructions for both floating point operations and multimedia operations. The Itanium supports several bundle mappings to allow for more instruction mixing possibility and includes a balance between serial and parallel execution modes. There is also room left in the initial bundle encodings to allow additional mappings to be added in future versions of IA 64. Despite the huge capabilities in IA 64 instruction set, it is notoriously difficult to program directly. Intel discourages against the practise of assembly programming on Itanium and instead urges the use of the Intel C++ compiler which has platform specific heuristics.
 2.1.4 AMD x86 Compatible Processors The AMD x86-64 or AMD64 is a 64 bit pricessor architecture invented by AMD. Its is a superset of the x86 architecture (discussed in 2.1.2) which it natively supports. The AMD64 instruction set is currently being used in AMD's Athlon 64, Athlon 64 FX and Opteron processors. An important part of AMD64 is tht it allows the latest in processor innovation to be brought to the existing installed base of 32 bit applications and operating systems, while establishing an installed base of systems that are 64 bit capable. For example the IA-64 offers no native x86 compatibility, meaning that existing 32 bit applications are not anticipated ti run with leading edge performance on IA-64 technology based processors. Instaed the AMD64 provides extensions to the reliable, proven and high performance x86 instruction set and preserver full compatibility between 32 and 64 bit environments. AMD64 Architecture Overview The AMD64 architecture extends the x86 architecture by introducing two major features: a 64 bit extension called long mode and register extensions. The new modes are encoded using two flags in the segment decsriptor. The first flag in the existing "D" bit that controls the size of operands, a second bit is a previously unused "L" bit which is used for determining if specific applications are 64 bit enabled or are run in compatibility mode. Long mode Long mode is enabled by a global control bit called LMA (Long mode Active). When LMA is disabled, the processor operates as a standard x86 processor and is compatible with all existing 16 and 32 bit operating systems and applications. When LMA is activated (LMA = 1), the 64 bit processor extensions are enables. Thus the system can auto configure according to the capabilities of the machine and the processor. Long mode consists of two sub modes: - 64 bit mode: This mode supports the following new features:
- 64 bit virtual addresses (implementations can have less)
- Register extensions through a new prefix (REX) which adds eight GPR (R8-R15), widens GPRs to 64 bits and adds eight 128 bit Streaming SIMD extension (SSE) registers (XMM8-XMM15)
- 64 bit instruction pointer (RIP)
- new RIP data addressing mode
- Flat address space with single code, data and stack space. Since the 64 bit mode supports a 64 bit virtual address space, it requires a 64 bit operating system and tool chain. A few instruction opcodes and prefix bytes are redefined to allow the register extensions and 64 bit addressing.
The default address size is 64 bits and the default operand size is 32 bits. The defaults can be overriden on an instruction-by-instruction basis using prefixes. A new REX prefix is introduced for specifying 64 bit operand size and the new registers. This mode is enabled by the OS on an individual code segment basis. The new register extensions added via the new REX prefix add eight 64 bit GPRs (R8-R15), eight 128 bit streamimg SIMD Extensions registers (XMM8-XMM15) and widens all GPRs and the instruction pointer to 64 bits. The instruction pointer is also widened to 64 bits. - Compatibility mode: Compatibility mode supports binary compatibility with existing 16 and 32 bit applications within a 64 bit environment.In compatibility mode the applications can only access the first 4 GB of virtual address space. As with the 64 bit mode, compatibility mode is enabled by the OS on an individual code segment basis. However unlike the 64 bit mode, x86 segmentation functions normally using either the 16 bit or 32 bit protected mode semantics. From the application's point of view the compatibility mode looks like a legacy x86 protected mode environment. From the OS's point of view, address translation, interrupt and exception handling and system data structures use the 64 bit long mode mechanisms.
In addition to the long mode the architecture also supports a pure x86 legacy mode, which preserves binary compatibility not only with existing 16 and 32 bit applications but also with such 16 bit and 32 bit OS. None of the 64 bit features are available when the processor operates as a standard x86 processor. The Legacy mode is completely compatible with existing 32 bit implementations of the x86 architecture. This includes support for current technologies like segmented memory and 32 bit GPRs and instruction pointer. Register Extensions To define the addressing logic for the registers, the AMD64 architecture simply extends the addressing scheme currently used for 16 and 32 bit instructions. For example for 16 bit operations, the two bytes of register A are addressed as AX, for 32 bit operations four bytes of register A are addressed as EAX and for 64 bit operations the eight bytes are addresses as RAX. In 64 bit mode the general purpose registers (GPRs) are extended to 64 bits. The 64 bit registers are called RAX,RBX,RCX,RDX,RDI,RSI,RBP,RSP,RIP and RFLAGS. The new 64 bit registers overlay and extend the existing registers. Besides 8 new 64 bit GPRs are added for a total of 16 GPRs. There are also eight new streaming SIMD registers for a total of 16 SIMD registers. These new SIMD registers are called XMM8 through XMM15. Segment registers (ES, DS, FS, GS and SS) are ignored in the 64 bit mode. Code segments still exist however. The CS is needed to encapsulate the defult mode of the processor as well as the execution privilege level. When performing 32 bit operations the destination register being a GPR, the 32 bit value will be zero extended into the full 64 bit GPR. 8 bit and 16 bit operations on GPRs preserve all unwritten upper bits. This preserves the 16 and 32 bit semantics for partial width results. The final step is to simply define a set of instructions prefixes that specify a 64 bit operand size and allow access to the new registers. This is similar to the the method used to extend the x86 architecture for other funtionalities such as AMD's 3DNOW! technology. Thus by extending the x86 core rather than replacing it with a new, entirely different instruction set, AMD64 makes the transition to 64 bit much easier, faster and less expensive. The problem of migrating to a new architecture is greatly reduced, without limitung the forward compatibility and future performance of existing applications. 2.2 Motherboards and System Busses In this section the different choices for motherboards will be given. Distinctions between workstation and server chasis will be presented. This will include an introduction to system busses. Most of the detailed information will be about the PCI bus. The two next generation of buses that will replace PCI will also be introduced. 2.2.1 Motherboards The motherboard is the main circuit board inside the PC which holds the processor, memory and expansion slots and connects directly or indirectly to every part of the PC. It is made up of a chipset, some ROM code and various interconnections known as buses. The physical layout of the motherboard itself varies greatly from PC to PC, two different boards can have very similar performance even though they might be laid out completely differently. This is more true because of the large number of vendors available who manufacture a variety of motherboards. But the basic function of the motherboard is to provide a useful working place for all the components of the PC. The following sections give a brief overview of the basic functionality and layout of the motherboard. Motherboard Form Factors The form factor of the motherboard describes its general shape, the kind of power supply used, its physical organization and the kind of cses it uses. The two most common form factors in motherboards are the AT and the Baby AT form factors. These two forms differ mainly in the width, the older AT board being 12" wide, while the Baby AT board is 8.5" wide and nominally 13". The AT form is the much older version and is usually found in older machines (386 or older). Another troublesome feature of this board is that a good percentage of the board overlaps with the drive bays which makes installation and upgrading difficult and cumbersome. For the Baby AT form. the reduced width allows much less overlap with drive bays. IT has three rows of mounting holies, the first running along the back of the board where the bus slots and key connectors reside, the second running through the middle of the board and the third along the fron of the board near to where the drivers are mounted. One problem with the Baby AT is that many of its newer versions try and reduce cost by reducing the board size (for example 10" to 11" long). This often leads to mounting problems as the third row of holes might now line up with rows on the case. Both the AT and Baby AT form factors places the processor sockets, slots amd memory sockets at the front of the motherboard and long expansion cards were designed to extend over them. This design was introduced over a decade ago. However presently the processors need bigger heat sinks and fans mounted on them, the result is that the processor, heat sink and fan combination can often block as many as three of the expansion slots on the motherboard. Besides there are also SIMM/DIMM sockets. Although the newer Baby AT motherboards move the SIMM/DIMM sockets out of the way but the processors still remain a problem. The ATX was designed to solve this problem. ATX and Mini ATX form factors The ATX form was invented by Intel in 1995. The Pentium Pro and Pentium II are the most common users of this kind of motherboards. The ATX has many advantages over the older motherboards which include: - Integrated I/O Port Connectors: Baby AT motherboards use headers which stick up from the board, and a cable that goes from them to the physical serial and parallel port connectors mounted on to the case. The ATX has these connectors soldered directly onto the motherboard. This improvement reduces cost, saves installation time, improves reliability (since the ports can be tested before the motherboard is shipped) and makes the board more standardized.
- Integrated PS/2 Mouse Connector: On most retail baby AT style motherboards, there is either no PS/2 mouse port, or to get one you need to use a cable from the PS/2 header on the motherboard, just like the serial and parallel ports. (Of course most large OEMs have PS/2 ports built in to their machines, since their boards are custom built in large quantities). ATX motherboards have the PS/2 port built into the motherboard.
- Reduced Drive Bay Interference: Since the board is essentially "rotated" 90 degrees from the baby AT style, there is much less "overlap" between where the board is and where the drives are. This means easier access to the board, and fewer cooling problems.
- Reduced Expansion Card Interference: The processor socket/slot and memory sockets are moved from the front of the board to the back right side, near the power supply. This eliminates the clearance problem with baby AT style motherboards and allows full length cards to be used in most (if not all) of the system bus slots.
- 3.3V Power Support: The ATX style motherboard has support for 3.3V power from the ATX power supply. This voltage (or lower) is used on almost all newer processors, and this saves cost because the need for voltage regulation to go from 5V to 3.3V is removed.
- Soft Power Support: The ATX power supply is turned on and off using signalling from the motherboard, not a physical toggle switch. This allows the PC to be turned on and off under software control, allowing much improved power management. For example, with an ATX system you can configure Windows 95 so that it will actually turn the PC off when you tell it to shut down.
LPX and Mini LPX The primary design goal behind the LPX form factor is reducing space usage (and cost). This can be seen in its most distinguishing feature: the riser card that is used to hold expansion slots. Instead of having the expansion cards go into system bus slots on the motherboard, like on the AT or ATX motherboards, LPX form factor motherboards put the system bus on a riser card that plugs into the motherboard. Then, the expansion cards plug into the riser card; usually, a maximum of just three. This means that the expansion cards are parallel to the plane of the motherboard. This allows the height of the case to be greatly reduced, since the height of the expansion cards is the main reason full-sized desktop cases are as tall as they are. LPX form factor motherboards also often come with video display adapter cards built into the motherboard. If the card built in is of good quality, this can save the manufacturer money and provide the user with a good quality display. However, if the user wants to upgrade to a new video card, this can cause a problem unless the integrated video can be disabled. LPX motherboards also usually come with serial, parallel and mouse connectors attached to them, like ATX. NLX form factor The need for a modern, small motherboard standard has lead to the development of the new NLX form factor. In many ways, NLX is to LPX what ATX is to AT: it is generally the same idea as LPX, but with improvements and updates to make it more appropriate for the latest PC technologies. Also like ATX, the NLX standard was developed by Intel Corporation and is being promoted by Intel. Intel of course is a major producer of large-volume motherboards for the big PC companies. NLX still uses the same general design as LPX, with a smaller motherboard footprint and a riser card for expansion cards. The NLX form factor is, like the LPX, designed primarily for commercial PC makers mass-producing machines for the retail market. Many of the changes made to it are based on improving flexibility to allow for various PC options and flavors, and to allow easier assembly and reduced cost. For homebuilders and small PC shops, the ATX form factor is the design of choice heading into the future. 2.2.3 PCI Bus PCI or Peripheral Component Interface is a 32 bit bus architecture (64 bit with multiplexing) developed by DEC, IBM, Intel and others, that is widely used in Pentium bases PCs. A PCI bus provides a high bandwidth data channel between system board components such as the CPU and devices such as hard disks and video adapters. The PCI superseded the VL-bus which as widely in use till the early 1990s. The essential purpose of introducing the PCI bus was to make expansion easier to implement by offering plug and play (PnP) hardware, i.e. a system that would enable the PC to adjust automatically to new cards as they are plugged in, thus making redundant the need to check jumper settings and interrupt levels. By 1994 PCI was established as the dominant Local Bus standard. Unlike the VL-bus, which was essentially an extension of the bus that the CPU uses to access the main memory, the PCI is a separate bus isolated from the CPU but having access to the main memory. Besides the VL-bus was designed to run at system bus speeds, whereas since the PCI bus is linked to the system bus through special bridge circuitry, the speed of the PCI bus can be set synchronously or asynchronously depending on the chipset and the motherboard. In a synchronous setup (used in most PCs), the PCI bus runs at half the memory bus speed, which is usually 25 or 30 or 33 MHZ. In an asynchronous setup the speed of the PCI bus can be set independent of the memory bus speed, controlled through jumpers on the motherboard or BIOS settings. The PCI is also limited to five connectors, although each can be replaced by two devices built into the motherboard. It is also possible for a processor to support more than one bridge chip. The PCI is more tightly specified than the VL-bus and offers a number of additional features. For example it cab support cards running from both 5 volts and 3.3 volt supplies using different key slots to prevent the wrong card being put into the wrong slot. Â In its original implementation the PCI ran at 33MHz, but was then raised to 66MHz by the later PCI 2.1 specification. As a result the theoretical thoroughput was increased to 266 MBps. The PCI can also be configured both as a 32 bit and a 64 bit bus and both kinds of cards can be used as well in either configuration. PCI Bus Performance The PCI is the highest performance general I/O bus currently used on PCs. This superior performance of the PCI bus is due to several factors: - Burst mode: The PCI bus can transfer information in a burst mode, where after an initial address is provided multiple sets of data can be transmitted in a row. This works in a manner similar to how cache bursting works
- Bus Mastering: PCI supports full bus mastering, which leads to improved performance.
- High Bandwidth Options: The PCI 2.1 version is expanded to 64 bits and 66 MHz, thus quadrupling the bandwidth.
PCI Internal Interrupts The PCI bus uses its own interrupt system for dealing with requests from the cards on the bus. These interrupts are often called "#A", "#B", "#C", "#D" to avoid confusion with the normal sytem IRQs (they are sometimes called "#1" to "#4" as well). These interrupts if needed by cards in the alots are mapped to regular interrupts, normally IRQ9 through IRQ12. The PCI slots in most systems can be mapped to at most 4 regular IRQs. In systems having more than 4 PCI slots two or more PCI devices share an IRQ. PCI Bus Mastering Bus mastering is the ability of devices on the PCI bus (other than the system chipset) to take control of the bus and perform transfers directly. The PCI bus is the first bus to popularize bus mastering. PCI's design allows bus mastering of multiple devices on the bus simultaneously, with the arbitration circuitry working to ensure that no device on the bus (including the processor) locks out any other device. At the same time it allows any given device to use the full bus thoroughput if no other device needs to transfer anything. Thus it acts as a tiny local network within the computer in which multiple devices can talk to each other through a communication channel managed by the chipset. The PCI bus also allows you to setup compatible IDE/ATA hard disk drives to be bus masters. This can increase the performance over the use of PIO modes, which are the default way of data transfering used by IDE/ATA. However for IDE bus mastering to work properly and correctly all of the following are needed: - Bus Mastering Capable system hardware: This includes the motherboard, chipset, bus and BIOS. Most of the newer motherboards using Intel 430 PEntium chipset family will support bus mastering IDE.
- Bus Mastering hard disk: All Ultra ATA hard disks support bus mastering
- 32 bit Multitasking OS
- Bus Mastering drivers: A special driver must be provided to the OS to enable bus mastering to work.
The PCI protocol The PCI bus uses an intermediate protocol rather than a register to register protocol. With a conventional PCI device, the following steps occur when the device switches a control signal: - On the rising clock edge, the device switches the signal to a high or low state onto the PCI bus.
- The signal propagates across the bus (propagation delay).
- during the same clock cycle, the receiving device decodes the signal to determine whether the signal is for the receiving device and to determine if ir must respond by switching one of its outputs.
- The receiving device responds immediately, that is in the next clock cycle.
With a 33MHz clock frequency the time allocated to the decode logic is of the order of 7 nanoseconds of the total 30ns clock cycle time. At 33MHz this is sufficient time for the receiving device to respond on the next rising clock edge. However an important bottleneck or problem with this protocol is that when the clock frequency is doubled to 66MHz (thus reducing the clock cycle time to 15ns), the number of nanoseconds available for the receiving device to respond is cut down to 3ns. Thus there is a severe time constraint for the conventional PCI bus which makes it difficult for the PCI bus to adapt to 66MHz. Dating from the mid 1995s, the main performance critical components of the PC communicated with each other across the PCI bus. Most common of these PCI devices were the disk and graphics controllers which were either mounted onto the motherboard or on expansion cards in PCI slots. Moreover by the late 1990s new processors and I/O devices were demanding much higher I/O bandwidth than PCI could deliver. This resulted in creation of higher bandwidth buses like the PCI-X bus. 2.2.4 PCI-X Bus The PCI-X is a high performance addendum to the PCI local bus specification developed in collaboration by IBM, HP and Compaq. The PCI-X is generally viewed as an immediate solution to the increased I/O requirements for high bandwidth enterprise applications such as Gigabit ethernet, fibre channel and high performance graphics. The PCI-X technology increases bus capacity to more than eight times that of the conventional PCI bus bandwidth, from 133 Mbps with the 32 bit 33 MHz PCI bus to 1066Mbps with the 64 bit 133MHz PCI-X bus. It also enhances the PCI protocol to develop an interconnect that exceeds raw bandwidth of 1 Gbps. The following sections briefly describe some of the key elements of the PCI-X technology: Register to Register protocol With the PCI-X register-to register protocol the following steps occur: - On the rising clock edge, the device switches the signal to a high or low state onto the PCI-C bus.
- The signal propagates across the bus.
- The signal is sent to a register or flip-flop, that holds the signal until the nexy clock cycle.
- The receiving device has a full clock cycle to decode the signal and determine the proper response.
- The receiving device responds two full clock cycles after the sending device first switched the signal.
Thus the PCI-X considerably eases the time constraints that were a bottleneck for the PCI bus by providing an entire clock cycle for the decoding logic to occur. The net difference is that the PCI-X transactions will require an additional clock cycle more than the conventional PCI transaction. With the timing constraint reduced it is much easier to design and implement adapters and systems to operate at 66MHz and greater. Enhanced Bus Efficiency The PCI-X bus incorporates the following technologies to improve the bus efficiency: - Attribute Phase: The PCI-C includes a new transaction phase called the attribute phase that uses a 36 bit attribute field to decsribe bus transactions in more detail than conventional PCI allows. The following enhancements are included in the attribute phase:
- Relaxed ordering: IF the device driver or the controlling software sets this bit, then the transaction is permitted to pass previously posted transactions from other devices. Relaxed ordering is especially important in applications such as audio or video streaming, where a delay in information would cause a noticeable interruption.
- Non-Cache-Coherent Transactions: This refers to maintaining a consistent view of memory during a transaction between the processors and I/O subsystem. For the PCI bus whenever a device writes or reads to main memory, the processor has to perform a snoop operation to make sure that the data does not exist in the cache memory. These snoop cycles limits the performance of the system by adding traffic. In the PCI-X non-cache-coherrent transactions are allowed by using a dont snoop bit. If any device driver or software sets this bit, then the PCI-X device informs the system cache controllers that no query is needed.
- Transaction Byte Count: In PCI protocol the bridge fetches a default number of cache lines (one or two) for every data request as it has no way knowing how much data will be requested. With the PCI-X the bridge knows exactly how much data to fetch because the byte count is included in the attribute field. Each PCI-X transaction in a sequence identifies the total number of bytes remaining to be read or written in its associated sequence. This enables more efficient buffer management schemes in the bridge as well as more efficient utilization of bus and other system resources.
- Sequence number: The sequence number uniquely identifies transactions that are part of the same sequence. The sequence number is used to increase efficiency in the buffer management algorithms.
- Split Transaction Support: Conveentional PCI protocol supports delayed transactions. With a delayed transaction, the device requesting data must poll the target to determine when the request has been completed. But with split transaction in PCI-X the device requesting data sends a signal to the target. The target device informs the requester that is accepted the request so that the requester is free to process other jobs, thus increasing the efficiency.
- Optimized Wait States: PCI-X eliminatesthe use of wait states, excepr for initial target latency. When a PCI-X device does not have data to transfer, it will remove itself from the bus so that another device can use the bus bandwidth. This provides more efficient use of bus and memory resources.
- Standard Block Size Movements: With PCI-X, adapters and bridges are permitted to disconnect transactions only on natural aligned 128 byte boundaries. This encourages longer bursts and enables more efficient use of cache line based resources such as the processor bus and main memory.
- Provides bandwidths which are an order of magnitude greater rthan existing I/O capabilities.
- Provides improved connection flexibility and scalability as storage and I/O are separated from processor and memory
- It offloads communications processing from the OS and CPU,thus eliminating traditional communications overhead.
- It can also do simultaneous device communication, rather than waiting for other devices to complete their communication.
- Provides support for up to 64,000 addressable devices and support for Internet Protocol version 6 (IPv6) for effective communications between IBA fabrics and the Internet or intranets.
- Host Channel Adapter (HCA): An HCA is an interface that resides within a server and communicates directly with the serverÂ’s memory and processor as well as the IBA fabric. The HCA guarantees delivery of data, performs advanced memory access and can recover from transmission errors. HCAs can communicate with a target channel adapter or a switch. An HCA can be a PCI to InfiniBand card or it can be integrated on a system motherboard.
- Target Channel Adapter (TCA):A TCA enables I/O devices, such as disk or tape storage, to be located within the network independent of a host computer. The TCA includes an I/O controller that is specific to its particular device's protocol. TCAs can communicate with an HCA or a switch.
- Switch: The switch allows many HCAs and TCAs to connect to it and handles network traffic. The switch looks at the local route header on each packet of data that passes through it and forwards it to the appropriate location. A group of switches is referred to as a fabric. The switch also frees up servers and other devices by handling network traffic.
- Router: A router forwards data packets from a local network (called a subnet) to other external subnets. The router reads the global route header and forwards packets based on the IPv6 network layer address. The router rebuilds each packet with the proper local address header as it passes it to the new subnet.
- Subnet Manager: The subnet manager is an application responsible for configuring the local subnet and ensuring its continued operation. Configuration responsibilities include managing switch and router setups and reconfiguring the subnet if a link goes down or a new one is added.
- Physical Layer: The InfiniBand physical layer defines its electrical and mechanical characteristics, including cables, connectors and hot-swap characteristics. Connectors include fiber, copper and backplane connectors. There are three link speeds specified as 1X, 4X and 12X. The speeds are a function of the pin counts or wires within each cable. With a 1X link cable, there are four wires, two for each direction of communication (read and write). The 4X speed has four times as many pins and wires and the 12X has twelve times as many pins and wires as a 1X link cable. he bandwidth for a 1X InfiniBand link is 2.5 Gb/s, which can achieve an actual raw data bandwidth of 2 Gb/s because 8b/10b data encryption is used on all transmissions, resulting in a 20% performance overhead. Because all links are bidirectional, the aggregate bandwidth can be doubled. Many InfiniBand products have multiple ports, further increasing I/O bandwidth.
- Link Layer: The link layer is central to the Infiniband and includes packet layout, point-topoint link instructions, switching within a local subnet and data integrity. There are two types of packets, management and data. Management packets handle link configurations and maintenance. Data packets carry up to 4 kilobytes of transaction payload. Packet forwarding and switching within a local subnet is also part of the link layerÂ’s responsibilities. Every device in a local subnet has a local ID (LID). Packets of data are forwarded to the appropriate LID by reading the local route header found in each packet of data. Virtual lanes are also part of the link layer. A virtual lane is a unique logical communication link that shares a single physical link. Each link can have up to 15 virtual lanes and a management lane. As a packet travels through the subnet, it can be assigned a priority or service level. Higher-priority packets are sent down special virtual lanes ahead of other packets.
- Network Layer: The network layer is responsible for routing packets from one subnet to another. The global route header located within a packet includes an IPv6 address for the source and destination of each packet. Using a router, packets are forwarded through different subnets. For singlesubnet environments, the network layer information is not used.
- Transport Layer: The transport layer handles the order of packet delivery as well as partitioning, multiplexing and transport services that determine reliable connections.
The operating mode and the frequency of the PCI-X bus depends on the type of adapters installed on the bus and on the number of adapters installed on it. A PCI-X system automatically adjusts the bus frequency to match the frequency of the slowest adapter on that bus segment. PCI-X supports upto 256 bus segments and each segment is initialized separately so that different operating frequencies can be used. Also as with conventional PCI, system designers can optimize a PCI-X system for particular I/O bandwidth needs. An important point about the PCI-X bus is that even if it operates as a conventional PCI bus, it still provides a significant performance enhancement.
2.2.5 Infiniband BusTo meet the increasing I/O demands of the computer industry, major technology leaders including Compaq, Dell, HP, IBM, Intel, Microsoft and Sun codeveloped the Infiniband architecture and released it in the year 2000. The Infiniband architecture was developed as a means to connect servers with remote storage, networking devices and other servers as well as for use inside servers for interprocessor communications. The Infiniband architecture offers many advantages over the existing PCI architecture and other I/O architecture which include:
Infiniband architecture InfiniBand is a point-to-point, switched I/O fabric architecture. Each end point, or node, can vary from an inexpensive single SCSI chip or Ethernet adapter to complex host systems. Point-to-point means that each communication link extends between only two devices. Both devices at each end of a link have full and exclusive access to the communication path. To go beyond a point and traverse the network, switches come into play. By adding switches, multiple points can be interconnected to create a fabric. As more switches are added to a network, aggregated bandwidth of the fabric increases. By adding multiple paths between devices, switches also provide a greater level of redundancy. There are five primary components that make up an InfiniBand fabric:
Infiniband layersThe Inifiniband is comprised of four primary layers that describe communication devices and methodology. These layers are briefly described below:
The Infiniband bus represents a significant improvement in reliability, availability and serviceability over the PCI bus. The bascic Infiniband link is comprised of only 4 signal wires compared to more than 100 on a PCI bus. IT can also accomodate multiple ports for each I/O unit. The Inifiniband also incluses a gailover mechanism that allows network to heal itself if a link fails, further it removes the I/O from the server, thus breaking the one-to-one relation between the server and the I/O elements. Thus if an I/O device fails, communication simply falls over to another redundant I/O device. This is unlike the PCI bus and so saves much time, resources and helps keep the server online.
|