| Article Index |
|---|
| Application of cluster computing for seismic imaging in the energy industry |
| Page 2 |
| Page 3 |
| Page 4 |
| Page 5 |
| All Pages |
Case Study 3
This case study demonstrates how performance of complex algorithms can be improved by redesigning the code structure of such algorithms. The study uses an example of a seismic imaging algorithm to demonstrate techniques that can be used to attain improved performance on existing hardware including parallel and distributed systems.
Using Performance Programming in Seismic Imaging Algorithms
1. Introduction
Along with development of sophisticated algorithms for a wide variety of tasks, there is also the need for developing methods that can effective use the avaialble machines and use thier full potential. This practise of best use of the hardware has been termed as performance programming.Performance programming is markedly different from the usual variety of coding. The primary goal of performance programmimg is to bridge the gap between the speed at which a code is running and the speed at which it should run when best optimized. This paper uses an example of a seimic imaging algorithm to demonstrate the efficiency that can be attined via performamce programming.
The basic function of a seismic imaging/migration algorithm is that it converts signals recorded on the surface to an image of the subsurface by extrapolating the wavefield recorded on the surface down into depth. there is a Fast Forier Trnasform (FFT) involved of the wavefield so that each frequency component can be extrapolated independently. The primary cost of this kind of an algorithm is the extraoplation step. In this study the extrapolation is done using a 39 point symmetric filter which makes the scheme very accuarate but computationally intensive. This is because to evaluate each sample of the wavefield at a particular location 39 samples from the previous location has to be used in a convolution like scheme. The following sections will describe how this algorithm can be bullet=proofed and enhanced using principles of performance programming.
2.Improving the Code Structure
One of the important component of this algorithm is an intensive inner loop where operations like using the filter for convolution is done. Unless this inner loop performance is optimized, efficiency cannot be achieved. This means that the code must make optimal use of the machine's underlying instruction set. The machine that has beed used for this study isa 20 MHZ IBM RISC system/6000 computer. This machine can perform 2 floating point operations per cycle so that is maximum theoretical speed is 40 Mflops. This seismic imaging method being implemented here has an inner loop that needs to do 6 floating point adds and 4 multiplies requiring 6 cycles. The maximum performance achievable is therefore 33 Mflops. There however exists a strange condition in the 6000 C compiler which treats a filter operation of the form x+ = a*b + c*d as x = x + (a*b+ c*d)i. Thus instead of having 6 operations we have eight. Thus a simple rearrangement of the C code will speed up performance by a factor of 10%. Another problem is that although the 6000 can dispatch a floating point instruction every cycle, the results are not available for atleast 2 cycles. This is also an area where drastic improvements can be achieved by simple restructing of the code. However this study could not achieve the optimal goal of 6 cycles per iteration. But with knowledge of the underlying machine hardware fine tuning resulted in achieving 7 cycles per iteration for the inner loop.
Another point of concern for the inner loop is the fixed ppoint computations. This can be optimized by structuring the code such that the computation does not require more than one address computation per floating point instruction. An example of this for the seismic migration code is the computation of imaginary numbers. It is necessary that the real and the imaginary parts are computed together in one loop rather that in separate loops as this halves the total number of loads and reduced redundant communications.
Along with the inner loop, the outer loop of seismic migration needs to be optimized as well. In the seismic migration algorithm, the outer loop involves substantial setup before the inner loop is reached. The setup involves data accessing, which may lead to cache miss. It also involves two float to fixed conversions which involves a 5 cycle delay. These problems can also be optimized by computing setup values one iteration in advance. This means that during iteration number j of the inner loop setup the iteration number (j+1) before beginning thr execution of tje inner loop. This permits muti-cycle delay transfers between units to be overlapped with useful work. This technique also leads to a 10% improvement in performance of the seismic imaging code.
3.Managing Data Movement
Controls should also be made on the way data is moved between different levels of the memory hierarchy and also between processors.It is extremely important that cache and page misses are avoided. For example multi dimensional arrays should be accessed in a manner which reuses elements of the cache line before the line is moved out of cache. As fas as techniques for reducing communication overheads in the memory hierarchy is concerned, an useful strategy is to breakup the computation cube into subcubes that fit the hierarchy at various levels. Management strategy is improved significantly by processing subcubes that have small surface to volume ratio. Similar considerations must be applied when distributing a computation to multiple processors. Once the data management was properly introduced along with the implementation of the proper code structure the computation speed increased from 8 to 26 Mflops for the sequential computation.
4.Optimizing the Distributed Computation
The distributed system for this case study was a set of 6000 machines connecetd using token rings. The parallel impementation is done using a manager-worker organization, similar to a master-slave arrangement except that the workers request assignments as well. The workers repeatedly request assignments from the manager and process the work. Finally when there is no additional work to perform, the worker reports the result of their computations to the manager. This partail results are then accumulated by the manager and it then writes the output.
The computations are assigned to the workers by chopping or cutting the computation cube in the most efficient manner. For this case study it was found that chopping the cube perpendicular to the frequency axis is the most efficient way. For this chopping process to be fully efficient it is however necessary that each worker keeps a separate copy of x by z array (the other two dimensions of the cube). The parallelism is furter improved by recognizing that ideas from the previous section (particularly on how the inner loop was optimized) should be incorporated here. In this case the inner loop can be thought of as a sequential computation of the subcubes. While the setup procedure in this case deals with communicating the surfaces of these subcubes. Once this analogy with the sequential code is recognized all the techniques that were developed and used in the previous section can be now used here.
5.Results and Discussions
For the first det of results the subcubes or the work assignments are slices of the cube one frequency wide. The evaluation of this subcube requires the extrapolation table of seimic imaging, the velocity model (nx by nz matrix) and a row of frequency data (nx complex numbers).The first two rows of figure (1) shows the results obtained with a small (nx=467, nz=200,mw=228) dataset and a large (nx=1000, nz=400, nw=303) dataset. The table includes entries for runtime and utilization. Runtime is wall clock time on unloaded machines, while utilization is sequential runtime divided by the product of the number of workers and the runtime, expressed as percentage. The small-slice-at-a-time utilization figures are disappointing as distributing the computation seems to entail significant communication overhead. This overload was due to low effective bandwidth and this strategy failed to overlap this overhead with useful computation.
To overcome this problem a julienne strategy was then adapted, whose primary goal is to overlap communication with useful computation. For this the initial assignment to each worker should have a small surface area to allow rapid startip of all workers as well as a large volume so that the worker does not have a long wait before the manager is ready for the workers' next request. The surface to volume ratio of this initial assignment is minimized by sending twice as many velocity rows as frequency rows as the frequencies are complex. In this strategy the initial work assignment involved 25 frequency rows and 50 velocity rows. The manager sends 50 row swaths of the velocity matrix with each assignment until each processor has the entire velocity matrix. Thereafter the manager sends 25 row swaths of the frequency matrix. The results are shown in the third and fourth row. It shows a marked improvement in performance.
The julienne techniques can be further improved by using for example variable sized julienne strips. In this case study, the processor is idle when the worker is communicating with the manager. Thus placing two workers on the processor may allow further overlap. These strategies however have not been implemented in this study.
6.Conclusions
This Case study demonstrates that the performance of the seismic imaging code can be improved by a factor of 2 to 4 by exploiting knowledge of the architecture of the environment. This techniques of improving the performance apply at all levels of the machines, from the ALUs and registers to the memory hierarchy. Another important conclusion of this case study is that it is helpful to visualize a scientific computation as a multidimensional soild where the surface area corresponds to communication and the volume to computation.
7.References
1. G. Almasi, Parallel distributed seismic migration. In Proc. ACM Workshop on Parallel and Distributed computing, 1991
2.G.F. Grohoski. Machine organization of the IBM RISC System/6000 processor. IBM J. Res. Des., volume 34(1).
3.D. Hale, Stable explicit depth extrapolation of seismic wavefields, 60th Annual meeting, SEG, 1994.
4. H.S.Warren, Instruction scheduling for the IBM RISC System/6000 processor. IBM J. Res. Dev , volume 34, 1990.




