|
Page 3 of 5 Case Study 2 This case study shows how seismic modeling can benefit from PVM (Parallel Virtual Machine) techniques by describing a message passing implementation of DSO, the modeling and inversion code developed in The Rice Inversion Project using PVM3, that runs on a network of workstations. The study uses coarse grain parallelism, of SPMD type in which the individual tasks are full edged simulations. Thus, even though the amount of data needed by each task is substantial, the length of the computation is such that it is reasonably possible to offset the time taken by the communications. Task Level Parallelization for Seismic Imaging 1. Code Structure DSO (Differential Semblance Optimization) is a code for seismic inversion under development in the Rice Inversion Project, which can be viewed as doing simulation, or data-fitting, of multi-experiments. DSO tries to find a model that best explains observed data, by using a variant of the least-squares technique. It assumes that these data are produced by one physical model, but come from a suite of experiments done by varying some additional parameter. The basic DSO idea is to let part of the model depend on this parameter, but to penalize the objective function in so that the optimal solution will actually be independent of the parameter. This study illustrates two different examples, the first one from is an acoustics point-source experiment, the second one, written at Exxonmobil describes viscoelastic plane-wave propagation. In the first example, the earth is described by its velocity (assumed smooth), and its reflectivity (assumed rapidly varying), and the additional parameter is the position of the seismic source. DSO lets the reflectivity depend on shot position, but since there is only one earth,it penalizes to ensure the solution does not actually depend on it. This is in some sense a canonical example, and most of the terminology derives from it. Thus the additional parameter is invariably called a shot. For the second example, the earth is now visco-elastic but one-dimensional, the source is a plane wave and the additional parameter is the angle of incidence of the plane wave. Again, some of the quantities are artificially allowed to depend on the angle of incidence, and this dependence is penalized. As a consequence of this discussion, we see that the main computational task of DSO is the simulation of a large number of instances of a basic experiment, under different conditions. All these simulations are independent, and it is quite natural to process them simultaneously. For the first example above, a simulation is the solution of the wave equation with several right hand sides, each one corresponding to a different seismic source position. For the second example, simulation is done for the arrival of plane waves with different angles of incidence. The implementation of DSO is built around the principle that generic tasks should be coded in a generic way. A consequence is that the simulator is completely hidden from the optimization algorithm. This allows the use of the same inversion method with several different choices of propagation models. This allows code at a very high level: a simplified view reduces to two routines, destined to be cooperating processes: - simline is the driver. In the sequential version, its role is to loop over shots, read the data for each shot, call shot, and write the results it receives.
- shotis the main computational routine. It invokes the simulator that does all the actual work.
Since all instances of shot are independent, it is a very natural idea to run them in parallel. Such an approach to parallelism has several potential advantages which are: - Parallelism is at a very coarse grain. Each task is a full edged simulation;
- It requires very little communication between the nodes;
- It can be made fault-tolerant, as the number of tasks is completely independent of the number of processors;
- It uses SPMD parallelism, thus the program can run on a distributed memory computer, but also a loosely coupled network of workstations.
- It respects the DSO framework: parallelism is still model independent.
2. Parallel Implementation 2.1 The Basic Algorithm A manager - worker implementation suggests itself, with simline acting as the manager, or rather as a work dispatcher, and each incarnation of shot as a worker, simply treating shot after shot. This implementation has the additional advantage of providing free load-balancing. A fast worker will require (and receive) more work than a slower one, without the manager having to be concerned about balancing work between workers. One drawback is the possible bottleneck of having the manager do all of the I/O. But as we shall see on examples (section 4.2), this does not usually happen, and we even get overlap of the communication by the computation. We show below the pseudo-code for both the manager and the worker sides. We denote the number of tasks by ntasks, and the number of processors by nprocs. The algorithm for simline (the manager) is shown on figure 1. The main loop of simline is very simple: it waits for a message to arrive. In normal operations, this should be a worker reporting work done. The results are then stored, and if there are more tasks to process, simline reads the relevant data, and sends them the worker. Otherwise, simline simply sends a "stop" signal. A feature that is not shown in figure 1 is that there is also simple for of read-ahead that is implemented, before entering its main loop, the manager starts by sending two tasks to each worker, thus ensuring that when a worker requests a new task, one is already there waiting. Obviously, the success of this scheme hinges on the assumption that it takes longer to compute a task than to send it. Figure 2 shows the pseudo-code for shot. Its structure is even simpler. After having received initial information, it waits for work to arrive from simline, performs the corresponding computation, and goes back to wait for the next message. Hopefully, the prefetching outlined before is efficient, so that shot is never blocked on the receive statement. Figure1.1 Simline Pseudo Code  | Figure1.2 Worker Pseudo Code  | 2.2 Using PVM The previous section showed the basic algorithm used. For a practical implementation, it is important to choose a proper transport mechanism, in other words to specify how different processors communicate. Use of Remote Procedure Calls, and the UNIX mail system are not adequate because of the large amount of data to be exchanged. Instaed this study uses PVM as our message passing mechanism, because of its wide availability, and also because it is rapidly becoming a de facto standard for distributed applications. Its allows a heterogeneous network of computers to appear as one single distributed memory computer. PVM is composed of a daemon than enables PVM to run on a given machine, and of a user-callable library of routines that allow the user to start tasks on remote machines, send and receive messages between different processors, and provides tools to manage these different processes. Two important features of PVM of which this study made particular use are: - Messages are typed. This greatly increases program clarity. It allows for instance the manager to decide whether the message just received is a result form a worker, or an error message from a worker.
- All messages exchanged by PVM have to be packed. This has the double advantage of allowing different data types in one message, and mostly of reducing the overall number of messages exchanged. A minor inconvenience is that the receiving process needs to unpack the data in exactly the order in which it was packed by the sending process.
2.3 Instrumentation Assessing the performance of any program is difficult, and programmers are notoriously poor at guessing where to optimize a code. This is even more true of parallel programs, where it is compounded by the necessity to take into account not only the computations but also the communication delays. Tools are essential in order to achieve an understanding of the kind of performance we are getting out of a code, and of the bottlenecks. An essential aspect of parallel programs is their dynamic character. Accordingly, a visual display of the time evolution of the program is important. This requires the code to be instrumented, then log-files are created while the program runs (hopefully with a minimum impact on the behavior of the program), and the log-files are interpreted graphically a-posteriori. The tools used here are two packages working together: alog is used to instrument the code. It allows the user to define events, to be logged to a _le. The second package, upshot, interprets the log-files in a graphical way using the X Window System.
|