PhD Thesis, University of Cape Town, 1996

Philip Machanick

Programming shared-memory multiprocessor systems is becoming increasingly difficult as the gap between memory speed and processor speed increases. At the same time, this class of computer—based on standard microprocessors—is becoming increasingly common as an alternative to traditional mainframes and supercomputers.

Programs that are not sympathetic to caches can perform poorly on such systems. Problems include false sharing (unrelated data in a cache block, resulting in coherence misses), inadequate blocking (processing data as often as possible before it moves out of the cache) and poor exploitation of prefetch (fetching data before it is needed).

This research addresses the problem of accommodating changes in memory hierarchy cost and cache characteristics through an object-oriented C++ library called OOSH (Object-Oriented Library for Shared Memory). OOSH includes low-level allocators which pad and align objects to cache block boundaries, and primitives for launching processes, locking and synchronization. OOSH also provides support for object blocking—an object-oriented version of blocking, a technique traditionally used in matrix operations. Blocking divides an algorithm’s data into blocks or tiles, which are processed as far as possible before moving on to other data, with the objective of reducing cache misses.

Performance of OOSH is evaluated by running selected applications from the Stanford SPLASH benchmarks on the Mint MIPS architecture simulator. Simulations are run both with parameters similar to current multiprocessor architectures, and with higher memory hierarchy costs, to investigate the impact of technology trends. OOSH versions of the applications are compared with the SPLASH implementations, based on Argonne National Laboratories Parmacs macros.

The applications are MP3D (particle-based wind tunnel simulation), Barnes-Hut (n-body gravitation simulation) and Water (water molecule simulation). MP3D only needs local synchronization, and uses distributed synchronization: barriers are replaced by synchronization on local clocks in regions of space. Early versions of MP3D are poorly written in terms of their cache behaviour, and the OOSH version of MP3D is substantially rewritten. To provide another data point with fewer variables, Barnes-Hut is a useful application. It has relatively complex data structures and algorithms, and is well-suited to object-oriented implementation, so the OOSH version is relatively similar to the SPLASH version. Water is a straightforward program which is used to investigate the performance impact of OOSH if no attempt is made to improve the structure of the application or its data, i.e., there are even fewer variables than in the case of the re-implementation of Barnes-Hut.

OOSH is shown to give improved cache usage for programs that are rewritten to exploit its features, especially in the case of MP3D which is substantially rewritten. The improvement in cache behaviour is shown to be significant, even for Barnes-Hut which has substantially the same structure for both SPLASH and OOSH versions. Water, which is not significantly rewritten, is shown to suffer a performance drop of up to 9% as a result of the overhead of object-oriented constructs. By comparison, the OOSH version of Barnes-Hut, on a simulated 32-processor architecture with a cache miss cost of 100 clock cycles, is about 20% faster than the SPLASH version.

Object blocking is shown to be a relatively minor effect in MP3D. Object blocking could be a useful optimization for simulations with a larger number of references to one object per timestep than is the case for MP3D. However, this research shows cache block aligned memory access to be a big win, and is more likely to be a useful optimization than blocking, across a variety of applications.

In general, as memory hierarchy costs increase, OOSH is shown to become increasingly beneficial, and even an application like Barnes-Hut, which had a relatively low number of misses before being ported to OOSH, can gain significantly from cache block aligned memory allocators.

(PDF 344K; 141 pages)
or for a nicely bound copy, you can buy one at Amazon