We simulate the effects of a compiler-directed prefetching algorithm, running on a range of bus-based multiprocessors. We show that, despite a high memory latency, this architecture does not necessarily support prefetching well, in some cases actually causing performance degradations. We pinpoint several problems with prefetching on a shared memory architecture (additional conflict misses, no reduction in the data sharing traffic and associated latencies, a multiprocessor's greater sensitivity to memory utilization and the sensitivity of the cache hit rate to prefetch distance) and measure their effect on performance. We then solve those problems through architectural techniques and heuristics for prefetching that could be easily incorporated into a compiler: 1) victim caching, which eliminates most of the cache conflict misses caused by prefetching in a direct-mapped cache, 2) special prefetch algorithms for shared data, which significantly improve the ability of our basic prefetching algorithm to prefetch invalidation misses, and 3) compiler-based shared data restructuring, which eliminates many of the invalidation misses the basic prefetching algorithm doesn't predict. The combined effect of these improvements is to make prefetching effective over a much wider range of memory architectures.