Computer Fundamentals, C Programming: Processor Memory Unit

The processor memory unit is the interface between the processor and the caches. Currently, instruction caches are not simulated and are assumed to be perfect. RSIM also does not currently support virtual memory. A processor's accesses to its private data space are currently considered to be cache hits in all multiprocessor simulations and in uniprocessor simulations configured for this purpose. However, contention at all processor and cache resources and all memory ordering constraints are modeled for private accesses in all cases.
The most important responsibility of the processor memory unit is to insure that memory instructions occur in the correct order. There are three types of ordering constraints that must be upheld:

Constraints to guarantee precise exceptions
Constraints to satisfy uniprocessor data dependences
Constraints due to the multiprocessor memory consistency model

Constraints for precise exceptions
The RSIM memory system supports non-blocking loads and stores. To maintain precise exceptions, a store cannot issue before it is ready to be graduated; namely, it must be one of the instructions to graduate in the next cycle and all previous instructions must have completed successfully. A store can be allowed to graduate, as it does not need to maintain a space in the active list for any later dependences. However, if it is not able to issue to the cache before graduating, it must hold a slot in the memory unit until it is actually sent to the cache. The store can leave the memory unit as soon as it has issued, unless the multiprocessor memory constraints require the store to remain in the unit.
Loads always wait until completion before leaving the memory unit or graduating, as loads must write a destination register value. Prefetches can leave the memory unit as soon as they are issued to the cache, as these instructions have no destination register value. Furthermore, there are no additional constraints on the graduation of prefetches.

Constraints for uniprocessor data depedences
These constraints require that a processor's conflicting loads and stores (to the same address) appear to execute in program order. The precise exception constraint ensures that this condition holds for two stores and for a load followed by a store. For a store followed by a load, the processor may need to maintains this data dependence by enforcing additional constraints on the execution of the load. If the load has generated its address, the state of the store address determines whether or not the load can issue. Specifically, the prior store must be in one of the following three categories:

address is known, non-conflicting
address is known, conflicting
address is unknown

In the first case, there is no data dependence from the store to the load. As a result, the load can issue to the cache in all configuration options, as long as the multiprocessor ordering constraints allow the load to proceed.
In the second case, the processor knows that there is a data dependence from the store to the load. If the store matches the load address exactly, the load can forward its return value from the value of the store in the memory unit without ever having to issue to cache, if the multiprocessor ordering constraints allow this. If the load address and the store address only partially overlap, the load may have to stall until the store has completed at the caches; such a stall is called a partial overlap, and is discussed further.
In the third case, however, the load may or may not have a data dependence on the previous store. The behavior of the RSIM memory unit in this situation depends on the configuration options. In the default RSIM configuration, the load is allowed to issue to the cache, if allowed by the multiprocessor ordering constraints. When the load data returns from the cache, the load will be allowed to complete unless there is still a prior store with an unknown or conflicting address. If a prior store is now known to have a conflicting address, the load must either attempt to reissue or forward a value from the store as appropriate. If a prior store still has an unknown address, the load remains in the memory unit, but clears the busy bit of its destination register, allowing further instructions to use the value of the load. However, if a prior store is later disambiguated and is found to conflict with a later completed load, the load is marked with a soft exception, which flushes the value of that load and all subsequent instructions. Soft-exception handling is discussed further.
There are two less aggressive variations provided on this default policy for handling the third case. The first scheme is similar to the default policy; however, the busy bit of the load is not cleared until all prior stores have completed. Thus, if a prior store is later found to have a conflicting address, the instruction must only be forced to reissue, rather than to take a soft exception. However, later instructions cannot use the value of the load until all prior stores have been disambiguated.
The second memory unit variation stalls the issue of the load altogether whenever a prior store has an unknown address.

Constraints for multiprocessor memory consistency model.
RSIM supports memory systems three types of multiprocessor memory consistency protocols:

Relaxed memory ordering (RMO) and release consistency (RC)
Sequential consistency (SC)
Processor consistency (PC) and total store ordering (TSO)

Each of these memory models is supported with a straightforward implementation and optimized implementations. We first describe the straightforward implementation and then the more optimized implementations for each of these models.
The relaxed memory ordering (RMO) model is based on the memory barrier (or fence) instructions, called MEMBARs, in the SPARC V9 ISA . Multiprocessor ordering constraints are imposed only with respect to these fence instructions. A SPARC V9 MEMBAR can specify one or more of several ordering options. An example of a commonly used class of MEMBAR is a LoadStore MEMBAR, which orders all loads (by program order) before the MEMBAR with respect to all stores following the MEMBAR (by program order). Other common forms of MEMBAR instructions include StoreStore, LoadLoad, and combinations of the above formed by bitwise or (e.g. LoadLoad|LoadStore). Instructions that are ordered by the above MEMBAR instructions must appear to execute in program order. Additionally, RSIM supports the MemIssue class of MEMBAR, which forces all previous memory accesses to have been globally performed before any later instructions can be initiated; this precludes the use of the optimized consistency implementations described below.
Release consistency is implemented using RMO with LoadLoad|LoadStore fences after acquire operations and LoadStore|StoreStore fences before release operations.
In the sequential consistency (SC) memory model, all operations must appear to execute in strictly serial order. The straightforward implementation of SC enforces this constraint by actually serializing all memory instructions; i.e. a load or store is issued to the cache only after the previous memory instruction by program order is globally performed . Further, stores in SC maintain their entries in the memory unit until they have globally performed to facilitate maintaining multiprocessor memory ordering dependences from stores to later instructions. Unless RSIM is invoked with the store buffering command line option , stores in SC do not graduate until they have globally performed. Forwarding of values from stores to loads inside the memory unit is not allowed in the straightforward implementation of sequential consistency. MEMBARs are ignored in the sequential consistency model.
The processor consistency (PC) and total-store ordering (TSO) implementations are identical in RSIM. With these models, stores are handled just as in sequential consistency with store buffering. Loads are ordered with respect to other loads, but are not prevented from issuing, leaving the memory unit, or graduating if only stores are ahead of them in the memory unit. Processor consistency and total store ordering also do not impose any multiprocessor constraints on forwarding values from stores to loads inside the memory unit, or on loads issuing past stores that have not yet disambiguated. MEMBARs are ignored under the processor consistency and total store ordering models.
Beyond the above straightforward implementations, the processor memory unit in RSIM also supports optimized implementations of memory consistency constraints. These implementations use two techniques to improve the performance of consistency implementations: hardware-controlled non-binding prefetching from the active list and speculative load execution .
In the straightforward implementations of memory consistency models, a load or store is prevented from issuing into the memory system whenever it has an outstanding consistency constraint from a prior instruction that has not yet been completed at the memory system. Hardware-controlled non-binding prefetching from the active list allows loads or stores in the active list that are blocked for consistency constraints to be prefetched into the processor's cache. As a result, the access is likely to expose less latency when it is issued to the caches after its consistency constraints have been met. This technique also allows exclusive prefetching of stores that have not yet reached the head of the active list (and which are thus prevented from issuing by the precise exception constraints).
Speculative load execution allows the processor not only to prefetch the cache lines for loads blocked for consistency constraints into the cache, but also to use the values in these prefetched lines. Values used in this fashion are correct as long as they are not overwritten by another processor before the load instruction completes its consistency constraints. The processor detects potential violations by monitoring coherence actions due to sharing or replacement at the cache. As in the MIPS R10000, a soft exception is marked on any speculative load for which such a coherence action occurs ; this soft exception will force the load to reissue and will flush subsequent instructions. The soft exception mechanism used on violations is the same as the mechanism used in the case of aggressive speculation of loads beyond stores with unknown addresses. Speculative load execution can be used in conjunction with hardware-controlled non-binding prefetching.

Computer Fundamentals, C Programming

Wednesday, December 8, 2010

Processor Memory Unit

No comments:

Post a Comment