Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, March 06, 2012

New(ish) IBM z196 synchronization instructions


http://semipublic.comp-arch.net/wiki/Interlocked-access_facility#New_IBM_z196_synchronization_instructions

= New IBM z196 synchronization instructions =

The IBM z196 adds new [[synchronization instructions]] to the [[IBM System z Mainframe ISA]].

Augmenting the legacy instructions
* [[COMPARE AND SWAP]]
* [[PERFORM LOCKED OPERATION]]
* [[TEST AND SET]]

The reference [share] comments "there is no need for a COMPARE AND SWAP loop to perform these operations!"
- their exclamation mark!
This suggests the motivation for these instructions
- while [[compare and swap]] is one of the most powerful synchronization instructions,
it is not necessarily efficient.
Atomic operations such as [[atomic add to memory]] can perform in one instruction,
without a loop, things that would require looping for [[compare and swap]], [[test-and-set]], and [[load-linked store-conditional]].
Looping that may require special mechanisms to guarantee forward progress.

The z196 [[interlocked-access facility]] instructions include

New atomic instructions:
* [[LOAD AND ADD]]
* [[LOAD AND ADD LOGICAL]]
* [[LOAD AND AND]]
* [[LOAD AND EXCLUSIVE OR]]
* [[LOAD AND OR]]


    The "LOAD AND ..." part of these instructions' names is a bit misleading.
    These are really just [[fetch-and-op]] instructions to memory.
    Fetching the old value, performing the operation with a register input,
    storing the new value so produced,
    and returning the old value in a register.

    Flavours: add signed/unsigned, logical and/or/xor, 32/64 bits wide.



An interesting instruction that I feel must be called [[pseudo-atomic]], in much the same way [[LL/SC]] is [[pseudo-atomic]]:
* [[LOAD PAIR DISJOINT]]


    [[LOAD PAIR DISJOINT]] loads two separate, non-contiguous, memory locations (each of which must be naturally aligned),
    into an [[even/odd register pair]].
    It sets condition codes to indicate whether the fetch was atomic,
    or whether some other CPU or channel managed to sneak in a store between them.
    (The language in the [share] presentation suggests that the condition codes are only set if a store actually occurred,
    not "may have occurred" - since if thye latter, a correct implementation might always returned "failed to be atomic".)

    GLEW COMMENT: although undoubtedly most of the time such actions are atomic, it is not clear to me that there is any guarantee of forward progress,
    for a loop around [[LOAD PAIR DISJOINT]] that spins until atomicity is observed.

    I almost suspect that there may be plans to eventually provide some such guarantees,
    but that the detail of such guarantees does not want to be cast in concrete architecture at the time of introduction.


In addition, the existing instructions that perform [[add immediate to memory]], in signed/unsigned and 32b/64b forms, are declared to be atomic
when "the storage operand is aligned on an [[integral boundary]]" - IBM speak for [[natural alignment]].
* [[ADD IMMEDIATE]]
* [[ADD LOGICAL WITH SIGNED IMMEDIATE]]



    GLEW COMMENT: it is obviously desirable to be able to have a fetch-and-add immediate to memory, to avoid having to blow a register on the operand for the atomic modification.

    It is a bit unusual to be able to extend an existing instruction in this way.  If these existing instructions were in wide use, one would expect existing code to be somewhat slowed down.
    However (1) I suspect the trend towards simpler OOO instructions has already made these "add to memory" instructions slower,
    while on the other hand (2) advanced implementations make such atomic instructions neglibly slower than their non-atomic versions.



= [[Atomicity is always relative]] =

IBM literature says: "as observed by other CPUs and the channel subsystem, ... appear to occur atomically".

IBM also uses the phrase "block-concurrent interlocked update".  "Block concurrent" is a special IBM term related to memory ordering, that says that all bytes are accessed atomically as observed by other CPUs. However, they may be observed a byte at a time by channel programs... but "interlocked" means that channel program requests cannot be inserted between the bytes.
Bottom line: atomixc wrt CPUs and I/O channels.

= Reference =

Many references, scattered.

;[share]
: New CPU Facilities in the IBM zEnterprise 196, share.confex.com, 4 August 2010, http://share.confex.com/share/115/webprogram/Handout/Session7034/New%20CPU%20Facilities%20in%20the%20z196.pdf

No comments: