Krazy Glew's Blog

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, March 06, 2012

malloc and transactyional memory (TM)

http://semipublic.comp-arch.net/wiki/Transactional_memory_and_malloc

Dynamic memory allocation such as malloc is a classic example of why one may want memory accesses that can leave a transaxction.

Consider a simple mallocator, that allocates from a single queue or arena.
This might mean that multiple transactions, otherwise completely independent
might appear to interfere with each other through malloc.

(This interference might take several forms.
E.g. if the transactions are first performed totally independently, they may allocate the same memory.
E.g. or if the transactions malloc separate areas, but sequentially dependent, rollback of one might require rollback of the other.
(Although even this is an advanced TM topic, if the footprints overlap.))

Certainly, there can be malloc algorithms that minimize this - e.g. multiple areas, increasing the chance that different transactions might not interfere.

Or... the topic of this page: permit malloc to exit the transaction
- so the [[parallel or concurrent]] mallocs may be properly synchronized, and receive independent memory addresses.

Q: what happens when a transaction aborts?
* Possibly could free the memory in the abort handler. But this becomes less hardware, and more software TM. Or at least a hybrid.
* Let garbage collection recover the memory speculative allocated within an aborted transaction.

(Q: what about [[deterministic finalization and aborted transactions]]?)

= Related Topics =

* [[speculative multithreading and memory allocation]]:
** [[malloc and speculative multithreading]]
** [[stack allocation and speculative multithreading]]

Pseudo-atomic - atomic operations that can fail

http://semipublic.comp-arch.net/wiki/Pseudo-atomic

[[Pseudo-atomic]] is a term that I (Glew) have coined to refer to atomic operations that can fail to be atomic, such as:
* [[Load-linked/store-conditional (LL/SC)]]
* IBM z196's [[LOAD PAIR DISJOINT]]
* even [[hardware transactional memory]]
** such as [[Intel Transactional Synchronization Extensions (TSX)]]'s [[Restricted Transactional Memory (RTM)]], where "the [[XBEGIN]] instruction takes an operand that provides a relative offset to the fallback instruction address if the RTM region could not be successfully executed transactionally."

Q: what does IBM transactional memory provide? Is it pseudo-atomic, or does it provide guarantees?

I.e. these [[pseudo-atomic]] operations do not guarantee completion.
At least not at the instruction level.
While it is possible to imagine implementations that detect use of pseudo-atomic instruction sequences
and provide guarantees that certain code sequences will eventually complete.
such mechanisms are
(1) not necessarily architectural
and (2) more complicated that non-pseudo[-atomic instructions.

E.g. for [[LL/SC]] hardware could "pick up" the instructions that lie between the load-linked and the store-conditional,
and map them onto a vocabulary of atomic instructions such as [[fetch-and-op]] that is supported by the memory subsystem.
Similarly, [[LL/SC]] might be implemented using [[transactional memory (TM)]].

Intel's TSX documentation says

GLEW OPINION: requiring alternate code paths has historically been a bad idea.
E.g. [[Intel Intanium ALAT]]. Now [[RTM (Restricted Transactional Memory)]]

Why, then, provide pseudo-atomicity?

* Pseudo-atomic operations allow complicated atomic operations to be built up out of simpler
* Plus, of course, it is easier than providing real atomicity. Most of the time it works. Most of the time may be good enough for many people, who may not care if it occasionally crashes when the seldom used alternate path is exercised.

New(ish) IBM z196 synchronization instructions

http://semipublic.comp-arch.net/wiki/Interlocked-access_facility#New_IBM_z196_synchronization_instructions

= New IBM z196 synchronization instructions =

The IBM z196 adds new [[synchronization instructions]] to the [[IBM System z Mainframe ISA]].

Augmenting the legacy instructions
* [[COMPARE AND SWAP]]
* [[PERFORM LOCKED OPERATION]]
* [[TEST AND SET]]

The reference [share] comments "there is no need for a COMPARE AND SWAP loop to perform these operations!"
- their exclamation mark!
This suggests the motivation for these instructions
- while [[compare and swap]] is one of the most powerful synchronization instructions,
it is not necessarily efficient.
Atomic operations such as [[atomic add to memory]] can perform in one instruction,
without a loop, things that would require looping for [[compare and swap]], [[test-and-set]], and [[load-linked store-conditional]].
Looping that may require special mechanisms to guarantee forward progress.

The z196 [[interlocked-access facility]] instructions include

New atomic instructions:
* [[LOAD AND ADD]]
* [[LOAD AND ADD LOGICAL]]
* [[LOAD AND AND]]
* [[LOAD AND EXCLUSIVE OR]]
* [[LOAD AND OR]]

An interesting instruction that I feel must be called [[pseudo-atomic]], in much the same way [[LL/SC]] is [[pseudo-atomic]]:
* [[LOAD PAIR DISJOINT]]

In addition, the existing instructions that perform [[add immediate to memory]], in signed/unsigned and 32b/64b forms, are declared to be atomic
when "the storage operand is aligned on an [[integral boundary]]" - IBM speak for [[natural alignment]].
* [[ADD IMMEDIATE]]
* [[ADD LOGICAL WITH SIGNED IMMEDIATE]]

= [[Atomicity is always relative]] =

IBM literature says: "as observed by other CPUs and the channel subsystem, ... appear to occur atomically".

IBM also uses the phrase "block-concurrent interlocked update". "Block concurrent" is a special IBM term related to memory ordering, that says that all bytes are accessed atomically as observed by other CPUs. However, they may be observed a byte at a time by channel programs... but "interlocked" means that channel program requests cannot be inserted between the bytes.
Bottom line: atomixc wrt CPUs and I/O channels.

= Reference =

Many references, scattered.

;[share]
: New CPU Facilities in the IBM zEnterprise 196, share.confex.com, 4 August 2010, http://share.confex.com/share/115/webprogram/Handout/Session7034/New%20CPU%20Facilities%20in%20the%20z196.pdf

Sounds like a Fraudster Phone Call

I just received a phone call whose caller ID says "Credit Service", 11:59am, 1-701-661-1003.

Recorded message saying something like "This is your credit card company." Note: they did not say what company, just a generic phrase like "your credit card company".

Going on "There is no reason to be alarmed. You are eligible for a special reduction in interest rate to 6.9%. You must act quickly, because this special offer expires soon. Press 1 if you want to receive this special lower interest rate."

When I pressed 1, after some hold music, eventually I got what sounded like a human. He said something that again sounded, to my recollection, like "This is the charge department".

At this point I said "Hold on, you guys cold called me, so I need you to tell me what company you are, and how I can verify..." And they hung up.

---

Now, I must be careful, since I am reporting the exact caller ID as reported by my phone, including the phone number, but since my notes above as to what they and I said are only an approximate recollection.

I don't record all of my telephone calls. At least, not at this time.

If this was a legitimate business call, at the very least they exhibited bad customer service.

However, this is also exactly the sort of thing that a fraud operation might do: try to fool people into giving out account numbers over the phone, etc.

---

I wonder if there are an police systems to report this sort of thing to.

Oregon: http://www.doj.state.or.us/finfraud/engexplanation.shtml

Wish there was a national service. FBI has a webpage, but looks like it is for actual fraud only, not suspicion.

Monday, March 05, 2012

Warnings, variables, and unnested brackets

I have long been fascinated by improperly nested bracketed constructs, such as [ ( ] ).

Today I ran into an example of a situation that might warrant such improper nesting. I decided to clean up a Perl script, converting it to 'use strict'. Along the way I enabled warnings using Perl's lexically scoped 'use warnings' facility. In places I had to disable warnings, to make the code compile and run with few enough changes.

And so I encountered:

if( ... ) {
   ...
   my $m = an expression with an ignorable warning;
   ...
}

I want to disable the warning for the smallest possible region, but

if( ... ) {
   ...
   { no warnings 'type';
   my $m = an expression with an ignorable warning;
   }
   ...
}

restricts the scope of both the warning disable (good), but also the variable (nad). Whereas letting the warning be disabled until the end of the enclosing lexical scope

if( ... ) {
   ...
   no warnings 'type';
   my $m = an expression with an ignorable warning;
   ...
}

disables the warning for too large a region. What you want is

if( ... ) {
   ...
   <disable-warnings 'type'>
   <variable-scope 'm'>
   my $m = an expression with an ignorable warning;
   </disable-warnings>
   ...
   </variable-scope>
}

Sunday, March 04, 2012

IBM z196 high word facility

http://semipublic.comp-arch.net/wiki/High-word_facility

== Discussion ==

The [[high-word facility]] is somewhat intermediate between [[overlapping registers]] and [[extending registers]].

In terms of dataflow, assuredly a large, 64 bit, register contains at least two separate pieces whose dataflow must be tracked.
This means that a natural OOO implementation of the 16 64-bit registers would have to track 32 different 32 bit words.
Writing to a 64 bit register would update the renaming pointers for both halves, etc.
(Note that I say "natural". Alternate implementations are possible, e.g. at the cost of [[partuial register stalls]].)

However, the [[high-word facility]] uses the high word
without increasing the size of the register number field in the instruction encoding.

IBM's own literature implies that it was led to do this
because 16 GPRs was not enough.

The question is, whether one should consider something like the high word facility
- generalized, perhaps, to not just be high word, but to allow access to different parts of overlapped registers,
e.g. 32 bit scalars extended to 128 bit SIMD packed vector registers.

One might argue that, while 16 registers is not enough, 32 or 64 is more than enough. So why bothedr?

Note that there are two levels. Level 0 is simply to provide accessors - e.g. access the X Y, Z, or W channels of a 128 buit wide quanity.
Level 1 provides operations, not just accessors.

== High word facility ==

The IBM z196 processor added the [[high-word facility]] or [[high word extension]]
to the [[IBM System z Mainframe ISA]], circa 2010.

Since its introduction in 1964, System 360 and all of its successors provided 16 general purpose registers.
Many programs were constrained by this rather small number of registers.

When the registers were extended to 64 bits, the upper 32 bits of the 64 bit registers became underused.
(In IBM parlance, bits 0-31 of the GPRs are the upper half, the most significant, and bits 32-63 are the lower half, the last significant.)

Many programs only needed 32 bit instructions and addresses;
indeed, many programs were limited to 32 bit addresses.
And even programs that use 64 bit instructions and addresses do not use them everywhere.
E.g. in C parlance, there may be 32 bit ints in a 64 bit register.
(Also, for that matter, 16 bit shorts and 8 bit chars or bytes.)

But even these 32 bit programs can benefit from an extra 16 32-bit in terms of register storage,
in the upper halves of the 64 bit registers.

So, this is what the [[high-word facility]] does: it provides a limited set of instructions that use the upper 32 bit halves of the 64 bit registers.

== List of Instructions ==

* ADD
** ADD HIGH, [[AHHHR]]: r_hi = r_hi + r_hi
** ADD HIGH, [[AHHLR]]: r_hi = r_hi + r_lo
*** Presenter comments "should perhaps be called ADD HIGH AND LOW".
** ADD HIGH IMMEDIATE, [[AIH]]: r_hi += imm32
* [[ADD LOGICAL]]
** ADD LOGICAL HIGH, [[ALHHHR]]: r_hi = r_hi + r_hi
** ADD LOGICAL HIGH, [[ALHHLR]]: r_hi = r_hi + r_lo
** ADD LOGICAL WITH SIGNED IMMEDIATE HIGH, [[ALSIH]]: r_hi += imm32
** ADD LOGICAL WITH SIGNED IMMEDIATE HIGH, [[ALSIHN]]: r_hi = -r_hi(self) + imm32
*** [[ALSIHN]] is like [[ALSIH]], but is an example of [[instructions that do not change condition codes]].

* BRANCH RELATIVE ON COUNT HIGH, [[BRCTH]]: if( --r_hi) goto target

* COMPARE
** COMPARE HIGH, [[CHHR]]: r_hi ? r_hi
** COMPARE HIGH, [[CHLR]]: r_hi ? r_lo
** COMPARE HIGH, [[CHF]]: r_hi ? S20
*** F/S20 is a storage operand, i.e. in memory specified by an addressing mode.
** COMPARE IMMEDIATE HIGH, [[CIH]]: r_hi ? imm32

* COMPARE LOGICAL HIGH: in HH, HL, HF and IH flavors

* LOADs:
** LOAD BYTE HIGH (signed/unsigned (LOGICAL))
** LOAD HALFWORD HIGH (signed/unsigned (LOGICAL))
** LOAD HIGH

* ROTATE THEN INSERT SELECTED BITS HIGH/LOW
** flavours to add 32 to bit indices

* STORE HIGH - 8 16, 32

* SUBTRACT HIGH: signed/unsigned, HHH and HHL flavors

The IBM slideset that I got this from says

GLEW COMMENT: it is sad when mnemonics become an obstacle to use.

== References ==

Regional Pricing

Interesting observation about regional pricing differentials:

I am purchasing bookshelves. Google Ikea, found yellow and green Billy bookshelves with glass doors for 79.99$.

However, in Portland they are being sold for 99$, with advertising that pushes them as University of Oregon "Ducks" colors.

Same part number. Not available for website purchase.

Non-painted equivalent bookcases - birch, black, brown - are 249.99$

Called Ikea. Confirmed that they are for sale everywhere except Oregon at 79.99$. E.g. Seattle. But 99$ in Portland. Product being closed out by April 1st. Fellow I talked to on the phone says that he has never seen a local price higher than the website price. But We can guess what is happening here: an Ikea store manager seeking to take advantage of Oregon Ducks fans possibly being more willing to pay a premium. Or, conversely: somebody tried to make an Oregon Ducks special, and is now closing it out, with an especially large discount outside of Oregon.

Myself, I did not go the UofO, so I would not pay a premium. I'm just looking for a good deal on glass fronted bookshelves

Marketing classes teach about price discrimination. It's interesting to see it in practice, as a consumer.

... Woah, prices for this family of bookcases are falling all across Ikea's webpages, since I looked at them last night.

Krazy Glew's Blog

Disclaimer

Tuesday, March 06, 2012

malloc and transactyional memory (TM)

Pseudo-atomic - atomic operations that can fail

New(ish) IBM z196 synchronization instructions

Sounds like a Fraudster Phone Call

Monday, March 05, 2012

Warnings, variables, and unnested brackets

Sunday, March 04, 2012

IBM z196 high word facility

Regional Pricing

Blog Archive

Labels

Search This Blog

Followers

About Me

Links to Me