John Rose, 2018-0310 (light edits 2020-0923, name changes 2022-0803)
There is a clear parallelism between immutable object types (notably value-based classes) and value types, which are immutable and have similar qualities to value-based classes. Indeed, recent changes in Project Valhalla tend to make value classes (per se) conform to the contract of value-based classes so closely as to make migration of VBSs to VCs a desirable choice (in most cases).
There is another parallelism between stateful object types and value classes, but it is more subtle. In many (not all) cases a stateful object can be transformed into a value, as long as the value which models the current state of the object is made available to all code that may need to read that state. This means that updating a state must involve discarding the previous value and using the new value which models the updated object state.
This all sounds complicated, but it feels simple in practice. “Codes like a class but works like an int” means that the current state value is always assigned to whatever variable(s) are looking at it. In fact, just like an int:
int x = 0;
Value v = Value(0);
…
x + 1; // bad code
x = x + 1; // good code
v.bump(); // bad code
v = v.bump(); // good code
(Hmmm… It might be time to support a __NoDiscard
modifier to return
values in APIs. The bad code with x
is a syntax error, but the bad
code with v
is silently accepted. To track this I filed JDK-8199429
and threw some ideas in to deal with later.)
Let’s do an example, where a stateful Iterator
object is transformed
into a stateless Cursor
value. This code is taken verbatim from the
JDK:
class CharIterator implements PrimitiveIterator.OfInt {
int cur = 0;
public boolean hasNext() {
return cur < length();
}
public int nextInt() {
if (hasNext()) {
return charAt(cur++);
} else {
throw new NoSuchElementException();
}
}
// omitted forEachRemaining
}
It is an inner class, used to construct a stream over an enclosing
object (a CharSequence
) which has length()
and charAt(i)
operations.
It could also be used as a normal Iterator
(and would return boxed
Integer
s, for what that’s worth).
(To peek under the covers at this, try javap on the local class
sometimes known as java.lang.CharSequence$1CharIterator
.)
An instance of this class has two fields: A pointer to the outer
CharSequence
, plus the varying index cur
. It is often the case that
iterators have one fixed field and one varying; sometimes there are
additional fixed fields, such as a limit to the varying field.
Sometimes there are additional varying fields, such as a prefetched current value. These fixed and varying fields correspond to loop variables, which are (respectively) loop-invariant and loop-varying, but they are neatly packaged inside of the iterator.
The public interface of this object is very simple, a variation of
java.util.Iterator
:
public interface PrimitiveIterator.OfInt
extends … Iterator<Integer> {
@Override boolean hasNext();
int nextInt();
}
A for-loop over this iterator looks like this:
for (final var t = myIntIterator(); t.hasNext(); ) {
var x = t.nextInt();
workOn(x);
}
This for-loop appears to use just one iteration variable t
, and it is
loop-invariant. But under the hood there is a loop-invariant pointer
to a CharSequence
and a loop-varying index.
A recurring problem with iterators is that they require a heap
allocation to hold the iteration variables. This is necessary in
order to make t
be loop-invariant, but to encapsulate the loop-varying
field cur
. A recurring fix to the problem is to use something called
“escape analysis” to determine that t
is local to the for-loop, and
expand its fields directly into the same stack frame as the for-loop.
Here’s the same loop as above, but with the fields of the iterator inlined (as source code) into the for-loop:
final var t = myIntIterator();
final var tcs = t.this$0; // CharSequence.this
for (var tcur = t.cur; tcur < tcs.length(); ) {
var x = tcs.charAt(tcur++);
workOn(x);
}
(Note how the hasNext()
method simplifies down to what you would write
in a hand-crafted loop, something like a[i++]
. We’ll want to preserve
this effect, which comes from correct usage of hasNext
and next
.)
While wonderful in theory, the escape analysis fix is fragile in practice. Value types, besides supplying flattened data structures, also supply (as a necessary component) an inherent property that they never “escape” in the sense of escape analysis, and therefore do not require any analysis in order to enjoy their optimization into groups of local variables.
The catch is that values have different characteristics, in order to
avoid escape analysis and allow flattening. Specifically, they must be
identity-free and immutable. That in turn requires different patterns
in source code to use them correctly. If a value models the same
state variables as the CharIterator
above, it must appear in the
for-loop as a loop-varying value, not a loop-invariant iterator
object. In a values-rich world, this loop-varying replacement
for iterator will allow loops to be coded generically, as
with iterators, but with no heap allocations to hold the state.
It seems possible that such loop-varying loop control values
could become as common as iterators, so they deserve their own
name; we can call them cursors.
Let’s work again through the iterator example to see how a value-based
cursor might replace the iterator. First, we’ll regroup the tcs
and
tcur
variables into a value class:
value class CharCursor extends … {
//CharSequence this$0; // make it a nested class
int cur = 0;
…
}
Now let’s rename the expanded loop to see how to fill in the value class:
var c = myIntCursor();
final var ccs = c.this$0; // CharSequence.this
for (var ccur = c.cur; ccur < ccs.length(); ) {
var x = tcs.charAt(ccur++);
workOn(x);
}
And then regroup ccs
/ccur
back into c
:
for (var c = myIntCursor(); c.hasValue(); c = c.advance()) {
var x = c.getAsInt();
workOn(x);
}
I chose the names more or less arbitrarily, but tried to approach, yet not encroach on, the existing iterator API.
(An earlier draft of this note just used
hasNext
andnext
, but using the same name with a subtly different behavioral contract is one of those clever-but-bad ideas some of us can tangled in. Users might callget
and expect it (somehow) to include the semantics ofadvance
, but that’s impossible for a stateless cursor.)
The name get
(or getAsInt
) allows a cursor to be a Supplier
(or
IntSupplier
). This is a legitimate implementation of the contract,
despite the fact that some suppliers, like all iterators, are
stateful.
The name hasValue
could also be hasCurrent
or notAtEnd
or
alive
, while the name get
could be value
or current
.
The name advance
could also have been bump
, increment
, etc. It
could even have been next
, encroaching on a different part of the
Iterator
API.
More importantly, the value version of the loop requires three methods where the object version requires two. Is this evidence of a mistake somewhere? Not really; it is evidence that we are succeeding in the core principle of value types: “codes like a class, works like an int”. Compare the previous loop with a simple array/int loop and you will see why:
final var a = myArray();
for (var i = 0; i < a.length; i = i + 1) {
var x = a[i];
workOn(x);
}
The same pattern of two iteration variables occurs (a
and i
), and the
same pattern of loop control operations occurs. Instead of
c.hasValue()
we have i<a.length
. Instead of c.getAsInt()
we have a[i]
,
which is the get value to iterate over. The third method call (which
wasn’t present in the iterator version) is simply the update to it.
As the example with arrays and ints makes clear, the iterator’s
t.nextInt()
operation was actually producing two results: First, it
handed out the get value in the iteration series, and second it
quietly updated the iterator object to point at the value after the
t.nextInt()
value just returned. Integers can’t do this trick; they
must be updated separately, and value types must follow along. So
instead of c=c.advance()
we have i=i+1
.
Note that if you were to repeatedly ask for c.getAsInt()
, you’d get the
same value (if c
were properly coded). In order to get a different
value you have to call c.advance()
. This works like an int and an
array: If you ask repeatedly for a[i]
you get the same value every
time (in a well-behaved program). To get a different array element
you must update i
.
<digression>
The Java language (following C) gives an abbreviated way to get the
effect of t.nextInt()
in a single expression, the famous a[i++]
. If
you look carefully at a[i++]
, you’ll notice that it returns a pair of
values: First, the value of the expression is the get element of that
array you are scanning. Second, the iteration state quietly updates
itself to point at the array element afterwards. Given the widespread
use of this deceptively simple expression, it seems worth considering
whether value types could benefit from such a syntax also; it would
call a method on a value stored in a variable, update the variable,
and then call another method on the previous value of the variable.
Maybe something like:
for (var c = myIntCursor(); c.hasValue(); ) {
var x = c{.=advance()}.getAsInt();
workOn(x);
}
Sadly, there don’t seem to be any obviously good answers here. And to
be honest i++
, though indispensible, is quite ugly; we’re just used to
it. So I’ll back away from that set of ideas. See JDK-8199429 for a
bit more.
</digression>
Now we can fill in our cursor class more fully:
value class CharCursor extends … PrimitiveCursor.OfInt {
//CharSequence this$0; // make it a nested class
int cur = 0;
public boolean hasValue() {
return cur < length();
}
public int getAsInt() {
if (hasValue()) {
return charAt(cur);
} else {
throw new NoSuchElementException();
}
}
public CharCursor advance() {
return this __With { i = i + 1; };
}
}
(The __With
token sweeps under the rug the details of how a new
value instance is obtained from an old value instance of the same
class by updating one of its fields. This is a separate and
complicated conversation, involving concepts like factory methods,
primary constructors, reconstructors, deconstructors, and “wither”
methods. __With
in this case is simply a punch-through to the current
JVM-level bytecode which handles such things, withfield
.)
The details of the class are not deeply important, although the above
is probably close to a working example in current Valhalla prototypes.
The code is almost identical to the original CharIterator
example.
(And I didn’t have to cherry-pick an example, or work hard on the
iterator-to-cursor transition.) Notice that the double effect of the
t.nextInt()
method derives from a tiny “++
” token embedded in the
heart of the method. We can derive the value type semantics by
dropping the “++
”. Of course it is not always this easy to make a
cursor from an iterator, but it probably not much harder in most
cases.
If we change “i++
” to “i
” in getInt
to drop the second effect, we had
to put it somewhere. In fact, it went into the third method, as one
would expect. In some proposals for the language support for values,
you can write “i++
” as the complete body of a reconstructor method:
public __Reconstructor advance() {
i++;
}
As a sanity check, let’s inline the cursor into our new loop and see how it optimizes:
for (var c = myIntCursor();
c.cur < c.this$0.length();
c = __WithField(c.cur, c.cur + 1)) {
var x = c.this$0.charAt(c.cur);
workOn(x);
}
This is more complicated than the inlining of the CharIterator
; is it
harder to optimize? For starters, there are no heap objects or object
identities, and thus no aliasing. The compiler can accurately track c
and
its components, c.cur
and c.this$0
as if they were all plain
variables. In fact, the above loop quickly folds to the previous
version of the inlined loop, including operations like ccur++
. The
compiler can easily and reliably note that c.this$0
is in fact a loop
invariant, and can be hoisted into a final like ccs
was above.
There’s no need to exhibit this loop because its behavior is
indistinguishable from the nice loop with ccs
, above.
A key part of optimizing the cursor-based loop is recognizing the loop
invariants. This recognition is slightly at risk since when the
cursor value is updated, it apparently updates all components of the
cursor (including this$0
), not just the part that changed (which is
cur
only). If the compiler could not inline the crucial call to
c.advance()
, it would be unable to deduce that c.this$0
is a loop
invariant. Of course it also wouldn’t see that the thing changing was
a nice little int
variable.
To make sure the loop optimizes fully, all three methods of CharCursor
must be inlined, and their bodies must be easy for the compiler to
analyze. Of course, the same is already true with the CharIterator
,
and we have seen that the code of the two classes is almost identical,
except that the CharCursor
splits out two methods from a single
CharIterator
method.
Nearly-identical code is a problem of course. If we move away from iterators and toward cursors, will there be a general duplication of loop iteration code, one copy for cursors and one for iterators? After all, iterators aren’t going anywhere, and sometimes you need one.
What’s necessary in order to adopt cursors widely is to have a story
for interconverting them with iterators, without always duplicating
code. (Sometimes we duplicate code if it helps performance but we try
not to indulge too often.) Given the similarity in structure, can
cursors be automatically derived from iterators? The answer seems to
be no (in general) since deriving the three methods of a cursor from
the two methods of an iterator requires somehow splitting the get
method of the iterator into its two effects, and assigning the bits of
code to the get
and advance
methods of the cursor.
But it seems that iterators can be derived from cursors. Here is a
proof, specialized to the current example of CharCursor
and
CharIterator
:
class CharIterator implements PrimitiveIterator.OfInt {
CharCursor c;
public boolean hasValue() {
return c.hasValue();
}
public int nextInt() {
int x = c.getAsInt(); // (where is my ++)
c = c.advance();
return x;
}
}
More generically:
abstract class IteratorFromCursor<T> implements Iterator<T> {
protected Cursor<T> c;
protected IteratorFromCursor<T>(Cursor<T> initialState) {
c = initialState;
}
public boolean hasNext() {
return c.hasValue();
}
public T next() {
T x = c.get(); // (where is my ++)
c = c.advance();
return x;
}
public void forEachRemaining(Consumer<T> block) {
c.forEachRemaining(block);
}
}
(In order to fully flatten this class, it will need to be a template class. This is a planned follow-on to value types. Before template classes, the generic iterator will have a slightly more complex heap structure. Where escape analysis succeeds with bespoke iterators, it is likely also to succeed with such generic iterators built on top of cursors.)
Given that derivation, it will be reasonable, over time, to refactor
some existing iterators as cursors, splitting the two effects of the
get
method by hand, and supplying backward-compatible iterators
automatically using generics like the above.
What have we learned?
Value classes can incrementally improve on certain kinds of mutable classes, such as iterators. The improvement is that the dependence on escape analysis is removed.
In particular, value classes can reduce heap pressure for loop, in cases where we care about such things.
The improvement requires recoding but is incremental in cost, because the code for cursors and iterators is very similar.
To benefit from cursors, for-loops which use iterators need recoding; this recoding is also incremental in nature.
There is a backward compatibility story for iterators in a world where cursors are the new way to do things.
In order to operate effectively with for-loops, value classes may need a bit more syntax love, so they can “work like an int” when that int is loop variable.
Since moving to cursors requires recoding for both loops and iterators, it’s fair to wonder whether we are actually stuck with iterators, despite their issues. Of course, most code will stay as it is, but performance sensitive code may benefit from recoding loops. In that case, having a basic supply of cursors on hand makes sense.
In any case, simply as a thought experiment, cursors have taught us more about how value types will work.