C, Fortran, and single-character strings
The C language famously does not worry much about the length of strings, which simply extend until the null byte at the end. Fortran, though, likes to know the sizes of the strings it is dealing with. When strings are passed as arguments to functions or subroutines, the GCC Fortran argument-passing conventions state that the length of each string is to be appended to the list of arguments. Consider a Fortran subroutine defined something like this:
subroutine foo(i, s) integer i character s ...
When that subroutine is called from other Fortran code, the length of s will be added by the compiler as a third, "hidden" argument. The C compiler will do no such thing, though, so the proper call to that function from C would look like:
int i; char *s = "bar"; foo(&i, s, strlen(s));
From C, the length of s must be passed explicitly the end of the list of arguments.
At some distant point in the past, though, somebody decided that the hidden length argument should be omitted for single-character strings — those that are declared "character *1" in the called function, for example. It is not clear that any Fortran compiler anywhere ever implemented that behavior, but developers writing calls from C developed the habit of leaving out the length in that situation. As long as the called code knew that it was getting a single-character string, it would not need to check the (missing) hidden length parameter; everything worked, even though the calling standards were being violated. Various LAPACK subroutines expect single-character strings, and packages like CBLAS and LAPACKE duly leave out the length argument when calling them.
Once again, this is not how these functions are supposed to be called, but things worked anyway. At least, until they broke. It seems that the problem was originally worked out by Thomas Kalibera in the R language community: a fix for an unrelated ABI issue caused crashes with some LAPACK calls. After, seemingly, a great deal of analysis work, Kalibera figured out where things go wrong. A subroutine taking a single-character string would call another just prior to returning, passing the same string. The compiler would optimize that call into a tail call (or more properly a "sibling call" using the same parameters); prior to making the jump, it would helpfully store the string length at the end of the argument list. But that length wasn't there to begin with, and no space had been allocated for it, so the result was an unsightly stack traceback. The problem can be worked around by compiling the Fortran code with the ‑fno‑optimize‑sibling‑calls option.
This behavior was reported as a GCC bug on May 3. Thomas Koenig responded:
The code that fails with new compilers is widely understood to actually have been broken (if "working") for years. Richard Biener suggested that the solution was to tell the affected users to fix their code. That suggestion did not go far, though, and the GCC developers took the problem seriously; it is not a good thing for a compiler update to break code that was working before. So a solution had to be found, but it wasn't clear what the best solution would be. Simply reverting the ABI fix was not an option, since that would reintroduce a real bug of its own.
After some discussion of options that were shown not to be real solutions,
Koenig returned
to the use of ‑fno‑optimize‑sibling‑calls which, he said,
"restores the status quo because things would go back to being
fragile, nonconforming, and they would work again
". He suggested
that this option could be enabled by default for GCC versions 7, 8,
and 9, since the code in question used to work when built with those
versions. For the upcoming GCC 10 release, developers would be warned
and would have around a year to fix their code.
Unfortunately, as Jakub Jelinek pointed out, that solution was not viable either. There are programs performing recursive tail calls to a significant depth; turning those tail calls into real function calls would cause them to run out of stack space and crash. He suggested instead trying to avoid the tail (or sibling) calls only when there are string arguments involved. Koenig took this work and attached it to a new ‑fbroken‑callers option, which would be enabled by default in updates to older GCC releases.
Jelinek didn't like the name of that option, though, so he reworked it into one called ‑ftail‑call‑workaround. Setting that option to two gives the same behavior as ‑fbroken‑callers, while setting it to one (the default value) limits the workaround to calls to functions without explicit prototypes. Setting it to zero disables the workaround entirely. This code has been backported for the (future) 8.4 and 9.2 releases (so far), and will appear in 10.1 as well, perhaps with a different default value.
Weinberg's
second law states that "if builders built houses the way
programmers built programs, the first woodpecker to come along would
destroy civilization
". Situations like this, where the interface
between functions has been misunderstood for years, would appear to be a
case in point. There is a lot of code on our systems that appears to work
fine, but which is really just waiting for a woodpecker to come along and
poke a hole in the right place. The GCC developers have worked out how to
patch over this particular problem, but there is certainly still plenty of
code out there that should never have worked, but which seems to — for now.
[Thanks to Dave Williams for the heads-up on this issue.]
C, Fortran, and single-character strings
Posted Jun 20, 2019 15:46 UTC (Thu)
by joib (subscriber, #8541)
[Link]
Posted Jun 20, 2019 15:46 UTC (Thu) by joib (subscriber, #8541) [Link]
C, Fortran, and single-character strings
Posted Jun 20, 2019 18:06 UTC (Thu)
by HenrikH (subscriber, #31152)
[Link] (1 responses)
Posted Jun 20, 2019 18:06 UTC (Thu) by HenrikH (subscriber, #31152) [Link] (1 responses)
C, Fortran, and single-character strings
Posted Jun 20, 2019 18:27 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link]
Posted Jun 20, 2019 18:27 UTC (Thu) by mathstuf (subscriber, #69389) [Link]
C, Fortran, and single-character strings
Posted Jun 20, 2019 22:35 UTC (Thu)
by imMute (guest, #96323)
[Link] (9 responses)
Posted Jun 20, 2019 22:35 UTC (Thu) by imMute (guest, #96323) [Link] (9 responses)
I would argue that the code *wasn't* working before...
C, Fortran, and single-character strings
Posted Jun 21, 2019 7:19 UTC (Fri)
by ibukanov (guest, #3942)
[Link] (4 responses)
Posted Jun 21, 2019 7:19 UTC (Fri) by ibukanov (guest, #3942) [Link] (4 responses)
C, Fortran, and single-character strings
Posted Jun 21, 2019 12:35 UTC (Fri)
by nivedita76 (guest, #121790)
[Link] (2 responses)
Posted Jun 21, 2019 12:35 UTC (Fri) by nivedita76 (guest, #121790) [Link] (2 responses)
C, Fortran, and single-character strings
Posted Jun 22, 2019 21:37 UTC (Sat)
by ncm (guest, #165)
[Link] (1 responses)
Posted Jun 22, 2019 21:37 UTC (Sat) by ncm (guest, #165) [Link] (1 responses)
To pass an integer that is to be interpreted as an enumeration value, you also pass just a pointer. In Fortran, enumeration values are conventionally assigned character codes. They're not strings, they're just convenient literal notation for a symbol. Passing a pointer to a character is just the language idiom for an enumerated-value argument.
But passing a pointer to a character is not wrong or fragile code. Changing a library to fail without an extra length argument, after conventional usage is well-established, would be. If, in fact, maintainers of Fortran libraries are doing that, shame on THEM. I see no reason for C users of Fortran libraries to be embarrassed. Their code is not wrong or fragile, absent specific documentation of the original library to the contrary. Changing the compiler to forbid this usage would be wrong.
If you look at the libraries in question, you will probably find places where an argument interpreted as an enumerator is followed by another actual argument, where inserting another, length, argument would produce the wrong result, or a crash.
C, Fortran, and single-character strings
Posted Jun 24, 2019 5:47 UTC (Mon)
by joib (subscriber, #8541)
[Link]
Posted Jun 24, 2019 5:47 UTC (Mon) by joib (subscriber, #8541) [Link]
C, Fortran, and single-character strings
Posted Jul 7, 2019 1:07 UTC (Sun)
by ericharris76 (guest, #132998)
[Link]
Posted Jul 7, 2019 1:07 UTC (Sun) by ericharris76 (guest, #132998) [Link]
If the code had no comments and all its variable names were completely arbitrary and the indenting was missing or goofy, it would be "working" but not "right" or "good", even if the calling sequences were all standards-conforming and it always produced the right results, before and after the change to the compiler.
C, Fortran, and single-character strings
Posted Jun 21, 2019 11:22 UTC (Fri)
by mb (subscriber, #50428)
[Link] (3 responses)
Posted Jun 21, 2019 11:22 UTC (Fri) by mb (subscriber, #50428) [Link] (3 responses)
It was working code, if the definition was: "Get the actual job done."
Most people care about getting things done, not about ABI definitions.
C, Fortran, and single-character strings
Posted Jun 21, 2019 12:37 UTC (Fri)
by nivedita76 (guest, #121790)
[Link] (2 responses)
Posted Jun 21, 2019 12:37 UTC (Fri) by nivedita76 (guest, #121790) [Link] (2 responses)
C, Fortran, and single-character strings
Posted Jun 22, 2019 10:21 UTC (Sat)
by Jandar (subscriber, #85683)
[Link]
Posted Jun 22, 2019 10:21 UTC (Sat) by Jandar (subscriber, #85683) [Link]
This attitude is accepting reality and not naming something inconvenient fake facts. If something works than it works - fact. The goal isn't simply to get something to work, it is to get something to work safe and reliable. This distinction between simply working (maybe only backed by luck) and good engineering is the topic of the woodpecker comment.
C, Fortran, and single-character strings
Posted Jun 27, 2019 14:51 UTC (Thu)
by jschrod (subscriber, #1646)
[Link]
Posted Jun 27, 2019 14:51 UTC (Thu) by jschrod (subscriber, #1646) [Link]
C, Fortran, and single-character strings
Posted Jun 21, 2019 6:24 UTC (Fri)
by bokr (subscriber, #58369)
[Link] (6 responses)
Posted Jun 21, 2019 6:24 UTC (Fri) by bokr (subscriber, #58369) [Link] (6 responses)
single characters :) (of some character type)?
C, Fortran, and single-character strings
Posted Jun 21, 2019 17:29 UTC (Fri)
by valarauca (guest, #109490)
[Link] (2 responses)
Posted Jun 21, 2019 17:29 UTC (Fri) by valarauca (guest, #109490) [Link] (2 responses)
single characters :) (of some character type)?
But what is a "character"?
UTF-8 doesn't have "characters" anymore. It has "glyphs" which is a "displayable symbol", but some "glyphs" require multiple "unicode scalar values" (what is normally a `unit32_t`). So even a "character" isn't just 1 value anymore.
Sure we can pretend the rest of the world doesn't exist and ASCII is the only text standard, but that seems extremely short sited to burn into an ABI.
C, Fortran, and single-character strings
Posted Jun 21, 2019 21:17 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link]
Posted Jun 21, 2019 21:17 UTC (Fri) by pbonzini (subscriber, #60935) [Link]
However older Fortran programs didn't have prototypes, so the parent comment's suggestion is not applicable.
C, Fortran, and single-character strings
Posted Jun 26, 2019 9:05 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
Posted Jun 26, 2019 9:05 UTC (Wed) by tialaramex (subscriber, #21167) [Link]
UTF-8 consists of 8-bit _code units_, strings of which (and similarly for UCS-2, UTF-7, UTF-16 and UCS-4) can be decoded to get Unicode _code points_, which are just integers from an enormous enumeration with names, like "Latin Capital A" and a shared understanding of what they mean.
"Glyphs" are the pretty pictures in a typeface, Unicode isn't directly concerned with how typefaces work, it is ambivalent about whether you choose to have lots of pretty pictures and do some extra work pick from those to draw text, or very few pretty pictures and do different extra work assembling those to draw text. In particular Unicode doesn't care about allographs at all by default, (but for reasons to do with its mission to replace all previous text encodings in fact it encodes a LOT of allographs) that's seen as purely a typeface problem.
You should definitely treat the word "character" as code smell, whatever is going on there will usually be at least confusing and an opportunity for bugs, if not itself directly a bug. That's sad for a lot of older programming languages which use the word "character" all the time. Too bad. See also "number" when used to actually mean something far more limited, like an integer, or a float, or a real.
C, Fortran, and single-character strings
Posted Jun 21, 2019 18:01 UTC (Fri)
by excors (subscriber, #95769)
[Link] (1 responses)
Posted Jun 21, 2019 18:01 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)
E.g. http://www.netlib.org/lapack/explore-html/d7/d03/dpptrs_8... does "CALL dtpsv( 'Upper', 'Transpose', 'Non-unit', ...)", where the documentation and definition of dtpsv show it just compares the first argument to 'U'/'L', the second to 'N'/'T'/'C', the third to 'U'/'N', using "lsame" (case-insensitive comparison of the first character).
(This seems a really stupid way to design an API, even without the cross-language issue, because the compiler can't type-check the strings to detect typos. I guess Fortran didn't/doesn't have anything equivalent to C enums?)
C, Fortran, and single-character strings
Posted Jun 21, 2019 19:02 UTC (Fri)
by joib (subscriber, #8541)
[Link]
Posted Jun 21, 2019 19:02 UTC (Fri) by joib (subscriber, #8541) [Link]
The LAPACK interface is Fortran 77, which didn't have derived types (structs in C) or enums like modern Fortran has, so it certainly is a lot more limited than what you'd be able to do today.
C, Fortran, and single-character strings
Posted Jun 21, 2019 21:20 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link]
Posted Jun 21, 2019 21:20 UTC (Fri) by pbonzini (subscriber, #60935) [Link]
Also it would be an ABI break.
What does it have to do with the compiler?
Posted Jun 21, 2019 9:46 UTC (Fri)
by NAR (subscriber, #1313)
[Link] (3 responses)
I don't quite get what this bug has to do with the compiler. I mean it's like a function that has a "flags" arguments which previously only was using the lowest 2 bits, but now it starts to use the 3rd bit and suddenly all calls are broken that previously set the 3rd bit... Also it doesn't seem that hard to fix the callers with a sed command.
Posted Jun 21, 2019 9:46 UTC (Fri) by NAR (subscriber, #1313) [Link] (3 responses)
The other thing I don't get: it's C calling FORTRAN, so I'd guess there are C header files generated from the FORTRAN code that contain the prototype of the public functions and those should have the length parameter, shouldn't they?
What does it have to do with the compiler?
Posted Jun 21, 2019 10:32 UTC (Fri)
by bjartur (guest, #67801)
[Link]
Posted Jun 21, 2019 10:32 UTC (Fri) by bjartur (guest, #67801) [Link]
What does it have to do with the compiler?
Posted Jun 21, 2019 13:42 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
Posted Jun 21, 2019 13:42 UTC (Fri) by mathstuf (subscriber, #69389) [Link]
Maybe for BLAS/LAPACK, but this isn't how it works in general since most Fortran code used directly from C happens within the same project. There's usually some configure-time logic to determine the mangling strategy of the in-use Fortran compiler. This then guides a macro selection around the core symbol names in a header which has the `extern` declarations of those functions with the right mangling. I suspect that the length arguments are visible in those C declarations, but not in the actual Fortran code.
What does it have to do with the compiler?
Posted Jun 21, 2019 19:10 UTC (Fri)
by joib (subscriber, #8541)
[Link]
Posted Jun 21, 2019 19:10 UTC (Fri) by joib (subscriber, #8541) [Link]
They have C prototypes for the Fortran functions, but they are presumably hand-written and they have omitted the length parameter. Which is the reason behind this entire mess.
C, Fortran, and single-character strings
Posted Jun 22, 2019 5:05 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (8 responses)
Posted Jun 22, 2019 5:05 UTC (Sat) by marcH (subscriber, #57642) [Link] (8 responses)
Not just Fortran but any remotely sane/safe/modern language including C++. Even newer and safer C APIs.
Some other comment mentioned "cargo cult programming techniques": null-terminated strings is probably one of the top examples of that. Any other language doing it?
C, Fortran, and single-character strings
Posted Jun 22, 2019 17:45 UTC (Sat)
by ncm (guest, #165)
[Link] (7 responses)
Posted Jun 22, 2019 17:45 UTC (Sat) by ncm (guest, #165) [Link] (7 responses)
But it's not the only dodgy practice around strings, and they are accumulating at an impressive rate. A lot of Pascal family languages store/stored the length in the first byte, with no great answer to how to do a longer string. Others, for first-two-bytes. Lots of languages switched to two-byte first generation Unicode, but have no concept of normalizing different representations with modifier code points, so e.g. strings that produce the same set of glyphs compare unequal, and there is no concept of a character representable only as a pair of two inseparable code units.
Unicode has characters that have no visible glyph and take no space, so could be sprinkled anywhere, and lots of code points have glyphs necessarily identical to others, that normalization isn't allowed to choose just one of. Lots of languages have adopted UTF-8, but not tackled any of the similar problems.
Getting exercised over the choice of representing length with a null terminator will leave you entirely unequipped for the much bigger problems that matter.
C, Fortran, and single-character strings
Posted Jun 22, 2019 18:12 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (6 responses)
Posted Jun 22, 2019 18:12 UTC (Sat) by marcH (subscriber, #57642) [Link] (6 responses)
I was referring to *memory* length from a safety and performance perspective.
> Every language does that has to interact with C does, which today is all of them.
Yeah, sure. Off-topic too.
C, Fortran, and single-character strings
Posted Jun 22, 2019 21:04 UTC (Sat)
by ncm (guest, #165)
[Link] (5 responses)
Posted Jun 22, 2019 21:04 UTC (Sat) by ncm (guest, #165) [Link] (5 responses)
Null termination is an example of a venerable programming practice, the use of sentinel elements, lately fallen from favor now that memory and cycles are thousands, millions, or even billions of times cheaper than they once were.
If we sneer at choices made then, under the constraints of the time, how much more derision do we deserve for unfortunate choices made without such constraints? 'Cause I could list such, all day long, about any system, language, or technology you can think of.
C, Fortran, and single-character strings
Posted Jun 22, 2019 21:37 UTC (Sat)
by marcH (subscriber, #57642)
[Link] (4 responses)
Posted Jun 22, 2019 21:37 UTC (Sat) by marcH (subscriber, #57642) [Link] (4 responses)
> 'Cause I could list such, all day long, about any system, language, or technology you can think of.
Sure, let's start by looking at some CVE statistics. Wait, I said no digression sorry.
C, Fortran, and single-character strings
Posted Jun 23, 2019 4:04 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Jun 23, 2019 4:04 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)
C strings allow you to pass substrings as a pair of pointers (or just one pointer for tail substrings), for example.
C, Fortran, and single-character strings
Posted Jun 23, 2019 18:02 UTC (Sun)
by marcH (subscriber, #57642)
[Link] (2 responses)
Posted Jun 23, 2019 18:02 UTC (Sun) by marcH (subscriber, #57642) [Link] (2 responses)
Yes the type of (safer) arrays would have been one step above "primitive".
Looking at string.h on opengroup.org, it's interesting to see almost half the functions there already have some size_t argument.
> C strings allow you to pass substrings as a pair of pointers (or just one pointer for tail substrings), for example.
This is indeed a performance optimization. It's also a dangerous one if the array is not const (who owns it now?) and I don't see how "higher level" arrays would stop you from still doing that, I would just discourage you from doing it routinely in non-critical paths.
C, Fortran, and single-character strings
Posted Jun 23, 2019 21:07 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jun 23, 2019 21:07 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)
Sure, but C was designed without such arrays. And a language with safe arrays won't be C.
I'm not saying that it's a good idea now, but null-terminated strings certainly make sense in C.
C, Fortran, and single-character strings
Posted Jun 25, 2019 16:41 UTC (Tue)
by rgmoore (✭ supporter ✭, #75)
[Link]
Posted Jun 25, 2019 16:41 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link]
Sure, but C was designed without such arrays. And a language with safe arrays won't be C.
No. A language with only safe arrays won't be C. C is supposed to provide access to low-level functions and that includes unsafe pointers and arrays. But C is also supposed to allow programmers to build higher-level abstractions, including things like safe arrays and strings, and there's excellent reason to use those safe arrays and strings in place of the unsafe alternatives when performance is not critical.