|
|
Subscribe / Log in / New account

C, Fortran, and single-character strings

By Jonathan Corbet
June 20, 2019
The calling interfaces between programming languages are, by their nature, ripe for misunderstandings; different languages can have subtly different ideas of how data should be passed around. Such misunderstandings often have the effect of making things break right away; these are quickly fixed. Others can persist for years or even decades before jumping out of the shadows and making things fail. A problem of the latter variety recently turned up in how some C programs are passing strings to Fortran subroutines, with unpleasant effects on widely used packages like LAPACK.

The C language famously does not worry much about the length of strings, which simply extend until the null byte at the end. Fortran, though, likes to know the sizes of the strings it is dealing with. When strings are passed as arguments to functions or subroutines, the GCC Fortran argument-passing conventions state that the length of each string is to be appended to the list of arguments. Consider a Fortran subroutine defined something like this:

    subroutine foo(i, s)
    	integer i
	character s
	...

When that subroutine is called from other Fortran code, the length of s will be added by the compiler as a third, "hidden" argument. The C compiler will do no such thing, though, so the proper call to that function from C would look like:

    int i;
    char *s = "bar";

    foo(&i, s, strlen(s));

From C, the length of s must be passed explicitly the end of the list of arguments.

At some distant point in the past, though, somebody decided that the hidden length argument should be omitted for single-character strings — those that are declared "character *1" in the called function, for example. It is not clear that any Fortran compiler anywhere ever implemented that behavior, but developers writing calls from C developed the habit of leaving out the length in that situation. As long as the called code knew that it was getting a single-character string, it would not need to check the (missing) hidden length parameter; everything worked, even though the calling standards were being violated. Various LAPACK subroutines expect single-character strings, and packages like CBLAS and LAPACKE duly leave out the length argument when calling them.

Once again, this is not how these functions are supposed to be called, but things worked anyway. At least, until they broke. It seems that the problem was originally worked out by Thomas Kalibera in the R language community: a fix for an unrelated ABI issue caused crashes with some LAPACK calls. After, seemingly, a great deal of analysis work, Kalibera figured out where things go wrong. A subroutine taking a single-character string would call another just prior to returning, passing the same string. The compiler would optimize that call into a tail call (or more properly a "sibling call" using the same parameters); prior to making the jump, it would helpfully store the string length at the end of the argument list. But that length wasn't there to begin with, and no space had been allocated for it, so the result was an unsightly stack traceback. The problem can be worked around by compiling the Fortran code with the ‑fno‑optimize‑sibling‑calls option.

This behavior was reported as a GCC bug on May 3. Thomas Koenig responded:

OUCH. So, basically, people have been depending on C undefined behavior for ages, and this includes recent developments like LAPACKE. Only an accident of calling conventions has kept this "working". Oh my...

The code that fails with new compilers is widely understood to actually have been broken (if "working") for years. Richard Biener suggested that the solution was to tell the affected users to fix their code. That suggestion did not go far, though, and the GCC developers took the problem seriously; it is not a good thing for a compiler update to break code that was working before. So a solution had to be found, but it wasn't clear what the best solution would be. Simply reverting the ABI fix was not an option, since that would reintroduce a real bug of its own.

After some discussion of options that were shown not to be real solutions, Koenig returned to the use of ‑fno‑optimize‑sibling‑calls which, he said, "restores the status quo because things would go back to being fragile, nonconforming, and they would work again". He suggested that this option could be enabled by default for GCC versions 7, 8, and 9, since the code in question used to work when built with those versions. For the upcoming GCC 10 release, developers would be warned and would have around a year to fix their code.

Unfortunately, as Jakub Jelinek pointed out, that solution was not viable either. There are programs performing recursive tail calls to a significant depth; turning those tail calls into real function calls would cause them to run out of stack space and crash. He suggested instead trying to avoid the tail (or sibling) calls only when there are string arguments involved. Koenig took this work and attached it to a new ‑fbroken‑callers option, which would be enabled by default in updates to older GCC releases.

Jelinek didn't like the name of that option, though, so he reworked it into one called ‑ftail‑call‑workaround. Setting that option to two gives the same behavior as ‑fbroken‑callers, while setting it to one (the default value) limits the workaround to calls to functions without explicit prototypes. Setting it to zero disables the workaround entirely. This code has been backported for the (future) 8.4 and 9.2 releases (so far), and will appear in 10.1 as well, perhaps with a different default value.

Weinberg's second law states that "if builders built houses the way programmers built programs, the first woodpecker to come along would destroy civilization". Situations like this, where the interface between functions has been misunderstood for years, would appear to be a case in point. There is a lot of code on our systems that appears to work fine, but which is really just waiting for a woodpecker to come along and poke a hole in the right place. The GCC developers have worked out how to patch over this particular problem, but there is certainly still plenty of code out there that should never have worked, but which seems to — for now.

[Thanks to Dave Williams for the heads-up on this issue.]


to post comments

C, Fortran, and single-character strings

Posted Jun 20, 2019 15:46 UTC (Thu) by joib (subscriber, #8541) [Link]

The R developers have a good writeup at https://developer.r-project.org/Blog/public/2019/05/15/gf...

C, Fortran, and single-character strings

Posted Jun 20, 2019 18:06 UTC (Thu) by HenrikH (subscriber, #31152) [Link] (1 responses)

Regarding the woodpecker problem, I wonder how many of us in the industry that have worked at or visited places where there is this one server in the rack somewhere that "oh that one we *never* touch no matter what happens".

C, Fortran, and single-character strings

Posted Jun 20, 2019 18:27 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

I've migrated us off of a few of those kinds of machines (or even people for some processes!) at my current job. Luckily we do sometimes find time to carve out for such things. Not as often as I'd like, but often enough for actual progress.

C, Fortran, and single-character strings

Posted Jun 20, 2019 22:35 UTC (Thu) by imMute (guest, #96323) [Link] (9 responses)

>it is not a good thing for a compiler update to break code that was working before

I would argue that the code *wasn't* working before...

C, Fortran, and single-character strings

Posted Jun 21, 2019 7:19 UTC (Fri) by ibukanov (guest, #3942) [Link] (4 responses)

The code wasn't working before in *theory*. In practice it was.

C, Fortran, and single-character strings

Posted Jun 21, 2019 12:35 UTC (Fri) by nivedita76 (guest, #121790) [Link] (2 responses)

Just because people were very very lucky. It broke when their luck ran out. Read some of the bug report chains, ALL FORTRAN compilers expect the length argument. Not passing it was probably something one programmer did by mistake once, and it has since been copied around via cargo cult programming techniques. We don’t even know if prior a FORTRAN compilers just by luck never modified that argument for BLAS functions, or if they did but it didn’t kill the program and nobody noticed it.

C, Fortran, and single-character strings

Posted Jun 22, 2019 21:37 UTC (Sat) by ncm (guest, #165) [Link] (1 responses)

Just for now, consider this. The traditional Fortran calling convention is that ALL arguments are passed by reference. To pass a machine-word-sized integer, you pass a pointer. Float, pass a pointer. Double, pass a pointer. In NO case do you push a length after it.

To pass an integer that is to be interpreted as an enumeration value, you also pass just a pointer. In Fortran, enumeration values are conventionally assigned character codes. They're not strings, they're just convenient literal notation for a symbol. Passing a pointer to a character is just the language idiom for an enumerated-value argument.

But passing a pointer to a character is not wrong or fragile code. Changing a library to fail without an extra length argument, after conventional usage is well-established, would be. If, in fact, maintainers of Fortran libraries are doing that, shame on THEM. I see no reason for C users of Fortran libraries to be embarrassed. Their code is not wrong or fragile, absent specific documentation of the original library to the contrary. Changing the compiler to forbid this usage would be wrong.

If you look at the libraries in question, you will probably find places where an argument interpreted as an enumerator is followed by another actual argument, where inserting another, length, argument would produce the wrong result, or a crash.

C, Fortran, and single-character strings

Posted Jun 24, 2019 5:47 UTC (Mon) by joib (subscriber, #8541) [Link]

No, that's not how the Fortran character type works. It's a string with an associated length, not a single char (in C lingo).

C, Fortran, and single-character strings

Posted Jul 7, 2019 1:07 UTC (Sun) by ericharris76 (guest, #132998) [Link]

Maybe a better word to use here is not "working" but "good" or "right".

If the code had no comments and all its variable names were completely arbitrary and the indenting was missing or goofy, it would be "working" but not "right" or "good", even if the calling sequences were all standards-conforming and it always produced the right results, before and after the change to the compiler.

C, Fortran, and single-character strings

Posted Jun 21, 2019 11:22 UTC (Fri) by mb (subscriber, #50428) [Link] (3 responses)

That depends on the definition of "working".
It was working code, if the definition was: "Get the actual job done."
Most people care about getting things done, not about ABI definitions.

C, Fortran, and single-character strings

Posted Jun 21, 2019 12:37 UTC (Fri) by nivedita76 (guest, #121790) [Link] (2 responses)

This attitude is exactly what the woodpecker comment is getting at.

C, Fortran, and single-character strings

Posted Jun 22, 2019 10:21 UTC (Sat) by Jandar (subscriber, #85683) [Link]

> This attitude is exactly what the woodpecker comment is getting at.

This attitude is accepting reality and not naming something inconvenient fake facts. If something works than it works - fact. The goal isn't simply to get something to work, it is to get something to work safe and reliable. This distinction between simply working (maybe only backed by luck) and good engineering is the topic of the woodpecker comment.

C, Fortran, and single-character strings

Posted Jun 27, 2019 14:51 UTC (Thu) by jschrod (subscriber, #1646) [Link]

If I look at my 2 houses, the same attitude is used by house builders as well...

C, Fortran, and single-character strings

Posted Jun 21, 2019 6:24 UTC (Fri) by bokr (subscriber, #58369) [Link] (6 responses)

Could some workaround be based on passing single-character strings as
single characters :) (of some character type)?

C, Fortran, and single-character strings

Posted Jun 21, 2019 17:29 UTC (Fri) by valarauca (guest, #109490) [Link] (2 responses)

> Could some workaround be based on passing single-character strings as
single characters :) (of some character type)?

But what is a "character"?

UTF-8 doesn't have "characters" anymore. It has "glyphs" which is a "displayable symbol", but some "glyphs" require multiple "unicode scalar values" (what is normally a `unit32_t`). So even a "character" isn't just 1 value anymore.

Sure we can pretend the rest of the world doesn't exist and ASCII is the only text standard, but that seems extremely short sited to burn into an ABI.

C, Fortran, and single-character strings

Posted Jun 21, 2019 21:17 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

That's not a problem, Fortran character means "byte".

However older Fortran programs didn't have prototypes, so the parent comment's suggestion is not applicable.

C, Fortran, and single-character strings

Posted Jun 26, 2019 9:05 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

No.

UTF-8 consists of 8-bit _code units_, strings of which (and similarly for UCS-2, UTF-7, UTF-16 and UCS-4) can be decoded to get Unicode _code points_, which are just integers from an enormous enumeration with names, like "Latin Capital A" and a shared understanding of what they mean.

"Glyphs" are the pretty pictures in a typeface, Unicode isn't directly concerned with how typefaces work, it is ambivalent about whether you choose to have lots of pretty pictures and do some extra work pick from those to draw text, or very few pretty pictures and do different extra work assembling those to draw text. In particular Unicode doesn't care about allographs at all by default, (but for reasons to do with its mission to replace all previous text encodings in fact it encodes a LOT of allographs) that's seen as purely a typeface problem.

You should definitely treat the word "character" as code smell, whatever is going on there will usually be at least confusing and an opportunity for bugs, if not itself directly a bug. That's sad for a lot of older programming languages which use the word "character" all the time. Too bad. See also "number" when used to actually mean something far more limited, like an integer, or a float, or a real.

C, Fortran, and single-character strings

Posted Jun 21, 2019 18:01 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

It looks like the arguments are not really single characters. They're strings, where the function only cares about the first character.

E.g. http://www.netlib.org/lapack/explore-html/d7/d03/dpptrs_8... does "CALL dtpsv( 'Upper', 'Transpose', 'Non-unit', ...)", where the documentation and definition of dtpsv show it just compares the first argument to 'U'/'L', the second to 'N'/'T'/'C', the third to 'U'/'N', using "lsame" (case-insensitive comparison of the first character).

(This seems a really stupid way to design an API, even without the cross-language issue, because the compiler can't type-check the strings to detect typos. I guess Fortran didn't/doesn't have anything equivalent to C enums?)

C, Fortran, and single-character strings

Posted Jun 21, 2019 19:02 UTC (Fri) by joib (subscriber, #8541) [Link]

Fortran doesn't have a separate type for a single character/glyph/grapheme/whatever. There's just the CHARACTER type which is, well, what many other languages call a string.

The LAPACK interface is Fortran 77, which didn't have derived types (structs in C) or enums like modern Fortran has, so it certainly is a lot more limited than what you'd be able to do today.

C, Fortran, and single-character strings

Posted Jun 21, 2019 21:20 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

No, because you don't have prototypes and when passing "a" the receiver could expect either a length-1 string or an unknown-length string. In the latter case the length would be needed, in the former it wouldn't.

Also it would be an ABI break.

What does it have to do with the compiler?

Posted Jun 21, 2019 9:46 UTC (Fri) by NAR (subscriber, #1313) [Link] (3 responses)

I don't quite get what this bug has to do with the compiler. I mean it's like a function that has a "flags" arguments which previously only was using the lowest 2 bits, but now it starts to use the 3rd bit and suddenly all calls are broken that previously set the 3rd bit... Also it doesn't seem that hard to fix the callers with a sed command.

The other thing I don't get: it's C calling FORTRAN, so I'd guess there are C header files generated from the FORTRAN code that contain the prototype of the public functions and those should have the length parameter, shouldn't they?

What does it have to do with the compiler?

Posted Jun 21, 2019 10:32 UTC (Fri) by bjartur (guest, #67801) [Link]

There is no ABI for passing string arguments to Fortran procedures. Since the note broken code was written there was invented an API that works as you imagine. The code does not use it, because the code is older than the API, and it wouldn't fix the lack of an ABI. Speed can be gained by recompiling or swapping out BLAS/LAPACK, and hitherto this was not thought to require recompiling R or CBLAS/LAPACKE.

What does it have to do with the compiler?

Posted Jun 21, 2019 13:42 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> so I'd guess there are C header files generated from the FORTRAN code

Maybe for BLAS/LAPACK, but this isn't how it works in general since most Fortran code used directly from C happens within the same project. There's usually some configure-time logic to determine the mangling strategy of the in-use Fortran compiler. This then guides a macro selection around the core symbol names in a header which has the `extern` declarations of those functions with the right mangling. I suspect that the length arguments are visible in those C declarations, but not in the actual Fortran code.

What does it have to do with the compiler?

Posted Jun 21, 2019 19:10 UTC (Fri) by joib (subscriber, #8541) [Link]

> it's C calling FORTRAN, so I'd guess there are C header files generated from the FORTRAN code that contain the prototype of the public functions and those should have the length parameter, shouldn't they?

They have C prototypes for the Fortran functions, but they are presumably hand-written and they have omitted the length parameter. Which is the reason behind this entire mess.

C, Fortran, and single-character strings

Posted Jun 22, 2019 5:05 UTC (Sat) by marcH (subscriber, #57642) [Link] (8 responses)

> The C language famously does not worry much about the length of strings, which simply extend until the null byte at the end. Fortran, though, likes to know the sizes of the strings it is dealing with.

Not just Fortran but any remotely sane/safe/modern language including C++. Even newer and safer C APIs.

Some other comment mentioned "cargo cult programming techniques": null-terminated strings is probably one of the top examples of that. Any other language doing it?

C, Fortran, and single-character strings

Posted Jun 22, 2019 17:45 UTC (Sat) by ncm (guest, #165) [Link] (7 responses)

Every language does that has to interact with C does, which today is all of them.

But it's not the only dodgy practice around strings, and they are accumulating at an impressive rate. A lot of Pascal family languages store/stored the length in the first byte, with no great answer to how to do a longer string. Others, for first-two-bytes. Lots of languages switched to two-byte first generation Unicode, but have no concept of normalizing different representations with modifier code points, so e.g. strings that produce the same set of glyphs compare unequal, and there is no concept of a character representable only as a pair of two inseparable code units.

Unicode has characters that have no visible glyph and take no space, so could be sprinkled anywhere, and lots of code points have glyphs necessarily identical to others, that normalization isn't allowed to choose just one of. Lots of languages have adopted UTF-8, but not tackled any of the similar problems.

Getting exercised over the choice of representing length with a null terminator will leave you entirely unequipped for the much bigger problems that matter.

C, Fortran, and single-character strings

Posted Jun 22, 2019 18:12 UTC (Sat) by marcH (subscriber, #57642) [Link] (6 responses)

> Lots of languages switched to two-byte first generation Unicode,

I was referring to *memory* length from a safety and performance perspective.

> Every language does that has to interact with C does, which today is all of them.

Yeah, sure. Off-topic too.

C, Fortran, and single-character strings

Posted Jun 22, 2019 21:04 UTC (Sat) by ncm (guest, #165) [Link] (5 responses)

If C strings are evidence if cargo-cultish programming, so are all other string implementations, without exception. Nobody gets a pass, or a diploma.

Null termination is an example of a venerable programming practice, the use of sentinel elements, lately fallen from favor now that memory and cycles are thousands, millions, or even billions of times cheaper than they once were.

If we sneer at choices made then, under the constraints of the time, how much more derision do we deserve for unfortunate choices made without such constraints? 'Cause I could list such, all day long, about any system, language, or technology you can think of.

C, Fortran, and single-character strings

Posted Jun 22, 2019 21:37 UTC (Sat) by marcH (subscriber, #57642) [Link] (4 responses)

Yes, some of C' unsafe choices were made in a completely different context and yes of course many of these choices were required to perform anything in a reasonable time on 50 years older hardware (and not under constant attack). For this particular question however - if we can stop digressing for a minute - I can't see any massive performance advantage for having a marker at the end of an array compared to storing its length somewhere near the start. Maybe for some operations but clearly not for others. By the way Fortran is older than C and is still in use today too because of its... performance. LISP is even older; it's now renown for its performance yet it was somehow getting some work done at the time too.

> 'Cause I could list such, all day long, about any system, language, or technology you can think of.

Sure, let's start by looking at some CVE statistics. Wait, I said no digression sorry.

C, Fortran, and single-character strings

Posted Jun 23, 2019 4:04 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

C-style strings are really the only choice for a language like C. Since your primitive is a (regular) pointer you can't pass length naturally.

C strings allow you to pass substrings as a pair of pointers (or just one pointer for tail substrings), for example.

C, Fortran, and single-character strings

Posted Jun 23, 2019 18:02 UTC (Sun) by marcH (subscriber, #57642) [Link] (2 responses)

> Since your primitive is a (regular) pointer you can't pass length naturally.

Yes the type of (safer) arrays would have been one step above "primitive".

Looking at string.h on opengroup.org, it's interesting to see almost half the functions there already have some size_t argument.

> C strings allow you to pass substrings as a pair of pointers (or just one pointer for tail substrings), for example.

This is indeed a performance optimization. It's also a dangerous one if the array is not const (who owns it now?) and I don't see how "higher level" arrays would stop you from still doing that, I would just discourage you from doing it routinely in non-critical paths.

C, Fortran, and single-character strings

Posted Jun 23, 2019 21:07 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Yes the type of (safer) arrays would have been one step above "primitive".
Sure, but C was designed without such arrays. And a language with safe arrays won't be C.

I'm not saying that it's a good idea now, but null-terminated strings certainly make sense in C.

C, Fortran, and single-character strings

Posted Jun 25, 2019 16:41 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link]

Sure, but C was designed without such arrays. And a language with safe arrays won't be C.

No. A language with only safe arrays won't be C. C is supposed to provide access to low-level functions and that includes unsafe pointers and arrays. But C is also supposed to allow programmers to build higher-level abstractions, including things like safe arrays and strings, and there's excellent reason to use those safe arrays and strings in place of the unsafe alternatives when performance is not critical.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds