Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, December 30, 2018

Potholes to avoid when migrating to IPv6

Some of my posts are sourced from observing the kinds of mistakes that people are likely to make. Occasionally, they clump together and present a pattern or two, and then I try to share the finding to help others avoid falling in the same trap.

Watching a bunch of services migrate to dual-stack IPv4 + IPv6 behavior showed me a lot of places where it can go wrong. A lot of it is subtle fiddly stuff which nobody really thought about before in the 4-only world, but that suddenly became important.

For example, how many people have built a service where they pass around connection details as a hostname, a colon, and then a port number? It might be something like this.

leader_host = bigdata.example.org:10443

Parsing that is easy enough, right? In the IPv4 world, you split on the colon. That gives you a host portion which goes to your DNS resolver call, like gethostbyname(). It also gives you a port portion which needs to go through the local equivalent of "atoi", and then gets crammed into a sockaddr struct and handed to connect().

Pretty old hat, these BSD sockets, yeah? Well... what happens when you get handed this?

leader_host = 192.168.200.2:10443

It turns out that this also splits cleanly, and the "dotted quad" notation for that IPv4 address passes straight through gethostbyname().

Odds are good that someone is creating this string by emitting (hostname) + ":" + (port) somewhere. What happens when you start using IPv6 addresses in the system, and then that code runs and generates an address string for another host? You get something like... this:

leader_host = 2001:0db8:f00f::0553:1211:0088:10443

Is your parser still going to work? If it's finding the first colon in the string, I'm guessing the answer is no. It'll end up with "2001" in the host section, and the whole rest of it trying to be parsed as an int. That's not going to work.

Okay, so, someone will probably change the code to use the LAST instance of the colon in the string. That gives you the split you needed. You get "2001:0db8:f00f::0553:1211:0088" as the host, "10443" as the port, and you can proceed to the resolver step. Obviously, you have to use getaddrinfo() now, but you knew that already.

This version of the code will hold up for a while, but then one day, you'll start seeing errors like this:

Connect: unable to connect to leader: 2001:0db8:f00f::0553:1211 port 88: connection refused

This one will take a lot of head-scratching to figure out. Hopefully when you see this one, you remember this post and it saves you some time. What happened? Assumptions.

It turned out that the program only did the "split on colon" thing if the string actually HAD a colon in it. There was a standard port the service should run on, which is 10443 for our little story today. You didn't have to pass in the port. You could just give it an IP address or hostname and it would use the default.

So what happened is that someone used a generator to put out something like this:

leader_host = 2001:0db8:f00f::0553:1211:0088

Why? Well, they figured the extra ":port" was stupid, since they were always writing out ":10443", and it's the default, so why bother?

When the consuming program got a hold of it, the "hey, a colon!" code fired off, and stripped off the ":0088" as the port. It so happens that "0088" will wash through a lot of atoi type functions as decimal 88, and so it tried to use that port. That's the wrong port.

But, did you notice the actual IP address is bogus, too? Yep, thanks to the embedded "::" which fills in the center of the address with zeroes, the now-missing ":0088" means the "0553" and "1211" parts of that address get shifted down.

I'll use a monospaced font and fill in all of the zeroes to show what happened:

Intended: 2001:0db8:f00f:0000:0000:0553:1211:0088 with default port
  Result: 2001:0db8:f00f:0000:0000:0000:0553:1211 with port 0088

The next thing that probably happens is that someone decrees that everyone will now specify IPv6 addresses in strings just like you do in URLs, using brackets. Parsers will be adjusted to only look for a possible ":port" AFTER those brackets.

The config now looks like this...

leader_host = [2001:0db8:f00f::0553:1211:0088]

... or this, specifying a port number:

leader_host = [2001:0db8:f00f::0553:1211:0088]:10443

Everyone changes their generators and parsers and life goes on for a while.

Then, one day, someone notices that there are too many connections being established based on the list of hosts. This makes no sense because the list is supposed to have a deduplication pass applied to it. Identical entries should just disappear, leaving just unique entries.

It turns out this is actually true. Identical entries are going away. That is, identical strings are going away. But... it turns out you have many ways to express those IP addresses.

Someone decided to be explicit about everything, and decided to spell out the zeroes. Maybe they didn't like what the "::" did to them before during the whole port number fiasco that we just talked about. They started generating this:

leader_host = [2001:0db8:f00f:0000:0000:0553:1211:0088]

That gave them a host string of "2001:0db8:f00f:0000:0000:0553:1211:0088". The existing entry is "2001:0db8:f00f::0553:1211:0088". Those strings aren't identical. They aren't even the same length! So, that must represent two different hosts and we should probably connect to both of them.

An intern is given this problem, and decides to fix it by simply turning all instances of "::" into the requisite number of "0000:0000" blocks. They call it "zero injection". (Meanwhile, another intern at another company is doing it the other way, calling it "zero squashing". This doesn't matter at all until years later when the two companies merge, but we'll leave that story for another day.)

"Zero injection" solves this problem, and life goes on.

Then, one day, it starts happening again. It's again about zeroes, but in a different way. Someone's been making config strings like this:

leader_host = [2001:0db8:f00f:0:0:553:1211:88]

Nothing says you have to zero-pad those things out to four hex characters, after all. 88, 088, and 0088 all mean the same thing: 0x88.

This sort of difference can come about from someone using printf, and person A uses %04x (thus zero-padding out to 4 places), while person B uses just %x. A and B look the same as long as you have values at or above 0x1000, but once you drop below it, things get interesting.

Someone else gets roped into fixing this. Since they're already doing "zero injection", they follow that same logic and just re-write ALL of the address parts to make sure they are always a fixed width using leading zeroes.

Things are pretty good. But you know we can't be done yet.

Did you notice that these addresses are hex, and thus contain six letters? I'm talking about how you represent the values 10 through 15. You use a, b, c, d, e, f. Or... maybe you use A, B, C, D, E and F.

So, yes, this happens, eventually:

leader_host = [2001:0DB8:F00F:0000:0000:0553:1211:0088]

Even though it's been both "zero injected" and "zero padded", there's still the little matter that "0db8" and "0DB8" don't match when you're talking about dumb string comparisons. Oops.

This, too, can come about from someone using different printf format strings. %04x gives one, and %04X gives the other. getaddrinfo() doesn't care, but your de-duplication string compares sure do!

All of this is just one way to discover the data parsing and representation bugs lurking in your system. You will also trip over this when it comes time to look for things in log files or in databases if you're using strings to hold these addresses. Imagine this SQL query:

SELECT auth_username FROM ssh_log WHERE host_ip = x;

Just how is 'x' going to be formatted, anyway? Does it have runs of zeroes squished out with ::? Do the numbers have leading zeroes? Does it use A-F or a-f? Or does it mix them up in the same thing? Did they wrap it with [] because someone said you should always store them that way after that one outage a long time ago?

Strings are hard!

By the way, none of this is intended to scare people off IPv6. You'll get there eventually. Hopefully these tales just remind you to make the right choices when you do go for it.