The dangers of streaming across versions of glibc: A cautionary tale

Lists: pgsql-general
From: Matthew Kelly <mkelly(at)tripadvisor(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Cc: Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-06 21:24:17
Message-ID: BA6132ED-1F6B-4A0B-AC22-81278F5AB81E@tripadvisor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

The following is a real critical problem that we ran into here at TripAdvisor, but have yet figured out a clear way to mitigate.

TL;DR:
Streaming replicas—and by extension, base backups—can become dangerously broken when the source and target machines run slightly different versions of glibc. Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on the slave. These indexes are sorted out of order with respect to the strcoll running on the slave. Because postgres is unaware of the discrepancy is uses these "corrupt" indexes to perform merge joins; merges rely heavily on the assumption that the indexes are sorted and this causes all the results of the join past the first poison pill entry to not be returned. Additionally, if the slave becomes master, the "corrupt" indexes will in cases be unable to enforce uniqueness, but quietly allow duplicate values.

Context:
We were doing a hardware upgrade on a large internal machine a couple months ago. We followed a common procedure here: stand up a the new HA pair as streaming replica's of the old system; then failover to the new pair. All systems involved were running 9.1.9 (though that is not relevant as we'll see), and built from source.

Immediately, after the failover we saw some weird cases with some small indexes. We thought it was because the streaming replication failover had gone poorly (and because we weren't running latest version of postgres on that machine), so we rebuilt them and moved on. Until last week when an important query stopped getting optimized as a hash join and turned into a merge join. From that query I generated a simple, single column join between two tables. That query returns 50 million rows with merge joins disabled and 0 rows with them enabled. Rebuilding the index fixed the issue, but this was an important table and so we did some digging.

Using some query optimizer coercion, I was able to show that 1. the "corrupt" index had the same number of rows as the table, and 2. the index returned rows in a different, but nearly identical ordering to the one that you would receive by explicitly sorting the column. Taking a pair of rows that were out of place, I manage to narrow the issue down. Luckily, we able to find the old server sitting on the floor. The simplest form of this issue is:

SELECT 'M' > 'ஐ';

Root cause:
Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.

Digging further lead me to: master:src/backend/utils/adt/varlena.c:1494,1497 These are the lines where postgres calls strcoll_l and strcoll, in order to sort strings in a locale aware manner.

In the simplest case, the attached c file returns inconsistent results across glibc versions and environments. It just sets the collation to 'en_US.UTF-8', and then compares two one character strings (the characters above). Depending on the version of glibc you are running, you may see positive, negative or zero as the output. I have observed:

Old Server (CentOS 5.8, kernel 2.6.18-308.24.1.el5):
0 -> glibc-2.5-81.el5_8.7
1 -> glibc-devel-2.5-81.el5_8.7 statically linked
0 -> Source builds of glibc (2.5, 2.6, 2.10)

New Server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.132.el6.x86_64

Dev server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.107.el6_4.5.x86_64
-1 -> Source build (2.12.2, 2.16, 2.18)
15 -> Source build (HEAD)

Laptop (Ubuntu, kernel 3.11.0-12-generic)
-1 -> 2.17-93ubuntu4
15 -> 2.17-93ubuntu4 statically linked

Mac OS (For comparison only)
62365 -> OSX 10.8, 10.9

From my digging, I have been unable to figure out why glibc is returning different results in different situations. It is probably worth getting a discussion going on their mailing lists as well.

Regardless, the reality is that there are different versions of glibc out there in the wild, and they do not sort consistently across versions/environments. Streaming replica's rely on the assumption that the sort order within a collation is consistent across machines. If they differ ever so slightly, then a single pair of rows with poison pill characters can yield an entire index invalid. We were lucky that the first discrepancy was at the beginning of the index. If it was 85% of the way through then we probably would never have noticed that merge joins were broken on that machine.

We still are discussing internally how we plan to prevent this in the future, but we'd like to open this up to the community for wider discussion.

- Matt K, TripAdvisor

Attachment Content-Type Size
localetest.c application/octet-stream 387 bytes

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Matthew Kelly <mkelly(at)tripadvisor(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 00:11:37
Message-ID: 20140807001137.GA14724@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Aug 6, 2014 at 09:24:17PM +0000, Matthew Kelly wrote:
> The following is a real critical problem that we ran into here at TripAdvisor,
> but have yet figured out a clear way to mitigate.
>
> TL;DR:
> Streaming replicas—and by extension, base backups—can become dangerously broken
> when the source and target machines run slightly different versions of glibc.
> Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on
> the slave. These indexes are sorted out of order with respect to the strcoll
> running on the slave. Because postgres is unaware of the discrepancy is uses
> these "corrupt" indexes to perform merge joins; merges rely heavily on the
> assumption that the indexes are sorted and this causes all the results of the
> join past the first poison pill entry to not be returned. Additionally, if the
> slave becomes master, the "corrupt" indexes will in cases be unable to enforce
> uniqueness, but quietly allow duplicate values.

No surprise; I have been expecting to hear about such breakage, and am
surprised we hear about it so rarely. We really have no way of testing
for breakage either. :-(

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Matthew Kelly <mkelly(at)tripadvisor(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 01:12:53
Message-ID: CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYVbwoU79=TuP8HE3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> No surprise; I have been expecting to hear about such breakage, and am
> surprised we hear about it so rarely. We really have no way of testing
> for breakage either. :-(

I guess that Trip Advisor were using some particular collation that
had a chance of changing. Sorting rules for English text (so, say,
en_US.UTF-8) are highly unlikely to change. That might be much less
true for other locales.

Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

So, the reality is that we only have ourselves to blame. :-(

LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US
looks like on my machine:

"""
escape_char /
comment_char %
% Locale for English locale in the USA
% Contributed by Ulrich Drepper <drepper(at)redhat(dot)com>, 2000

LC_IDENTIFICATION
title "English locale for the USA"
source "Free Software Foundation, Inc."
address "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA"
contact ""
email "bug-glibc-locales(at)gnu(dot)org"
tel ""
fax ""
language "English"
territory "USA"
revision "1.0"
date "2000-06-24"
%
category "en_US:2000";LC_IDENTIFICATION
category "en_US:2000";LC_CTYPE
category "en_US:2000";LC_COLLATE
category "en_US:2000";LC_TIME
category "en_US:2000";LC_NUMERIC
category "en_US:2000";LC_MONETARY
category "en_US:2000";LC_MESSAGES
category "en_US:2000";LC_PAPER
category "en_US:2000";LC_NAME
category "en_US:2000";LC_ADDRESS
category "en_US:2000";LC_TELEPHONE
*** SNIP ***
"""

This is a GNU extension [1]. If the OS adds a new version of a
collation, that probably accidentally works a lot of the time, because
the collation rule added or removed was fairly esoteric anyway, such
is the nature of these things. If it was something that came up a lot,
it would surely have been settled by standardization years ago.

If OS vendors are not going to give us a standard API for versioning,
we're hosed. I thought about suggesting that we hash a strxfrm() blob
for about 2 minutes, before realizing that that's a stupid idea. Glibc
would be a good start.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html
--
Regards,
Peter Geoghegan


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: peter(dot)geoghegan86(at)gmail(dot)com
Cc: bruce(at)momjian(dot)us, mkelly(at)tripadvisor(dot)com, pgsql-general(at)postgresql(dot)org, mspilich(at)tripadvisor(dot)com
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 01:30:48
Message-ID: 20140807.103048.1016702529143996447.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Over time, collation order will vary: there may be fixes needed as
> more information becomes available about languages; there may be new
> government or industry standards for the language that require
> changes; and finally, new characters added to the Unicode Standard
> will interleave with the previously-defined ones. This means that
> collations must be carefully versioned.

Another idea could be having our own collation data to isolate any
changes from outside world. I vaguley recall this had been discussed
before.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


From: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Matthew Kelly <mkelly(at)tripadvisor(dot)com>, PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 01:46:49
Message-ID: CAEYLb_WvdCzuL=Cyf1xyzjwn-1CVo6kZEaWMKbxTS3jPhtjOig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Aug 6, 2014 at 6:30 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> Another idea could be having our own collation data to isolate any
> changes from outside world. I vaguley recall this had been discussed
> before.

That's probably the best solution. It would not be the first time that
we decided to stop relying on the operating system's facilities due to
various problems (e.g. we used to use the C standard library qsort()
until about 2006). The only problem is that it's a lot of work. One
possible solution that has been proposed is to adopt ICU [1]. That
might allow us to say "this is the official way that PostgreSQL 9.6
sorts Japanese; you may use the old way if you want, but it's
incompatible with the new way". ICU would give us a standard
versioning interface [2]. They seem to take this seriously, and are
aware of our considerations around B-Tree indexes on text.

[1] https://wiki.postgresql.org/wiki/Todo:ICU
[2] http://userguide.icu-project.org/collation/architecture#TOC-Versioning
--
Regards,
Peter Geoghegan


From: Matthew Kelly <mkelly(at)tripadvisor(dot)com>
To: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 15:07:04
Message-ID: 1A97C54F-1AF3-4C11-8EEA-F53EEFD5B91F@tripadvisor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

We are currently running with the en_US.UTF-8 collation. It was a decision made long ago, and seeing as we never actually rely on the sort order of internationalized strings (other than for stability, apparently), we have never had any motivation to change this practice.

Some way of versioning collations, which is not tied to glibc seems immensely appealing. Without a good way of testing the differences between glibc sort versions, it seems the only safe thing to do at the moment is to guarantee all streaming replica's run from the exact same OS image. Which is fine until you want to upgrade your OS, and need to do a dump-restore instead of being able to do that in a rolling fashion.

To Bruce's point the way I was able to test for this issue in a particular index was (approximately):
--Assuming textfield is what the index is on, this causes the query planner to scan the index and give the position in the index.
CREATE TABLE index_order (SELECT textfield, dense_rank() OVER (ORDER BY textfield) as i_order FROM table);
--No index here, postgres must sort
CREATE TABLE both_order as (SELECT textfield, i_order, dense_rank() OVER (ORDER BY textfield) as sort_order FROM index_order);
-- If this doesn't return zero, you have a problem
SELECT count(*) FROM both_orders WHERE i_order <> sort_order;

This method is really slow on a big table, and I'm not going to promise it always works, but that is how we found the root cause.

- Matt K

On Aug 6, 2014, at 9:46 PM, Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
wrote:

> On Wed, Aug 6, 2014 at 6:30 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>> Another idea could be having our own collation data to isolate any
>> changes from outside world. I vaguley recall this had been discussed
>> before.
>
> That's probably the best solution. It would not be the first time that
> we decided to stop relying on the operating system's facilities due to
> various problems (e.g. we used to use the C standard library qsort()
> until about 2006). The only problem is that it's a lot of work. One
> possible solution that has been proposed is to adopt ICU [1]. That
> might allow us to say "this is the official way that PostgreSQL 9.6
> sorts Japanese; you may use the old way if you want, but it's
> incompatible with the new way". ICU would give us a standard
> versioning interface [2]. They seem to take this seriously, and are
> aware of our considerations around B-Tree indexes on text.
>
> [1] https://wiki.postgresql.org/wiki/Todo:ICU
> [2] http://userguide.icu-project.org/collation/architecture#TOC-Versioning
> --
> Regards,
> Peter Geoghegan


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Matthew Kelly <mkelly(at)tripadvisor(dot)com>
Cc: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 16:46:14
Message-ID: 20140807164614.GC14724@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 7, 2014 at 03:07:04PM +0000, Matthew Kelly wrote:
> We are currently running with the en_US.UTF-8 collation. It was a decision made long ago, and seeing as we never actually rely on the sort order of internationalized strings (other than for stability, apparently), we have never had any motivation to change this practice.
>
> Some way of versioning collations, which is not tied to glibc seems immensely appealing. Without a good way of testing the differences between glibc sort versions, it seems the only safe thing to do at the moment is to guarantee all streaming replica's run from the exact same OS image. Which is fine until you want to upgrade your OS, and need to do a dump-restore instead of being able to do that in a rolling fashion.
>
>
>
> To Bruce's point the way I was able to test for this issue in a particular index was (approximately):
> --Assuming textfield is what the index is on, this causes the query planner to scan the index and give the position in the index.
> CREATE TABLE index_order (SELECT textfield, dense_rank() OVER (ORDER BY textfield) as i_order FROM table);
> --No index here, postgres must sort
> CREATE TABLE both_order as (SELECT textfield, i_order, dense_rank() OVER (ORDER BY textfield) as sort_order FROM index_order);
> -- If this doesn't return zero, you have a problem
> SELECT count(*) FROM both_orders WHERE i_order <> sort_order;
>
> This method is really slow on a big table, and I'm not going to promise it always works, but that is how we found the root cause.

We could walk the index looking for inconsistent btree splits, e.g. the
split doesn't match the ordering returned by the existing collation
functions.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Matthew Kelly <mkelly(at)tripadvisor(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>, Matthew Spilich <mspilich(at)tripadvisor(dot)com>
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Date: 2014-08-07 22:10:33
Message-ID: CAEYLb_X6-ikGf00zsV3MzwHp81boONx3LK8Nd3dq2xdn=16hmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 7, 2014 at 9:46 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> We could walk the index looking for inconsistent btree splits, e.g. the
> split doesn't match the ordering returned by the existing collation
> functions.

I'm not sure I follow. I don't think that a tool like my btreecheck
tool will necessarily be able to catch anything like this on a
standby. Maybe it will, but that isn't guaranteed. For example, the
difference in collation rules in question might just not have cropped
up yet, but it's still a ticking time-bomb. Or, there are only
differences affecting values on internal pages. Things break down very
quickly.

In general, once there is an undetected inconsistency in collations
between replicas, that means that the varlena B-Tree support function
number 1 can violate various invariants that all operator classes must
obey. I doubt we want to get into the business of working backwards
from a broken state of affairs like that to figure out there is a
problem. Rather, I really do think we're compelled to offer better
versioning of collations using a versioning interface like Glibc's
LC_IDENTIFICATION. There is no way other way to properly fix the
problem. This is a problem that is well understood, and anticipated by
the Unicode consortium.

--
Regards,
Peter Geoghegan