We have some shiny new machines which have 16 disks in a 3U chassis, which is a pretty tight fit. In fact we re-jigged them so that instead of being wired with one block of eight in one RAID set and the other block of 8 in another, we’ve put alternate disks in different RAID sets so that it matters less if a disk messes with a neighbour when it is being removed. (Two degraded RAID sets rather than one completely shafted one.) We’re going to use four of them as the 4th generation Cyrus machines.
As usual for us they have Intel mobos and LSI battery-backed RAID controllers - though unlike our older RAID cards, these ones have been re-badged by Intel and have a more graphical user interface, neither of which are advantages. My colleague David has been doing our usual paranoid power-loss sanity checks, and it turns out that when these new machines find unwritten data in the write cache at boot time, they say “ooh look! unwritten data. I will log this fact and then throw it away.”. D’oh!
Further investigation has revealed that this seems to be an incompatibility between the motherboard and the RAID card. The exact nature of the incompatibility is unclear, and it looks like we might have to ditch LSI and replace the cards with 3ware ones instead. This will be a particular pain because some of these new machines have already gone into service, e.g. for our local free software mirror site.
This reminded David that some of our Cyrus machines are now getting a bit old, and are now past the advertized lifetime of the RAID batteries, so he has been doing power-off tests. Fortunately the batteries seem to be surviving well. However he discovered that our 3rd generation Cyrus servers (16 disks in 4U) seem to be suffering from the same RAID/mobo incompatibility as the new machines. They worked OK originally, but required a mobo replacement some time back, and we didn’t expect this to stop the RAID working. Restoring these machines is NOT an easy job:
email@example.com:~ : 0 ; df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 2.0G 970M 951M 51% / tmpfs 1.8G 8.0K 1.8G 1% /dev/shm /dev/md0 3.3T 1.5T 1.8T 46% /spool /dev/sdb1 2.0G 289M 1.6G 16% /varTheir networking is 100Mb/s, which means it will take at least 1.5 days to get the data back on to the machine after replacing the RAID card. A gigabit upgrade is now being planned...