|
|
Subscribe / Log in / New account

4K-sector drives and Linux

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jonathan Corbet
March 9, 2010
Almost exactly one year ago, LWN examined the problem of 4K-sector drives and the reasons for their existence. In short, going to 4KB physical sectors allows drive manufacturers to increase storage density, always welcome in that competitive market. Recently, there have been a number of reports that Linux is not ready to work with these drives; kernel developer Tejun Heo even posted an extensive, worth-reading summary stating that "4 KiB logical sector support is broken in both the kernel and partitioners." As the subsequent discussion revealed, though, the truth of the matter is that we're not quite that badly prepared.

Linux is fully prepared for a change in the size of physical sectors on a storage device, and has been for a long time. The block layer was written with an avoidance of hardwired sector sizes in mind. Sector counts and offsets are indeed managed as 512-byte units at that level of the kernel, but the block layer is careful to perform all I/O in units of the correct size. So, one would hope, everything would Just Work.

But, as Tejun's document notes, "unfortunately, there are complications." These complications result from the fact that the rest of the world is not prepared to deal with anything other than 512-byte sectors, starting with the BIOS found on almost all systems. In fact, a BIOS which can boot from a 4K-sector drive is an exceedingly rare item - if, indeed, it exists at all. Fixing the BIOS is evidently harder than one might think, and, evidently, there is little motivation to do so. Martin Petersen, who has done much of the work around supporting these drives in Linux, noted:

Part of the hesitation to work on booting off of 4 KB lbs drives is motivated by a general trend in the industry to move boot functionality to SSD. There are 4 KB LBS SSDs out there but in general the industry is sticking to ATA for local boot.

The problem does not just exist at the BIOS level: bootloaders (whether they are Linux-oriented or not) are not set up to handle larger sectors; neither are partitioning tools, not to mention a wide variety of other operating systems. Something must be done to enable 4K-sector drives to work with all of this software.

That something, of course, is to interpose a mapping layer in the middle. So most 4K-sector drives will implement separate logical and physical sector sizes, with the logical size - the one presented to the host computer - remaining 512 bytes. The system can then pretend that it's dealing with the same kind of hardware it has always dealt with, and everything just works as desired.

Except that, naturally enough, there are complications. A 512-byte sector written to a 4K-sector drive will now force the drive to perform a read-modify-write cycle to avoid losing the data in the rest of the sector. That slows things down, of course, and also increases the risk of data loss should something go wrong in the middle. To avoid this kind of problem, the operating system should do transfers that are a multiple of the physical sector size whenever possible. But, to do that, it must know the physical sector size. As it happens, that information has been made available; the kernel makes use of this information internally and exports it via sysfs.

It is not quite that simple, though. The Linux kernel can go out of its way to use the physical sector size, and to align all transfers on 4KB boundaries from the beginning of the partition. But that goes badly wrong if the partition itself is not properly aligned; in this case, every carefully-arranged 4KB block will overlap two physical sectors - hardly an optimal outcome.

As it happens, badly-aligned partitions are not just common; they are the norm. Consider an example: your editor was a lucky recipient of an Intel solid-state drive at the Kernel Summit which was quickly plugged into his system and partitioned for use. It has been a great move: git repositories on an SSD are much nicer to work with. A quick look at the partition table, though, shows this:

Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x5361058c

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1              63    52452224    26226081   83  Linux

Note that fdisk, despite having been taken out of the "DOS compatibility" mode, is displaying the drive dimensions in units of heads and cylinders. Needless to say, this device has neither; even on rotating media, those numbers are entirely fictional; they are a legacy from a dark time before Linux even existed. But that legacy is still making life difficult now.

Once upon a time, it was determined that 63 (512-byte) sectors was far more than anybody would be able to fit into a single disk track. Since track-aligned I/O is faster on a rotating drive, it made sense to align partitions so that the data began at the beginning of a track. So, traditionally, the first partition on a drive begins at (logical) sector 63, the last sector of the first track. That sector holds the boot block; any filesystem stored on the partition will follow at the beginning of the next track. That placement, of course, misaligns the filesystem with regard to any physical sector size larger than 512 bytes; logical sector 64 (the first data sector in the partition) will be placed at the end of a 4K physical sector. Any subsequent partitions on the device will almost certainly be misaligned in the same way.

One might argue that the right thing to do is to simply ditch this particular practice and align partitions properly; it should not be all that hard to teach partitioning tools about physical sector sizes. This can certainly be done. The tools have been slow to catch on, but a suitably motivated system administrator can usually convince them to place partitions sensibly even now. So weird alignments should not be an insurmountable problem.

Unfortunately, there are complications. It would appear that Windows XP not only expects misaligned partitions; it actually will not function properly without them. One simply cannot run XP on a device which has been properly partitioned for 4K physical sector sizes. To cope with that, drive manufacturers have introduced an even worse hack: shifting all 512-byte logical sectors forward by one, so that logical sector 64 lands at the beginning of a physical sector. So any partitioning tool which wants to lay things out properly must know where the origin of the device actually is - and not all devices are entirely forthcoming with that information.

With luck, the off-by-one problem will go away before it becomes a big issue. As James Bottomley put it: "...fortunately very few of these have been seen in the wild and we're hopeful they can be shot before they breed." But that doesn't fix the problem with the alignment of partitions for use by XP. Later versions of Windows need not concern themselves with this problem, since they rarely coexist with XP (and Windows has never been greatly concerned about coexistence with other systems in general). Linux, though, may well be installed on the same drive as XP; that leads to differing alignment requirements for different partitions. Making that just work is not going to be fun.

Martin suggests that it might be best to just ignore the XP issue:

With regards to XP compatibility I don't think we should go too much out of our way to accommodate it. XP has been disowned by its master and I think virtualization will take care of the rest.

It may well be that there will not be a significant number of XP installations on new-generation storage devices, but failure to support XP may still create some misery in some quarters.

A related issue pointed out by Tejun is that the DOS partition format, which is still widely used, tops out at 2TB, which just does not seem all that large anymore. Using 4K logical sectors in the partition table can extend that limit as far as 16TB, but, again, that requires cooperation from the BIOS - and it still does not seem all that large. The long-term solution would appear to be moving to a partition format like GPT, but that is not likely to be an easy migration.

In summary: Linux is not all that badly placed to support 4K-sector drives, especially when there is no need to share a drive with older operating systems. There is still work required at the tools level to make that support work optimally without the need for low-level intervention by system administrators, but that is, as they say, just a matter of a bit of programming. As these drives become more widely available, we will be able to make good use of them.

Index entries for this article
KernelBlock layer/Large physical sectors


(Log in to post comments)

4K-sector drives and Fedora

Posted Mar 9, 2010 23:24 UTC (Tue) by rahulsundaram (subscriber, #21946) [Link]

This comment adds a lot of important details

http://www.osnews.com/thread?409410

4K-sector drives and Fedora

Posted Mar 10, 2010 13:33 UTC (Wed) by Trelane (subscriber, #56877) [Link]

Interesting link; thanks! :)

4K-sector drives and Fedora

Posted Mar 11, 2010 16:28 UTC (Thu) by msnitzer (subscriber, #57232) [Link]

Here is another link that should give more insight on how most of the Linux I/O
stack has already been updated upstream and will be included in Fedora 13:
http://lkml.org/lkml/2010/3/11/230

I think it is unfortunate that the time was taken to highlight Linux's preparedness
for 4K sector drives but failed to accurately convey the true state of the various
pieces involved. Even inaccurately concluding that there is much work ahead for
the various Linux tools.

The article focused on Tejun's misunderstanding and worked from there rather
than incorporating the more promising state of Linux's preparedness that was
revealed in reply to Tejun's post.

4K-sector drives and Fedora

Posted Mar 11, 2010 16:45 UTC (Thu) by corbet (editor, #1) [Link]

Interesting...I thought I wrote an article saying that, Tejun's worries notwithstanding, we're not in all that bad a shape. As far as I can tell, the only thing I got really wrong was regarding XP, which can handle things better than I had thought. As for the rest...what does it take to make a partitioning utility a little smarter?

Anyway, my apologies if you feel I misrepresented the situation.

4K-sector drives and Fedora

Posted Mar 11, 2010 17:09 UTC (Thu) by msnitzer (subscriber, #57232) [Link]

Hey Jon,

Not a big deal.. I was just saying that many of the partition tools and others have
already been updated. The "what does it take" to update them is somewhat
moot; as it has been done. Updating LVM was a bit more involved than
mkfs.ext3. Updating virtio and qemu was also somewhat intrusive.

But for those tools that haven't been updated they will either need to use libblkid
(like e2fsprogs does) or use the new "IO topology" block ioctls. Please see this
for more info (contains the specific ioctls and more):
http://people.redhat.com/msnitzer/docs/io-limits.txt

4K-sector drives and Fedora

Posted Mar 11, 2010 18:47 UTC (Thu) by ricwheeler (subscriber, #4980) [Link]

Thanks for the article!

One thing that seems to have been skipped in the discussion so far is that this is not just an issue with local, 4KB sector drives. The changes we have in the kernel and in the tool chain will help with external arrays which have long had larger internal "sectors" but pretended to have 512 byte sectors like a local disk.

The larger impact of the change is that it should all "just work" if we got all of the bits in place correctly :-)

Testing on a variety of storage hardware from various vendors, without and without DM and MD is really, really interesting right now to help us uncover any bits we did miss.

4K-sector drives and Linux

Posted Mar 9, 2010 23:43 UTC (Tue) by pheldens (guest, #19366) [Link]

I just got one of these drives, a 1TB WD10EARS, I intend to put it in soft raid 1 with a WD10EACS, are there caveats there? using stock linux 2.6.33.

RAID: was 4K-sector drives and Linux

Posted Mar 10, 2010 2:36 UTC (Wed) by smoogen (subscriber, #97) [Link]

My guess would be that you would see poor performance on the RAID-1 in the ways similar to having a 5400 RPM and and 10k RPM drive as the RAID-1 pair. or having 2 completely different disks (like in the old days of one disk saying it had 16 platters and another one saying it had 4.. IO to them is very different.) Ones always going to have different IO performance than the other so sometimes things would be fast and others really slow.

But that would be my guess. If its different let us know.

4K-sector drives and Linux

Posted Mar 10, 2010 21:21 UTC (Wed) by pheldens (guest, #19366) [Link]

I did some tests with the EARS and the effect described in the article is very noticable.

Default fdisk misaligns the sdb1 data area (63)
changing it to 64 makes formats 20-30% faster
changing it to 65-71 slower again,
changing it to 72 faster again. (+8 presumably 4096/512)

Will see what happens when I put it in an array aligned

4K-sector drives and Linux

Posted Mar 9, 2010 23:54 UTC (Tue) by pheldens (guest, #19366) [Link]

the article linked above suggests:
"...For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode ("x"), then setting the start sector ("b") to 64."
for about twice the write performance, compared to defaults (63).

thanks for this important tip.

4K-sector drives and Linux

Posted Mar 10, 2010 9:25 UTC (Wed) by Darkmere (subscriber, #53695) [Link]

Most _recent_ tools ( parted 2.1 ) util-linux-ng as of... *tries to remember* some really new release (Fedora 13 has it, 12 does not) work with the new disks perfectly as well.

parted even goes as far as to yell at you if you have the wrong alignment, and offers to fix it for you.

However, you may have to use % based partitions for parted to be able to fix the auto-alignment, otherwise it complains.

mkpart primary ext3 0% +200M
mkpart primary ext4 200M 100%
Or however you want to work things.

4K-sector drives and Linux

Posted Mar 18, 2010 20:38 UTC (Thu) by till (guest, #50712) [Link]

gdisk aka GPT fdisk ( http://www.rodsbooks.com/gdisk/whygdisk.html ) on Fedora 12 seems to work nicely, too. It aligns data by default and does not irritate with obscure old data like CHS and the partitions seem to be supported by linux. I don't care about Windows, though.

4K-sector drives and Linux

Posted Mar 11, 2010 16:15 UTC (Thu) by jackb (guest, #41909) [Link]

So if you don't need to worry about non-linux operating systems will this procedure fix the problem for every 4K sector drive except those that offset by one?

4K-sector drives and Linux

Posted Mar 10, 2010 9:25 UTC (Wed) by ringerc (subscriber, #3071) [Link]

"...fortunately very few of these have been seen in the wild and we're hopeful they can be shot before they breed"

I have six, and another six on back-order :-(

The WD Green series (at least the 1TB drive WDC WD10EARS-00Y5B1) have 4kb sectors and offset-by-one. Thankfully, that awful hack is disabled by default and must be turned on by the application of a jumper across a pin pair on the back of the drive.

This is fine so long as the host OS can probe the disk to find out whether or not it's so jumpered, but somehow I doubt it's so easy.

4K-sector drives and Linux

Posted Mar 10, 2010 21:49 UTC (Wed) by kjp (guest, #39639) [Link]

I don't understand the problem with GPT. And does it fix XP as well?

4K-sector drives and Linux

Posted Mar 11, 2010 19:41 UTC (Thu) by cmccabe (guest, #60281) [Link]

> I don't understand the problem with GPT.

I agree with you. We have to move to GPT soon anyway, unless you think that 2 TB should be enough for anyone.

Rather than fooling with this C/H/S nonsense, we should just fix BIOSes and such so that they can use the new format.

> And does it fix XP as well?

The 32 bit version of Windows XP can only work with MBRs, never GPTs. There is some weird way that you can have both an MBR and a GPT on your disk, but it looks ugly... very ugly. And it still doesn't let you break the 2 TB limit under XP.

On the other hand, the 64-bit version of Windows XP can read and write GPT partitions, but can't boot from them.

4K-sector drives and Linux

Posted Mar 12, 2010 21:46 UTC (Fri) by cmccabe (guest, #60281) [Link]

Ok, I figured out that a 4k-sector drive with a traditional MBR can be up to 16TB, not 2TB.

So that should buy the old MBR scheme another few years.

4K-sector drives and Linux

Posted Mar 15, 2010 10:16 UTC (Mon) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

> I figured out that a 4k-sector drive with a traditional MBR can be up to 16TB, not 2TB.

But only for disks with 4096 bytes per LOGICAL sector and 4096 bytes per PHYSICAL sector; you can't find those drives on the market.
There is only 512 bytes per LOGICAL sector and 4096 bytes per PHYSICAL sector for sale, those are 100% compatible with current 512 bytes per sector drive, just are slower in some cases (so they have the same limits).

XP compatibility

Posted Mar 10, 2010 21:49 UTC (Wed) by mrpippy (guest, #57134) [Link]

From reading AnandTech's article on 4K sectors
(http://www.anandtech.com/storage/showdoc.aspx?i=3691&p=2), I got the impression that
Windows XP can function with aligned partitions, it's just not possible to create them with any of
XP's partition editors.
That's why XP either needs to run with the sector+1 hack, or use the WD Align tool after
installation that literally shifts the entire partition back one sector. Both methods result in an
aligned partition that XP can run off of.

4K-sector drives and Linux

Posted Mar 11, 2010 5:44 UTC (Thu) by ranmachan (guest, #21283) [Link]

> One simply cannot run XP on a device which has been properly partitioned for 4K physical sector sizes

That's not true.
If you use 32 sectors per track you'll create only correctly aligned partitions. (If the disk doesn't use an offset)
Of course you'll need to go to expert mode in fdisk to change the sector count and you can only do that on a yet unpartitioned disk because IIRC it doesn't recalculate the CHS values for already existing partitions.
And you're still screwed with drives that use an offset. :)

Why partition alignment?

Posted Mar 11, 2010 8:43 UTC (Thu) by PO8 (guest, #41661) [Link]

I don't understand why the kernel can't just read and write with the appropriate disk alignment, rather than relying on the partition being aligned. Surely it can just look at the partition table for the device and remember the correct alignment for each partition, the same as the article above does? The off-by-one thing is still a problem, but other than that it all seems straightforward. What am I missing?

Why partition alignment?

Posted Mar 11, 2010 15:50 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

So you want the filesystem to start at some offset from the start of the partition? How would an old kernel know about that offset? How would a new kernel know that an old filesystem did not have that offset? The rule is that the partition table specifies the start.

Why partition alignment?

Posted Mar 11, 2010 17:15 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

I am not sure that is the solution neither, but if it is needed for ext2/3/4fs, it could be implemented in two steps:
1. The EXTxFS superblock is no more located at 1 Kbyte from the beginning of the partition but at the 3rd sector i.e. LBA=2.
It then only make unreadable the EXTxFS located on DVD-RAM or the EXTxFS images written to CDROM/DVDs.
Also, it seems strange to search for a signature in the middle of a sector when the device has 4096 bytes/sector.
2. The EXTxFS superblock is located at the 3rd *physical* sector of the partition.
Then to mount the FS the software has to scan few sectors to see if it find an EXT* superblock, and old mount command can probably handle the "-o offset=1" parameter.

Why partition alignment?

Posted Mar 11, 2010 18:39 UTC (Thu) by PO8 (guest, #41661) [Link]

No, I just want the kernel to issue reads and writes aligned with the disk blocks. I don't see why anything has to move on-disk at all? Reads presumably aren't much of an issue anyhow, and writes are only going to be a problem at the beginning and end. I don't see why just doing the reads and writes properly aligned wouldn't be almost as good as shifting the data on disk?

Why partition alignment?

Posted Mar 11, 2010 19:33 UTC (Thu) by cmccabe (guest, #60281) [Link]

What you're talking about is actually done by a lot of SSDs that are sold today. It's called write coalescing. SSDs don't really have sectors, they have "erase blocks" which are often 16k or so in size.

So when they get a request for a 512-byte write, rather than doing the read-modify-write of a 16k block, they wait to see if the user wants to do any more I/O to that erase block.

The disadvantages of write coalescing are kind of obvious-- it's complex, requires temporary storage (for the un-coalesced 512-byte chunks). More buffering also means there's a longer window when power failures can result in data loss.

Overall, it's not something you want to do unless you absolutely have to. Performance and stability would be a lot better if the kernel knew about the real situation on the hardware.

4K-sector drives and Linux

Posted Mar 11, 2010 12:32 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

About NTFS having problems with unaligned start of the partition (start at sector 63), it seem strange to read:
http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx
> Formatting Volumes: Formatting also aligns clusters at the cluster size boundary.
Same for FAT created on the other OS, we can read a bit further:
> Because formatting in Windows Server 2003 aligns FAT data clusters at the cluster size boundary
The FAT filesystem cluster aligment can be modified (and it seems to be the same for NTFS) depending on the alignment of the first sector; it means that you will not generate the same FAT for two partition which have the same size but are aligned differently - as a consequence you cannot directly copy them neither (by "dd").
The Gujin bootloader is aware of that when creating FATs.
I did not find such a field in the EXT2/3/4 filesystem to ignore some sectors at the beginning of the FS.

About bootloaders and 4096 sectors, the Gujin bootloader may be able to help thanks to its minimal IDE driver in the 512 bytes MBR, but the problem is a lack of hardware to test:
http://www.wdc.com/en/products/products.asp?driveid=336
says drive WD10EACS has 1,000,204 MB and 1,953,525,168 sectors, i.e. (1,000,204 * 1000 * 1000) / 1,953,525,168 = 512 bytes/sectors
http://www.wdc.com/en/products/products.asp?driveid=763
says drive WD10EARS has exactly the same 512 bytes/sectors
and WD10EARS-00Y5B1 doesn't even have a hit on WD web site... Is that available in UK?

BTW, GPT is quite easy to use to define partitions.

4K-sector drives and Linux

Posted Mar 18, 2010 15:44 UTC (Thu) by welinder (guest, #4699) [Link]

Why not start the first partition at sector 63*8?

That wastes 7 extra "cylinders" (7*63*512 = ~250KB) but is aligned
any way you look at it. XP should be able to read it.

4K-sector drives and Linux

Posted Mar 18, 2010 20:24 UTC (Thu) by till (guest, #50712) [Link]

The XP-bugaround mode moves the sectors by one sector, so there is no factor that can be used that works for both kind of disks with this bug and without.

4K-sector drives and Linux

Posted Apr 25, 2010 15:23 UTC (Sun) by ramiro_morales (guest, #65623) [Link]

> Once upon a time, it was determined that 63 (512-byte) sectors was far more than anybody would be able to fit into a single disk track.
> [...] it made sense to align partitions so that the data began at the beginning of a track. So, traditionally, the first partition on a drive begins at (logical) sector 63, the last sector of the first track.
> That sector holds the boot block; any filesystem stored on the partition will follow at the beginning of the next track.

I think the two last paragraphs are incorrect. Logical sector 63 is the first sector of the second track, sectors 0-62 are in the first track. So the first partition is completely (both administrative overhead and data) located on the second track.

Not that this matters now that legacy emulated geometry is finally getting obsoleted.

4K-sector drives and Linux

Posted Dec 12, 2011 20:17 UTC (Mon) by derickmoore (guest, #81787) [Link]

One of the first things I noticed when I came to this page was the remark, "In fact, a BIOS which can boot from a 4K-sector drive is an exceedingly rare item - if, indeed, it exists at all."

I can't speak for any 'other' BIOS, but the 2nd Generation SAS products from LSI do have 4K sector support under INT13 Boot.

The only drawback to that support is the RMW (Read/Modify/Write) cycles that must take place in an environment that has no concept of 4K sectors.

The BIOS is smart enough to 'package' accesses and minimize reads in consecutive blocks, but of necessity is unable to do anything about the many one sector (512 byte) accesses.

On the other hand it does 'remember' previous reads into a 4K block, so that consecutive 512 byte accesses don't reread the 'same' 4K block when it hasn't changed. (Remember that INT13 is single threaded)

I don't know who this might help, but there it is!

Derick

P.S. I wish someone would update the DOS version of GDISK to work with 4K drives. Currently, it blows up when it sees one!


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds