I like Google News as it offers an aggregated snapshot of the days trending news.
I caught a wind of Linus Torvalds take on the ZFS File System and his reticence to merge it into Linux Core.
The reason he cited is Oracle’ litigious posture.
Jim Salter’s Response
Last Monday in the “Moderated Discussions” forum at realworldtech.com, Linus Torvalds—founding developer and current supreme maintainer of the Linux kernel—answered a user’s question about a year-old kernel maintenance controversy that heavily impacted the ZFS on Linux project. After answering the user’s actual question, Torvalds went on to make inaccurate and damaging claims about the ZFS filesystem itself.
Given the massive weight automatically given Torvalds’ words due to his status as founding developer and chief maintainer of the Linux kernel, we feel it’s a good idea to explain both the controversial kernel change itself, and Torvalds’ comments about both the change in question and the ZFS filesystem.
The original January 2019 controversy, explained
In January 2019, senior kernel developer Greg Kroah-Hartman fierily defended a Linux kernel commit which disabled exporting certain kernel symbols to non-GPL loadable kernel modules.
For those whose heads are spinning, kernel symbol exports expose internal information about the kernel state to loadable kernel modules. The particular symbol being discussed here, kernel_fpu, tracks the state of the processor’s Floating Point Unit. Without access to that symbol, external kernel modules that access the FPU directly—as ZFS does—must implement state preservation code of their own. State preservation, whether in-kernel or native to kernel modules, makes sure that the original state of the FPU is restored before control is released to other kernel code that may be dependent on the values they last saw in the FPU’s registers.
The technical impact of refusing to continue exporting the kernel_fpu symbol is not to prevent modules from accessing the FPU directly—it only prevents them from using the kernel’s own state-management facilities to preserve and restore state. Removing access to that symbol therefore requires module developers to reinvent their own state-preservation code individually. This increases the likelihood of catastrophic error within the kernel itself, since improperly restored state could cause a later kernel operation to crash.
Kroah-Hartman’s defense of the decision to stop exporting the symbol to non-GPL kernel modules appeared to be driven largely by spite, as borne out by his own comment regarding the change: “my tolerance for ZFS is pretty non-existent.” Normally, ZFS—on any platform, including the BSDs—uses SSE/AVX SIMD vector optimization to speed up certain operations. Without access to the kernel_fpu symbol, ZFS developers were initially forced to disable the SIMD optimizations entirely, with fairly significant real-world performance degradation.
Although the change—and the way Kroah-Hartmann defended it—initially spawned a lot of drama and uncertainty, the long-term impact on the Linux ZFS community was fairly minimal. The breaking change only affected bleeding-edge kernels that few ZFS users were using in production, and in July 2019 new, in-module state management code was committed to the ZFS on Linux source tree.
We don’t break users
“We don’t break users”
Torvalds’ position in last Monday’s forum post starts out reasonable and well-informed—after all, he’s Linus Torvalds, discussing the Linux kernel. He notes that the famous kernel mantra “we don’t break users” is “literally about user-space applications”—and so it does not apply to the decision to stop exporting kernel symbols to non-GPL kernel modules. By definition, if you’re looking for a kernel symbol, you aren’t a user-space application. The line being drawn here is a very bright and functional one: Torvalds is saying that if you want to run in kernel space, you need to keep up with kernel development.
From there, Torvalds branches out into license concerns, another topic on which he’s accurate and reasonable. “Honestly, there is no way I can merge any of the ZFS efforts until I get an official letter from Oracle,” he writes. “Other people think it can be OK to merge ZFS code into the kernel and that the module interface makes it OK, and that’s their decision. But considering Oracle’s litigious nature, and the questions over licensing, there’s no way I can feel safe in ever doing so.”
He goes on to discuss the legally flimsy nature of the kernel module “shim” that the ZFS on Linux project (along with other non-GPL and non-weak-permissive projects, such as Nvidia’s proprietary graphics drivers) use. There’s some question as to whether they constitute a reasonable defense now—since nobody has challenged any project for using an LGPL shim for 20 years and running—but in purely logical terms, there isn’t much question that the shims don’t accomplish much. The real function of an LGPL kernel module shim isn’t to sanction touching the kernel with non-GPL code, it’s to protect the proprietary code on the far side of the shim from being forcibly published in the event of a GPL enforcement lawsuit victory.
So far, so good, but then Torvalds dips into his own impressions of ZFS itself, both as a project and a filesystem. This is where things go badly off the rails, as Torvalds states, “Don’t use ZFS. It’s that simple. It was always more of a buzzword than anything else, I feel… [the] benchmarks I’ve seen do not make ZFS look all that great. And as far as I can tell, it has no real maintenance behind it any more…”
“It was always more of a buzzword than anything else”
This jaw-dropping statement makes me wonder whether Torvalds has ever actually used or seriously investigated ZFS. Keep in mind, he’s not merely making this statement about ZFS now, he’s making it about ZFS for the last 15 years—and is relegating everything from atomic snapshots to rapid replication to on-disk compression to per-block checksumming to automatic data repair and more to the status of “just buzzwords.”
There’s only one other widely available filesystem that even takes a respectable stab at providing most of those features, and that’s btrfs—which was not available for the first several years of ZFS’ general availability. In fact, btrfs still isn’t really stable enough for production use, unless you nerf all the features that make it interesting in the first place.
ZFS’ per-block checksumming and automatic data repair has prevented data loss in my own real-world use many times, including this particularly egregious case of a SATA controller gone rabid. A standard RAID1 mirror would have cheerfully returned that 119GB of bad data with no warning whatsoever, but ZFS’ live checksumming and error detection mitigated the whole thing to the point of never having to so much as touch a backup.
Meanwhile, atomic snapshots make it possible to keep a full block-for-block identical copy of storage at a point in time with negligible performance overhead and minimal storage overhead—and replication of those snapshots is typically hundreds or thousands of times faster (and more reliable) than non-filesystem-integrated solutions like rsync.
It’s possible to not have a personal need for ZFS. But to write it off as “more of a buzzword than anything else” seems to expose massive ignorance on the subject.
“The benchmarks I’ve seen do not make ZFS look all that great”
If you’re looking for the absolute-fastest uncached and uncompressed performance, ZFS is unlikely to satisfy. While not much slower than alternatives like ext4 or xfs, it is generally at least a bit slower—again, if you design your benchmark workload to exclude the benefits of caching or compression.
To be fair to Torvalds, deliberately nerfing cache is a very, very long-standing practice in storage system benchmarking. The reason why, however, is pretty telling. The assumption is that all filesystems will cache equally well (or poorly), and all use the same caching algorithm. This is a good assumption for most filesystems, which use a simple LRU cache built into their operating system’s kernel. ZFS, on the other hand, uses an algorithm called ARC—short for Adaptive Replacement Cache.
When new blocks are read in from the filesystem, an LRU (least recently used) cache evicts whichever blocks have been sitting in cache un-read for the longest time. Each time you read a block from an LRU cache, it does jump that block back up to the top of the cache; but the LRU cache does not know or care how many times the block has been hit—only about how long it has been sitting without a hit.
The ARC is more complex, but it can simplistically be thought of as a “weighted” cache. It keeps track of the history of both cached and recently evicted blocks; a block that has been hit very frequently will be more difficult to evict from cache than one that has been hit only once, even if the only-once block was hit more recently than the frequent flyer.
Since the cache also tracks block evictions, it can also detect—and get more stubborn about evicting—blocks that have frequently been evicted and re-read after eviction.
The long and the short of this is that the ARC can be considerably more efficient for many workloads, with much higher hit-rates from cache than the simple LRU caches used by Linux, BSD, Windows, and MacOS kernels—and the filesystems that depend on using those more naive caches.
So it’s no longer particularly reasonable to discard cache information when comparing and contrasting other filesystems with ZFS, since, contrary to long-standing expectation, they are no longer all on an even footing in regard to cache.
Another frequently large ZFS performance win is inline compression. ZFS allows datasets to be live-compressed with algorithms such as LZ4 and Gzip. Contrary to popular belief, this is usually a performance win as well as a storage efficiency win—particularly with streaming algorithms such as LZ4, or the newer ZSTD. Modern CPUs are significantly faster at both compression and decompression of arbitrary datastreams using these lightweight algorithms than even the fastest SSD storage. Some of my own real-world testing demonstrated a five-percent performance hit when using inline LZ4 compression on completely incompressible data but a 27-percent performance win on a Windows Server ISO—and that was on very fast SSD storage.
Summary on Performance Comparison
The TL;DR here is that it’s not really accurate to make blanket statements about ZFS performance, absent a very particular, well-understood workload to measure that performance on. But more importantly, quibbling about the fastest possible benchmark rather loses the main point of ZFS. This filesystem is meant to provide an eminently scalable filesystem that’s extremely resistant to data loss; those are points Torvalds notably never so much as touches on.
Source Code Maintenance
“As far as I can tell, it has no real maintenance behind it any more” – Linus Torvalds
We’re not entirely sure where Torvalds was looking to see evidence of “maintenance.” The most charitable take is that he didn’t understand the difference between OpenZFS and Oracle ZFS, couldn’t find a Git tree for the latter, and assumed it was dead. In reality, Oracle’s version of ZFS—which underlies the Oracle Storage Appliance—is still in very active development, which I can attest to in part from personally knowing people on that team. However, that development has been closed and proprietary since 2010.
This is a bit of a red herring, of course, since the only ZFS relevant to Torvalds in the first place is ZFS on Linux, which is itself a part of the larger OpenZFS project. The ZFS on Linux master tree at Github has seen 52 merged pull requests, 60 commits pushed to master from 25 authors, and 5,807 additions to files within the project in the last month alone. Although there’s no fixed released schedule, ZFS on Linux releases a full, new version three or four times per year.
Meanwhile, OpenZFS is actively consumed, developed, and in some cases commercially supported by organizations ranging from the Lawrence Livermore National Laboratory (where OpenZFS is the underpinning of some of the world’s largest supercomputers) through Datto, Delphix, Joyent, ixSystems, Proxmox, Canonical, and more.
One sysadmin’s opinion
It’s entirely reasonable for Torvalds, as chief maintainer of the Linux kernel, to declare his unwillingness to so much as flirt with the integration of ZFS—as a CDDL-licensed project—into the Linux kernel. There is a pretty good chance that enforcement attempts to separate CDDL and GPL licensed code might fail in court—note the fact that nobody has challenged Canonical on its integration of ZFS code into mainline kernels over the last four years—but nothing forces Torvalds to take a controversial stance here.
It’s equally reasonable for Torvalds to draw a hard, bright line separating the kernel mandate for not breaking userland from any nonexistent promise to maintain ABI compatibility in kernel space. If you want to play in kernel space, you have to play with the big kids—and that means keeping up with ABI changes that might break your stuff. As aggressive as OpenZFS users and developers might find Kroah-Hartman’s language around removing the export of kernel symbols, the actual change was well within Linux kernel policy. And there’s nothing wrong with Torvalds supporting either the kernel change itself, or the policies it followed.
Use or Not Use
Where things went off the rails is when Torvalds started talking about reasons to use or not use ZFS. Torvalds’ status within the Linux community grants his words an impact that can be entirely out of proportion to Torvalds’ own knowledge of a given topic—and this was clearly one of those topics.
Enlarge / Linus Torvalds is eminently qualified to discuss issues with license compatibility and kernel policy. However, this does not mean he’s equally qualified to discuss individual projects in project-specific context.
Life offers multiple opportunities for Yes and No takes.
I am thankful that Jim Salter took the time to offer a nuanced personal take on this.