20-24 September 2021
US/Pacific timezone

Bad Storage vs. Filesystems

21 Sep 2021, 09:45
30m
Microconference4/Virtual-Room (LPC Virtual)

Microconference4/Virtual-Room

LPC Virtual

150
File Systems MC File Systems MC

Speaker

Darrick Wong (Oracle)

Description

The focus of this session is on mitigating the effects of unreliable storage devices. This author works at a cloud vendor (as is fashionable now), and one of the large story arcs of the past few years has been that storage devices do not seem as reliable as we thought even a few years ago.

Specifically, I've observed that as the world moves from direct-attached spinning rust to software-defined storage on cheap devices, we increasingly must deal with large devices that corrupt data, temporarily stop responding (due to problems on the network/control plane/hypervisor/whatever), or have some odd means to request re-reads

XFS sort of mitigates some of these problems by enabling sysadmins to configure its response to certain kinds of hardware errors (mostly EIO and ENOSPC). Other filesystems lack these control knobs; how might we standardize them? The block layer has some retry capabilities, but no filesystems touch them. We don't have a general corrupted-read retry mechanism, and have not succeeded in adding one.

So what I want to know is: Who cares? Are sysadmins and users happy with the current patchwork? Do they accept the defaults? Would they like more control or better communication between layers?

I agree to abide by the anti-harassment policy I agree

Primary author

Darrick Wong (Oracle)

Presentation Materials

There are no materials yet.