Discussion:
[PATCH][try6] VFS: new want_holesize and got_holesize buffer_head flags for fiemap
Bob Peterson
2014-10-22 01:09:55 UTC
Permalink
Hi,

This is my sixth rework of this patch. The problem with the previous
version is that the underlying file system's block_map function
may set the buffer_head's b_state to 0, which may be misinterpreted
by fiemap as meaning it has returned the hole size in b_size.
This version implements a suggestion from Steve Whitehouse: it
improves on the design by (unfortunately) using two flags: a flag to
tell block_map to return hole size if possible, and another flag to
tell fiemap that the hole size has been returned. This way there is
no possibility of a misunderstanding.

The problem:
If you do a fiemap operation on a very large sparse file, it can take
an extremely long amount of time (we're talking days here) because
function __generic_block_fiemap does a block-for-block search when it
encounters a hole.

The solution:
Allow the underlying file system to return the hole size so that
function __generic_block_fiemap can quickly skip the hole. I have a
companion patch to GFS2 that takes advantage of the new flags to
speed up fiemap on sparse GFS2 files. Other file systems can do the
same as they see fit. For GFS2, the time it takes to skip a 1PB hole
in a sparse file goes from days to milliseconds.

Patch description:
This patch changes function __generic_block_fiemap so that it sets a new
buffer_want_holesize bit. The new bit signals to the underlying file system
to return a hole size from its block_map function (if possible) in the
event that a hole is encountered at the requested block. If the block_map
function encounters a hole, and sets buffer_got_holesize, fiemap takes the
returned b_size to be the size of the hole, in bytes. It then skips the
hole and moves to the next block. This may be repeated several times
in a row, especially for large holes, due to possible limitations of the
fs-specific block_map function. This is still much faster than trying
each block individually when large holes are encountered. If the
block_map function does not set buffer_got_holesize, the request for
holesize has been ignored, and it falls back to today's method of doing a
block-by-block search for the next valid block.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson <***@redhat.com>
---
fs/ioctl.c | 7 ++++++-
include/linux/buffer_head.h | 4 ++++
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 8ac3fad..9c07a40 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -291,13 +291,18 @@ int __generic_block_fiemap(struct inode *inode,
memset(&map_bh, 0, sizeof(struct buffer_head));
map_bh.b_size = len;

+ set_buffer_want_holesize(&map_bh);
ret = get_block(inode, start_blk, &map_bh, 0);
if (ret)
break;

/* HOLE */
if (!buffer_mapped(&map_bh)) {
- start_blk++;
+ if (buffer_got_holesize(&map_bh))
+ start_blk += logical_to_blk(inode,
+ map_bh.b_size);
+ else
+ start_blk++;

/*
* We want to handle the case where there is an
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 324329c..fc8e47c 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -37,6 +37,8 @@ enum bh_state_bits {
BH_Meta, /* Buffer contains metadata */
BH_Prio, /* Buffer should be submitted with REQ_PRIO */
BH_Defer_Completion, /* Defer AIO completion to workqueue */
+ BH_Want_Holesize, /* Return hole size if possible */
+ BH_Got_Holesize, /* Hole size has been returned */

BH_PrivateStart,/* not a state bit, but the first bit available
* for private allocation by other entities
@@ -128,6 +130,8 @@ BUFFER_FNS(Boundary, boundary)
BUFFER_FNS(Write_EIO, write_io_error)
BUFFER_FNS(Unwritten, unwritten)
BUFFER_FNS(Meta, meta)
+BUFFER_FNS(Want_Holesize, want_holesize)
+BUFFER_FNS(Got_Holesize, got_holesize)
BUFFER_FNS(Prio, prio)
BUFFER_FNS(Defer_Completion, defer_completion)
--
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-10-22 06:04:34 UTC
Permalink
This looks like a big indicator that get_blocks is the wrong
interface for fiemap.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Whitehouse
2014-10-22 10:50:39 UTC
Permalink
Hi,
Post by Christoph Hellwig
This looks like a big indicator that get_blocks is the wrong
interface for fiemap.
The question is then, what a better interface would look like? We could
have "get_hole" as an extra operation I suppose. Not sure that I really
see why thats better or worse than the extra flag at the moment? or did
you have something else in mind?

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bob Peterson
2014-10-22 12:28:53 UTC
Permalink
----- Original Message -----
Post by Christoph Hellwig
This looks like a big indicator that get_blocks is the wrong
interface for fiemap.
Hi Christoph,

Yes, I thought about that.
One of my early prototypes had a separate function used by fiemap.
Function __generic_block_fiemap would call get_block() which
returned an indication of a hole as it does today. When it saw
the hole, fiemap called a new function get_hole_size() that was
passed in like get_block. The problem is: it's grossly inefficient,
since the new function get_hole_size() has to redo most of the work
that get_block just did (at least in the case of GFS2). (Which in the
case of a 1PB sparse file is non-trivial, since it involves several
levels of metadata indirection). Combining it with get_block made it
much more efficient.

Making a separate get_block_map_fiemap() function just seems like an
exercise in redundancy.

Regards,

Bob Peterson
Red Hat File Systems
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-10-23 08:51:43 UTC
Permalink
Post by Bob Peterson
Yes, I thought about that.
One of my early prototypes had a separate function used by fiemap.
Function __generic_block_fiemap would call get_block() which
returned an indication of a hole as it does today. When it saw
the hole, fiemap called a new function get_hole_size() that was
passed in like get_block. The problem is: it's grossly inefficient,
since the new function get_hole_size() has to redo most of the work
that get_block just did (at least in the case of GFS2). (Which in the
case of a 1PB sparse file is non-trivial, since it involves several
levels of metadata indirection). Combining it with get_block made it
much more efficient.
Making a separate get_block_map_fiemap() function just seems like an
exercise in redundancy.
I was thinking of replacing get_blocks entirely. We're not actually
using a buffer_head in fiemap, so the interface seems somewhat awkward.
If it used something like the iomap interface proposed by Dave long
time ago we'd have a much saner interface that for example XFS could use
as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bob Peterson
2014-10-23 12:34:58 UTC
Permalink
----- Original Message -----
Post by Christoph Hellwig
Post by Bob Peterson
Yes, I thought about that.
One of my early prototypes had a separate function used by fiemap.
Function __generic_block_fiemap would call get_block() which
returned an indication of a hole as it does today. When it saw
the hole, fiemap called a new function get_hole_size() that was
passed in like get_block. The problem is: it's grossly inefficient,
since the new function get_hole_size() has to redo most of the work
that get_block just did (at least in the case of GFS2). (Which in the
case of a 1PB sparse file is non-trivial, since it involves several
levels of metadata indirection). Combining it with get_block made it
much more efficient.
Making a separate get_block_map_fiemap() function just seems like an
exercise in redundancy.
I was thinking of replacing get_blocks entirely. We're not actually
using a buffer_head in fiemap, so the interface seems somewhat awkward.
If it used something like the iomap interface proposed by Dave long
time ago we'd have a much saner interface that for example XFS could use
as well.
Hi Christoph. Can you send a link to the thread regarding Dave's iomap?
proposal? I don't recall it offhand, so I don't know what it was or
why it was never implemented. I assume you mean Dave Chinner. Maybe it's
time to revisit the concept as a long-term solution.

In the meantime, do you otherwise object to this patch as a short-term
solution?

Regards,

Bob Peterson
Red Hat File Systems
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...