Matthew Wilcox
2014-10-08 19:05:23 UTC
One of the things on my todo list is making O_DIRECT work to a
memory-mapped direct_access file. Right now, it simply doesn't work
because there's no struct page for the memory, so get_user_pages() fails.
Boaz has posted a patch to create struct pages for direct_access files,
which is certainly one way of solving the immediate problem, but it
ignores the deeper problem.
For normal files, get_user_pages() elevates the reference count on
the pages. If those pages are subsequently truncated from the file,
the underlying file blocks are released to the filesystem's free pool.
The pages are removed from the page cache and the process's address space,
but hang around until the caller of get_user_pages() calls put_page() on
them again at which point they are released into the pool of free pages.
Once we have a struct page for (or some other way to handle pinning of)
persistent memory blocks, truncating a file that has pinned pages will
still cause the disk blocks to be released to the free pool. But there
weren't any pages of DRAM between the filesystem and the application!
So those blocks are "freed" while still referenced. And that reference
might well be programmed into a piece of hardware that's doing DMA;
it can't be stopped.
I see three solutions here:
1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
the caller with the struct pages of the DRAM. Modify DAX to handle some
file pages being in the page cache, and make sure that we know whether
the PMEM or DRAM is up to date. This has the obvious downside that
get_user_pages() becomes slow.
2. Modify filesystems that support DAX to handle pinning blocks.
Some filesystems (that support COW and snapshots) already support
reference-counting individual blocks. We may be ale to do better by
using a tree of pinned extents or something. This makes it much harder
to modify a filesystem to support DAX, and I don't see patches adding
this capability to ext2 being warmly welcomed.
3. Make truncate() block if it hits a pinned page. There's really no
good reason to truncate a file that has pinned pages; it's either a bug
or you're trying to be nasty to someone. We actually already have code
for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
just for O_DIRECT I/Os and other transient users like crypto, it's also
for long-lived things like RDMA, where we could potentially block for
an indefinite time.
Does option 3 open up a new attack surface? I'm thinking about somebody
opening a large file that's publically readable, and pinning part of
it by handing part of it to an RDMA card. That would prevent the owner
from truncating it.
One thing that option 3 doesn't do is affect whether a file can be
removed. Just having the file mmaped is enough to prevent the file blocks
from being reused, even if all names for that file have been removed.
I'm open to other solutions ...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
memory-mapped direct_access file. Right now, it simply doesn't work
because there's no struct page for the memory, so get_user_pages() fails.
Boaz has posted a patch to create struct pages for direct_access files,
which is certainly one way of solving the immediate problem, but it
ignores the deeper problem.
For normal files, get_user_pages() elevates the reference count on
the pages. If those pages are subsequently truncated from the file,
the underlying file blocks are released to the filesystem's free pool.
The pages are removed from the page cache and the process's address space,
but hang around until the caller of get_user_pages() calls put_page() on
them again at which point they are released into the pool of free pages.
Once we have a struct page for (or some other way to handle pinning of)
persistent memory blocks, truncating a file that has pinned pages will
still cause the disk blocks to be released to the free pool. But there
weren't any pages of DRAM between the filesystem and the application!
So those blocks are "freed" while still referenced. And that reference
might well be programmed into a piece of hardware that's doing DMA;
it can't be stopped.
I see three solutions here:
1. If get_user_pages() is called, copy from PMEM into DRAM, and provide
the caller with the struct pages of the DRAM. Modify DAX to handle some
file pages being in the page cache, and make sure that we know whether
the PMEM or DRAM is up to date. This has the obvious downside that
get_user_pages() becomes slow.
2. Modify filesystems that support DAX to handle pinning blocks.
Some filesystems (that support COW and snapshots) already support
reference-counting individual blocks. We may be ale to do better by
using a tree of pinned extents or something. This makes it much harder
to modify a filesystem to support DAX, and I don't see patches adding
this capability to ext2 being warmly welcomed.
3. Make truncate() block if it hits a pinned page. There's really no
good reason to truncate a file that has pinned pages; it's either a bug
or you're trying to be nasty to someone. We actually already have code
for this; inode_dio_wait() / inode_dio_done(). But page pinning isn't
just for O_DIRECT I/Os and other transient users like crypto, it's also
for long-lived things like RDMA, where we could potentially block for
an indefinite time.
Does option 3 open up a new attack surface? I'm thinking about somebody
opening a large file that's publically readable, and pinning part of
it by handing part of it to an RDMA card. That would prevent the owner
from truncating it.
One thing that option 3 doesn't do is affect whether a file can be
removed. Just having the file mmaped is enough to prevent the file blocks
from being reused, even if all names for that file have been removed.
I'm open to other solutions ...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html