Discussion:
munmap, msync: synchronization
Michael Kerrisk (man-pages)
2014-04-21 10:16:46 UTC
Permalink
[CCing a few people who may correct my errors; perhaps there are some
improvements that are needed for the mmap() and msync() man pages

]

Hello Heinrich,
Hello Michael,
=20
when analyzing how the fanotify API interacts with mmap(2) I stumbled=
=20
=20
=20
"msync() flushes changes made to the in-core copy of a file that was=20
mapped into memory using mmap(2) back to disk."
=20
"back to disk" implies that the file system is forced to actually wri=
te=20
to the hard disk, somewhat equivalent to invoking sync(1). Is that=20
guaranteed for all file systems?
=20
Not all file systems are necessarily disk based (e.g. davfs, tmpfs).
=20
"... back to the file system."
Yes, that seems better to me. Done.
http://pubs.opengroup.org/onlinepubs/007904875/functions/msync.html
says
"... to permanent storage locations, if any,"
=20
=20
The manpage of munmap(2) leaves it unclear, if copying back to the=20
filesystem is synchronous or asynchronous.
In fact, the page says nearly nothing about whether it synchs at all.
That is (I think) more or less deliberate. See below.
This bit of information is important, because, if munmap is=20
asynchronous, applications might want to call msync(,,MS_SYNC), befor=
e=20
calling munmap. If munmap is synchronous it might block until the fil=
e=20
system responds (think of waiting for a tape to be loaded, or a webda=
v=20
server to respond).
=20
=20
What happens to an unfinished prior asynchronous update by=20
mmap(,,MS_ASYNC) when munmap is called?
I believe the answer is: On Linux, nothing special; the asynchronous
update will still be done. (I'm not sure that anything needs to be
said in the man page... But, if you have a good argument about why=20
something should be said, I'm open to hearing it.)
Will munmap "invalidate other mappings of the same file (so that they=
=20
can be updated with the fresh values just written)" like=20
msync(,,MS_INVALIDATE) does?
I don't believe there's any requirement that it does. (Again, I'm not
sure that anything needs to be said in the man page... But, if
you have a good argument...)

So, here's how things are as I understand them.

1. In the bad old days (even on Linux, AFAIK, but that was in days
before I looked closely at what goes on), the page cache and
the buffer cache were not unified. That meant that a page from=20
a file might both be in the buffer cache (because of file I/O
syscalls) and in the page cache (because of mmap()).

2. In a non-unified cache system, pages can naturally get out of
synch in the two locations. Before it had a unified cache, Linux=20
used to jump some hoops to ensure that contents in the two=20
locations remained consistent.

3. Nowadays Linux--like most (all?) UNIX systems--has a=20
unified cache: file I/O, mmap(), and the paging system all=20
use the same cache. If a file is mmap()-ed and also subject
to file I?/, there will be only one copy of each file page=20
in the cache. Ergo, the inconsistency problem goes away.

4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
exist only because of the bad old non-unified cache days.
MS_INVALIDATE was a way of saying: make sure that writes
to the file by other processes are visible in this mapping.
msync() without the MS_INVALIDATE flags was a way of saying:
make sure that read()s from the file see the changes made
via this mapping. Using either MS_SYNC or MS_ASYNC
was the way of saying: "I either want to wait until the file
updates have been completed", or "please start the updates
now, but I don't want to wait until they're completed".

5. On systems with a unified cache, msync(MS_INVALIDATE)
is a no-op. (That is so on Linux.)

6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified=20
cache system. Filesystem I/O always sees a consistent view,
and MS_ASYNC never undertook to give a guarantee about *when*
the update would occur. (The Linux buffer cache logic will=20
ensure that it is flushed out sometime in the near future.)

7. On Linux (and probably many other modern systems), the only
call that has any real use is msync(MS_SYNC), meaning
"flush the buffers *now*, and I want to wait for that to=20
complete, so that I can then continue safe in the knowledge
that my data has landed on a device". That's useful if we
want insurance for our data in the event of a system crash.

8. POSIX make no mandate for a unified cache system. Thus,
we have MS_ASYNC and MS_INVALIDATE in the standard, and
the standard says nothing (AFAIK) about whether munmap()=20
will flush data. On Linux (and probably most modern systems),
we're fine. but portable applications that care about=20
standards and nonunified caches need to use msync().

My advice: To ensure that the contents of a shared file
mapping are written to the underlying file--even on bad old
implementations--a call to msync() should be made before=20
unmapping a mapping with munmap().

9. The mmap() man page says this:

MAP_SHARED=20
Share this mapping. Updates to the mapping are vis=E2=80=90
ible to other processes that map this file, and are
carried through to the underlying file. The file
may not actually be updated until msync(2) or mun=E2=80=90
map() is called.

I believe the piece "or munmap()" is misleading. It implies
that munmap() must trigger a sync action. I don't think this
is true. All that it is required to do is remove some range
of pages from the process's virtual address space. I'm
inclined to remove those words, but I'd like to see if any
FS person has a correction to my understanding first.

Cheers,

Michael


--=20
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-21 18:14:31 UTC
Permalink
Post by Michael Kerrisk (man-pages)
1. In the bad old days (even on Linux, AFAIK, but that was in days
before I looked closely at what goes on), the page cache and
the buffer cache were not unified. That meant that a page from
a file might both be in the buffer cache (because of file I/O
syscalls) and in the page cache (because of mmap()).
Correct.
Post by Michael Kerrisk (man-pages)
2. In a non-unified cache system, pages can naturally get out of
synch in the two locations. Before it had a unified cache, Linux
used to jump some hoops to ensure that contents in the two
locations remained consistent.
Yeah.
Post by Michael Kerrisk (man-pages)
3. Nowadays Linux--like most (all?) UNIX systems--has a
unified cache: file I/O, mmap(), and the paging system all
use the same cache. If a file is mmap()-ed and also subject
to file I?/, there will be only one copy of each file page
in the cache. Ergo, the inconsistency problem goes away.
Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
it's own file cache that is not coherent with the VM cache at the
implementation level. Not sure how much of this leaks to userspace,
though.
Post by Michael Kerrisk (man-pages)
4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
exist only because of the bad old non-unified cache days.
MS_INVALIDATE was a way of saying: make sure that writes
to the file by other processes are visible in this mapping.
make sure that read()s from the file see the changes made
via this mapping. Using either MS_SYNC or MS_ASYNC
was the way of saying: "I either want to wait until the file
updates have been completed", or "please start the updates
now, but I don't want to wait until they're completed".
Right.
Post by Michael Kerrisk (man-pages)
5. On systems with a unified cache, msync(MS_INVALIDATE)
is a no-op. (That is so on Linux.)
Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me
why, though..
Post by Michael Kerrisk (man-pages)
6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
cache system. Filesystem I/O always sees a consistent view,
and MS_ASYNC never undertook to give a guarantee about *when*
the update would occur. (The Linux buffer cache logic will
ensure that it is flushed out sometime in the near future.)
Right. It's a fairly inefficient noop, though - it actually loops
over all vmas to do nothing with them.
Post by Michael Kerrisk (man-pages)
7. On Linux (and probably many other modern systems), the only
call that has any real use is msync(MS_SYNC), meaning
"flush the buffers *now*, and I want to wait for that to
complete, so that I can then continue safe in the knowledge
that my data has landed on a device". That's useful if we
want insurance for our data in the event of a system crash.
Right. It's basically another way to call fsync, which is used to
implement it underneath. It actually should be a ranged-fdatasync
but right it's it's implemented horribly inefficiently in that it
does a fsync call for each vma that it encounters in the range
specified.
Post by Michael Kerrisk (man-pages)
8. POSIX make no mandate for a unified cache system. Thus,
we have MS_ASYNC and MS_INVALIDATE in the standard, and
the standard says nothing (AFAIK) about whether munmap()
will flush data. On Linux (and probably most modern systems),
we're fine. but portable applications that care about
standards and nonunified caches need to use msync().
My advice: To ensure that the contents of a shared file
mapping are written to the underlying file--even on bad old
implementations--a call to msync() should be made before
unmapping a mapping with munmap().
Agreed.
Post by Michael Kerrisk (man-pages)
MAP_SHARED
Share this mapping. Updates to the mapping are vis???
ible to other processes that map this file, and are
carried through to the underlying file. The file
may not actually be updated until msync(2) or mun???
map() is called.
I believe the piece "or munmap()" is misleading. It implies
that munmap() must trigger a sync action. I don't think this
is true. All that it is required to do is remove some range
of pages from the process's virtual address space. I'm
inclined to remove those words, but I'd like to see if any
FS person has a correction to my understanding first.
I would expect non-coherent systems to update their caches on munmap,
Posix does not seem to require this, and I can't find any language
towards that in the HP-UX man page, which was a system that I remember
as non-coherent until the end.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael Kerrisk (man-pages)
2014-04-21 19:54:16 UTC
Permalink
Christoph,
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
1. In the bad old days (even on Linux, AFAIK, but that was in days
before I looked closely at what goes on), the page cache and
the buffer cache were not unified. That meant that a page from
a file might both be in the buffer cache (because of file I/O
syscalls) and in the page cache (because of mmap()).
Correct.
Post by Michael Kerrisk (man-pages)
2. In a non-unified cache system, pages can naturally get out of
synch in the two locations. Before it had a unified cache, Linux
used to jump some hoops to ensure that contents in the two
locations remained consistent.
Yeah.
Post by Michael Kerrisk (man-pages)
3. Nowadays Linux--like most (all?) UNIX systems--has a
unified cache: file I/O, mmap(), and the paging system all
use the same cache. If a file is mmap()-ed and also subject
to file I?/, there will be only one copy of each file page
in the cache. Ergo, the inconsistency problem goes away.
Mostly true, except for FreeBSD and Solaris when they use ZFS, which has
it's own file cache that is not coherent with the VM cache at the
implementation level. Not sure how much of this leaks to userspace,
though.
Thanks for that detail.
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
4. IIUC, the pieces like msync(MS_ASYNC) and msync(MS_INVALIDATE)
exist only because of the bad old non-unified cache days.
MS_INVALIDATE was a way of saying: make sure that writes
to the file by other processes are visible in this mapping.
make sure that read()s from the file see the changes made
via this mapping. Using either MS_SYNC or MS_ASYNC
was the way of saying: "I either want to wait until the file
updates have been completed", or "please start the updates
now, but I don't want to wait until they're completed".
Right.
Post by Michael Kerrisk (man-pages)
5. On systems with a unified cache, msync(MS_INVALIDATE)
is a no-op. (That is so on Linux.)
Almost. It returns EBUSY if it hits any mlock()ed region. Don't ask me
why, though..
Ahhh yes, I was aware of that detail, but overlooked it in the point
above.
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
cache system. Filesystem I/O always sees a consistent view,
and MS_ASYNC never undertook to give a guarantee about *when*
the update would occur. (The Linux buffer cache logic will
ensure that it is flushed out sometime in the near future.)
Right. It's a fairly inefficient noop, though - it actually loops
over all vmas to do nothing with them.
Post by Michael Kerrisk (man-pages)
7. On Linux (and probably many other modern systems), the only
call that has any real use is msync(MS_SYNC), meaning
"flush the buffers *now*, and I want to wait for that to
complete, so that I can then continue safe in the knowledge
that my data has landed on a device". That's useful if we
want insurance for our data in the event of a system crash.
Right. It's basically another way to call fsync, which is used to
implement it underneath. It actually should be a ranged-fdatasync
but right it's it's implemented horribly inefficiently in that it
does a fsync call for each vma that it encounters in the range
specified.
Post by Michael Kerrisk (man-pages)
8. POSIX make no mandate for a unified cache system. Thus,
we have MS_ASYNC and MS_INVALIDATE in the standard, and
the standard says nothing (AFAIK) about whether munmap()
will flush data. On Linux (and probably most modern systems),
we're fine. but portable applications that care about
standards and nonunified caches need to use msync().
My advice: To ensure that the contents of a shared file
mapping are written to the underlying file--even on bad old
implementations--a call to msync() should be made before
unmapping a mapping with munmap().
Agreed.
Thanks for checking all of this over and thanks also
for confirming that I learned my lessens well in the
"Jamie Lokier school of tough technical reviewing" ;-).
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
MAP_SHARED
Share this mapping. Updates to the mapping are vis???
ible to other processes that map this file, and are
carried through to the underlying file. The file
may not actually be updated until msync(2) or mun???
map() is called.
I believe the piece "or munmap()" is misleading. It implies
that munmap() must trigger a sync action. I don't think this
is true. All that it is required to do is remove some range
of pages from the process's virtual address space. I'm
inclined to remove those words, but I'd like to see if any
FS person has a correction to my understanding first.
I would expect non-coherent systems to update their caches on munmap,
Posix does not seem to require this, and I can't find any language
towards that in the HP-UX man page, which was a system that I remember
as non-coherent until the end.
Yes, that's how I read it too. POSIX seems to have no requirements here,
so I assume it was catering to to the lowest common denominator.

Cheers,

Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2014-04-21 21:34:18 UTC
Permalink
Post by Michael Kerrisk (man-pages)
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
7. On Linux (and probably many other modern systems), the only
call that has any real use is msync(MS_SYNC), meaning
"flush the buffers *now*, and I want to wait for that to
complete, so that I can then continue safe in the knowledge
that my data has landed on a device". That's useful if we
want insurance for our data in the event of a system crash.
Right. It's basically another way to call fsync, which is used to
implement it underneath. It actually should be a ranged-fdatasync
but right it's it's implemented horribly inefficiently in that it
does a fsync call for each vma that it encounters in the range
specified.
A ranged-fdatasync, for databases with little logs inside the big data
file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any
likelihood of that ever appearing in Linux? sync_file_range() comes
with its Warning in the man page which basically means "don't trust me
unless you know the filesystem exactly".
Post by Michael Kerrisk (man-pages)
Thanks for checking all of this over and thanks also
for confirming that I learned my lessens well in the
"Jamie Lokier school of tough technical reviewing" ;-).
Hi! That was a long time ago :)
Post by Michael Kerrisk (man-pages)
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
MAP_SHARED
Share this mapping. Updates to the mapping are vis???
ible to other processes that map this file, and are
carried through to the underlying file. The file
may not actually be updated until msync(2) or mun???
map() is called.
I believe the piece "or munmap()" is misleading. It implies
that munmap() must trigger a sync action. I don't think this
is true. All that it is required to do is remove some range
of pages from the process's virtual address space. I'm
inclined to remove those words, but I'd like to see if any
FS person has a correction to my understanding first.
I would expect non-coherent systems to update their caches on munmap,
Posix does not seem to require this, and I can't find any language
towards that in the HP-UX man page, which was a system that I remember
as non-coherent until the end.
Yes, that's how I read it too. POSIX seems to have no requirements here,
so I assume it was catering to to the lowest common denominator.
According to this:

http://h30499.www3.hp.com/t5/System-Administration/2-second-delays-in-fsync-msync-munmap/td-p/3092785/page/2#.U1WBw8dSI1-

and the conclusion of the following page:

- munmap() does _something_ on HP-UX, but it might be just a poorly
implemented artifact rather than equivalent to msync.

- While we're there, the lowest common denominator for HP-UX was
that pwrite() followed by mmap() does not provide the data
recently written, even with fsync() between. The thread ended
there, but I would guess either it's a bug _or_ perhaps
write+mmap+msync(MS_INVALIDATE) are needed in that order despite
the write being before the mmap, perhaps if the shared segment
was maintained by another process.

- To keep it exciting, if you look at the HP-UX man page, 32-bit
and 64-bit processes have separate mmap caches - writing to
shared memory in one of them won't be seen immediately by the other.

Then there's this, about Linux NFS incoherency with msync() and O_DIRECT:

- https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ

I don't know if any of the above are _true_ though :)

Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-22 06:03:20 UTC
Permalink
Post by Jamie Lokier
A ranged-fdatasync, for databases with little logs inside the big data
file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any
likelihood of that ever appearing in Linux? sync_file_range() comes
with its Warning in the man page which basically means "don't trust me
unless you know the filesystem exactly".
We have the infrastructure for range fsync and fdatasync in the kernel,
it's just not exposed. Given that you've already done the research
how about you send a patch to wire it up? Do the above implementations
at least agree on an API for it?

sync_file_range() unfortunately only writes out pagecache data and never
the needed metadata to actually find it. While we could multiplex a
range fsync over it that seems to be very confusing (and would be more
complicated than just adding new syscalls)
Post by Jamie Lokier
- https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ
That mail is utterly confused. Yes, NFS has less coherency than normal
filesystems (google for close to open), but msync actually does it's
proper job on NFS.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2014-04-22 07:04:21 UTC
Permalink
Post by Christoph Hellwig
Post by Jamie Lokier
A ranged-fdatasync, for databases with little logs inside the big data
file, would be nice. AIX, NetBSD and FreeBSD all have one :) Any
likelihood of that ever appearing in Linux? sync_file_range() comes
with its Warning in the man page which basically means "don't trust me
unless you know the filesystem exactly".
We have the infrastructure for range fsync and fdatasync in the kernel,
it's just not exposed. Given that you've already done the research
how about you send a patch to wire it up? Do the above implementations
at least agree on an API for it?
Hi Christoph,

Hardly research, I just did a quick Google and was surprised to find
some results. AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).

As for me doing it, no, sorry, I haven't touched the kernel in a few
years, life's been complicated for non-technical reasons, and I don't
have time to get back into it now.
Post by Christoph Hellwig
sync_file_range() unfortunately only writes out pagecache data and never
the needed metadata to actually find it. While we could multiplex a
range fsync over it that seems to be very confusing (and would be more
complicated than just adding new syscalls)
I agree. I never saw the point in sync_file_range() except to mislead,
whereas fsync_range() always seemed obvious!

In the kernel, I was always under the impression the simple part of
fsync_range - writing out data pages - was solved years ago, but being
sure the filesystem's updated its metadata in the proper way, that
begs for a little research into what filesystems do when asked,
doesn't it?

For example, imagine two dirty pages 0 and 1, two disk blocks A and B,
and a non-overwriting filesystem (similar to btrfs) which knows about
the dirty flags and has formulated a plan to journal a single metadata
change containing two pointers, from [0->A,1->B] to [0->C,1->D] when
it flushes metadata _after_ pages 0 and 1 are written to new disk
blocks C and D. And you do fsync_range just on block 1. Now if only
page 1 gets written and page 0 does not, it's important that a
different metadata change is journalled: [0->A,1->D] (or just [1->D]).
Now hopefully, all filesystems are sane enough to just do that, by
calculating what to journal as a response to only data I/O that's in
flight and behind a barrier. But I wouldn't like to _assume_ that no
filesystems algorithms don't queue up the joint [0->C,1-D] metadata
change somehow, having seem the dirty flags, in a way that gets
confused by a forced metadata flush after partial dirty data flush.
After all it might be a legitimate thing to do in the current scheme.

(Similar things apply to converting preallocated-but-unwritten regions
to written.)

So I have this weird idea that to do it carefully needs a little
checking what filesystems do with carefully ordered block-pointer
metadata writes.
Post by Christoph Hellwig
Post by Jamie Lokier
- https://groups.google.com/d/msg/comp.os.linux.development.apps/B49Rej6KV24/xEouZOVXs9gJ
That mail is utterly confused. Yes, NFS has less coherency than normal
filesystems (google for close to open), but msync actually does it's
proper job on NFS.
Good to know :)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-22 09:28:37 UTC
Permalink
Post by Jamie Lokier
Hi Christoph,
Hardly research, I just did a quick Google and was surprised to find
some results. AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).
There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.

The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace. Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.
Post by Jamie Lokier
As for me doing it, no, sorry, I haven't touched the kernel in a few
years, life's been complicated for non-technical reasons, and I don't
have time to get back into it now.
I've cooked up a patch, but I really need someone to test it and promote
it. Find the patch attached. There are two differences to the NetBSD
one:

1) It doesn't fail for read-only FDs. fsync doesn't, and while
standards used to have fdatasync and aio_fsync fail for them,
Linux never did and the standards are catching up:

http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671

2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
and we wouldn't even have the infrastructure for it. It might make
sense to provide it defined to 0 so that we have the identifier but
make it a no-op.
Post by Jamie Lokier
In the kernel, I was always under the impression the simple part of
fsync_range - writing out data pages - was solved years ago, but being
sure the filesystem's updated its metadata in the proper way, that
begs for a little research into what filesystems do when asked,
doesn't it?
The filesystems I care about handle it fine, and while I don't know
the details of others they better handle it properly, given that we
use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
from the nfs server.
Michael Kerrisk (man-pages)
2014-04-23 14:33:06 UTC
Permalink
Post by Christoph Hellwig
Post by Jamie Lokier
Hi Christoph,
Hardly research, I just did a quick Google and was surprised to find
some results. AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).
There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.
The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace. Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.
Post by Jamie Lokier
As for me doing it, no, sorry, I haven't touched the kernel in a few
years, life's been complicated for non-technical reasons, and I don't
have time to get back into it now.
I've cooked up a patch, but I really need someone to test it and promote
it. Find the patch attached. There are two differences to the NetBSD
1) It doesn't fail for read-only FDs. fsync doesn't, and while
standards used to have fdatasync and aio_fsync fail for them,
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
and we wouldn't even have the infrastructure for it. It might make
sense to provide it defined to 0 so that we have the identifier but
make it a no-op.
Post by Jamie Lokier
In the kernel, I was always under the impression the simple part of
fsync_range - writing out data pages - was solved years ago, but being
sure the filesystem's updated its metadata in the proper way, that
begs for a little research into what filesystems do when asked,
doesn't it?
The filesystems I care about handle it fine, and while I don't know
the details of others they better handle it properly, given that we
use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
from the nfs server.
The functionality sounds like it would be worthwhile. I've applied the
patch against 3.15-rc2, and employed the test program below, with test
files on standard laptop HDD (ext4). The test program repeatedly
a) overwrites a specified region of a file
b) does an fsync_range() on a specified range of the file (need not be
the same region that was written).

The CLI is crude, but the arguments are:

1: pathname
2: number of loops
3: Starting point for writes each time round loop
4: Length of region to write
5: Either 'f' for or 'd' for FDATASYNC
6: start offset for fsync_range()
7: length for fsync_range()

It seems that the patch does roughly what it says on the tin:

# Precreate a 1MB file

$ sync; time ./t_fsync_range /testfs/f 100 0 1000000 d 0 1000000^C
$ dd of=/testfs/f bs=1000 count=1000 if=/dev/full
1000+0 records in
1000+0 records out
1000000 bytes (1.0 MB) copied, 0.00575843 s, 174 MB/s

# Take journaling and atime out of the equation:

$ sudo umount /dev/sdb6
$ sudo tune2fs -O ^has_journal /dev/sdb6$
[sudo] password for mtk:
tune2fs 1.42.8 (20-Jun-2013)
$ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs

# Filesystem unmounted and remounted (with above options) before
# each of the following tests

===

# 1000 loops, writing 1 MB, syncing entire 1MB range, with FFILESYNC:

$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000
fsync_range(3, 0x20, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real 0m10.677s
user 0m0.011s
sys 0m0.816s


# 1000 loops, writing 1MB, syncing entire 1MB range, with FDATASYNC:
# (Takes less time, as expected)

$ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000
fsync_range(3, 0x10, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real 0m8.685s
user 0m0.017s
sys 0m0.825s

===

# 1000 loops, writing 1 MB, syncing just 100kB, with FFILESYNC:
# (Take less time than syncing entire 1MB range, as expected)

$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 100000
fsync_range(3, 0x20, 0, 100000)
Performed 16000 writes
Performed 1000 sync operations

real 0m1.501s
user 0m0.005s
sys 0m0.339s

# 1000 loops, writing 1 MB, syncing just 10kB, with FFILESYNC:

$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 10000
fsync_range(3, 0x20, 0, 10000)
Performed 16000 writes
Performed 1000 sync operations

real 0m0.616s
user 0m0.004s
sys 0m0.240s

=======

But I have a question:

When I precreate a 10MB file, and repeat the tests (this time with
100 loops), I no longer see any significant difference between
FFILESYNC and FDATASYNC. What am I missing? Sample runs here,
though I did the tests repeatedly with broadly similar results
each time:

#FFILESYNC

$ time ./t_fsync_range /testfs/f 100 0 10000000 f 0 10000000
fsync_range(3, 0x20, 0, 10000000)
Performed 15300 writes
Performed 100 sync operations

real 0m17.575s
user 0m0.001s
sys 0m0.656s

# FDATASYNC

$ time ./t_fsync_range /testfs/f 100 0 10000000 d 0 10000000
fsync_range(3, 0x10, 0, 10000000)
Performed 15300 writes
Performed 100 sync operations

real 0m17.228s
user 0m0.005s
sys 0m0.624s

======

Add another question: is there any piece of sync_file_range()
functionality that could or should be incorporated in this API?

======

Tested-by: Michael Kerrisk <***@gmail.com>

Cheers,

Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-23 15:45:50 UTC
Permalink
Post by Michael Kerrisk (man-pages)
$ sudo umount /dev/sdb6
$ sudo tune2fs -O ^has_journal /dev/sdb6$
tune2fs 1.42.8 (20-Jun-2013)
$ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
The second strictatime argument overrides the earlier norelatime,
so you put it into the picture.
Post by Michael Kerrisk (man-pages)
When I precreate a 10MB file, and repeat the tests (this time with
100 loops), I no longer see any significant difference between
FFILESYNC and FDATASYNC. What am I missing? Sample runs here,
though I did the tests repeatedly with broadly similar results
Not sure. Do you also see this on other filesystems?
Post by Michael Kerrisk (man-pages)
Add another question: is there any piece of sync_file_range()
functionality that could or should be incorporated in this API?
I don't think so. sync_file_range is a complete mess and impossible
to use correctly for data integrity operations. Especially the whole
notion that submitting I/O and waiting for it are separate operations
is incompatible with a data integrity call.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2014-04-23 22:20:11 UTC
Permalink
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
Add another question: is there any piece of sync_file_range()
functionality that could or should be incorporated in this API?
I don't think so. sync_file_range is a complete mess and impossible
to use correctly for data integrity operations. Especially the whole
notion that submitting I/O and waiting for it are separate operations
is incompatible with a data integrity call.
I guess it's also to give the application a way to nudge a preferred
asynchronous writeback order, prior to a synchronous wait. If the
application knows there's a lot of dirty data being generated over
time prior to needing a short fdatasync, it might see it as beneficial
to tell the kernel to start writing that data sooner, so the fdatasync
delay will be shorter.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-25 06:07:34 UTC
Permalink
Post by Jamie Lokier
I guess it's also to give the application a way to nudge a preferred
asynchronous writeback order, prior to a synchronous wait. If the
application knows there's a lot of dirty data being generated over
time prior to needing a short fdatasync, it might see it as beneficial
to tell the kernel to start writing that data sooner, so the fdatasync
delay will be shorter.
If they want to do an async writeback pass first they can just use
sync_file_range for it, that's the only thing it's actually useful for.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael Kerrisk (man-pages)
2014-04-24 09:34:31 UTC
Permalink
Oops -- I see that I forgot to attach the test program in my last
mail. Appended below, now.)
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
$ sudo umount /dev/sdb6
$ sudo tune2fs -O ^has_journal /dev/sdb6$
tune2fs 1.42.8 (20-Jun-2013)
$ sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
The second strictatime argument overrides the earlier norelatime,
so you put it into the picture.
Oh -- have I misunderstood something? I was wanting classical behavior:
atime always updated (but only synced to disk by FILESYNC). Is that not
what I should get with norelatime+strictatime?
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
When I precreate a 10MB file, and repeat the tests (this time with
100 loops), I no longer see any significant difference between
FFILESYNC and FDATASYNC. What am I missing? Sample runs here,
though I did the tests repeatedly with broadly similar results
Not sure. Do you also see this on other filesystems?
=======

So, here's some results from XFS:

# 1000 loops. 1MB file, 1MB fsync_range()
# As with ext4, FDATASYNC is faster than FFILESYNC (as expected)

$ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
$ time ./t_fsync_range /testfs/f 1000 0 1000000 f 0 1000000
fsync_range(3, 0x20, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real 0m52.264s
user 0m0.018s
sys 0m0.926s
$ sudo umount /dev/sdb6; sudo mount -o norelatime,strictatime /dev/sdb6 /testfs
$ time ./t_fsync_range /testfs/f 1000 0 1000000 d 0 1000000
fsync_range(3, 0x10, 0, 1000000)
Performed 16000 writes
Performed 1000 sync operations

real 0m33.689s
user 0m0.002s
sys 0m0.915s

# (Note that I did not disable XFS journalling--it's not possible to
# do so, right?)

====

# 100 loops, 100MB file, 100MB fsync_range()
# FDATASYNC and FFIFLESYNC times are again similar

$ time ./t_fsync_range /testfs/f 100 0 100000000 f 0 100000000
fsync_range(3, 0x20, 0, 100000000)
Performed 152600 writes
Performed 100 sync operations

real 4m45.257s
user 0m0.004s
sys 0m5.607s

$ time ./t_fsync_range /testfs/f 100 0 100000000 d 0 100000000
fsync_range(3, 0x10, 0, 100000000)
Performed 152600 writes
Performed 100 sync operations

real 4m43.925s
user 0m0.010s
sys 0m3.824s

# Again, the same pattern: no difference between FFILESYNC and FDATASYNC

=====
On JFS, I get

1000 loops, 1MB file, 1MB fsync_range, FFILESYNC:
* Quite a lot of variability (11.3 to 16.5 secs)
1000 loops, 1MB file, 1MB fsync_range, FDATASYNC:
* Quite a lot of variability (8.6 to 10.9 secs)
==> FDATASYNC is on average faster than FFILESYNC

100 loops, 100 MB file, 100MB fsync_range, FFILESYNC:
281 seconds (just a single test)
100 loops, 100 MB file, 100MB fsync_range, FDATASYNC:
280 seconds (just a single test)

So, again, it seems like for a large file sync, there's no difference between
FFILESYNC and FDATASYNC
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
Add another question: is there any piece of sync_file_range()
functionality that could or should be incorporated in this API?
I don't think so. sync_file_range is a complete mess and impossible
to use correctly for data integrity operations. Especially the whole
notion that submitting I/O and waiting for it are separate operations
is incompatible with a data integrity call.
Okay -- I just thought it worth checking.

Cheers,

Michael

========
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)

/* flags for fsync_range */
#define FDATASYNC 0x0010
#define FFILESYNC 0x0020

#define SYS_fsync_range 317

static int
fsync_range(unsigned int fd, int how, loff_t start, loff_t length)
{
return syscall(SYS_fsync_range, fd, how, start, length);
}

#define BUF_SIZE 65536
static char buf[BUF_SIZE];

int
main(int argc, char *argv[])
{
int j, fd, nloops, how;
size_t writeLen, syncLen, wlen;
size_t bufSize;
off_t writeOffset, syncOffset;
int scnt, wcnt;

if (argc != 8 || strcmp(argv[1], "--help") == 0) {
fprintf(stderr, "%s pathname nloops write-offset write-length {f|d} "
"sync-offset sync-len\n", argv[0]);
exit(EXIT_SUCCESS);
}

fd = open(argv[1], O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (fd == -1)
errExit("read");

nloops = atoi(argv[2]);
writeOffset = atoi(argv[3]);
writeLen = atoi(argv[4]);
how = (argv[5][0] == 'd') ? FDATASYNC :
(argv[5][0] == 'f') ? FFILESYNC : 0;
syncOffset = atoi(argv[6]);
syncLen = atoi(argv[7]);

if (how != 0)
fprintf(stderr, "fsync_range(%d, 0x%x, %lld, %zd)\n",
fd, how, (long long) syncOffset, syncLen);

scnt = 0;
wcnt = 0;

for (j = 0; j < nloops; j++) {
memset(buf, j % 256, BUF_SIZE);
if (lseek(fd, writeOffset, SEEK_SET) == -1)
errExit("lseek");

wlen = writeLen;
while (wlen > 0) {
bufSize = (wlen > BUF_SIZE) ? BUF_SIZE : wlen;
wlen -= bufSize;

if (write(fd, buf, bufSize) != bufSize) {
fprintf(stderr, "Write failed\n");
exit(EXIT_FAILURE);
}

wcnt++;
}

if (how != 0) {
scnt++;
if (fsync_range(fd, how, syncOffset, syncLen) == -1)
errExit("fsync_range");
}
}

fprintf(stderr, "Performed %d writes\n", wcnt);
fprintf(stderr, "Performed %d sync operations\n", scnt);
exit(EXIT_SUCCESS);
}
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jamie Lokier
2014-04-23 22:15:27 UTC
Permalink
Post by Christoph Hellwig
Post by Jamie Lokier
Hi Christoph,
Hardly research, I just did a quick Google and was surprised to find
some results. AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).
There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.
Yes, especially with the headings on the man pages saying FreeBSD :)
Just checked a FreeBSD 8.2 system, doesn't have it.
Post by Christoph Hellwig
The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace. Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.
...
Post by Christoph Hellwig
I've cooked up a patch, but I really need someone to test it and promote
it. Find the patch attached. There are two differences to the NetBSD
1) It doesn't fail for read-only FDs. fsync doesn't, and while
standards used to have fdatasync and aio_fsync fail for them,
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
See also for maybe why:

http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html
Post by Christoph Hellwig
2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
and we wouldn't even have the infrastructure for it. It might make
sense to provide it defined to 0 so that we have the identifier but
make it a no-op.
I presume Linux does the equivalent without needing FDISKSYNC, if and
only if the filesystem is mounted with barriers enabled, which is the
default nowadays?
Post by Christoph Hellwig
Post by Jamie Lokier
In the kernel, I was always under the impression the simple part of
fsync_range - writing out data pages - was solved years ago, but being
sure the filesystem's updated its metadata in the proper way, that
begs for a little research into what filesystems do when asked,
doesn't it?
The filesystems I care about handle it fine, and while I don't know
the details of others they better handle it properly, given that we
use vfs_fsync_range to implement O_SNYC/O_DSYNC writes and commits
from the nfs server.
Excellent. This really looks like it should have gone in as a system
call years ago, since vfs_fsync_range was there all along waiting to
be used!
Post by Christoph Hellwig
1) It doesn't fail for read-only FDs. fsync doesn't, and while standards
used require fdatasync and aio_fsync to fail for read-only file
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
2) It doesn't implement the FDISKSYNC. Requiring a flag to actuall make
data persistant is completely broken, and the Linux infrastructure
doesn't support it anyway. We could provide it as a no-op if we
really need to.
Ah, more differences, which I think should be dropped actually.

3) Does not implement NetBSD's documented behaviour when length == 0.
NetBSD says "If the length parameter is zero, fsync_range() will
synchronize all of the file data". This path does from offset.

4) Other weird range stuff inherited from sync_file_range() on 32
bit machines only. May not be correct with O_DIRECT or
filesystems that don't use page cache.
Post by Christoph Hellwig
+static loff_t end_offset(loff_t offset, loff_t nbytes)
+{
+ loff_t endbyte = offset + nbytes;
+
+ if ((s64)offset < 0)
+ return -EINVAL;
+ if ((s64)endbyte < 0)
+ return -EINVAL;
+ if (endbyte < offset)
+ return -EINVAL;
+
+ if (sizeof(pgoff_t) == 4) {
+ if (offset >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+ /*
+ * The range starts outside a 32 bit machine's
+ * pagecache addressing capabilities. Let it "succeed"
+ */
+ return 0;
+ }
+ if (endbyte >= (0x100000000ULL << PAGE_CACHE_SHIFT)) {
+ /*
+ * Out to EOF
+ */
+ return LLONG_MAX;
+ }
+ }
+
+ if (nbytes == 0)
+ endbyte = LLONG_MAX;
+ else
+ endbyte--; /* inclusive */
+
+ return endbyte;
+}
That was in sync_file_range(), where I think it might have made more
sense as that's obviously tied to the page cache only. So:

a) Giving zero length results in sync from offset..LLONG_MAX.
(NetBSD would have it be 0..LLONG_MAX, according to man page.)

b) If the offset is "too large" for page cache on a 32-bit machine,
it won't do anything -- including no metadata side-effects.

c) If the length is "too large" for page cache on a 32-bit machine,
it extends the length to LLONG_MAX.

The desired behaviour with zero length, that's obviously a judgement
call. I guess that provided NetBSD applications the option to use
FDISKSYNC without a range :)

About b) and c) they both look dubious, because it's not a given that
a filesystem is using page cache, or only using page cache. For
example FUSE using O_DIRECT. (Not that I've checked if you can
actually write anything in those ranges though.)

b) looks worse because it means side effects are also quietly not
done, and a file might legitimately not use the page cache (consider a
FUSE-mounted file accessed with O_DIRECT).

So, would it not make sense to just check the offset, length and
offset+length fit into s64; and if length is zero change the range to
0..LLONG_MAX, and simply match NetBSD that way? (Or, call me crazy,
just return if length is zero.)

Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-25 06:26:18 UTC
Permalink
Post by Jamie Lokier
Post by Christoph Hellwig
1) It doesn't fail for read-only FDs. fsync doesn't, and while
standards used to have fdatasync and aio_fsync fail for them,
http://austingroupbugs.net/view.php?id=501
http://austingroupbugs.net/view.php?id=671
http://www.eivanov.com/2011/06/using-fsync-and-fsyncrange-with.html
I don't really see a "why" there, just the observation that fsync and
fsync_range behavior different on NetBSD, which is odd but documented
behavior.
Post by Jamie Lokier
Post by Christoph Hellwig
2) I don't implement the FDISKSYNC. Requiring it is utterly broken,
and we wouldn't even have the infrastructure for it. It might make
sense to provide it defined to 0 so that we have the identifier but
make it a no-op.
I presume Linux does the equivalent without needing FDISKSYNC, if and
only if the filesystem is mounted with barriers enabled, which is the
default nowadays?
That's correct, at least for modern mainstream filesystems. Either way
the filesystem would have to implement the cache flush, so those that
don't support it couldn't support FDISKSYNC either.
Post by Jamie Lokier
Ah, more differences, which I think should be dropped actually.
3) Does not implement NetBSD's documented behaviour when length == 0.
NetBSD says "If the length parameter is zero, fsync_range() will
synchronize all of the file data". This path does from offset.
Indeed. AIX also documents the same behavior.
Post by Jamie Lokier
4) Other weird range stuff inherited from sync_file_range() on 32
bit machines only. May not be correct with O_DIRECT or
filesystems that don't use page cache.
It's not really possible to implement a full Linux filesystem without
touching the pagecache, but I agree that this probably doesn't
belong into the VFS. sync_file_range is one of these odd layering
violations that calls straight into the pagecache without going into
the filesystem first (readahead is the other one that comes to mind).
Post by Jamie Lokier
The desired behaviour with zero length, that's obviously a judgement
call. I guess that provided NetBSD applications the option to use
FDISKSYNC without a range :)
It seems to originate from the earlier AIX version, but I think it's
just their way to sync the whole range. I prefer our 0, LLONG_MAX
notation, but given the existing user interface we should stick to it.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2014-04-24 01:34:35 UTC
Permalink
Post by Christoph Hellwig
Post by Jamie Lokier
Hi Christoph,
Hardly research, I just did a quick Google and was surprised to find
some results. AIX API differs from the BSDs; the BSDs seem to agree
with each other. fsync_range(), with a flag parameter saying what type
of sync, and whether it flushes the storage device write cache as well
(because they couldn't agree that was good - similar to the barriers
debate).
There is no FreeBSD implementation, I think you were confused by FreeBSD
also hosting NetBSD man pages on their site, just as I initially was.
The APIs are mostly the same, except that AIX reuses O_ flags as
argument and NetBSD has a separate namespace. Following the latter
seems more sensible, and also allows developer to define the separate
name to the O_ flag for portability.
Post by Jamie Lokier
As for me doing it, no, sorry, I haven't touched the kernel in a few
years, life's been complicated for non-technical reasons, and I don't
have time to get back into it now.
I've cooked up a patch, but I really need someone to test it and promote
it. Find the patch attached. There are two differences to the NetBSD
.....
Post by Christoph Hellwig
From b63881cac84b35ce3d6a61a33e33ac795a5c583c Mon Sep 17 00:00:00 2001
Date: Tue, 22 Apr 2014 11:24:51 +0200
Subject: fs: implement fsync_range
Christoph, if this is going into the kernel, can you add support for
xfs_io and write a couple of xfstests to test it? I'm not
comfortable with adding new data integrity primitives to the kernel
without having robust validation infrastructure already in place for
it. It might also be worthwhile looking to extend Josef's
fsync-tester.c to be able to use ranged fsyncs so to test all the
various corner cases that we need to....

Cheers,

Dave.
--
Dave Chinner
david-***@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2014-04-25 06:06:52 UTC
Permalink
Post by Dave Chinner
Christoph, if this is going into the kernel, can you add support for
xfs_io and write a couple of xfstests to test it? I'm not
comfortable with adding new data integrity primitives to the kernel
without having robust validation infrastructure already in place for
it. It might also be worthwhile looking to extend Josef's
fsync-tester.c to be able to use ranged fsyncs so to test all the
various corner cases that we need to....
If we actually want to add it will obviously need test coverage. Seem
like I can't really get people excited enough to make this more than a
PoC so far, though.

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-***@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Matthew Wilcox
2014-04-23 14:03:08 UTC
Permalink
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
6. On Linux, MS_ASYNC is also a no-op. That's fine on a unified
cache system. Filesystem I/O always sees a consistent view,
and MS_ASYNC never undertook to give a guarantee about *when*
the update would occur. (The Linux buffer cache logic will
ensure that it is flushed out sometime in the near future.)
Right. It's a fairly inefficient noop, though - it actually loops
over all vmas to do nothing with them.
This will probably change for Persistent Memory. The reason it
works today is that we have a page cache which tracks dirty bits and
periodically writes dirty pages to storage. If we bypass the page cache,
we have to ensure that everything does still eventually get synced.

I don't quite know how this is going to work yet ... I have a number of
ideas in my head. It probably won't be asynchronous though!
Post by Christoph Hellwig
Post by Michael Kerrisk (man-pages)
7. On Linux (and probably many other modern systems), the only
call that has any real use is msync(MS_SYNC), meaning
"flush the buffers *now*, and I want to wait for that to
complete, so that I can then continue safe in the knowledge
that my data has landed on a device". That's useful if we
want insurance for our data in the event of a system crash.
Right. It's basically another way to call fsync, which is used to
implement it underneath. It actually should be a ranged-fdatasync
but right it's it's implemented horribly inefficiently in that it
does a fsync call for each vma that it encounters in the range
specified.
See also:

From: Matthew Wilcox <***@intel.com>
To: linux-***@kvack.org, linux-***@vger.kernel.org
Cc: Matthew Wilcox <***@intel.com>, ***@linux.intel.com
Subject: [PATCH] Sync only the requested range in msync
Date: Thu, 27 Mar 2014 19:02:41 -0400
Message-Id: <1395961361-21307-1-git-send-email-***@intel.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...