Discussion:
EXT4 nodelalloc => back to stone age.
Dmitry Monakhov
2013-04-01 11:06:18 UTC
Permalink
I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
It shows numbers which are slower than HDD which was produced 15 years ago
#mount $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
blktrace shows horrible traces:
Eric Sandeen
2013-04-01 15:18:51 UTC
Permalink
1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
...
2) Why don't we have writepages for non delalloc case ?
...

I'd add:

3) Why do we have a "nodelalloc" mount option at all?

but then I thought:

Is it also this bad when using the ext4 driver to run an ext3 fs?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Ts'o
2013-04-01 15:39:52 UTC
Permalink
Post by Eric Sandeen
3) Why do we have a "nodelalloc" mount option at all?
Is it also this bad when using the ext4 driver to run an ext3 fs?
Yes, and I there would be a similar performance problem if you are
using the ext3 file system driver, since ext3_*_writepage() also ends
up calling block_write_full_page() which will also result in the
writes happening with WRITE_SYNC.

The main reason why we keep nodelalloc at this point is bug-for-bug
compatibility with ext3 file systems --- basically, for users who are
using this as a workaround for the O_PONIES issue instead of fixing
their applications to use fsync() appropriately.

So another question is how much do we care about exact emulation of
ext3's behaviour for those distributions who wish to use ext4 file
system driver for ext2 and ext3 file systems?

One of the reasons for keeping nodealloc mode was the argument was
that it removing it wouldn't really allow us to remove that much
complexity from ext4. But adding a nodealloc specific ext4_writepages
pages would result in adding a huge amount of complexity, and my first
reaction is that it's really not worth the code maintenance headache.
Dmitry, is there a reason why you are especially worried about the
performace of nodelalloc mode?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Sandeen
2013-04-01 16:00:33 UTC
Permalink
Post by Theodore Ts'o
Post by Eric Sandeen
3) Why do we have a "nodelalloc" mount option at all?
Is it also this bad when using the ext4 driver to run an ext3 fs?
Yes, and I there would be a similar performance problem if you are
using the ext3 file system driver, since ext3_*_writepage() also ends
up calling block_write_full_page() which will also result in the
writes happening with WRITE_SYNC.
The main reason why we keep nodelalloc at this point is bug-for-bug
compatibility with ext3 file systems --- basically, for users who are
using this as a workaround for the O_PONIES issue instead of fixing
their applications to use fsync() appropriately.
Sorry for getting off the original thread here, but IMHO these are
2 different things:

nondelalloc behavior makes sense for ext3, but:
-o nodelalloc mount options don't make sense for ext4.
Post by Theodore Ts'o
So another question is how much do we care about exact emulation of
ext3's behaviour for those distributions who wish to use ext4 file
system driver for ext2 and ext3 file systems?
One of the reasons for keeping nodealloc mode was the argument was
that it removing it wouldn't really allow us to remove that much
complexity from ext4.
IMHO we should keep the mode for ext2/3, but lose the ext4 option.
It'd just be one less row in the ext4 test matrix.

-Eric
Post by Theodore Ts'o
But adding a nodealloc specific ext4_writepages
pages would result in adding a huge amount of complexity, and my first
reaction is that it's really not worth the code maintenance headache.
Dmitry, is there a reason why you are especially worried about the
performace of nodelalloc mode?
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Zheng Liu
2013-04-01 16:34:33 UTC
Permalink
Hi Eric,
Post by Eric Sandeen
Post by Theodore Ts'o
Post by Eric Sandeen
3) Why do we have a "nodelalloc" mount option at all?
Is it also this bad when using the ext4 driver to run an ext3 fs?
Yes, and I there would be a similar performance problem if you are
using the ext3 file system driver, since ext3_*_writepage() also ends
up calling block_write_full_page() which will also result in the
writes happening with WRITE_SYNC.
The main reason why we keep nodelalloc at this point is bug-for-bug
compatibility with ext3 file systems --- basically, for users who are
using this as a workaround for the O_PONIES issue instead of fixing
their applications to use fsync() appropriately.
Sorry for getting off the original thread here, but IMHO these are
-o nodelalloc mount options don't make sense for ext4.
nodelalloc makes sense to me. In our product system, we met a latency
problem that is caused by delalloc feature. The workload is a web app
that does some append writes (approximately 5M/s), and wait flusher to
do write out. We obverse that on every 30 seconds the latency will
reach a high level (approximately 100-200ms or higher, but normally
10-20ms). The reason is that when flush tries to write dirty pages out,
it will take i_data_sem lock (write lock) and allocate some blocks for
these dirty pages. But in the mean time the app does some append
write(2)s that will try to take i_data_sem lock (read lock) too. So the
app will be delayed. So I think nodelalloc is still useful for us.

Regards,
- Zheng


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-04-01 15:45:41 UTC
Permalink
Quoting Eric Sandeen (2013-04-01 11:18:51)
Post by Eric Sandeen
1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
Yes? The stuff we wait on should be WRITE_SYNC.
Post by Eric Sandeen
...
2) Why don't we have writepages for non delalloc case ?
...
3) Why do we have a "nodelalloc" mount option at all?
Is it also this bad when using the ext4 driver to run an ext3 fs?
Quick comparison on a single iodrive:

Ext4 (defaults):
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.95442 s, 549 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.45012 s, 740 MB/s

Ext4 (nodelalloc):
dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 2.97308 s, 361 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.76617 s, 608 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc

XFS gives 628, 733MB/s

Btrfs gives 659, 635MB/s -- since we're doing fsync, this includes all
the crcs for the data.

Ext3 mounted by ext4.ko: 291, 467MB/s

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-04-01 15:57:05 UTC
Permalink
Quoting Chris Mason (2013-04-01 11:45:41)
Post by Chris Mason
Quoting Eric Sandeen (2013-04-01 11:18:51)
Post by Eric Sandeen
1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
Yes? The stuff we wait on should be WRITE_SYNC.
Post by Eric Sandeen
...
2) Why don't we have writepages for non delalloc case ?
...
3) Why do we have a "nodelalloc" mount option at all?
Is it also this bad when using the ext4 driver to run an ext3 fs?
On the theory that writepages is the problem try echo 1 >
/sys/block/xxx/queue/rotational. With request merging on here in
nodelalloc mode:

dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 2.53741 s, 423 MB/s

dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.37795 s, 779 MB/s

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Kara
2013-04-02 13:46:34 UTC
Permalink
Post by Dmitry Monakhov
I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
It shows numbers which are slower than HDD which was produced 15 years ago
#mount $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
253,1 0 11 0.004965203 13618 Q WS 1219360 + 8 [jbd2/dm-1-8]
253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
253,1 1 39 0.004983642 0 C WS 1219344 + 8 [0]
253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
253,1 1 40 0.005082898 0 C WS 1219352 + 8 [0]
253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
253,1 3 12 0.005106049 2580 Q W 1219368 + 8 [flush-253:1]
253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
253,1 2 17 0.005197143 13750 Q WS 1219376 + 8 [dd]
253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
253,1 1 41 0.005199871 0 C WS 1219360 + 8 [0]
Hum, not sure why you see all the events 4x. But that's not important I
guess.
Post by Dmitry Monakhov
As one can see data written from two threads dd and jbd2 on per-page basis and
jbd2 submit pages with WRITE_SYNC i.e. we write page-by-page
synchronously :)
journal_submit_inode_data_buffers
wbc.sync_mode = WB_SYNC_ALL
->generic_writepages
->write_cache_pages
->ext4_writepage
->ext4_bio_write_page
->io_submit_add_bh
->io_submit_init
WRITE);
->ext4_io_submit(io);
1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
Actually WRITE_SYNC doesn't mean we write sychronously. We just tell the
IO scheduler that we are going to wait for the IO to complete soon. So it
prioritizes these writes against other async writes. We don't have to use
WRITE_SYNC but really in this case we do pretty much what IO scheduler
people want - flag IO that's going to be waited upon.
Post by Dmitry Monakhov
Why blk_finish_plug(&plug) which is called from generic_writepages() is
not enough? As far as I can see this code was copy-pasted from XFS,
also DIO also tag bio-s with WRITE_SYNC, but what happen if file
is highly fragmented (or block device is RAID0) we will endup doing
synchronous io.
I see you are tracing the DM device. That may be actually somewhat
confusing since you are missing some actions like merges of requests and
dispatches to underlying device.
Post by Dmitry Monakhov
2) Why don't we have writepages for non delalloc case ?
I want to fix (2) by implementing writepages() for non delalloc case
Once this will be done we may add new flag WB_SYNC_NOALLOC so
journal_submit_inode_data_buffers will use
__filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
which will call optimized ->ext4_writepages()
So what would you expect from ->writepages() implementation?

Anyway the throughput you see looks bad. What kernel version are you using?
There's possibility my recent changes to ext4_writepage() could have slowed
down something...

Honza
--
Jan Kara <***@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...