Discussion:
fs test suite
(too old to reply)
Pat LaVarre
2003-10-07 04:11:28 UTC
Permalink
Anybody got an fs test suite posted that I could easily apply to the udf
of linux-2.6.0-test6?

I ask because I'm seeing ext3 work fine on top of loop devices, not so
udf.

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Randy.Dunlap
2003-10-07 14:17:55 UTC
Permalink
On 06 Oct 2003 22:11:28 -0600 Pat LaVarre <***@ieee.org> wrote:

| Anybody got an fs test suite posted that I could easily apply to the udf
| of linux-2.6.0-test6?
|
| I ask because I'm seeing ext3 work fine on top of loop devices, not so
| udf.

Are you looking for a test (suite) that tests fs metadata moreso
than fs IO? People have asked for that a few times, but I don't
know of one that is made for that.

Here are some possibilities:

iozone - all sorts of read/write testing, little metadata
http://www.iozone.org
postmark - email-like tester, mostly small files, with file
create/delete
[google for it]
fsx - tester that has stressed (and busted) extN and nfs several
times [I would start here.]
from: http://www.codemonkey.org.uk/cruft/
or: http://www.zip.com.au/~akpm/linux/patches/stuff/
(these might be different versions of the same prog.)

--
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Zachary Peterson
2003-10-07 14:59:59 UTC
Permalink
Also try Connectathon, which runs a series of individual system call
tests, that look for correctness and deliver performance metrics.

http://www.connectathon.org/

It's not great, but free.

Zachary


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Zachary Peterson ***@cse.ucsc.edu
http://znjp.com

856D 29FA E1F7 DB5E 9215 C68D 5F0F 3929 C929 9A72
Post by Randy.Dunlap
| Anybody got an fs test suite posted that I could easily apply to the udf
| of linux-2.6.0-test6?
|
| I ask because I'm seeing ext3 work fine on top of loop devices, not so
| udf.
Are you looking for a test (suite) that tests fs metadata moreso
than fs IO? People have asked for that a few times, but I don't
know of one that is made for that.
iozone - all sorts of read/write testing, little metadata
http://www.iozone.org
postmark - email-like tester, mostly small files, with file
create/delete
[google for it]
fsx - tester that has stressed (and busted) extN and nfs several
times [I would start here.]
from: http://www.codemonkey.org.uk/cruft/
or: http://www.zip.com.au/~akpm/linux/patches/stuff/
(these might be different versions of the same prog.)
--
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Randy.Dunlap
2003-10-07 17:16:02 UTC
Permalink
In that vein, there's also the Linux Test Project (LTP),
http://ltp.sourceforge.net/

--
~Randy


On Tue, 7 Oct 2003 07:59:59 -0700 (PDT) Zachary Peterson <***@cse.ucsc.edu> wrote:

|
| Also try Connectathon, which runs a series of individual system call
| tests, that look for correctness and deliver performance metrics.
|
| http://www.connectathon.org/
|
| It's not great, but free.
|
| Zachary
|
|
| =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
| Zachary Peterson ***@cse.ucsc.edu
| http://znjp.com
|
| 856D 29FA E1F7 DB5E 9215 C68D 5F0F 3929 C929 9A72
|
| On Tue, 7 Oct 2003, Randy.Dunlap wrote:
|
| >On 06 Oct 2003 22:11:28 -0600 Pat LaVarre <***@ieee.org> wrote:
| >
| >| Anybody got an fs test suite posted that I could easily apply to the udf
| >| of linux-2.6.0-test6?
| >|
| >| I ask because I'm seeing ext3 work fine on top of loop devices, not so
| >| udf.
| >
| >Are you looking for a test (suite) that tests fs metadata moreso
| >than fs IO? People have asked for that a few times, but I don't
| >know of one that is made for that.
| >
| >Here are some possibilities:
| >
| >iozone - all sorts of read/write testing, little metadata
| > http://www.iozone.org
| >postmark - email-like tester, mostly small files, with file
| > create/delete
| > [google for it]
| >fsx - tester that has stressed (and busted) extN and nfs several
| > times [I would start here.]
| > from: http://www.codemonkey.org.uk/cruft/
| > or: http://www.zip.com.au/~akpm/linux/patches/stuff/
| > (these might be different versions of the same prog.)
| >
| >--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pat LaVarre
2003-10-07 18:54:57 UTC
Permalink
Post by Randy.Dunlap
Are you looking for a test (suite) that tests
fs metadata moreso than fs IO? People have
asked for that a few times, but I don't know
of one that is made for that.
May I ask you to elaborate? I'm not yet confident I understand the
question. I mean to ask how do I increase my confidence that 2.4.x and
2.6.x udf.ko will read back to me what I wrote thru it. I figure that
mixes together metadata and data, since the metadata tells me how much
and from where I read back my data.

I see my semi-private ***@hpesjro.fc.hp.com thread titled "zeroes
read back more often than appended" says a write-read-compare test as
trivial as fopen-fwrite-fclose doesn't yet work.

I blame that either on my own bonehead newbie errors i.e. illegit test
setup, else low-hanging bugs. I'm hear wondering, can I easily look for
other low-hanging fruit.
Post by Randy.Dunlap
http...
Thank you! I will pursue:

http://www.codemonkey.org.uk/cruft/
http://www.zip.com.au/~akpm/linux/patches/stuff/
http://ltp.sourceforge.net/
http://www.iozone.org
http://www.google.com/search?q=postmark+linux
http://www.connectathon.org/
http://www.google.com/search?q=bonnie+linux

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Randy.Dunlap
2003-10-07 18:58:46 UTC
Permalink
On 07 Oct 2003 12:54:57 -0600 Pat LaVarre <***@ieee.org> wrote:

| > Are you looking for a test (suite) that tests
| > fs metadata moreso than fs IO? People have
| > asked for that a few times, but I don't know
| > of one that is made for that.
|
| May I ask you to elaborate? I'm not yet confident I understand the
| question. I mean to ask how do I increase my confidence that 2.4.x and
| 2.6.x udf.ko will read back to me what I wrote thru it. I figure that
| mixes together metadata and data, since the metadata tells me how much
| and from where I read back my data.

Sure, they are usually mixed, but some tests emphasize (or stress)
file data IO vs. metadata more than others do.
And sometimes people ask for a metadata stress test, which would
focus on mv, ln, stat, etc., more than reading/writing file data.

| I see my semi-private ***@hpesjro.fc.hp.com thread titled "zeroes
| read back more often than appended" says a write-read-compare test as
| trivial as fopen-fwrite-fclose doesn't yet work.

That sounds like a filesystem IO test more than a metadata test,
though the problem could be in either area.

| I blame that either on my own bonehead newbie errors i.e. illegit test
| setup, else low-hanging bugs. I'm hear wondering, can I easily look for
| other low-hanging fruit.

Hear?

--
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pat LaVarre
2003-10-07 19:26:50 UTC
Permalink
some tests emphasize ...
mv, ln, stat, etc.,
more than reading/writing file
Clear now thank you. Sorry to hear noone much exercises the many ways
of writing only metadata. Immediately I think to include `ls` and and
`touch` in your etc., also I see `head` and `tail` and `tail -f` halfwa=
y
back towards stressing data.
Post by Pat LaVarre
I'm hear wondering,
can I easily look for other low-hanging fruit.
You know, =E2=80=9Cwith enough eyeballs, all bugs are shallow=E2=80=9D.
Post by Pat LaVarre
... I'm hear wondering,
can I easily look for other low-hanging fruit.
Hear?
I meant "here", I'm not sure if you understood me or not, sorry if not,
grin if yes. Me, I learned American English as a phonetic foreign
language. For example, I didn't discover people who pronounced "herb"
with an h until I discovered English English, and I prefer to spell
"fa=C3=A7ade" with a soft '=C3=A7', and I still think "aisle" and "isle=
" ought to
sound more like "ayzel" and less like "island", and ...

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-20 09:12:07 UTC
Permalink
hello all.

We're proud to announce the availability of a _proof of concept_ file
system, called srfs. ( http://www.cs.bgu.ac.il/~srfs/ ).
a quick overview: [from the home page]
srfs is a global file system designed to be distributed geographicly over
multiple locations and provide a consistent, high available and durable
infrastructure for information.

Started as a research project into file systems and self-stabilization in
Ben Gurion University of the Negev Department of Computer Science, the
project aims to integrate self-stabilization methods and algorithms into
the file (and operation) systems to provide a system with a desired
behavior in the presence of transient faults.

Based on layered self-stabilizing algorithms, provide a tree replication
structure based on auto-discovery of servers using local and global IP
multicasting. The tree structure is providing the command and timing
infrastructure required for a distributed file system.

The project is basically divided into two components:
1) a kernel module, which provides the low level functionality, and
disk management.
2) a user space caching daemon, which provide the stabilization and
replication properties of the file system.
these two components communicate via a character device.

more info on the system architecture can be find on the web page, and
here: http://www.cs.bgu.ac.il/~tzachar/srfs.pdf

We hope some will find this interesting enough to take for a test drive,
and wont mind the latencies ( currently, the caching daemon is a bit slow.
hopefully, we will improve it in the future. )
anyway, please keep in mind this is a very early version that only works,
and keeps the stabilization properties. no posix compliance whatsoever...

the code contains several hacks and design flaws that we're aware of,
and probably many that we're not... so please be gentle ;)

if someone found this interesting, please contact us with ur insights.
cheers,
the srfs team.

p.s I would like to thank all members of this mailing list (fsdevel), for
ur continual help with problems we encountered during the development.
thanks guys (and girls???).

========================================================================
nir.
Eric Sandall
2003-10-20 21:00:38 UTC
Permalink
Post by Nir Tzachar
hello all.
We're proud to announce the availability of a _proof of concept_ file
system, called srfs. ( http://www.cs.bgu.ac.il/~srfs/ ).
a quick overview: [from the home page]
srfs is a global file system designed to be distributed geographicly over
multiple locations and provide a consistent, high available and durable
infrastructure for information.
Started as a research project into file systems and self-stabilization in
Ben Gurion University of the Negev Department of Computer Science, the
project aims to integrate self-stabilization methods and algorithms into
the file (and operation) systems to provide a system with a desired
behavior in the presence of transient faults.
Based on layered self-stabilizing algorithms, provide a tree replication
structure based on auto-discovery of servers using local and global IP
multicasting. The tree structure is providing the command and timing
infrastructure required for a distributed file system.
1) a kernel module, which provides the low level functionality, and
disk management.
2) a user space caching daemon, which provide the stabilization and
replication properties of the file system.
these two components communicate via a character device.
more info on the system architecture can be find on the web page, and
here: http://www.cs.bgu.ac.il/~tzachar/srfs.pdf
We hope some will find this interesting enough to take for a test drive,
and wont mind the latencies ( currently, the caching daemon is a bit slow.
hopefully, we will improve it in the future. )
anyway, please keep in mind this is a very early version that only works,
and keeps the stabilization properties. no posix compliance whatsoever...
the code contains several hacks and design flaws that we're aware of,
and probably many that we're not... so please be gentle ;)
if someone found this interesting, please contact us with ur insights.
cheers,
the srfs team.
p.s I would like to thank all members of this mailing list (fsdevel), for
ur continual help with problems we encountered during the development.
thanks guys (and girls???).
========================================================================
nir.
This sounds fairly similar to Coda[0], which is already in development and use.

-sandalle

[0] http://www.coda.cs.cmu.edu/
--
PGP Key Fingerprint: FCFF 26A1 BE21 08F4 BB91 FAED 1D7B 7D74 A8EF DD61
http://search.keyserver.net:11371/pks/lookup?op=get&search=0xA8EFDD61

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS/E/IT$ d-- s++:+>: a-- C++(+++) BL++++VIS>$ P+(++) L+++ E-(---) W++ N+@ o?
K? w++++>-- O M-@ V-- PS+(+++) PE(-) Y++(+) PGP++(+) t+() 5++ X(+) R+(++)
tv(--)b++(+++) DI+@ D++(+++) G>+++ e>+++ h---(++) r++ y+
------END GEEK CODE BLOCK------

Eric Sandall | Source Mage GNU/Linux Developer
***@sandall.us | http://www.sourcemage.org/
http://eric.sandall.us/ | SysAdmin @ Inst. Shock Physics @ WSU
http://counter.li.org/ #196285 | http://www.shock.wsu.edu/

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
Nir Tzachar
2003-10-21 12:07:04 UTC
Permalink
Post by Eric Sandall
This sounds fairly similar to Coda[0], which is already in development and use.
not at all.

coda is not self stabilizing at all.
srfs is also a totally distributed file system -> see the doc.
bye
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brian Beattie
2003-10-21 14:29:51 UTC
Permalink
Post by Nir Tzachar
Post by Eric Sandall
This sounds fairly similar to Coda[0], which is already in development and use.
not at all.
coda is not self stabilizing at all.
srfs is also a totally distributed file system -> see the doc.
what does "self stabilizing" mean in this context?
Post by Nir Tzachar
bye
bye bye
--
Brian Beattie | Experienced kernel hacker/embedded systems
***@beattie-home.net | programmer, direct or contract, short or
www.beattie-home.net | long term, available immediately.

"Honor isn't about making the right choices.
It's about dealing with the consequences." -- Midori Koto
Jan Harkes
2003-10-21 16:59:28 UTC
Permalink
Post by Nir Tzachar
not at all.
coda is not self stabilizing at all.
srfs is also a totally distributed file system -> see the doc.
In what way do you think that Coda isn't distributed?

Also Coda does have 'self-stabilizing' properties, but probably in a
different way compared to how you think about self stabilization.

When a server becomes loaded (too many clients, heavy CPU/memory usage
by other processes, network trouble) it's responses slow down and
clients will automatically switch to some lighter loaded replica that
stores the same data. We work based on an estimate of the available
bandwidth on a per-client basis, and the switch is performed in a
non-deterministic fashion, i.e. we don't pick the 'fastest' machine, but
decide to switch to a random machine when we're talking to the 'slowest'
one. As a result this works very well at balancing the load across all
available replicas. This adaptation mostly affects read-oriented data
traffic.

Similarly, when a client happens to be sending modifications (writes) to
an overloaded server, it will at some point switch to writeback caching
(write-disconnected operation), in this state it keeps track of
modifications without writing them back to the server immediately.
During this time it can optimize away some operations (intermediate
files created during a compilation) and once the local data has 'aged'
enough to be considered stable, it reintegrates the modifications in
batches of multiple operations at a time. When several operations arrive
in a batch, the server only needs to commit a single transaction for up
to 100 operations at a time, which results in a far more efficient use
of the CPU and disk IO resources on the server. The trade-off is
ofcourse a weaker consistency model.

So there is definitely a self-stabilizing mechanism present in Coda.

Jan

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Machek
2003-10-23 13:58:51 UTC
Permalink
Hi!
Post by Nir Tzachar
Post by Eric Sandall
This sounds fairly similar to Coda[0], which is already in development and use.
not at all.
coda is not self stabilizing at all.
srfs is also a totally distributed file system -> see the doc.
bye
Yes, but perhaps differences can be localized to userspace daemon,
having same kernel part for coda and srfs?
That would be *good*.

Pavel
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 09:28:54 UTC
Permalink
hi
Post by Pavel Machek
Yes, but perhaps differences can be localized to userspace daemon,
having same kernel part for coda and srfs?
That would be *good*.
in essence, ur correct. we would have taken that approach, if we were not
aiming at building a file system on top of an object storage. this
approach simplifies things a bit, and the kernel part is reduced.
--
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Erik Andersen
2003-10-22 04:57:09 UTC
Permalink
Post by Nir Tzachar
more info on the system architecture can be find on the web page, and
here: http://www.cs.bgu.ac.il/~tzachar/srfs.pdf
Suppose I install srfs on both my laptop and my server. I then
move the CVS repository for my pet project onto the new srfs
filesystem and I take off for the weekend with my laptop. Over
the weekend I commit several changes to file X. Over the weekend
my friend also commits several changes to file X.

When I get home and plug in my laptop, presumably the caching
daemon will try to stabalize the system by deciding which version
of file X was changed last and replicating that latest version.

Who's work will the caching daemon overwrite? My work, or my
friends work?

Of course, this need not involve anything so extreme as days of
disconnected independent operation. A rebooting router between
two previously syncd srfs peers seems sufficient to trigger this
kind of data loss, unless you make the logging daemon fail all
writes when disconnected.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-22 10:16:02 UTC
Permalink
first, id like to thank u guys for playing around with the idea.

now, i want to apologize if my explanation was not clear enough:
self stabilization (original idea by Dijkstra) - A self stabilizing system
is a system that can automatically recover following the occurrence of
( transient ) faults. The idea is to design a system which can be started
in an arbitrary state and still converge to a desired behavior.

Our file system behaves like this:
lets say you have several servers, with different file system trees on
them. If (and when ...) you connect these file systems with an srfs
framework, all servers will display the same file system tree, which is
somewhat of a union between them all.
if you wish to talk in coda terms, you can say all servers operated
disconnectedly, and then were connected at the same time. the conflict
resolving mechanism we use, is by majority.

We differ from coda in the sense we don't have a main server, which pushes
Volumes to sub-servers (im not sure what the coda terminology is... ), and
data is served in a load-balanced way. In Srfs, all the data resides on
all servers (hosts) and is replicated between them.
replication takes place at two levels: tree view (plus meta data) and the
actual data.
tree view - the tree view on all hosts is the same. an `ls` on a dir
on any host will produce the same output.
data - data will be replicated to all hosts upon a successful write,
and upon each access to a dirty file on each host.

all replication is lazy, and happens only on access to dirs / files
(and on successful writes - when the file is being closed.)

Thus, the following behavior can be achieved:
lets say you have 2N+1 hosts, all with coherent file system trees.
now, take N of them offline, change the tree, put those N back online,
and their tree will be the same as the other N+1 other hosts.

The main goal of the file system is self stabilization, over long periods
of time and long distances. you can use it as a SAN, or as a data farm,
using system like LinuxVirtualServer to balance the load between nodes.

cheers.

========================================================================
nir.




-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Harkes
2003-10-22 14:22:13 UTC
Permalink
Post by Nir Tzachar
if you wish to talk in coda terms, you can say all servers operated
disconnectedly, and then were connected at the same time. the conflict
resolving mechanism we use, is by majority.
That's annoying when >50% of your servers were unavailable for a period
of time, because all recent changes will be lost when connectivity is
restored.
Post by Nir Tzachar
We differ from coda in the sense we don't have a main server, which pushes
Volumes to sub-servers (im not sure what the coda terminology is... ), and
Where in the world did you get the idea that Coda has a main server that
pushes out modifications? That is so wrong, I don't even know where to
begin.
Post by Nir Tzachar
data is served in a load-balanced way. In Srfs, all the data resides on
all servers (hosts) and is replicated between them.
replication takes place at two levels: tree view (plus meta data) and the
actual data.
tree view - the tree view on all hosts is the same. an `ls` on a dir
on any host will produce the same output.
data - data will be replicated to all hosts upon a successful write,
and upon each access to a dirty file on each host.
Coda also uses a global namespace that's pretty normal for distributed
filesystems (AFS/DFS).

So the only differences really are that Coda uses a version-vector
based mechanism to detect and resolve version conflicts instead of
majority voting. i.e. even when only a single server is accessible for a
period of time, the committed updates will eventually propagate to
others. And we don't throw away a file just because 2 out of three
servers happen to have an old copy and vote against it.

And Coda gives an administrator the ability to use different replication
groups within his servers for different types of data based on for
instance expected access patterns. Temporary objects or files that are
rarely used could only have a single replica. Mail folders would have 2
replicas (as only one user would read it, so the replication is only
needed to protect against occasional server outage), and data shared by
many users (binaries) but rarely updated could be available from many
replicas.
Post by Nir Tzachar
all replication is lazy, and happens only on access to dirs / files
(and on successful writes - when the file is being closed.)
Did you read _any_ of the Coda papers that were written during the past
16 years?

Well, this one is pretty recent and nicely summarizes the history of
Coda, and provides an overview of what Coda actually does.

M. Satyanarayanan, 'The Evolution of Coda'
ACM Transactions on Computer Systems (TOCS)
Volume 20, Issue 2 (May 2002)
Pages: 85 - 124

http://portal.acm.org/citation.cfm?id=507052.507053&dl=GUIDE&dl=GUIDE&idx=J774&part=periodical&WantType=periodical&title=ACM%20Transactions%20on%20Computer%20Systems%20(TOCS)

Jan

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-23 07:50:01 UTC
Permalink
hi there.
Post by Jan Harkes
That's annoying when >50% of your servers were unavailable for a period
of time, because all recent changes will be lost when connectivity is
restored.
well, if u want a _full_ self stabilizing file system, you cannot behave
any other way. When you have a self stabilizing algorithm, you __have__ to
operate under the assumption that transient errors can and will happen.
so, a cosmic ray can hit N out of your 2N+1 hosts, and corrupt the data
they hold. its very slim, but you have to take these kind of errors into
account to prove the correctness of the algorithm.
Post by Jan Harkes
Where in the world did you get the idea that Coda has a main server that
pushes out modifications? That is so wrong, I don't even know where to
begin.
ur right, what i described was more like AFS, but you got my point...
Post by Jan Harkes
Coda also uses a global namespace that's pretty normal for distributed
filesystems (AFS/DFS).
well, i was not talking about a global name space. surely u must have one,
otherwise things will get though...
what i meant is, data is replicated at two levels: first, the meta data
(file attributes) is replicated, and the actual data will only get
replicated upon access.

to surmise:
At no time we set srfs to be "better" than CODA.

Our design was aimed to be "close" to CODA, but our emphasis was
self-stabilization, minimal dependency on a single point of failure, and
trust no one (either a central mamanger or stored information).
from here came all our of our design.

cheers.
--
========================================================================
nir.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Hudec
2003-10-23 12:33:57 UTC
Permalink
Post by Nir Tzachar
hi there.
Post by Jan Harkes
That's annoying when >50% of your servers were unavailable for a period
of time, because all recent changes will be lost when connectivity is
restored.
well, if u want a _full_ self stabilizing file system, you cannot behave
any other way. When you have a self stabilizing algorithm, you __have__ to
operate under the assumption that transient errors can and will happen.
so, a cosmic ray can hit N out of your 2N+1 hosts, and corrupt the data
they hold. its very slim, but you have to take these kind of errors into
account to prove the correctness of the algorithm.
But the vector time approach solves this too and does so a lot better.

If we return to the example with notebook. Assume there is a computer
lab with 20 computers and all have replicas of some file. Assume, that
I take a laptop, connect it to the system, replicate the file and
disconnect. Then I work on it while disconnected and then reconnect
again.

With vector time, the system decides that the copy on all 20 computers
is ancestor to my copy and replace everything with my copy. With
majority vote, my copy loses 1:20 and is lost.

Imagine further, that my friend does the same.

Now, with vector time, the system decides, that
* All copies in the lab are old and invalidates them.
* Our copies conflict. It does not blindly choose one, rather it asks
for assistance.
While with majority vote, both our copies loose 1:20 and are discarded.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <***@ucw.cz>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pat LaVarre
2003-10-23 20:12:25 UTC
Permalink
Post by Jan Hudec
transient errors can and will happen. so, a cosmic ray can hit N out
of your 2N+1 hosts, and corrupt the data they hold. its very slim,
but you have to take these kind of errors into
account to prove the correctness of the algorithm.
But the vector time approach solves this too and does so a lot better.
If we return to the example with notebook ...
Now, with vector time, the system decides, that
* All copies in the lab are old and invalidates them.
* Our copies conflict. It does not blindly choose one,
rather it asks for assistance.
We regard getting everyone to agree about what the time is as a solved
problem?

I'm inspired to ask because I'm posting this query from behind an
employer-owned firewall thru which I have not yet punched time service,
cvs, etc. There was a day, not so long ago, when I couldn't punch thru
ftp and streaming media ...

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 09:21:38 UTC
Permalink
Post by Pat LaVarre
Post by Jan Hudec
If we return to the example with notebook ...
Now, with vector time, the system decides, that
* All copies in the lab are old and invalidates them.
* Our copies conflict. It does not blindly choose one,
rather it asks for assistance.
We regard getting everyone to agree about what the time is as a solved
problem?
I'm inspired to ask because I'm posting this query from behind an
employer-owned firewall thru which I have not yet punched time service,
cvs, etc. There was a day, not so long ago, when I couldn't punch thru
ftp and streaming media ...
i think ur right, but even when u do succeed, lets take a more byzantine
approach: what if the time service is down? sabotaged? damaged? maybe it
lies through its teeth(ports) ?? maybe ur vector time is corrupted? maybe
a user deliberately changed his vector time - he will bring havoc upon
ur system .

so, srfs takes the approach of 'trust no one, not even myself' .
a bit paranoid, but very useful (although the cost is very high... )
--
========================================================================
nir.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Matthew Wilcox
2003-10-24 12:08:22 UTC
Permalink
Post by Nir Tzachar
i think ur right, but even when u do succeed, lets take a more byzantine
approach: what if the time service is down? sabotaged? damaged? maybe it
lies through its teeth(ports) ?? maybe ur vector time is corrupted? maybe
a user deliberately changed his vector time - he will bring havoc upon
ur system .
uh, *vector time*, not real time. Think CVS branches.

And if your server allows clients to corrupt it, then it's broken. I doubt
Coda does that.
--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 19:14:22 UTC
Permalink
Post by Matthew Wilcox
uh, *vector time*, not real time. Think CVS branches.
And if your server allows clients to corrupt it, then it's broken. I doubt
Coda does that.
maybe not intentionally, but as Murphy put it:

Anything that can go wrong will go wrong.
If there is a possibility of several things going wrong, the one that will
cause the most damage will be the one to go wrong. Corollary: If there is
a worse time for something to go wrong, it will happen then.
If anything simply cannot go wrong, it will anyway.
If you perceive that there are four possible ways in which a procedure can
go wrong, and circumvent these, then a fifth way, unprepared for, will
promptly develop.
Left to themselves, things tend to go from bad to worse.
If everything seems to be going well, you have obviously overlooked
something.
Nature always sides with the hidden flaw.
Mother nature is a bitch.
It is impossible to make anything foolproof because fools are so ingenious

[ taken from http://dmawww.epfl.ch/roso.mosaic/dm/murphy.html ]

so, here u go... ;)
--
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Harkes
2003-10-24 14:38:29 UTC
Permalink
Post by Nir Tzachar
i think ur right, but even when u do succeed, lets take a more byzantine
approach: what if the time service is down? sabotaged? damaged? maybe it
lies through its teeth(ports) ?? maybe ur vector time is corrupted? maybe
a user deliberately changed his vector time - he will bring havoc upon
ur system .
Ever heard of lamport clocks?

A version vector is incremented on updates. So it doesn't matter whether
I changed the file on Saturday and you changed it on Sunday, when we
both return on Monday the system _will_ detect that both of us have a
new version of the same original file and considers it a conflict. It is
just another way of detecting version differences.

If a server has been off-line for a while, the version on it's files are
lower than those of files that were updated on the on-line servers. So
we see that it simply has an older version and we can (trivially)
resolve the conflict by forcing the new versions to the restored server.
This even works if the broken server had to be rebuilt from scratch and
has no data (i.e. all 'versions-vectors' are all zeros).

But we don't need to have a majority of the servers available to perform
successfull writes. It is just a different solution from yours, with
it's own unique limitations (limited length of the version vector limits
maximal replication factor).

Jan

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 19:16:49 UTC
Permalink
Post by Jan Harkes
Ever heard of lamport clocks?
i know them as vector clocks, but yes.
Post by Jan Harkes
A version vector is incremented on updates. So it doesn't matter whether
I changed the file on Saturday and you changed it on Sunday, when we
both return on Monday the system _will_ detect that both of us have a
new version of the same original file and considers it a conflict. It is
just another way of detecting version differences.
i know what ur talking about, but the model lamport uses does not fit
ours. lamport's vector clocks are not self-stabilizing, and a corrupted
vector (intentionally or un-intentionally) can break ur system.
[vector clocks are also not space bounded. (although there are some
solutions to this problem.)]

let me give u an example:
lets say, you connected ur laptop to the coda pool, worked on a file
locally, and then disconnected from the pool.
now, b4 reconnecting ur laptop to the pool, u accidently dropped it, and
ur hard disk got banged. as a result, some random bits on it changed,
and the saved vector clock, as well as the local counter and the file's
content, gets corrupted.
upon reconnection, the system will rightly decide ur version is the
correct one, and will push ur corrupted replica to all other servers.
you have broken system integrity.

there is a one to 10^gazillion chance of this happening, but this is
exactly what we're aiming at: dont take any chances (or prisoners).
if we dont agree on a file, nuke it. lets stay on the safe side.
--
========================================================================
nir.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2003-10-24 20:11:46 UTC
Permalink
Post by Nir Tzachar
there is a one to 10^gazillion chance of this happening, but this is
exactly what we're aiming at: dont take any chances (or prisoners).
if we dont agree on a file, nuke it. lets stay on the safe side.
So, system stabilizes when there are no files left ;-).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pat LaVarre
2003-10-24 20:24:48 UTC
Permalink
Post by Andreas Dilger
Post by Nir Tzachar
a one to 10^gazillion chance of this happening,
So, system stabilizes when there are no files left ;-).
Anyone have a measure of how often these events actually do occur?

What little I've seen people say of how they design file and RAID
systems speaks as if HDD's reliably chose either to read back what you
wrote to them or else reported an error.

What about when the HDD actually reads back something else?

How can we know how commonly that occurs in practice, so that we can
know how often we're wrong to believe such things as our locally
recorded vector time?

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2003-10-24 20:38:02 UTC
Permalink
Post by Pat LaVarre
Post by Andreas Dilger
Post by Nir Tzachar
a one to 10^gazillion chance of this happening,
So, system stabilizes when there are no files left ;-).
Anyone have a measure of how often these events actually do occur?
What little I've seen people say of how they design file and RAID
systems speaks as if HDD's reliably chose either to read back what you
wrote to them or else reported an error.
What about when the HDD actually reads back something else?
How can we know how commonly that occurs in practice, so that we can
know how often we're wrong to believe such things as our locally
recorded vector time?
There are lots of ways to read back garbage from a disk unrelated to
physical HDD errors: memory errors, bad cables, software errors (driver,
fs, vm, etc), bad IDE DMA settings, power failures during write, etc...

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pat LaVarre
2003-10-24 20:52:44 UTC
Permalink
Post by Andreas Dilger
There are lots of ways to read back garbage from a disk unrelated to
physical HDD errors: memory errors, bad cables, software errors (driver,
fs, vm, etc), bad IDE DMA settings, power failures during write, etc...
Yes, thank you for finding words to express that fact so much more
clearly than I did.
Post by Andreas Dilger
There are lots of ways to read back garbage from a disk unrelated to
physical HDD errors: memory errors, bad cables, software errors (driver,
fs, vm, etc), bad IDE DMA settings, power failures during write, etc...
In particular, I see HDD's vary in their opinion of which cabling and
configuration and protocol is bad. Therefore I ask:

"How can we know how commonly that occurs in practice, so that we can
know how often we're wrong to believe such things as our locally
recorded vector time?"

Is there, as yet, no linux filesystem that preserves the integrity of
the data and metadata despite such failures? If I could write such an
fs on to a single directly-attached local drive, then I could measure
how often I myself experience such failures.

I'm confident I actually do experience these failures because often I
work in comparative, raid-like measures. When I see one drive and
another disagree about what I wrote, then whenever I trust my write and
diff and read tools I must conclude one or both of the HDD's is wrong.

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 21:00:09 UTC
Permalink
Post by Pat LaVarre
I'm confident I actually do experience these failures because often I
work in comparative, raid-like measures. When I see one drive and
another disagree about what I wrote, then whenever I trust my write and
diff and read tools I must conclude one or both of the HDD's is wrong.
so, wont a f/s that can guarantee (and prove) its stability be nice?
thats what we're aiming at ;)

since these kind of errors are transient (meaning, in an infinite
execution time only a finite number of errors occur), srfs should be
capable to deal with'em.

--
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pat LaVarre
2003-10-24 21:22:18 UTC
Permalink
Post by Pat LaVarre
I'm confident I actually do experience these failures because often I
Post by Pat LaVarre
work in comparative, raid-like measures. When I see one drive and
another disagree about what I wrote, then whenever I trust my write and
diff and read tools I must conclude one or both of the HDD's is wrong.
so, wont a f/s that can guarantee (and prove) its stability be nice?
Yes.
Post by Pat LaVarre
thats what we're aiming at ;)
since these kind of errors are transient (meaning, in an infinite
execution time only a finite number of errors occur), srfs should be
capable to deal with'em.
Good.

Help the mass market more accurately measure how often the millions of
commodity HDD's actually do fail to read back what was written, and
you'll get noticed, I think.

We can't know til after we run this experiment?

We might actually discover that in fact quantifying the real experience
of HDD failure does gives us numbers roughly equal to the more easily
repeated, carefully controlled, therefore useless to me, laboratory
results that some folk prefer to publish.

Pat LaVarre


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 23:03:53 UTC
Permalink
Post by Pat LaVarre
Help the mass market more accurately measure how often the millions of
commodity HDD's actually do fail to read back what was written,
the numbers are probably _very_ low.
Post by Pat LaVarre
and you'll get noticed, I think.
i think i know where ur going, but i disagree.
say u have a system, and u wish it operational with zero maintenance.
how else can this be achieved?
--
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bryan Henderson
2003-10-25 00:23:13 UTC
Permalink
Post by Nir Tzachar
these kind of errors are transient (meaning, in an infinite
execution time only a finite number of errors occur)
You lost me here. First, why is the number of errors finite when the
execution time is infinite? Second, how does that mean the errors are
transient? I'd think the relationship is exactly the opposite: If the
errors are permanent, then there's a limit to how many can occur
regardless of time.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-25 10:37:43 UTC
Permalink
Post by Bryan Henderson
You lost me here. First, why is the number of errors finite when the
execution time is infinite? Second, how does that mean the errors are
transient? I'd think the relationship is exactly the opposite: If the
errors are permanent, then there's a limit to how many can occur
regardless of time.
first, transient means short-lived, not permanent (from the dict. )
now, you can describe a model of a file system as an infinite execution of
file operations. [we use infinite sequences to prove correctness]
im not talking about execution time, but an infinite number of separate
operations.

when striving to achieve self stabilization, you need to prove that as
long as at some point you will get no more errors (hence, transient
errors) the system will stabilize and keep on working correctly.
--
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2003-10-24 21:15:41 UTC
Permalink
Post by Pat LaVarre
"How can we know how commonly that occurs in practice, so that we can
know how often we're wrong to believe such things as our locally
recorded vector time?"
Is there, as yet, no linux filesystem that preserves the integrity of
the data and metadata despite such failures? If I could write such an
fs on to a single directly-attached local drive, then I could measure
how often I myself experience such failures.
I'm confident I actually do experience these failures because often I
work in comparative, raid-like measures. When I see one drive and
another disagree about what I wrote, then whenever I trust my write and
diff and read tools I must conclude one or both of the HDD's is wrong.
I recall there being a loopback driver that will write a checksum for each
block written to the device into a separate block device (probably just
another loop device on a separate filesystem) so you could use that to
verify your data on each read.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-24 20:53:04 UTC
Permalink
Post by Andreas Dilger
Post by Nir Tzachar
exactly what we're aiming at: dont take any chances (or prisoners).
if we dont agree on a file, nuke it. lets stay on the safe side.
So, system stabilizes when there are no files left ;-).
you know, as asys admin i always told my boss, that getting rid of all of
our users will solve all of our problems <@-;)
--
========================================================================
nir.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Hudec
2003-10-25 08:01:25 UTC
Permalink
Post by Nir Tzachar
Post by Jan Harkes
Ever heard of lamport clocks?
i know them as vector clocks, but yes.
So I'd expect you would get the point when I used the term "vector time"
;-)

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <***@ucw.cz>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nir Tzachar
2003-10-22 10:21:37 UTC
Permalink
Who's work will the caching daemon overwrite?My work, or my
friends work?
well, in our system, unless u break the symmetry, the daemon will
pick a random file. Since no majority can be found, this is the default.
but, lets say your friend was connected to a third server, and his work
was saved there also. when you'll connect ur laptop, all of ur work will
be lost, and what ull see in only his work ;)


========================================================================
nir.
Charles Manning
2003-10-25 09:27:55 UTC
Permalink
Hi

I'm the maintainer for YAFFS, the NAND-flash file system.

I've had readpage implemented for a long while to support read memory mapping
(eg to execute a program).

I've also had prepare_write and commit_write implemented for a while,
thinking this was sufficient to support write mmapping. Someone found that
this is not the case.

I need to therefore implement writepage and have a few questions:

1) Is there a generic writepage lurking somewhere that will use
prepare/commit_write instead?
2) What fiddling is required with kmap & page flags within writepage or is
this all handled by the caller?

Any help appreciated.

Thanx

-- Charles
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-25 16:18:19 UTC
Permalink
No you don't. This is flash -- people don't really need shared writable
mmap; if they think they do, they need educating not pandering to.
Post by Charles Manning
1) Is there a generic writepage lurking somewhere that will use
prepare/commit_write instead?
Don't think so. Offhand I don't see why it couldn't be done, but it's
not what most file systems would want.
Post by Charles Manning
2) What fiddling is required with kmap & page flags within writepage or is
this all handled by the caller?
You'll need to kmap the page since you actually want to touch the data
with the CPU. You probably also need to mark the page uptodate when
you're done. Take a look at the generic block writepage.
Post by Charles Manning
Any help appreciated.
Ensure you have no memory allocations in the writepage() code path,
unless they're done with (IIRC) GFP_NOIO.
--
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Charles Manning
2003-10-25 22:40:17 UTC
Permalink
Post by David Woodhouse
No you don't. This is flash -- people don't really need shared writable
mmap; if they think they do, they need educating not pandering to.
It's a matter of what software they're using. Debian apt-get is the current
woesome application. Some people have vast YAFFS fs (512Mbytes or more) and
are starting to think of it as a real disk equivalent rather than just a
little pokey place to store a few configs. This means they want to run
regular software.
Post by David Woodhouse
Post by Charles Manning
1) Is there a generic writepage lurking somewhere that will use
prepare/commit_write instead?
Don't think so. Offhand I don't see why it couldn't be done, but it's
not what most file systems would want.
It seems wierd to me that generic_file_write uses the address_ops
prepare/commit_write via the page cache, yet mmap does not have a generic
funtion to do the same, even though this might not be the best approach for
efficiency.
Post by David Woodhouse
Post by Charles Manning
2) What fiddling is required with kmap & page flags within writepage or
is this all handled by the caller?
You'll need to kmap the page since you actually want to touch the data
with the CPU. You probably also need to mark the page uptodate when
you're done. Take a look at the generic block writepage.
Post by Charles Manning
Any help appreciated.
Ensure you have no memory allocations in the writepage() code path,
unless they're done with (IIRC) GFP_NOIO.
Thanx

-- CHarles
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-26 10:25:49 UTC
Permalink
Post by Charles Manning
It's a matter of what software they're using. Debian apt-get is the current
woesome application. Some people have vast YAFFS fs (512Mbytes or more) and
are starting to think of it as a real disk equivalent rather than just a
little pokey place to store a few configs. This means they want to run
regular software.
While obviously YAFFS exists because you don't always make the same
choices as me, I'd encourage you to resist calls to implement shared
writable mmap on flash, or at least to make it an optional feature which
is omitted by default. Otherwise, people might actually use it :)

The life time of flash is limited, and users should endeavour to reduce
the number of writes even if the media become big enough that the rest
of the pain of using flash is alleviated.

Shared writable mmap can cause a rewrite of a whole page every time a
single byte is changed; explicit writes are almost always going to be a
more efficient use of the flash.

Your file system is not broken; your application author is. Mend that
instead. :)
--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Matthew Wilcox
2003-10-26 15:28:19 UTC
Permalink
Post by David Woodhouse
Your file system is not broken; your application author is. Mend that
instead. :)
I don't see why every application should be rewritten for the needs of
the current generation of flash. If this is such a problem for flash,
then maybe the filesystem should implement its own caching strategy
for these pages. (Again, expecting the page cache to understand about
flash's special requirements is unreasonable.)

Your argument makes sense for the embedded market, but not for mainstream.
--
"It's not Hollywood. War is real, war is primarily not about defeat or
victory, it is about death. I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark B
2003-10-26 18:47:16 UTC
Permalink
Post by Matthew Wilcox
Post by David Woodhouse
Your file system is not broken; your application author is. Mend that
instead. :)
I don't see why every application should be rewritten for the needs of
the current generation of flash. If this is such a problem for flash,
then maybe the filesystem should implement its own caching strategy
for these pages. (Again, expecting the page cache to understand about
flash's special requirements is unreasonable.)
Your argument makes sense for the embedded market, but not for mainstream.
I agree,
and someone maybe even cares to use the expensive flashs to keep the data
secure, because in some cases the data is more valuable then the drive, but
there are not so much data for use a RAID or something, it's like using a
cannon to kill a mosquito.
I'm telling this from expirience, since I'm currently developing a filesystem
for such a purpose, with minimising writes and aligning them to the pages of
the media + supporting small transactions across files (via ioctl hints by
the app) + usual journalling.
--
Mark Burazin
***@lemna.hr
---<>---<>---<>---<>---<>---<>---<>---<>---<>
Lemna d.o.o.
http://www.lemna.biz - ***@lemna.hr
<>---<>---<>---<>---<>---<>---<>---<>---<>---


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Charles Manning
2003-10-26 20:40:15 UTC
Permalink
Post by Mark B
Post by Matthew Wilcox
Post by David Woodhouse
Your file system is not broken; your application author is. Mend that
instead. :)
I don't see why every application should be rewritten for the needs of
the current generation of flash. If this is such a problem for flash,
then maybe the filesystem should implement its own caching strategy
for these pages. (Again, expecting the page cache to understand about
flash's special requirements is unreasonable.)
Your argument makes sense for the embedded market, but not for mainstream.
I agree,
and someone maybe even cares to use the expensive flashs to keep the data
secure, because in some cases the data is more valuable then the drive, but
there are not so much data for use a RAID or something, it's like using a
cannon to kill a mosquito.
I'm telling this from expirience, since I'm currently developing a
filesystem for such a purpose, with minimising writes and aligning them to
the pages of the media + supporting small transactions across files (via
ioctl hints by the app) + usual journalling.
Ok fellas, we've all had a fine time disagreeing with David Woodhouse's take
on this, now can someome tell me how to do it?

Thanx

-- Charles

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-26 21:04:48 UTC
Permalink
Post by Charles Manning
Ok fellas, we've all had a fine time disagreeing with David Woodhouse's take
on this, now can someome tell me how to do it?
Sorry, I thought we'd already done that. Make your writepage write out
the page, without allocating memory (at least with GFP_KERNEL), then
mark the offending page uptodate and unlock it.

You said in private mail you were looking at smbfs. That looks like it's
perfectly sufficient to let you implement your own.
--
dwmw2


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Charles Manning
2003-10-26 20:54:06 UTC
Permalink
Post by David Woodhouse
Post by Charles Manning
It's a matter of what software they're using. Debian apt-get is the
current woesome application. Some people have vast YAFFS fs (512Mbytes or
more) and are starting to think of it as a real disk equivalent rather
than just a little pokey place to store a few configs. This means they
want to run regular software.
While obviously YAFFS exists because you don't always make the same
choices as me, I'd encourage you to resist calls to implement shared
writable mmap on flash, or at least to make it an optional feature which
is omitted by default. Otherwise, people might actually use it :)
The life time of flash is limited, and users should endeavour to reduce
the number of writes even if the media become big enough that the rest
of the pain of using flash is alleviated.
Shared writable mmap can cause a rewrite of a whole page every time a
single byte is changed; explicit writes are almost always going to be a
more efficient use of the flash.
Your file system is not broken; your application author is. Mend that
instead. :)
I disagree David, to an extent. Sure a change of a single byte can cause the
rewrite of a whole page, but you can also cause the same thing with a poorly
written application using write(). mmap is not inherently at fault. I do
however agree that mmap is, in general, not the best way to construct apps
for flash friendliness.

I think too that the attitude to tell people to go rewrite apps for flash is
not really GoodForm(tm). Doing limited mmap with a few apps here and there is
not going to hurt YAFFS. Why embed Linux if you're going to have to scratch
through all utils to pull out the mmaps?

Yes, the lifetime of flash is limited, but I've done accelerated lifetime
tests (30GB or so of writes) and others have done way more than this (twenty
or 30 times). NAND flash, with YAFFS, is unlikely to wear out in most
embedded usage scenarios, by an order of magnitude or so. Sure, you could
craft an atypical "killer app".

I think it appropriate that YAFFS supports mmap. Now what I want to know is
how to implement it.

BTW: In case others on this list think I'm slagging off at David, you're much
mistaken. I hold him and his work in high regard. It would be a boring world
if we all agreed.

-- CHarles


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nikita Danilov
2003-10-27 08:34:47 UTC
Permalink
Post by David Woodhouse
No you don't. This is flash -- people don't really need shared writable
mmap; if they think they do, they need educating not pandering to.
Note that ->writepage() is used not only by mmap() (actually, it is only
used by mmap() is file system doesn't provide its on
->writepages()). ->writepage() is used by VM to write pages in response
to the memory pressure (see mm/vmscan.c:shrink_list()). Every
well-behaving file system has to provide ->writepage() for this purpose.
[...]
Post by David Woodhouse
--
dwmw2
Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-27 08:39:27 UTC
Permalink
Post by Nikita Danilov
Note that ->writepage() is used not only by mmap() (actually, it is only
used by mmap() is file system doesn't provide its on
->writepages()). ->writepage() is used by VM to write pages in response
to the memory pressure (see mm/vmscan.c:shrink_list()). Every
well-behaving file system has to provide ->writepage() for this purpose.
How do you get dirty but still file-backed pages if you don't get them
by shared writable mmap?
--
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nikita Danilov
2003-10-27 08:43:29 UTC
Permalink
Post by David Woodhouse
Post by Nikita Danilov
Note that ->writepage() is used not only by mmap() (actually, it is only
used by mmap() is file system doesn't provide its on
->writepages()). ->writepage() is used by VM to write pages in response
to the memory pressure (see mm/vmscan.c:shrink_list()). Every
well-behaving file system has to provide ->writepage() for this purpose.
How do you get dirty but still file-backed pages if you don't get them
by shared writable mmap?
By write(2)? May be I am missing something in this discussion, though.
Post by David Woodhouse
--
dwmw2
Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-27 08:46:44 UTC
Permalink
Post by Nikita Danilov
By write(2)? May be I am missing something in this discussion, though.
Either your commit_write() is synchronous or it does what writepage()
would do anyway... which is to start the I/O but not wait for it.
--
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nikita Danilov
2003-10-27 08:52:05 UTC
Permalink
Post by David Woodhouse
Post by Nikita Danilov
By write(2)? May be I am missing something in this discussion, though.
Either your commit_write() is synchronous or it does what writepage()
would do anyway... which is to start the I/O but not wait for it.
I don't quite follow why. generic_commit_write() only marks buffers
dirty. Actual IO is started by either ->writepages() called from within
balance_dirty_pages() (or pdflush) or by ->writepage() called by VM
scanner.
Post by David Woodhouse
--
dwmw2
Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-27 09:06:58 UTC
Permalink
Post by Nikita Danilov
I don't quite follow why. generic_commit_write() only marks buffers
dirty. Actual IO is started by either ->writepages() called from within
balance_dirty_pages() (or pdflush) or by ->writepage() called by VM
scanner.
Charles isn't using generic_commit_write(); this is not a traditional
block-device-backed file system.

I strongly suspect his commit_write() is in fact synchronous.
--
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Woodhouse
2003-10-27 09:08:12 UTC
Permalink
Sorry, I should be clearer...
Post by David Woodhouse
Post by Nikita Danilov
I don't quite follow why. generic_commit_write() only marks buffers
dirty. Actual IO is started by either ->writepages() called from within
balance_dirty_pages() (or pdflush) or by ->writepage() called by VM
scanner.
Ah yes, you are probably right in the case of block device file systems;
I missed that. But...
Post by David Woodhouse
Charles isn't using generic_commit_write(); this is not a traditional
block-device-backed file system.
I strongly suspect his commit_write() is in fact synchronous.
--
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...