Kernel page cache and FUSE

View: New views
4 Messages — Rating Filter:   Alert me  

Kernel page cache and FUSE

by Archie Cobbs :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

I really like FUSE and have I've written my first FUSE filesystem called
s3backer <http://code.google.com/p/s3backer/>. All this filesystem does is
contain a single normal file which is backed by a network remote data store
(Amazon S3). The file is divided up into blocks (typically will be same size
as kernel page size) and then you do a loopback mount of a normal filesystem
on top of this file.

As the "upper" filesystem reads and writes blocks, the "lower" s3backer
filesystem reads and writes over the network. I'm sure you've seen a similar
arrangement before in other FUSE filesystems. The result is that you treat
the single file in the FUSE filesystem more like a hard disk type block
device, where the "hard disk" storage is remotely located over the network.

My questions all relate to kernel caching of the file data in this scenario.
I'm mostly ignorant about how exactly Linux kernel caching works. And this
is a complicated scenario because it involves two filesystems (the "upper"
one and the "lower" one) and a loopback mount...

How do the kernel page cache, and the data blocks read from/written to the
FUSE filesystem (on behalf of the "upper" filesystem) interact?

How does the kernel handle caching of the underlying file data blocks when
doing a loopback mount? Does the fact that the underlying file is within a
FUSE filesystem matter at all?

If i create a bunch of swap space, will the kernel take advantage of it and
therefore do more caching of the FUSE file's data blocks?

Or will the kernel refuse to cache file data blocks in swap because it
treats the FUSE file like a hard disk because another filesystem is
loopback-mounted on top of it?

Please see this discussion on the
wiki<http://code.google.com/p/s3backer/wiki/ManPage>for more info on
some of the questions raised.

Thanks,
-Archie

P.S. Apologies if this topic has already been addressed, the sourceforge
mailing list search seems broken.

--
Archie L. Cobbs
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Kernel page cache and FUSE

by Miklos Szeredi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

(Sorry about the late answer, it must have slipped my attention).

On Fri, 18 Jul 2008, Archie Cobbs wrote:
> I really like FUSE and have I've written my first FUSE filesystem called
> s3backer <http://code.google.com/p/s3backer/>. All this filesystem does is
> contain a single normal file which is backed by a network remote data store
> (Amazon S3). The file is divided up into blocks (typically will be same size
> as kernel page size) and then you do a loopback mount of a normal filesystem
> on top of this file.

Note, that loop over fuse is something not "supported" in the sense
that I haven't really thought about all the nasty corner cases that
happen when the machine is out of memory and is trying to free some up
by writing out dirty data through the loopback device and then through
fuse.

This is generally a difficult problem, and even you suggested "open
file with O_DIRECT" thing wouldn't solve it, as the cached data
belongs to the filesystem, not to the loop device.

> As the "upper" filesystem reads and writes blocks, the "lower" s3backer
> filesystem reads and writes over the network. I'm sure you've seen a similar
> arrangement before in other FUSE filesystems. The result is that you treat
> the single file in the FUSE filesystem more like a hard disk type block
> device, where the "hard disk" storage is remotely located over the network.

Why not NBD?  That has been designed especially for this.

> My questions all relate to kernel caching of the file data in this scenario.
> I'm mostly ignorant about how exactly Linux kernel caching works. And this
> is a complicated scenario because it involves two filesystems (the "upper"
> one and the "lower" one) and a loopback mount...
>
> How do the kernel page cache, and the data blocks read from/written to the
> FUSE filesystem (on behalf of the "upper" filesystem) interact?
>
> How does the kernel handle caching of the underlying file data blocks when
> doing a loopback mount? Does the fact that the underlying file is within a
> FUSE filesystem matter at all?

Yes, it matters, when writing out file backed dirty data to free up
memory, the kernel has complicated mechanisms to prevent deadlocks
when more memory is needed to complete the write.

When fuse is involved, the kernel doesn't have any idea that the
allocation by the filesystem is special and needs to complete in order
to complete the original write request.

Miklos

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Kernel page cache and FUSE

by Archie Cobbs :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Aug 25, 2008 at 5:37 AM, Miklos Szeredi <miklos@...> wrote:

> > (Amazon S3). The file is divided up into blocks (typically will be same
> size
> > as kernel page size) and then you do a loopback mount of a normal
> filesystem
> > on top of this file.
>
> Note, that loop over fuse is something not "supported" in the sense
> that I haven't really thought about all the nasty corner cases that
> happen when the machine is out of memory and is trying to free some up
> by writing out dirty data through the loopback device and then through
> fuse.
>

Yes, this is an interesting/murky area.

I guess it depends on how robust the algorithm for writing back dirty blocks
is. For example, in this scenario the system will need to write back certain
dirty pages (upper filesystem files) before it can write back other dirty
pages (lower filesystem files). So as long as the algorithm keeps trying and
cycling through, it should eventually perform correctly... where "correctly"
means that if there is any possible way to free memory the system will
eventually figure it out.


> This is generally a difficult problem, and even you suggested "open
> file with O_DIRECT" thing wouldn't solve it, as the cached data
> belongs to the filesystem, not to the loop device.
>

You're referring to the upper filesystem, correct? From my tests it appears
that 'direct_io' does indeed prevent any files from a FUSE mount from being
cached.

For the upper filesystem, mounting with 'sync' should help the "upper"
caching problem I'd imagine...

Earlier I created a patch to mount(8) and losetup(8) to add a "direct" flag
(here <http://article.gmane.org/gmane.linux.utilities.util-linux-ng/1731>)...
but then realized that FUSE doesn't support opening files with O_DIRECT...
and in any case, the 'direct_io' option does the same thing.


> > As the "upper" filesystem reads and writes blocks, the "lower" s3backer
> > filesystem reads and writes over the network. I'm sure you've seen a
> similar
> > arrangement before in other FUSE filesystems. The result is that you
> treat
> > the single file in the FUSE filesystem more like a hard disk type block
> > device, where the "hard disk" storage is remotely located over the
> network.
>
> Why not NBD?  That has been designed especially for this.
>

s3backer is designed to work specifically with Amazon S3, which uses HTTP
for access and has weaker guarantees on data and timing than a "normal"
block device.


> > How does the kernel handle caching of the underlying file data blocks
> when
> > doing a loopback mount? Does the fact that the underlying file is within
> a
> > FUSE filesystem matter at all?
>
> Yes, it matters, when writing out file backed dirty data to free up
> memory, the kernel has complicated mechanisms to prevent deadlocks
> when more memory is needed to complete the write.
>
> When fuse is involved, the kernel doesn't have any idea that the
> allocation by the filesystem is special and needs to complete in order
> to complete the original write request.
>

If the kernel asks a filesystem to write a dirty page, and the filesystem
comes back with ENOMEM, will the kernel try again later?

-Archie

--
Archie L. Cobbs
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel

Re: Kernel page cache and FUSE

by Miklos Szeredi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 25 Aug 2008, Archie Cobbs wrote:
> > This is generally a difficult problem, and even you suggested "open
> > file with O_DIRECT" thing wouldn't solve it, as the cached data
> > belongs to the filesystem, not to the loop device.
> >
>
> You're referring to the upper filesystem, correct? From my tests it appears
> that 'direct_io' does indeed prevent any files from a FUSE mount from being
> cached.

Right, but caching isn't really problematic, only caching "dirty" data
is.  And fuse doesn't do that normally (now that it supports writable
mmaps, having dirty pages can happen, but obviously mmap is not
practical without caching).

> For the upper filesystem, mounting with 'sync' should help the "upper"
> caching problem I'd imagine...

Oh, OK.  Yes, that should help, though I don't know how the 'sync'
option is actually implemented.  The page cache is still probably
involved in that case.

> > When fuse is involved, the kernel doesn't have any idea that the
> > allocation by the filesystem is special and needs to complete in
> > order to complete the original write request.
> >
>
> If the kernel asks a filesystem to write a dirty page, and the filesystem
> comes back with ENOMEM, will the kernel try again later?

No, but even that's not the biggest problem, because ENOMEM will
happen only when the machine is _really_ out of memory, and then
nothing better can be done anyway.

What is really bad if the allocation deadlocks completely, because
it's waiting for the memory to be freed up, and that memory happens to
be the one that is currently being written out.

This used to be a big problem, but the dirty memory limiting was made
more robust so it cannot use up all the memory in the system.  But I
think there are still corner cases where an allocatation can hang on
writeback in the loop over fuse scenario.

Miklos

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
fuse-devel mailing list
fuse-devel@...
https://lists.sourceforge.net/lists/listinfo/fuse-devel