Welcome! Log In Create A New Profile

Advanced

[git pull] vfs and fs fixes

Posted by Al Viro 
Al Viro
[git pull] vfs and fs fixes
April 17, 2012 07:30AM
A bunch of endianness fixes plus a patch from bfields untangling
dependencies between vfs and nfsd trees; in principle, we could keep it
in nfsd tree (along with a bunch of followups that definitely belong there),
but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
and unionfs series - even closer, so that would create serious PITA for
both, whichever tree it would sit in.

Speaking of endianness stuff, I'm rather tempted to slap
ccflags-y += -D__CHECK_ENDIAN__ in fs/Makefile, if not making it
default for the entire tree; nfsd regressions I've caught make one
hell of a pile and we'd obviously benefit from having that kind of
stuff caught earlier...

Please, pull from the usual place -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (12):
nfsd: fix b0rken error value for setattr on read-only mount
nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails
nfsd: fix endianness breakage in TEST_STATEID handling
nfsd: fix error value on allocation failure in nfsd4_decode_test_stateid()
nfsd: fix compose_entry_fh() failure exits
ext4: fix endianness breakage in ext4_split_extent_at()
btrfs: btrfs_root_readonly() broken on big-endian
ocfs2: ->l_next_free_req breakage on big-endian
ocfs: ->rl_used breakage on big-endian
ocfs2: ->rl_count endianness breakage
ocfs2: ->e_leaf_clusters endianness breakage
lockd: fix the endianness bug

J. Bruce Fields (1):
vfs: take i_mutex on renamed file

Diffstat:
Documentation/filesystems/directory-locking | 11 ++++++-----
fs/btrfs/ctree.h | 2 +-
fs/ext4/extents.c | 2 +-
fs/lockd/clnt4xdr.c | 2 +-
fs/lockd/clntxdr.c | 2 +-
fs/namei.c | 3 +++
fs/nfsd/nfs3xdr.c | 22 ++++++++--------------
fs/nfsd/nfs4proc.c | 7 ++++---
fs/nfsd/nfs4state.c | 23 +++++++++--------------
fs/nfsd/nfs4xdr.c | 4 ++--
fs/ocfs2/alloc.c | 2 +-
fs/ocfs2/refcounttree.c | 12 ++++++------
fs/ocfs2/suballoc.c | 4 ++--
include/linux/fs.h | 9 ++++++---
14 files changed, 51 insertions(+), 54 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
Re: [git pull] vfs and fs fixes
April 17, 2012 05:10PM
On Mon, Apr 16, 2012 at 10:25 PM, Al Viro <[email protected]> wrote:
>        A bunch of endianness fixes plus a patch from bfields untangling
> dependencies between vfs and nfsd trees; in principle, we could keep it
> in nfsd tree (along with a bunch of followups that definitely belong there),
> but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
> and unionfs series - even closer, so that would create serious PITA for
> both, whichever tree it would sit in.

Why is that double mutex taking in vfs_rename_other() safe from ABBA?

We aren't guaranteed to hold the s_vfs_rename_mutex, since the parent
directories may be the same.

And yes, we hold the i_mutex on that shared parent, but the inodes may
exist (hardlinked) in another directory, so another rename could be
doing the i_mutex in the reverse order.

Maybe there is some reason why that double lock is safe, but I don't
see it, and I want it clearly documented. So I'm not pulling this.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 17, 2012 06:30PM
On Tue, Apr 17, 2012 at 08:01:48AM -0700, Linus Torvalds wrote:
> On Mon, Apr 16, 2012 at 10:25 PM, Al Viro <[email protected]> wrote:
> >        A bunch of endianness fixes plus a patch from bfields untangling
> > dependencies between vfs and nfsd trees; in principle, we could keep it
> > in nfsd tree (along with a bunch of followups that definitely belong there),
> > but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
> > and unionfs series - even closer, so that would create serious PITA for
> > both, whichever tree it would sit in.
>
> Why is that double mutex taking in vfs_rename_other() safe from ABBA?
>
> We aren't guaranteed to hold the s_vfs_rename_mutex, since the parent
> directories may be the same.
>
> And yes, we hold the i_mutex on that shared parent, but the inodes may
> exist (hardlinked) in another directory, so another rename could be
> doing the i_mutex in the reverse order.
>
> Maybe there is some reason why that double lock is safe, but I don't
> see it, and I want it clearly documented. So I'm not pulling this.

Ugh, no, I think you're right:

rename A/a->A/b
rename B/b->B/b

where A/a and B/a are the same file, and A/b and B/b are the same file,
can result in the first rename holding the lock on A and a and waiting
on b, and the second holding the lock on B and b and waiting on a.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
Re: [git pull] vfs and fs fixes
April 17, 2012 06:40PM
On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <[email protected]> wrote:
>
> Ugh, no, I think you're right:
>
>        rename A/a->A/b
>        rename B/b->B/b
>
> where A/a and B/a are the same file, and A/b and B/b are the same file,
> can result in the first rename holding the lock on A and a and waiting
> on b, and the second holding the lock on B and b and waiting on a.

In fact I don't think you need even that much. Just a simple

touch a
ln a b
mv a b

looks like it should deadlock on itself, no? source and dest inodes
will be the same, so the mutex_lock() will just deadlock without even
any ABBA race.

(I didn't really check - maybe there is some reason that doesn't happen).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 17, 2012 07:10PM
On Tue, Apr 17, 2012 at 09:33:06AM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <[email protected]> wrote:
> >
> > Ugh, no, I think you're right:
> >
> >        rename A/a->A/b
> >        rename B/b->B/b
> >
> > where A/a and B/a are the same file, and A/b and B/b are the same file,
> > can result in the first rename holding the lock on A and a and waiting
> > on b, and the second holding the lock on B and b and waiting on a.
>
> In fact I don't think you need even that much. Just a simple
>
> touch a
> ln a b
> mv a b
>
> looks like it should deadlock on itself, no? source and dest inodes
> will be the same, so the mutex_lock() will just deadlock without even
> any ABBA race.
>
> (I didn't really check - maybe there is some reason that doesn't happen).

Yeah, rename has that funny exception that makes the above a no-op, so I
think that's safe.

But the patch is still wrong; back to the drawing board.

Maybe a paper bag over my head will help my concentration....

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Al Viro
Re: [git pull] vfs and fs fixes
April 17, 2012 08:10PM
On Tue, Apr 17, 2012 at 09:33:06AM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 9:22 AM, J. Bruce Fields <[email protected]> wrote:

> In fact I don't think you need even that much. Just a simple
>
> touch a
> ln a b
> mv a b
>
> looks like it should deadlock on itself, no? source and dest inodes
> will be the same, so the mutex_lock() will just deadlock without even
> any ABBA race.

No, this one will bail out much earlier.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Al Viro
Re: [git pull] vfs and fs fixes
April 17, 2012 08:10PM
On Tue, Apr 17, 2012 at 08:01:48AM -0700, Linus Torvalds wrote:
> On Mon, Apr 16, 2012 at 10:25 PM, Al Viro <[email protected]> wrote:
> > ? ? ? ?A bunch of endianness fixes plus a patch from bfields untangling
> > dependencies between vfs and nfsd trees; in principle, we could keep it
> > in nfsd tree (along with a bunch of followups that definitely belong there),
> > but Miklos' stuff in fs/namei.c steps fairly close to it and overlayfs
> > and unionfs series - even closer, so that would create serious PITA for
> > both, whichever tree it would sit in.
>
> Why is that double mutex taking in vfs_rename_other() safe from ABBA?
>
> We aren't guaranteed to hold the s_vfs_rename_mutex, since the parent
> directories may be the same.
>
> And yes, we hold the i_mutex on that shared parent, but the inodes may
> exist (hardlinked) in another directory, so another rename could be
> doing the i_mutex in the reverse order.
>
> Maybe there is some reason why that double lock is safe, but I don't
> see it, and I want it clearly documented. So I'm not pulling this.

It isn't. Hell knows - I wonder if taking s_vfs_rename_mutex in all cases
in lock_rename() would be the right thing to do; it would remove the
problem, but might cost us too much contention...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Al Viro
Re: [git pull] vfs and fs fixes
April 17, 2012 08:30PM
On Tue, Apr 17, 2012 at 07:01:29PM +0100, Al Viro wrote:

> It isn't. Hell knows - I wonder if taking s_vfs_rename_mutex in all cases
> in lock_rename() would be the right thing to do; it would remove the
> problem, but might cost us too much contention...

Actually, it's even worse. ext4_move_extents() locks a _pair_ of
->i_mutex (having checked that both are non-directories first). In
i_ino order. So the only plausible ordering would be
* directories by tree order (with s_vfs_rename_mutex held to
stabilize the tree topology)
* non-directories after all directories, ordered in some consistent
way. Which would have to be by inumber if we want to leave ext4 code
as-is.

Bruce: for now I'm dropping that patch. We _might_ take ext4
mutex_inode_double_lock() into fs/namei.c and have it used by
vfs_rename_other(), but I'm not convinced that this is the right
thing to do. Is there any other sane way to deal with nfsd problem?
i_mutex is already used for more things than I'd like...

Linus: I've removed that sucker from the queue; it should propagate to
git.kernel.org in a few. Could you pull the rest? Same place, the stats
are below:

Shortlog:
Al Viro (12):
nfsd: fix b0rken error value for setattr on read-only mount
nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails
nfsd: fix endianness breakage in TEST_STATEID handling
nfsd: fix error value on allocation failure in nfsd4_decode_test_stateid()
nfsd: fix compose_entry_fh() failure exits
ext4: fix endianness breakage in ext4_split_extent_at()
btrfs: btrfs_root_readonly() broken on big-endian
ocfs2: ->l_next_free_req breakage on big-endian
ocfs: ->rl_used breakage on big-endian
ocfs2: ->rl_count endianness breakage
ocfs2: ->e_leaf_clusters endianness breakage
lockd: fix the endianness bug

Diffstat:
fs/btrfs/ctree.h | 2 +-
fs/ext4/extents.c | 2 +-
fs/lockd/clnt4xdr.c | 2 +-
fs/lockd/clntxdr.c | 2 +-
fs/nfsd/nfs3xdr.c | 22 ++++++++--------------
fs/nfsd/nfs4proc.c | 7 ++++---
fs/nfsd/nfs4state.c | 23 +++++++++--------------
fs/nfsd/nfs4xdr.c | 4 ++--
fs/ocfs2/alloc.c | 2 +-
fs/ocfs2/refcounttree.c | 12 ++++++------
fs/ocfs2/suballoc.c | 4 ++--
11 files changed, 36 insertions(+), 46 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 17, 2012 11:20PM
On Tue, Apr 17, 2012 at 07:28:26PM +0100, Al Viro wrote:
> On Tue, Apr 17, 2012 at 07:01:29PM +0100, Al Viro wrote:
>
> > It isn't. Hell knows - I wonder if taking s_vfs_rename_mutex in all cases
> > in lock_rename() would be the right thing to do; it would remove the
> > problem, but might cost us too much contention...
>
> Actually, it's even worse. ext4_move_extents() locks a _pair_ of
> ->i_mutex (having checked that both are non-directories first). In
> i_ino order. So the only plausible ordering would be
> * directories by tree order (with s_vfs_rename_mutex held to
> stabilize the tree topology)
> * non-directories after all directories, ordered in some consistent
> way. Which would have to be by inumber if we want to leave ext4 code
> as-is.
>
> Bruce: for now I'm dropping that patch. We _might_ take ext4
> mutex_inode_double_lock() into fs/namei.c and have it used by
> vfs_rename_other(), but I'm not convinced that this is the right
> thing to do. Is there any other sane way to deal with nfsd problem?
> i_mutex is already used for more things than I'd like...

I don't want to give out a delegation while a rename, link, unlink, or
setattr of an inode is in progress. All but rename are covered by the
i_mutex.

I'm happy just failing the delegation in case of conflict.

Maybe instead I could continue using the i_mutex but handle rename some
other way; e.g. in delegation code:

if (!mutex_trylock(inode->i_mutex))
return -EAGAIN;
if (atomic_read(inode->i_renames_in_progress))
return -EAGAIN;

and add an

atomic_inc(inode->i_renames_in_progress);
atomic_dec(inode->i_renames_in_progress);

pair around rename.

Or I could increment that counter for all the conflicting operations and
rely on it instead of the i_mutex. I was trying to avoid adding
something like that (an inc, a dec, another error path) to every
operation. And hoping to avoid adding another field to struct inode.
Oh well.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
Re: [git pull] vfs and fs fixes
April 18, 2012 12:10AM
On Tue, Apr 17, 2012 at 2:14 PM, J. Bruce Fields <[email protected]> wrote:
> On Tue, Apr 17, 2012 at 07:28:26PM +0100, Al Viro wrote:>
> Maybe instead I could continue using the i_mutex but handle rename some
> other way; e.g. in delegation code:
>
>        if (!mutex_trylock(inode->i_mutex))
>                return -EAGAIN;
>        if (atomic_read(inode->i_renames_in_progress))
>                return -EAGAIN;
>
> and add an
>
>        atomic_inc(inode->i_renames_in_progress);
>        atomic_dec(inode->i_renames_in_progress);
>
> pair around rename.

Please don't make up your own locking. Plus it's broken anyway, since
a rename could come in directly after your atomic_read (and this is
*why* people shouldn't make up their own locks - they are invariably
broken).

> Or I could increment that counter for all the conflicting operations and
> rely on it instead of the i_mutex.  I was trying to avoid adding
> something like that (an inc, a dec, another error path) to every
> operation.  And hoping to avoid adding another field to struct inode.
> Oh well.

We could just say that we can do a double inode lock, but then
standardize on the order. And the only sane order is comparing inode
pointers, not inode numbers like ext4 apparently does.

With a standard order, I don't think it would be at all wrong to just
take the inode lock on rename.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Al Viro
Re: [git pull] vfs and fs fixes
April 18, 2012 01:50AM
On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > Or I could increment that counter for all the conflicting operations and
> > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > something like that (an inc, a dec, another error path) to every
> > operation. ?And hoping to avoid adding another field to struct inode.
> > Oh well.
>
> We could just say that we can do a double inode lock, but then
> standardize on the order. And the only sane order is comparing inode
> pointers, not inode numbers like ext4 apparently does.
>
> With a standard order, I don't think it would be at all wrong to just
> take the inode lock on rename.

In principle, yes, but have you tried to grep for i_mutex? Note that
we have *another* place where multiple ->i_mutex might be held on
non-directories (and unless I'm missing something, ext4 move_extent.c
stuff doesn't play well with it): quota writes. Which can, AFAICS,
happen while write(2) is holding ->i_mutex on a regular file. So
it's not _that_ easy - we want something like "and quota file is goes
last", since there we don't get to change the locking order - the first
->i_mutex is taken too far outside.

I really don't like how messy i_mutex had become these days. Right now
I'm staring at 700-odd lines all over the place where it's taken/released
and it's a wastebucket lock - used to protect random bits and scraps, with a
lot of filesystems, etc. using it for purposes of their own ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 18, 2012 02:50AM
On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> On Tue, Apr 17, 2012 at 2:14 PM, J. Bruce Fields <[email protected]> wrote:
> > On Tue, Apr 17, 2012 at 07:28:26PM +0100, Al Viro wrote:>
> > Maybe instead I could continue using the i_mutex but handle rename some
> > other way; e.g. in delegation code:
> >
> >        if (!mutex_trylock(inode->i_mutex))
> >                return -EAGAIN;
> >        if (atomic_read(inode->i_renames_in_progress))
> >                return -EAGAIN;
> >
> > and add an
> >
> >        atomic_inc(inode->i_renames_in_progress);
> >        atomic_dec(inode->i_renames_in_progress);
> >
> > pair around rename.
>
> Please don't make up your own locking. Plus it's broken anyway, since
> a rename could come in directly after your atomic_read (and this is
> *why* people shouldn't make up their own locks - they are invariably
> broken).

Doh, yes, sounds like a good rule. (I was misremembering some previous
attempt at this--which admittedly may just have failed in some more
complicated way.)

--b.

> > Or I could increment that counter for all the conflicting operations and
> > rely on it instead of the i_mutex.  I was trying to avoid adding
> > something like that (an inc, a dec, another error path) to every
> > operation.  And hoping to avoid adding another field to struct inode.
> > Oh well.
>
> We could just say that we can do a double inode lock, but then
> standardize on the order. And the only sane order is comparing inode
> pointers, not inode numbers like ext4 apparently does.
>
> With a standard order, I don't think it would be at all wrong to just
> take the inode lock on rename.
>
> Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 18, 2012 02:50AM
On Wed, Apr 18, 2012 at 12:44:24AM +0100, Al Viro wrote:
> On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > Or I could increment that counter for all the conflicting operations and
> > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > something like that (an inc, a dec, another error path) to every
> > > operation. ?And hoping to avoid adding another field to struct inode.
> > > Oh well.
> >
> > We could just say that we can do a double inode lock, but then
> > standardize on the order. And the only sane order is comparing inode
> > pointers, not inode numbers like ext4 apparently does.
> >
> > With a standard order, I don't think it would be at all wrong to just
> > take the inode lock on rename.

I'll take a stab at that. But:

> In principle, yes, but have you tried to grep for i_mutex? Note that
> we have *another* place where multiple ->i_mutex might be held on
> non-directories (and unless I'm missing something, ext4 move_extent.c
> stuff doesn't play well with it): quota writes. Which can, AFAICS,
> happen while write(2) is holding ->i_mutex on a regular file. So
> it's not _that_ easy - we want something like "and quota file is goes
> last", since there we don't get to change the locking order - the first
> ->i_mutex is taken too far outside.
>
> I really don't like how messy i_mutex had become these days. Right now
> I'm staring at 700-odd lines all over the place where it's taken/released
> and it's a wastebucket lock - used to protect random bits and scraps, with a
> lot of filesystems, etc. using it for purposes of their own ;-/

I can understand not wanting the i_mutex to have another use.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Linus Torvalds
Re: [git pull] vfs and fs fixes
April 18, 2012 03:00AM
On Tue, Apr 17, 2012 at 4:44 PM, Al Viro <[email protected]> wrote:
>
> In principle, yes, but have you tried to grep for i_mutex?

No, I hadn't, and right you are. That's a mess.

It's not the i_mutex hits themselves that look scary, but there's 75
hits on just the *nested* locking. That's way more than I would have
guessed without the grep.

The ext4 ones look pretty simple. But yeah, there's a lot of other
noise there...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 19, 2012 12:00AM
On Wed, Apr 18, 2012 at 12:44:24AM +0100, Al Viro wrote:
> On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > Or I could increment that counter for all the conflicting operations and
> > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > something like that (an inc, a dec, another error path) to every
> > > operation. ?And hoping to avoid adding another field to struct inode.
> > > Oh well.
> >
> > We could just say that we can do a double inode lock, but then
> > standardize on the order. And the only sane order is comparing inode
> > pointers, not inode numbers like ext4 apparently does.
> >
> > With a standard order, I don't think it would be at all wrong to just
> > take the inode lock on rename.
>
> In principle, yes, but have you tried to grep for i_mutex? Note that
> we have *another* place where multiple ->i_mutex might be held on
> non-directories (and unless I'm missing something, ext4 move_extent.c
> stuff doesn't play well with it): quota writes. Which can, AFAICS,
> happen while write(2) is holding ->i_mutex on a regular file. So
> it's not _that_ easy - we want something like "and quota file is goes
> last"

So the idea would be to always take the i_mutex on non-quota files
before taking it on quota files?

I tried pulling the ext4 thing into fs/inode.c, modifying the order to
do that, and then doing the rename change on top of that.

One thing I don't understand is how that interacts with
quota_on/quota_off. How do we decide the right lock ordering if files
can go back and forth between being quota files?

--b.

> , since there we don't get to change the locking order - the first
> ->i_mutex is taken too far outside.
>
> I really don't like how messy i_mutex had become these days. Right now
> I'm staring at 700-odd lines all over the place where it's taken/released
> and it's a wastebucket lock - used to protect random bits and scraps, with a
> lot of filesystems, etc. using it for purposes of their own ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Benjamin Herrenschmidt
Re: [git pull] vfs and fs fixes
April 19, 2012 05:30AM
> ext4: fix endianness breakage in ext4_split_extent_at()

That smells like something we'd want in -stable no ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ted Ts'o
Re: [git pull] vfs and fs fixes
April 19, 2012 05:00PM
On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > ext4: fix endianness breakage in ext4_split_extent_at()
>
> That smells like something we'd want in -stable no ?

Yes, commit af1584f5 is something that should definitely go into
-stable.


- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
Re: [git pull] vfs and fs fixes
April 20, 2012 01:20PM
On Wed 18-04-12 00:44:24, Al Viro wrote:
> On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > Or I could increment that counter for all the conflicting operations and
> > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > something like that (an inc, a dec, another error path) to every
> > > operation. ?And hoping to avoid adding another field to struct inode.
> > > Oh well.
> >
> > We could just say that we can do a double inode lock, but then
> > standardize on the order. And the only sane order is comparing inode
> > pointers, not inode numbers like ext4 apparently does.
> >
> > With a standard order, I don't think it would be at all wrong to just
> > take the inode lock on rename.
>
> In principle, yes, but have you tried to grep for i_mutex? Note that
> we have *another* place where multiple ->i_mutex might be held on
> non-directories (and unless I'm missing something, ext4 move_extent.c
> stuff doesn't play well with it): quota writes. Which can, AFAICS,
> happen while write(2) is holding ->i_mutex on a regular file. So
> it's not _that_ easy - we want something like "and quota file is goes
> last", since there we don't get to change the locking order - the first
> ->i_mutex is taken too far outside.
Hum, I think I could just do away with quota file i_mutex being special.
It's used for two purposes:
1) When quota is being turned on/off, we want to set/clear inode immutable
flag, truncate page cache, etc. But we should be able push this locking
outside of quota locks.
2) Inside filesystems when quota file is written to. Quota writes are
serialized by quota code anyway and noone else has any bussiness with quota
files (they are marked as immutable to avoid mistakes) so there i_mutex is
not really needed.

> I really don't like how messy i_mutex had become these days. Right now
> I'm staring at 700-odd lines all over the place where it's taken/released
> and it's a wastebucket lock - used to protect random bits and scraps, with a
> lot of filesystems, etc. using it for purposes of their own ;-/

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Greg KH
Re: [git pull] vfs and fs fixes
April 24, 2012 07:50PM
On Thu, Apr 19, 2012 at 10:50:07AM -0400, Ted Ts'o wrote:
> On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > > ext4: fix endianness breakage in ext4_split_extent_at()
> >
> > That smells like something we'd want in -stable no ?
>
> Yes, commit af1584f5 is something that should definitely go into
> -stable.

Now queued up, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Al Viro
Re: [git pull] vfs and fs fixes
April 24, 2012 07:50PM
On Tue, Apr 24, 2012 at 10:40:06AM -0700, Greg KH wrote:
> On Thu, Apr 19, 2012 at 10:50:07AM -0400, Ted Ts'o wrote:
> > On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > > > ext4: fix endianness breakage in ext4_split_extent_at()
> > >
> > > That smells like something we'd want in -stable no ?
> >
> > Yes, commit af1584f5 is something that should definitely go into
> > -stable.
>
> Now queued up, thanks.

Actually, the rest of commits from that pull also should go there; forgot
to add Cc: stable, my apologies...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Greg KH
Re: [git pull] vfs and fs fixes
April 24, 2012 08:00PM
On Tue, Apr 24, 2012 at 06:45:02PM +0100, Al Viro wrote:
> On Tue, Apr 24, 2012 at 10:40:06AM -0700, Greg KH wrote:
> > On Thu, Apr 19, 2012 at 10:50:07AM -0400, Ted Ts'o wrote:
> > > On Thu, Apr 19, 2012 at 01:23:18PM +1000, Benjamin Herrenschmidt wrote:
> > > > > ext4: fix endianness breakage in ext4_split_extent_at()
> > > >
> > > > That smells like something we'd want in -stable no ?
> > >
> > > Yes, commit af1584f5 is something that should definitely go into
> > > -stable.
> >
> > Now queued up, thanks.
>
> Actually, the rest of commits from that pull also should go there; forgot
> to add Cc: stable, my apologies...

So that would be commits
96f6f98501196d46ce52c2697dd758d9300c63f5..e847469bf77a1d339274074ed068d461f0c872bc
inclusive?

Or a different range?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Al Viro
Re: [git pull] vfs and fs fixes
April 24, 2012 08:10PM
On Tue, Apr 24, 2012 at 10:59:11AM -0700, Greg KH wrote:

> So that would be commits
> 96f6f98501196d46ce52c2697dd758d9300c63f5..e847469bf77a1d339274074ed068d461f0c872bc
> inclusive?

Yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 24, 2012 10:00PM
On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> On Wed 18-04-12 00:44:24, Al Viro wrote:
> > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > Or I could increment that counter for all the conflicting operations and
> > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > something like that (an inc, a dec, another error path) to every
> > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > Oh well.
> > >
> > > We could just say that we can do a double inode lock, but then
> > > standardize on the order. And the only sane order is comparing inode
> > > pointers, not inode numbers like ext4 apparently does.
> > >
> > > With a standard order, I don't think it would be at all wrong to just
> > > take the inode lock on rename.
> >
> > In principle, yes, but have you tried to grep for i_mutex? Note that
> > we have *another* place where multiple ->i_mutex might be held on
> > non-directories (and unless I'm missing something, ext4 move_extent.c
> > stuff doesn't play well with it): quota writes. Which can, AFAICS,
> > happen while write(2) is holding ->i_mutex on a regular file. So
> > it's not _that_ easy - we want something like "and quota file is goes
> > last", since there we don't get to change the locking order - the first
> > ->i_mutex is taken too far outside.
> Hum, I think I could just do away with quota file i_mutex being special.
> It's used for two purposes:
> 1) When quota is being turned on/off, we want to set/clear inode immutable
> flag, truncate page cache, etc. But we should be able push this locking
> outside of quota locks.
> 2) Inside filesystems when quota file is written to. Quota writes are
> serialized by quota code anyway and noone else has any bussiness with quota
> files (they are marked as immutable to avoid mistakes) so there i_mutex is
> not really needed.

Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2. The
former two are in code called from the quota code (through the
->quota_write method). But the gfs2 code appears to be called directly
from gfs2's write code.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Greg KH
Re: [git pull] vfs and fs fixes
April 24, 2012 10:40PM
On Tue, Apr 24, 2012 at 07:04:07PM +0100, Al Viro wrote:
> On Tue, Apr 24, 2012 at 10:59:11AM -0700, Greg KH wrote:
>
> > So that would be commits
> > 96f6f98501196d46ce52c2697dd758d9300c63f5..e847469bf77a1d339274074ed068d461f0c872bc
> > inclusive?
>
> Yes.

Wonderful, now queued up, thanks for letting me know.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
Re: [git pull] vfs and fs fixes
April 25, 2012 12:30AM
On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > Or I could increment that counter for all the conflicting operations and
> > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > something like that (an inc, a dec, another error path) to every
> > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > Oh well.
> > > >
> > > > We could just say that we can do a double inode lock, but then
> > > > standardize on the order. And the only sane order is comparing inode
> > > > pointers, not inode numbers like ext4 apparently does.
> > > >
> > > > With a standard order, I don't think it would be at all wrong to just
> > > > take the inode lock on rename.
> > >
> > > In principle, yes, but have you tried to grep for i_mutex? Note that
> > > we have *another* place where multiple ->i_mutex might be held on
> > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > stuff doesn't play well with it): quota writes. Which can, AFAICS,
> > > happen while write(2) is holding ->i_mutex on a regular file. So
> > > it's not _that_ easy - we want something like "and quota file is goes
> > > last", since there we don't get to change the locking order - the first
> > > ->i_mutex is taken too far outside.
> > Hum, I think I could just do away with quota file i_mutex being special.
> > It's used for two purposes:
> > 1) When quota is being turned on/off, we want to set/clear inode immutable
> > flag, truncate page cache, etc. But we should be able push this locking
> > outside of quota locks.
> > 2) Inside filesystems when quota file is written to. Quota writes are
> > serialized by quota code anyway and noone else has any bussiness with quota
> > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > not really needed.
>
> Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2. The
> former two are in code called from the quota code (through the
> ->quota_write method). But the gfs2 code appears to be called directly
> from gfs2's write code.
Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
wrote.

Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 25, 2012 01:40PM
On Wed, Apr 25, 2012 at 12:23:12AM +0200, Jan Kara wrote:
> On Tue 24-04-12 15:52:36, J. Bruce Fields wrote:
> > On Fri, Apr 20, 2012 at 01:15:17PM +0200, Jan Kara wrote:
> > > On Wed 18-04-12 00:44:24, Al Viro wrote:
> > > > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > > > Or I could increment that counter for all the conflicting operations and
> > > > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > > > something like that (an inc, a dec, another error path) to every
> > > > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > > > Oh well.
> > > > >
> > > > > We could just say that we can do a double inode lock, but then
> > > > > standardize on the order. And the only sane order is comparing inode
> > > > > pointers, not inode numbers like ext4 apparently does.
> > > > >
> > > > > With a standard order, I don't think it would be at all wrong to just
> > > > > take the inode lock on rename.
> > > >
> > > > In principle, yes, but have you tried to grep for i_mutex? Note that
> > > > we have *another* place where multiple ->i_mutex might be held on
> > > > non-directories (and unless I'm missing something, ext4 move_extent.c
> > > > stuff doesn't play well with it): quota writes. Which can, AFAICS,
> > > > happen while write(2) is holding ->i_mutex on a regular file. So
> > > > it's not _that_ easy - we want something like "and quota file is goes
> > > > last", since there we don't get to change the locking order - the first
> > > > ->i_mutex is taken too far outside.
> > > Hum, I think I could just do away with quota file i_mutex being special.
> > > It's used for two purposes:
> > > 1) When quota is being turned on/off, we want to set/clear inode immutable
> > > flag, truncate page cache, etc. But we should be able push this locking
> > > outside of quota locks.
> > > 2) Inside filesystems when quota file is written to. Quota writes are
> > > serialized by quota code anyway and noone else has any bussiness with quota
> > > files (they are marked as immutable to avoid mistakes) so there i_mutex is
> > > not really needed.
> >
> > Grepping for I_MUTEX_QUOTA shows hits in ext4, reiserfs, and gfs2. The
> > former two are in code called from the quota code (through the
> > ->quota_write method). But the gfs2 code appears to be called directly
> > from gfs2's write code.
> Ah, gfs2 doesn't use generic quota code so whatever it does is it's own
> invention. For ext4 and reiserfs I could get rid of I_MUTEX_QUOTA as I
> wrote.

So, just the appended?

But unfortunately as long as that's left in gfs2 we're still stuck
trying to order quota files after other files when we take two
non-directory i_mutexes elsewhere.

--b.

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index e1025c7..1a6fb52 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1441,7 +1441,6 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
struct buffer_head tmp_bh;
struct buffer_head *bh;

- mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
while (towrite > 0) {
tocopy = sb->s_blocksize - offset < towrite ?
sb->s_blocksize - offset : towrite;
@@ -1471,16 +1470,13 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
blk++;
}
out:
- if (len == towrite) {
- mutex_unlock(&inode->i_mutex);
+ if (len == towrite)
return err;
- }
if (inode->i_size < off+len-towrite)
i_size_write(inode, off+len-towrite);
inode->i_version++;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
mark_inode_dirty(inode);
- mutex_unlock(&inode->i_mutex);
return len - towrite;
}

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index cf0b592..7c08c93 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -3000,7 +3000,6 @@ static ssize_t ext3_quota_write(struct super_block *sb, int type,
(unsigned long long)off, (unsigned long long)len);
return -EIO;
}
- mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
bh = ext3_bread(handle, inode, blk, 1, &err);
if (!bh)
goto out;
@@ -3024,10 +3023,8 @@ static ssize_t ext3_quota_write(struct super_block *sb, int type,
}
brelse(bh);
out:
- if (err) {
- mutex_unlock(&inode->i_mutex);
+ if (err)
return err;
- }
if (inode->i_size < off + len) {
i_size_write(inode, off + len);
EXT3_I(inode)->i_disksize = inode->i_size;
@@ -3035,7 +3032,6 @@ out:
inode->i_version++;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
ext3_mark_inode_dirty(handle, inode);
- mutex_unlock(&inode->i_mutex);
return len;
}

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ceebaf8..97938db 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4760,7 +4760,6 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
return -EIO;
}

- mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
bh = ext4_bread(handle, inode, blk, 1, &err);
if (!bh)
goto out;
@@ -4776,16 +4775,13 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
err = ext4_handle_dirty_metadata(handle, NULL, bh);
brelse(bh);
out:
- if (err) {
- mutex_unlock(&inode->i_mutex);
+ if (err)
return err;
- }
if (inode->i_size < off + len) {
i_size_write(inode, off + len);
EXT4_I(inode)->i_disksize = inode->i_size;
ext4_mark_inode_dirty(handle, inode);
}
- mutex_unlock(&inode->i_mutex);
return len;
}

diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 8b7616e..c07b7d7 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -2270,7 +2270,6 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
(unsigned long long)off, (unsigned long long)len);
return -EIO;
}
- mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
while (towrite > 0) {
tocopy = sb->s_blocksize - offset < towrite ?
sb->s_blocksize - offset : towrite;
@@ -2302,16 +2301,13 @@ static ssize_t reiserfs_quota_write(struct super_block *sb, int type,
blk++;
}
out:
- if (len == towrite) {
- mutex_unlock(&inode->i_mutex);
+ if (len == towrite)
return err;
- }
if (inode->i_size < off + len - towrite)
i_size_write(inode, off + len - towrite);
inode->i_version++;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
mark_inode_dirty(inode);
- mutex_unlock(&inode->i_mutex);
return len - towrite;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
J. Bruce Fields
Re: [git pull] vfs and fs fixes
April 25, 2012 05:30PM
On Wed, Apr 18, 2012 at 05:52:38PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 18, 2012 at 12:44:24AM +0100, Al Viro wrote:
> > On Tue, Apr 17, 2012 at 03:08:26PM -0700, Linus Torvalds wrote:
> > > > Or I could increment that counter for all the conflicting operations and
> > > > rely on it instead of the i_mutex. ?I was trying to avoid adding
> > > > something like that (an inc, a dec, another error path) to every
> > > > operation. ?And hoping to avoid adding another field to struct inode.
> > > > Oh well.
> > >
> > > We could just say that we can do a double inode lock, but then
> > > standardize on the order. And the only sane order is comparing inode
> > > pointers, not inode numbers like ext4 apparently does.
> > >
> > > With a standard order, I don't think it would be at all wrong to just
> > > take the inode lock on rename.
> >
> > In principle, yes, but have you tried to grep for i_mutex? Note that
> > we have *another* place where multiple ->i_mutex might be held on
> > non-directories (and unless I'm missing something, ext4 move_extent.c
> > stuff doesn't play well with it): quota writes. Which can, AFAICS,
> > happen while write(2) is holding ->i_mutex on a regular file. So
> > it's not _that_ easy - we want something like "and quota file is goes
> > last"
>
> So the idea would be to always take the i_mutex on non-quota files
> before taking it on quota files?
>
> I tried pulling the ext4 thing into fs/inode.c, modifying the order to
> do that, and then doing the rename change on top of that.

Patches follow, with the ordering change at the end.

And a documentation fix that I suppose could go in whenever.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: "J. Bruce Fields" <[email protected]>

We want to do this elsewhere as well.

Also, compare pointers instead of inode numbers.

Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/ext4/move_extent.c | 21 ++-------------------
fs/inode.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +++
3 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index c5826c6..b87d94a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -1086,19 +1086,7 @@ mext_inode_double_lock(struct inode *inode1, struct inode *inode2)
if (ret < 0)
goto out;

- if (inode1 == inode2) {
- mutex_lock(&inode1->i_mutex);
- goto out;
- }
-
- if (inode1->i_ino < inode2->i_ino) {
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
- } else {
- mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
- mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
- }
-
+ lock_two_nondirectories(inode1, inode2);
out:
return ret;
}
@@ -1123,12 +1111,7 @@ mext_inode_double_unlock(struct inode *inode1, struct inode *inode2)
if (ret < 0)
goto out;

- if (inode1)
- mutex_unlock(&inode1->i_mutex);
-
- if (inode2 && inode2 != inode1)
- mutex_unlock(&inode2->i_mutex);
-
+ unlock_two_nondirectories(inode1, inode2);
out:
return ret;
}
diff --git a/fs/inode.c b/fs/inode.c
index 9f4f5fe..9718f1c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -962,6 +962,42 @@ void unlock_new_inode(struct inode *inode)
EXPORT_SYMBOL(unlock_new_inode);

/**
+ * lock_two_nondirectories - take two i_mutexes on non-directory objects
+ * @inode1: first inode to lock; must be non-NULL
+ * @inode2: second inode to lock; optional, may equal first or be NULL.
+ */
+void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+ if (inode1 == inode2 || inode2 == NULL)
+ mutex_lock(&inode1->i_mutex);
+ else if (inode1 < inode2) {
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+
+ } else {
+ mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+ mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+ }
+}
+EXPORT_SYMBOL(lock_two_nondirectories);
+
+/**
+ * lock_two_nondirectories - release locks from lock_two_nondirectories()
+ * @inode1: first inode to unlock
+ * @inode2: second inode to lock
+ *
+ * Arguments must be same as those given to corresponding
+ * lock_two_nondirectories() call.
+ */
+void unlock_two_nondirectories(struct inode *inode1, struct inode *inode2)
+{
+ mutex_unlock(&inode1->i_mutex);
+ if (inode2 && inode2 != inode1)
+ mutex_unlock(&inode2->i_mutex);
+}
+EXPORT_SYMBOL(unlock_two_nondirectories);
+
+/**
* iget5_locked - obtain an inode from a mounted file system
* @sb: super block of file system
* @hashval: hash value (usually inode number) to get
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f31e45c..0bee727 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -875,6 +875,9 @@ enum inode_i_mutex_lock_class
I_MUTEX_QUOTA
};

+void lock_two_nondirectories(struct inode *, struct inode*);
+void unlock_two_nondirectories(struct inode *, struct inode*);
+
/*
* NOTE: in a 32bit arch with a preemptable kernel and
* an UP compile the i_size_read/write must be atomic
--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: "J. Bruce Fields" <[email protected]>

A write can take an i_mutex on a quota file while holding the i_mutex on
the file being written to.

And both rename and fs/ext4/move_extent.c:mext_inode_double_lock() can
also take the i_mutex on two regular files.

Either of those could take locks in opposite order from a quota file
write, and end up deadlocked.

Changing the locking order in the quota-update-while-writing case looks
hard. So, instead, change the order in the mext_inode_double_lock()
case so that the i_mutex is always taken on a quota file after being
taken on a file that isn't a quota file.

Signed-off-by: J. Bruce Fields <[email protected]>
---
Documentation/filesystems/directory-locking | 25 +++++++++++++++----------
fs/inode.c | 13 ++++++++++++-
2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking
index 9e8a629..022d94f 100644
--- a/Documentation/filesystems/directory-locking
+++ b/Documentation/filesystems/directory-locking
@@ -3,8 +3,12 @@ kinds of locks - per-inode (->i_mutex) and per-filesystem
(->s_vfs_rename_mutex).

When taking the i_mutex on multiple non-directory objects, we
-always acquire the locks in order by increasing address. We'll call
-that "inode pointer" order in the following.
+always acquire them in the following order (which we'll call "the usual
+order" in the following):
+
+ * non-IS_NOQUOTA inodes before IS_NOQUOTA inodes
+ * within each category, inodes with smaller addresses before
+ inodes with larger addresses

For our purposes all operations fall in 5 classes:

@@ -17,7 +21,7 @@ locks victim and calls the method.

4) rename() that is _not_ cross-directory. Locking rules: caller locks
the parent and finds source and target. If source and target both
-exist, they are locked in inode pointer order. Otherwise lock just
+exist, they are locked in the usual order. Otherwise lock just
source. Then call method.

5) link creation. Locking rules:
@@ -35,8 +39,8 @@ rules:
fail with -ENOTEMPTY
* if new parent is equal to or is a descendent of source
fail with -ELOOP
- * If target exists, lock both source and target, in inode
- pointer order. Otherwise lock just source.
+ * If target exists, lock both source and target, in the
+ usual order. Otherwise lock just source.
* call the method.


@@ -63,10 +67,10 @@ objects - A < B iff A is an ancestor of B.
the order until we had acquired all locks).

(3) locks on non-directory objects are acquired only after locks on
- directory objects, and are acquired in inode pointer order.
+ directory objects, and are acquired in the usual order.
(Proof: all operations but renames take lock on at most one
non-directory object, except renames, which take locks on source and
- target in inode pointer order.)
+ target in the usual order.)

Now consider the minimal deadlock. Each process is blocked on
attempt to acquire some lock and already holds at least one lock. Let's
@@ -75,9 +79,10 @@ not contended, since any process blocked on it is not holding any locks.
Thus all processes are blocked on ->i_mutex.

By (3), any process holding a non-directory lock can only be
-waiting on another non-directory lock with a larger address. Therefore
-the process holding the "largest" such lock can always make progress, and
-non-directory objects are not included in the set of contended locks.
+waiting on another non-directory that is "larger" in the usual order.
+Therefore the process holding the "largest" such lock can always make
+progress, and non-directory objects are not included in the set of
+contended locks.

Thus link creation can't be a part of deadlock - it can't be
blocked on source and it means that it doesn't hold any locks.
diff --git a/fs/inode.c b/fs/inode.c
index 487c924..13d23b6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -961,6 +961,17 @@ void unlock_new_inode(struct inode *inode)
}
EXPORT_SYMBOL(unlock_new_inode);

+/*
+ * We order !IS_NOQUOTA files before ISNOQUOTA files, and by pointer
+ * within each category.
+ */
+static bool nondir_mutex_ordered(struct inode *inode1, struct inode *inode2)
+{
+ if (IS_NOQUOTA(inode1) == IS_NOQUOTA(inode2))
+ return inode1 < inode2;
+ return IS_NOQUOTA(inode2);
+}
+
/**
* lock_two_nondirectories - take two i_mutexes on non-directory objects
* @inode1: first inode to lock; must be non-NULL
@@ -970,7 +981,7 @@ void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
{
if (inode1 == inode2 || inode2 == NULL)
mutex_lock(&inode1->i_mutex);
- else if (inode1 < inode2) {
+ else if (nondir_mutex_ordered(inode1, inode2)) {
mutex_lock(&inode1->i_mutex);
mutex_lock_nested(&inode2->i_mutex, I_MUTEX_QUOTA);

--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: "J. Bruce Fields" <[email protected]>

Signed-off-by: J. Bruce Fields <[email protected]>
---
include/linux/fs.h | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8de6755..f31e45c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -860,7 +860,8 @@ static inline int inode_unhashed(struct inode *inode)
* 0: the object of the current VFS operation
* 1: parent
* 2: child/target
- * 3: quota file
+ * 3: xattr
+ * 4: quota file
*
* The locking order between these classes is
* parent -> child -> normal -> xattr -> quota
--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Sorry, only registered users may post in this forum.

Click here to login