Welcome! Log In Create A New Profile

Advanced

[PATCH 00/19] Fix filesystem freezing deadlocks

Posted by Jan Kara 
Jan Kara
[PATCH 00/19] Fix filesystem freezing deadlocks
March 05, 2012 05:12PM
Hallelujah,

after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing. Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
* completely rewritten
* freezing is now blocked at VFS entry points
* two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
* have two counters to provide safe state transitions for SB_FREEZE_WRITE
and SB_FREEZE_TRANS states
* use percpu counters instead of own percpu structure
* added documentation fixes from the old fs freezing series
* converted XFS to use SB_FREEZE_TRANS counter instead of its private
m_active_trans counter

Honza

CC: Alex Elder <[email protected]>
CC: Anton Altaparmakov <[email protected]>
CC: Ben Myers <[email protected]>
CC: Chris Mason <[email protected]>
CC: cluster-devel@redhat.com
CC: "David S. Miller" <[email protected]>
CC: fuse-devel@lists.sourceforge.net
CC: "J. Bruce Fields" <[email protected]>
CC: Joel Becker <[email protected]>
CC: KONISHI Ryusuke <[email protected]>
CC: linux-btrfs@vger.kernel.org
CC: linux-ext4@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-nilfs@vger.kernel.org
CC: linux-ntfs-dev@lists.sourceforge.net
CC: Mark Fasheh <[email protected]>
CC: Miklos Szeredi <[email protected]>
CC: ocfs2-devel@oss.oracle.com
CC: OGAWA Hirofumi <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: "Theodore Ts'o" <[email protected]>
CC: xfs@oss.sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new
freeze protection. We also protect several ioctl entry points which
were missing the protection.

CC: Mark Fasheh <[email protected]>
CC: Joel Becker <[email protected]>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <[email protected]>
---
fs/ocfs2/file.c | 11 +++++++++--
fs/ocfs2/ioctl.c | 14 ++++++++++++--
fs/ocfs2/mmap.c | 2 ++
3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 061591a..9b1e3d4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1971,6 +1971,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
{
struct inode *inode = file->f_path.dentry->d_inode;
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ int ret;

if ((cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) &&
!ocfs2_writes_unwritten_extents(osb))
@@ -1985,7 +1986,12 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;

- return __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+ ret = __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+ mnt_drop_write_file(file);
+ return ret;
}

static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
@@ -2261,7 +2267,7 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
if (iocb->ki_left == 0)
return 0;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ sb_start_write(inode->i_sb);

appending = file->f_flags & O_APPEND ? 1 : 0;
direct_io = file->f_flags & O_DIRECT ? 1 : 0;
@@ -2434,6 +2440,7 @@ out_sems:
ocfs2_iocb_clear_sem_locked(iocb);

mutex_unlock(&inode->i_mutex);
+ sb_end_write(inode->i_sb);

if (written)
ret = written;
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index a6fda3c..59de312 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -928,7 +928,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (get_user(new_clusters, (int __user *)arg))
return -EFAULT;

- return ocfs2_group_extend(inode, new_clusters);
+ status = mnt_want_write_file(filp);
+ if (status)
+ return status;
+ status = ocfs2_group_extend(inode, new_clusters);
+ mnt_drop_write_file(filp);
+ return status;
case OCFS2_IOC_GROUP_ADD:
case OCFS2_IOC_GROUP_ADD64:
if (!capable(CAP_SYS_RESOURCE))
@@ -937,7 +942,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (copy_from_user(&input, (int __user *) arg, sizeof(input)))
return -EFAULT;

- return ocfs2_group_add(inode, &input);
+ status = mnt_want_write_file(filp);
+ if (status)
+ return status;
+ status = ocfs2_group_add(inode, &input);
+ mnt_drop_write_file(filp);
+ return status;
case OCFS2_IOC_REFLINK:
if (copy_from_user(&args, (struct reflink_arguments *)arg,
sizeof(args)))
diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 9cd4108..d150372 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -136,6 +136,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sigset_t oldset;
int ret;

+ sb_start_pagefault(inode->i_sb);
ocfs2_block_signals(&oldset);

/*
@@ -165,6 +166,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

out:
ocfs2_unblock_signals(&oldset);
+ sb_end_pagefault(inode->i_sb);
return ret;
}

--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
[PATCH 11/19] xfs: Convert to new freezing code
March 05, 2012 05:13PM
Generic code now blocks all writers from standard write paths. So we block all
writers coming from ioctl and replace blocking of transactions on frozen
filesystem with a debugging check. As a bonus, we get a protection of ioctl
against racing remount read-only. We also convert xfs_file_aio_write() to a
non-racy freeze protection.

CC: Ben Myers <[email protected]>
CC: Alex Elder <[email protected]>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <[email protected]>
---
fs/xfs/xfs_file.c | 10 ++++++--
fs/xfs/xfs_ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--
fs/xfs/xfs_ioctl32.c | 12 ++++++++++
fs/xfs/xfs_iomap.c | 1 -
fs/xfs/xfs_mount.c | 2 +-
fs/xfs/xfs_mount.h | 3 --
fs/xfs/xfs_sync.c | 2 +-
fs/xfs/xfs_trans.c | 2 +-
8 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7e5bc87..57dd20e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -874,10 +874,12 @@ xfs_file_aio_write(
if (ocount == 0)
return 0;

- xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
+ sb_start_write(inode->i_sb);

- if (XFS_FORCED_SHUTDOWN(ip->i_mount))
- return -EIO;
+ if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+ ret = -EIO;
+ goto out;
+ }

if (unlikely(file->f_flags & O_DIRECT))
ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
@@ -896,6 +898,8 @@ xfs_file_aio_write(
ret = err;
}

+out:
+ sb_end_write(inode->i_sb);
return ret;
}

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 76f3ca5..7890105 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -367,9 +367,15 @@ xfs_fssetdm_by_handle(
if (copy_from_user(&dmhreq, arg, sizeof(xfs_fsop_setdm_handlereq_t)))
return -XFS_ERROR(EFAULT);

+ error = mnt_want_write_file(parfilp);
+ if (error)
+ return error;
+
dentry = xfs_handlereq_to_dentry(parfilp, &dmhreq.hreq);
- if (IS_ERR(dentry))
+ if (IS_ERR(dentry)) {
+ mnt_drop_write_file(parfilp);
return PTR_ERR(dentry);
+ }

if (IS_IMMUTABLE(dentry->d_inode) || IS_APPEND(dentry->d_inode)) {
error = -XFS_ERROR(EPERM);
@@ -385,6 +391,7 @@ xfs_fssetdm_by_handle(
fsd.fsd_dmstate);

out:
+ mnt_drop_write_file(parfilp);
dput(dentry);
return error;
}
@@ -631,7 +638,11 @@ xfs_ioc_space(
if (ioflags & IO_INVIS)
attr_flags |= XFS_ATTR_DMI;

+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_change_file_space(ip, cmd, bf, filp->f_pos, attr_flags);
+ mnt_drop_write_file(filp);
return -error;
}

@@ -1160,6 +1171,7 @@ xfs_ioc_fssetxattr(
{
struct fsxattr fa;
unsigned int mask;
+ int error;

if (copy_from_user(&fa, arg, sizeof(fa)))
return -EFAULT;
@@ -1168,7 +1180,12 @@ xfs_ioc_fssetxattr(
if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
mask |= FSX_NONBLOCK;

- return -xfs_ioctl_setattr(ip, &fa, mask);
+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
+ error = xfs_ioctl_setattr(ip, &fa, mask);
+ mnt_drop_write_file(filp);
+ return -error;
}

STATIC int
@@ -1193,6 +1210,7 @@ xfs_ioc_setxflags(
struct fsxattr fa;
unsigned int flags;
unsigned int mask;
+ int error;

if (copy_from_user(&flags, arg, sizeof(flags)))
return -EFAULT;
@@ -1207,7 +1225,12 @@ xfs_ioc_setxflags(
mask |= FSX_NONBLOCK;
fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));

- return -xfs_ioctl_setattr(ip, &fa, mask);
+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
+ error = xfs_ioctl_setattr(ip, &fa, mask);
+ mnt_drop_write_file(filp);
+ return -error;
}

STATIC int
@@ -1382,8 +1405,13 @@ xfs_file_ioctl(
if (copy_from_user(&dmi, arg, sizeof(dmi)))
return -XFS_ERROR(EFAULT);

+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
+
error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask,
dmi.fsd_dmstate);
+ mnt_drop_write_file(filp);
return -error;
}

@@ -1431,7 +1459,11 @@ xfs_file_ioctl(

if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
return -XFS_ERROR(EFAULT);
+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_swapext(&sxp);
+ mnt_drop_write_file(filp);
return -error;
}

@@ -1460,9 +1492,14 @@ xfs_file_ioctl(
if (copy_from_user(&inout, arg, sizeof(inout)))
return -XFS_ERROR(EFAULT);

+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
+
/* input parameter is passed in resblks field of structure */
in = inout.resblks;
error = xfs_reserve_blocks(mp, &in, &inout);
+ mnt_drop_write_file(filp);
if (error)
return -error;

@@ -1493,7 +1530,11 @@ xfs_file_ioctl(
if (copy_from_user(&in, arg, sizeof(in)))
return -XFS_ERROR(EFAULT);

+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_growfs_data(mp, &in);
+ mnt_drop_write_file(filp);
return -error;
}

@@ -1503,7 +1544,11 @@ xfs_file_ioctl(
if (copy_from_user(&in, arg, sizeof(in)))
return -XFS_ERROR(EFAULT);

+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_growfs_log(mp, &in);
+ mnt_drop_write_file(filp);
return -error;
}

@@ -1513,7 +1558,11 @@ xfs_file_ioctl(
if (copy_from_user(&in, arg, sizeof(in)))
return -XFS_ERROR(EFAULT);

+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_growfs_rt(mp, &in);
+ mnt_drop_write_file(filp);
return -error;
}

diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index f9ccb7b..2012369 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -602,7 +602,11 @@ xfs_file_compat_ioctl(

if (xfs_compat_growfs_data_copyin(&in, arg))
return -XFS_ERROR(EFAULT);
+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_growfs_data(mp, &in);
+ mnt_drop_write_file(filp);
return -error;
}
case XFS_IOC_FSGROWFSRT_32: {
@@ -610,7 +614,11 @@ xfs_file_compat_ioctl(

if (xfs_compat_growfs_rt_copyin(&in, arg))
return -XFS_ERROR(EFAULT);
+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_growfs_rt(mp, &in);
+ mnt_drop_write_file(filp);
return -error;
}
#endif
@@ -629,7 +637,11 @@ xfs_file_compat_ioctl(
offsetof(struct xfs_swapext, sx_stat)) ||
xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
return -XFS_ERROR(EFAULT);
+ error = mnt_want_write_file(filp);
+ if (error)
+ return error;
error = xfs_swapext(&sxp);
+ mnt_drop_write_file(filp);
return -error;
}
case XFS_IOC_FSBULKSTAT_32:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..7ab98d9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,7 +679,6 @@ xfs_iomap_write_unwritten(
* the same inode that we complete here and might deadlock
* on the iolock.
*/
- xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
tp->t_flags |= XFS_TRANS_RESERVE;
error = xfs_trans_reserve(tp, resblks,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..bead529 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1541,7 +1541,7 @@ xfs_unmountfs(
int
xfs_fs_writable(xfs_mount_t *mp)
{
- return !(xfs_test_for_freeze(mp) || XFS_FORCED_SHUTDOWN(mp) ||
+ return !(mp->m_super->s_writers.frozen || XFS_FORCED_SHUTDOWN(mp) ||
(mp->m_flags & XFS_MOUNT_RDONLY));
}

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 19f69e2..f0afeb9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -310,9 +310,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
#define SHUTDOWN_REMOTE_REQ 0x0010 /* shutdown came from remote cell */
#define SHUTDOWN_DEVICE_REQ 0x0020 /* failed all paths to the device */

-#define xfs_test_for_freeze(mp) ((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l) vfs_check_frozen((mp)->m_super, (l))
-
/*
* Flags for xfs_mountfs
*/
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 40b75ee..c44d687 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -498,7 +498,7 @@ xfs_sync_worker(

if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
/* dgc: errors ignored here */
- if (mp->m_super->s_frozen == SB_UNFROZEN &&
+ if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&
xfs_log_need_covered(mp))
error = xfs_fs_log_dummy(mp);
else
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 329b06a..6468a2a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,7 +577,6 @@ xfs_trans_alloc(
xfs_mount_t *mp,
uint type)
{
- xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
return _xfs_trans_alloc(mp, type, KM_SLEEP);
}

@@ -589,6 +588,7 @@ _xfs_trans_alloc(
{
xfs_trans_t *tp;

+ WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
atomic_inc(&mp->m_active_trans);

tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
outside of i_mutex as in other places.

CC: OGAWA Hirofumi <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/fat/file.c | 15 +++++++--------
1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index a71fe37..e007b8b 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -43,10 +43,10 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
if (err)
goto out;

- mutex_lock(&inode->i_mutex);
err = mnt_want_write_file(file);
if (err)
- goto out_unlock_inode;
+ goto out;
+ mutex_lock(&inode->i_mutex);

/*
* ATTR_VOLUME and ATTR_DIR cannot be changed; this also
@@ -73,14 +73,14 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
/* The root directory has no attributes */
if (inode->i_ino == MSDOS_ROOT_INO && attr != ATTR_DIR) {
err = -EINVAL;
- goto out_drop_write;
+ goto out_unlock_inode;
}

if (sbi->options.sys_immutable &&
((attr | oldattr) & ATTR_SYS) &&
!capable(CAP_LINUX_IMMUTABLE)) {
err = -EPERM;
- goto out_drop_write;
+ goto out_unlock_inode;
}

/*
@@ -90,12 +90,12 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
*/
err = security_inode_setattr(file->f_path.dentry, &ia);
if (err)
- goto out_drop_write;
+ goto out_unlock_inode;

/* This MUST be done before doing anything irreversible... */
err = fat_setattr(file->f_path.dentry, &ia);
if (err)
- goto out_drop_write;
+ goto out_unlock_inode;

fsnotify_change(file->f_path.dentry, ia.ia_valid);
if (sbi->options.sys_immutable) {
@@ -107,10 +107,9 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)

fat_save_attrs(inode, attr);
mark_inode_dirty(inode);
-out_drop_write:
- mnt_drop_write_file(file);
out_unlock_inode:
mutex_unlock(&inode->i_mutex);
+ mnt_drop_write_file(file);
out:
return err;
}
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Move check in ntfs_file_aio_write_nolock() to ntfs_file_aio_write() and
use new freeze protection.

CC: linux-ntfs-dev@lists.sourceforge.net
CC: Anton Altaparmakov <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ntfs/file.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c587e2d..0503e65 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2084,7 +2084,6 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
if (err)
return err;
pos = *ppos;
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
/* We can write back this queue in page reclaim. */
current->backing_dev_info = mapping->backing_dev_info;
written = 0;
@@ -2117,6 +2116,7 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,

BUG_ON(iocb->ki_pos != pos);

+ sb_start_write(inode->i_sb);
mutex_lock(&inode->i_mutex);
ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
@@ -2125,6 +2125,7 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (err < 0)
ret = err;
}
+ sb_end_write(inode->i_sb);
return ret;
}

--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.

One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex. Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.

CC: ocfs2-devel@oss.oracle.com
CC: Mark Fasheh <[email protected]>
CC: Joel Becker <[email protected]>
CC: "David S. Miller" <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/namei.c | 115 +++++++++++++++++++++++++++--------------------
fs/ocfs2/refcounttree.c | 10 +---
include/linux/namei.h | 2 +
net/unix/af_unix.c | 13 ++----
4 files changed, 74 insertions(+), 66 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a780ea5..575d3a4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2383,7 +2383,9 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
return file;
}

-struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
+static struct dentry *do_kern_path_create(int dfd, const char *pathname,
+ struct path *path, int is_dir,
+ int freeze_protect)
{
struct dentry *dentry = ERR_PTR(-EEXIST);
struct nameidata nd;
@@ -2401,6 +2403,14 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path
nd.flags |= LOOKUP_CREATE | LOOKUP_EXCL;
nd.intent.open.flags = O_EXCL;

+ if (freeze_protect) {
+ error = mnt_want_write(nd.path.mnt);
+ if (error) {
+ dentry = ERR_PTR(error);
+ goto out;
+ }
+ }
+
/*
* Do the final lookup.
*/
@@ -2429,24 +2439,49 @@ eexist:
dentry = ERR_PTR(-EEXIST);
fail:
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+ if (freeze_protect)
+ mnt_drop_write(nd.path.mnt);
out:
path_put(&nd.path);
return dentry;
}
+
+struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
+{
+ return do_kern_path_create(dfd, pathname, path, is_dir, 0);
+}
EXPORT_SYMBOL(kern_path_create);

+struct dentry *kern_path_create_thawed(int dfd, const char *pathname, struct path *path, int is_dir)
+{
+ return do_kern_path_create(dfd, pathname, path, is_dir, 1);
+}
+EXPORT_SYMBOL(kern_path_create_thawed);
+
struct dentry *user_path_create(int dfd, const char __user *pathname, struct path *path, int is_dir)
{
char *tmp = getname(pathname);
struct dentry *res;
if (IS_ERR(tmp))
return ERR_CAST(tmp);
- res = kern_path_create(dfd, tmp, path, is_dir);
+ res = do_kern_path_create(dfd, tmp, path, is_dir, 0);
putname(tmp);
return res;
}
EXPORT_SYMBOL(user_path_create);

+struct dentry *user_path_create_thawed(int dfd, const char __user *pathname, struct path *path, int is_dir)
+{
+ char *tmp = getname(pathname);
+ struct dentry *res;
+ if (IS_ERR(tmp))
+ return ERR_CAST(tmp);
+ res = do_kern_path_create(dfd, tmp, path, is_dir, 1);
+ putname(tmp);
+ return res;
+}
+EXPORT_SYMBOL(user_path_create_thawed);
+
int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
{
int error = may_create(dir, dentry);
@@ -2502,7 +2537,7 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
if (S_ISDIR(mode))
return -EPERM;

- dentry = user_path_create(dfd, filename, &path, 0);
+ dentry = user_path_create_thawed(dfd, filename, &path, 0);
if (IS_ERR(dentry))
return PTR_ERR(dentry);

@@ -2511,12 +2546,9 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
error = may_mknod(mode);
if (error)
goto out_dput;
- error = mnt_want_write(path.mnt);
- if (error)
- goto out_dput;
error = security_path_mknod(&path, dentry, mode, dev);
if (error)
- goto out_drop_write;
+ goto out_dput;
switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(path.dentry->d_inode,dentry,mode,NULL);
@@ -2529,11 +2561,10 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
break;
}
-out_drop_write:
- mnt_drop_write(path.mnt);
out_dput:
dput(dentry);
mutex_unlock(&path.dentry->d_inode->i_mutex);
+ mnt_drop_write(path.mnt);
path_put(&path);

return error;
@@ -2571,24 +2602,20 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
struct path path;
int error;

- dentry = user_path_create(dfd, pathname, &path, 1);
+ dentry = user_path_create_thawed(dfd, pathname, &path, 1);
if (IS_ERR(dentry))
return PTR_ERR(dentry);

if (!IS_POSIXACL(path.dentry->d_inode))
mode &= ~current_umask();
- error = mnt_want_write(path.mnt);
- if (error)
- goto out_dput;
error = security_path_mkdir(&path, dentry, mode);
if (error)
- goto out_drop_write;
+ goto out_dput;
error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
-out_drop_write:
- mnt_drop_write(path.mnt);
out_dput:
dput(dentry);
mutex_unlock(&path.dentry->d_inode->i_mutex);
+ mnt_drop_write(path.mnt);
path_put(&path);
return error;
}
@@ -2683,6 +2710,9 @@ static long do_rmdir(int dfd, const char __user *pathname)
}

nd.flags &= ~LOOKUP_PARENT;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto exit1;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(&nd);
@@ -2693,19 +2723,15 @@ static long do_rmdir(int dfd, const char __user *pathname)
error = -ENOENT;
goto exit3;
}
- error = mnt_want_write(nd.path.mnt);
- if (error)
- goto exit3;
error = security_path_rmdir(&nd.path, dentry);
if (error)
- goto exit4;
+ goto exit3;
error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
-exit4:
- mnt_drop_write(nd.path.mnt);
exit3:
dput(dentry);
exit2:
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+ mnt_drop_write(nd.path.mnt);
exit1:
path_put(&nd.path);
putname(name);
@@ -2772,6 +2798,9 @@ static long do_unlinkat(int dfd, const char __user *pathname)
goto exit1;

nd.flags &= ~LOOKUP_PARENT;
+ error = mnt_want_write(nd.path.mnt);
+ if (error)
+ goto exit1;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(&nd);
@@ -2784,21 +2813,17 @@ static long do_unlinkat(int dfd, const char __user *pathname)
if (!inode)
goto slashes;
ihold(inode);
- error = mnt_want_write(nd.path.mnt);
- if (error)
- goto exit2;
error = security_path_unlink(&nd.path, dentry);
if (error)
- goto exit3;
+ goto exit2;
error = vfs_unlink(nd.path.dentry->d_inode, dentry);
-exit3:
- mnt_drop_write(nd.path.mnt);
- exit2:
+exit2:
dput(dentry);
}
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
if (inode)
iput(inode); /* truncate the inode here */
+ mnt_drop_write(nd.path.mnt);
exit1:
path_put(&nd.path);
putname(name);
@@ -2858,23 +2883,19 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
if (IS_ERR(from))
return PTR_ERR(from);

- dentry = user_path_create(newdfd, newname, &path, 0);
+ dentry = user_path_create_thawed(newdfd, newname, &path, 0);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_putname;

- error = mnt_want_write(path.mnt);
- if (error)
- goto out_dput;
error = security_path_symlink(&path, dentry, from);
if (error)
- goto out_drop_write;
+ goto out_dput;
error = vfs_symlink(path.dentry->d_inode, dentry, from);
-out_drop_write:
- mnt_drop_write(path.mnt);
out_dput:
dput(dentry);
mutex_unlock(&path.dentry->d_inode->i_mutex);
+ mnt_drop_write(path.mnt);
path_put(&path);
out_putname:
putname(from);
@@ -2964,7 +2985,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
return error;

- new_dentry = user_path_create(newdfd, newname, &new_path, 0);
+ new_dentry = user_path_create_thawed(newdfd, newname, &new_path, 0);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out;
@@ -2972,18 +2993,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
error = -EXDEV;
if (old_path.mnt != new_path.mnt)
goto out_dput;
- error = mnt_want_write(new_path.mnt);
- if (error)
- goto out_dput;
error = security_path_link(old_path.dentry, &new_path, new_dentry);
if (error)
- goto out_drop_write;
+ goto out_dput;
error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
-out_drop_write:
- mnt_drop_write(new_path.mnt);
out_dput:
dput(new_dentry);
mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+ mnt_drop_write(new_path.mnt);
path_put(&new_path);
out:
path_put(&old_path);
@@ -3174,6 +3191,10 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
if (newnd.last_type != LAST_NORM)
goto exit2;

+ error = mnt_want_write(oldnd.path.mnt);
+ if (error)
+ goto exit2;
+
oldnd.flags &= ~LOOKUP_PARENT;
newnd.flags &= ~LOOKUP_PARENT;
newnd.flags |= LOOKUP_RENAME_TARGET;
@@ -3209,23 +3230,19 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
if (new_dentry == trap)
goto exit5;

- error = mnt_want_write(oldnd.path.mnt);
- if (error)
- goto exit5;
error = security_path_rename(&oldnd.path, old_dentry,
&newnd.path, new_dentry);
if (error)
- goto exit6;
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
-exit6:
- mnt_drop_write(oldnd.path.mnt);
exit5:
dput(new_dentry);
exit4:
dput(old_dentry);
exit3:
unlock_rename(new_dir, old_dir);
+ mnt_drop_write(oldnd.path.mnt);
exit2:
path_put(&newnd.path);
putname(to);
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index cf78233..a99b8e2 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4453,7 +4453,7 @@ int ocfs2_reflink_ioctl(struct inode *inode,
return error;
}

- new_dentry = user_path_create(AT_FDCWD, newname, &new_path, 0);
+ new_dentry = user_path_create_thawed(AT_FDCWD, newname, &new_path, 0);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry)) {
mlog_errno(error);
@@ -4466,19 +4466,13 @@ int ocfs2_reflink_ioctl(struct inode *inode,
goto out_dput;
}

- error = mnt_want_write(new_path.mnt);
- if (error) {
- mlog_errno(error);
- goto out_dput;
- }
-
error = ocfs2_vfs_reflink(old_path.dentry,
new_path.dentry->d_inode,
new_dentry, preserve);
- mnt_drop_write(new_path.mnt);
out_dput:
dput(new_dentry);
mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+ mnt_drop_write(new_path.mnt);
path_put(&new_path);
out:
path_put(&old_path);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index ffc0213..432f6bb 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -77,7 +77,9 @@ extern int user_path_at_empty(int, const char __user *, unsigned, struct path *,
extern int kern_path(const char *, unsigned, struct path *);

extern struct dentry *kern_path_create(int, const char *, struct path *, int);
+extern struct dentry *kern_path_create_thawed(int, const char *, struct path *, int);
extern struct dentry *user_path_create(int, const char __user *, struct path *, int);
+extern struct dentry *user_path_create_thawed(int, const char __user *, struct path *, int);
extern int kern_path_parent(const char *, struct nameidata *);
extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
const char *, unsigned int, struct path *);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 85d3bb7..7888b09 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -856,7 +856,7 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
* Get the parent directory, calculate the hash for last
* component.
*/
- dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
+ dentry = kern_path_create_thawed(AT_FDCWD, sun_path, &path, 0);
err = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_mknod_parent;
@@ -866,19 +866,13 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
*/
mode = S_IFSOCK |
(SOCK_INODE(sock)->i_mode & ~current_umask());
- err = mnt_want_write(path.mnt);
- if (err)
- goto out_mknod_dput;
err = security_path_mknod(&path, dentry, mode, 0);
if (err)
- goto out_mknod_drop_write;
- err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
-out_mknod_drop_write:
- mnt_drop_write(path.mnt);
- if (err)
goto out_mknod_dput;
+ err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
mutex_unlock(&path.dentry->d_inode->i_mutex);
dput(path.dentry);
+ mnt_drop_write(path.mnt);
path.dentry = dentry;

addr->hash = UNIX_HASH_SIZE;
@@ -916,6 +910,7 @@ out:
out_mknod_dput:
dput(dentry);
mutex_unlock(&path.dentry->d_inode->i_mutex);
+ mnt_drop_write(path.mnt);
path_put(&path);
out_mknod_parent:
if (err == -EEXIST)
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
[PATCH 18/19] fs: Remove old freezing mechanism
March 05, 2012 05:21PM
Now that all users are converted, we can remove functions, variables, and
constants defined by the old freezing mechanism.

Signed-off-by: Jan Kara <[email protected]>
---
fs/super.c | 1 -
include/linux/fs.h | 8 --------
2 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 8b0b40c..0830afb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -209,7 +209,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
mutex_init(&s->s_dquot.dqio_mutex);
mutex_init(&s->s_dquot.dqonoff_mutex);
init_rwsem(&s->s_dquot.dqptr_sem);
- init_waitqueue_head(&s->s_wait_unfrozen);
s->s_maxbytes = MAX_NON_LFS;
s->s_op = &default_op;
s->s_time_gran = 1000000000;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 37667f3..fcb4135 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1483,8 +1483,6 @@ struct super_block {
struct hlist_node s_instances;
struct quota_info s_dquot; /* Diskquota specific options */

- int s_frozen;
- wait_queue_head_t s_wait_unfrozen;
struct sb_writers s_writers;

char s_id[32]; /* Informational name */
@@ -1537,12 +1535,6 @@ extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
extern struct timespec current_fs_time(struct super_block *sb);

/*
- * Snapshotting support.
- */
-/* Will go away when all users are converted */
-#define vfs_check_frozen(sb, level) do { } while (0)
-
-/*
* until VFS tracks user namespaces for inodes, just make all files
* belong to init_user_ns
*/
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason <[email protected]>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <[email protected]>
---
fs/btrfs/ioctl.c | 23 +++++++++++------------
1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
if (!inode_owner_or_capable(inode))
return -EACCES;

+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+
mutex_lock(&inode->i_mutex);

ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
}
}

- ret = mnt_want_write_file(file);
- if (ret)
- goto out_unlock;
-
if (flags & FS_SYNC_FL)
ip->flags |= BTRFS_INODE_SYNC;
else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
inode->i_flags = i_oldflags;
}

- mnt_drop_write_file(file);
out_unlock:
mutex_unlock(&inode->i_mutex);
+ mnt_drop_write_file(file);
return ret;
}

@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;

+ error = mnt_want_write(parent->mnt);
+ if (error)
+ return error;
+
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);

dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
if (dentry->d_inode)
goto out_dput;

- error = mnt_want_write(parent->mnt);
- if (error)
- goto out_dput;
-
error = btrfs_may_create(dir, dentry);
if (error)
- goto out_drop_write;
+ goto out_dput;

down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);

@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
fsnotify_mkdir(dir, dentry);
out_up_read:
up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
- mnt_drop_write(parent->mnt);
out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&dir->i_mutex);
+ mnt_drop_write(parent->mnt);
return error;
}

--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Convert check in fuse_file_aio_write() to using new freeze protection.

CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/fuse/file.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eade72e..05f661f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -943,8 +943,8 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (err)
return err;

+ sb_start_write(inode->i_sb);
mutex_lock(&inode->i_mutex);
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
@@ -970,6 +970,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
out:
current->backing_dev_info = NULL;
mutex_unlock(&inode->i_mutex);
+ sb_end_write(inode->i_sb);

return written ? written : err;
}
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
It is enough to update gfs2_page_mkwrite() to use new freeze protection.
Rest is handled by the generic code.

CC: cluster-devel@redhat.com
CC: Steven Whitehouse <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/gfs2/file.c | 15 +++------------
1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 1f03531..806b7a4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -369,11 +369,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
loff_t size;
int ret;

- /* Wait if fs is frozen. This is racy so we check again later on
- * and retry if the fs has been frozen after the page lock has
- * been acquired
- */
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ sb_start_pagefault(inode->i_sb);

/* Update file times before taking page lock */
file_update_time(vma->vm_file);
@@ -457,14 +453,9 @@ out:
gfs2_holder_uninit(&gh);
if (ret == 0) {
set_page_dirty(page);
- /* This check must be post dropping of transaction lock */
- if (inode->i_sb->s_frozen == SB_UNFROZEN) {
- wait_on_page_writeback(page);
- } else {
- ret = -EAGAIN;
- unlock_page(page);
- }
+ wait_on_page_writeback(page);
}
+ sb_end_pagefault(inode->i_sb);
return block_page_mkwrite_return(ret);
}

--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().

->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 22 ++++------------------
fs/open.c | 7 ++++++-
fs/splice.c | 3 +++
mm/filemap.c | 12 ++++++++++--
mm/filemap_xip.c | 5 +++--
5 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5294a33..89ed4af 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2288,8 +2288,8 @@ EXPORT_SYMBOL(block_commit_write);
* beyond EOF, then the page is guaranteed safe against truncation until we
* unlock the page.
*
- * Direct callers of this function should call vfs_check_frozen() so that page
- * fault does not busyloop until the fs is thawed.
+ * Direct callers of this function should protect against filesystem freezing
+ * using sb_start_write() - sb_end_write() functions.
*/
int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block)
@@ -2327,18 +2327,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,

if (unlikely(ret < 0))
goto out_unlock;
- /*
- * Freezing in progress? We check after the page is marked dirty and
- * with page lock held so if the test here fails, we are sure freezing
- * code will wait during syncing until the page fault is done - at that
- * point page will be dirty and unlocked so freezing code will write it
- * and writeprotect it again.
- */
set_page_dirty(page);
- if (inode->i_sb->s_frozen != SB_UNFROZEN) {
- ret = -EAGAIN;
- goto out_unlock;
- }
wait_on_page_writeback(page);
return 0;
out_unlock:
@@ -2353,12 +2342,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
int ret;
struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;

- /*
- * This check is racy but catches the common case. The check in
- * __block_page_mkwrite() is reliable.
- */
- vfs_check_frozen(sb, SB_FREEZE_WRITE);
+ sb_start_pagefault(sb);
ret = __block_page_mkwrite(vma, vmf, get_block);
+ sb_end_pagefault(sb);
return block_page_mkwrite_return(ret);
}
EXPORT_SYMBOL(block_page_mkwrite);
diff --git a/fs/open.c b/fs/open.c
index 456e415..444ffbb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -164,11 +164,13 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
if (IS_APPEND(inode))
goto out_putf;

+ sb_start_write(inode->i_sb);
error = locks_verify_truncate(inode, file, length);
if (!error)
error = security_path_truncate(&file->f_path);
if (!error)
error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
+ sb_end_write(inode->i_sb);
out_putf:
fput(file);
out:
@@ -266,7 +268,10 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (!file->f_op->fallocate)
return -EOPNOTSUPP;

- return file->f_op->fallocate(file, mode, offset, len);
+ sb_start_write(inode->i_sb);
+ ret = file->f_op->fallocate(file, mode, offset, len);
+ sb_end_write(inode->i_sb);
+ return ret;
}

SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
diff --git a/fs/splice.c b/fs/splice.c
index 1ec0493..401297d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -992,6 +992,8 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
};
ssize_t ret;

+ sb_start_write(inode->i_sb);
+
pipe_lock(pipe);

splice_from_pipe_begin(&sd);
@@ -1028,6 +1030,7 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
*ppos += ret;
balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
}
+ sb_end_write(inode->i_sb);

return ret;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index b865c0b..8ff3111 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1765,6 +1765,7 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
int ret = VM_FAULT_LOCKED;

+ sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
lock_page(page);
if ((page->mapping != inode->i_mapping) ||
@@ -1773,7 +1774,14 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
ret = VM_FAULT_NOPAGE;
goto out;
}
+ /*
+ * We mark the page dirty already here so that when freeze is in
+ * progress, we are guaranteed that writeback during freezing will
+ * see the dirty page and writeprotect it again.
+ */
+ set_page_dirty(page);
out:
+ sb_end_pagefault(inode->i_sb);
return ret;
}
EXPORT_SYMBOL(filemap_page_mkwrite);
@@ -2537,8 +2545,6 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
count = ocount;
pos = *ppos;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
-
/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
written = 0;
@@ -2635,6 +2641,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,

BUG_ON(iocb->ki_pos != pos);

+ sb_start_write(inode->i_sb);
mutex_lock(&inode->i_mutex);
blk_start_plug(&plug);
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
@@ -2648,6 +2655,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
ret = err;
}
blk_finish_plug(&plug);
+ sb_end_write(inode->i_sb);
return ret;
}
EXPORT_SYMBOL(generic_file_aio_write);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 591dba6..b051f0d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -402,6 +402,8 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
loff_t pos;
ssize_t ret;

+ sb_start_write(inode->i_sb);
+
mutex_lock(&inode->i_mutex);

if (!access_ok(VERIFY_READ, buf, len)) {
@@ -412,8 +414,6 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
pos = *ppos;
count = len;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
-
/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;

@@ -435,6 +435,7 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
current->backing_dev_info = NULL;
out_up:
mutex_unlock(&inode->i_mutex);
+ sb_end_write(inode->i_sb);
return ret;
}
EXPORT_SYMBOL_GPL(xip_file_write);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: linux-nfs@vger.kernel.org
CC: "J. Bruce Fields" <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/nfsd/nfs4recover.c | 9 +++--
fs/nfsd/nfsfh.c | 1 +
fs/nfsd/nfsproc.c | 9 ++++-
fs/nfsd/vfs.c | 79 ++++++++++++++++++++++---------------------
fs/nfsd/vfs.h | 11 +++++-
include/linux/nfsd/nfsfh.h | 1 +
6 files changed, 64 insertions(+), 46 deletions(-)

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 0b3e875..8e599a6 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -135,6 +135,10 @@ void nfsd4_create_clid_dir(struct nfs4_client *clp)
if (status < 0)
return;

+ status = mnt_want_write_file(rec_file);
+ if (status)
+ return;
+
dir = rec_file->f_path.dentry;
/* lock the parent */
mutex_lock(&dir->d_inode->i_mutex);
@@ -154,11 +158,7 @@ void nfsd4_create_clid_dir(struct nfs4_client *clp)
* as well be forgiving and just succeed silently.
*/
goto out_put;
- status = mnt_want_write_file(rec_file);
- if (status)
- goto out_put;
status = vfs_mkdir(dir->d_inode, dentry, S_IRWXU);
- mnt_drop_write_file(rec_file);
out_put:
dput(dentry);
out_unlock:
@@ -170,6 +170,7 @@ out_unlock:
" (err %d); please check that %s exists"
" and is writeable", status,
user_recovery_dirname);
+ mnt_drop_write_file(rec_file);
nfs4_reset_creds(original_cred);
}

diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 68454e7..8b93353 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -635,6 +635,7 @@ fh_put(struct svc_fh *fhp)
fhp->fh_post_saved = 0;
#endif
}
+ fh_drop_write(fhp);
if (exp) {
cache_put(&exp->h, &svc_export_cache);
fhp->fh_export = NULL;
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index e15dc45..aad6d45 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -196,6 +196,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
struct dentry *dchild;
int type, mode;
__be32 nfserr;
+ int hosterr;
dev_t rdev = 0, wanted = new_decode_dev(attr->ia_size);

dprintk("nfsd: CREATE %s %.*s\n",
@@ -214,6 +215,12 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
nfserr = nfserr_exist;
if (isdotent(argp->name, argp->len))
goto done;
+ hosterr = fh_want_write(dirfhp);
+ if (hosterr) {
+ nfserr = nfserrno(hosterr);
+ goto done;
+ }
+
fh_lock_nested(dirfhp, I_MUTEX_PARENT);
dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
if (IS_ERR(dchild)) {
@@ -330,7 +337,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
out_unlock:
/* We don't really need to unlock, as fh_put does it. */
fh_unlock(dirfhp);
-
+ fh_drop_write(dirfhp);
done:
fh_put(dirfhp);
return nfsd_return_dirop(nfserr, resp);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index edf6d3e..af49c63 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1268,6 +1268,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
* If it has, the parent directory should already be locked.
*/
if (!resfhp->fh_dentry) {
+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
/* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
fh_lock_nested(fhp, I_MUTEX_PARENT);
dchild = lookup_one_len(fname, dentry, flen);
@@ -1311,14 +1315,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
goto out;
}

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
-
/*
* Get the dir op function pointer.
*/
err = 0;
+ host_err = 0;
switch (type) {
case S_IFREG:
host_err = vfs_create(dirp, dchild, iap->ia_mode, NULL);
@@ -1335,10 +1336,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
break;
}
- if (host_err < 0) {
- fh_drop_write(fhp);
+ if (host_err < 0)
goto out_nfserr;
- }

err = nfsd_create_setattr(rqstp, resfhp, iap);

@@ -1350,7 +1349,6 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
err2 = nfserrno(commit_metadata(fhp));
if (err2)
err = err2;
- fh_drop_write(fhp);
/*
* Update the file handle to get the new inode info.
*/
@@ -1409,6 +1407,11 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
err = nfserr_notdir;
if (!dirp->i_op->lookup)
goto out;
+
+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
fh_lock_nested(fhp, I_MUTEX_PARENT);

/*
@@ -1441,9 +1444,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
v_atime = verifier[1]&0x7fffffff;
}

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
if (dchild->d_inode) {
err = 0;

@@ -1514,7 +1514,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (!err)
err = nfserrno(commit_metadata(fhp));

- fh_drop_write(fhp);
/*
* Update the filehandle to get the new inode info.
*/
@@ -1525,6 +1524,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
fh_unlock(fhp);
if (dchild && !IS_ERR(dchild))
dput(dchild);
+ fh_drop_write(fhp);
return err;

out_nfserr:
@@ -1604,6 +1604,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
err = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
if (err)
goto out;
+
+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
fh_lock(fhp);
dentry = fhp->fh_dentry;
dnew = lookup_one_len(fname, dentry, flen);
@@ -1611,10 +1616,6 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (IS_ERR(dnew))
goto out_nfserr;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_nfserr;
-
if (unlikely(path[plen] != 0)) {
char *path_alloced = kmalloc(plen+1, GFP_KERNEL);
if (path_alloced == NULL)
@@ -1674,6 +1675,12 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
if (isdotent(name, len))
goto out;

+ host_err = fh_want_write(tfhp);
+ if (host_err) {
+ err = nfserrno(host_err);
+ goto out;
+ }
+
fh_lock_nested(ffhp, I_MUTEX_PARENT);
ddir = ffhp->fh_dentry;
dirp = ddir->d_inode;
@@ -1685,18 +1692,13 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,

dold = tfhp->fh_dentry;

- host_err = fh_want_write(tfhp);
- if (host_err) {
- err = nfserrno(host_err);
- goto out_dput;
- }
err = nfserr_noent;
if (!dold->d_inode)
- goto out_drop_write;
+ goto out_dput;
host_err = nfsd_break_lease(dold->d_inode);
if (host_err) {
err = nfserrno(host_err);
- goto out_drop_write;
+ goto out_dput;
}
host_err = vfs_link(dold, dirp, dnew);
if (!host_err) {
@@ -1709,12 +1711,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
else
err = nfserrno(host_err);
}
-out_drop_write:
- fh_drop_write(tfhp);
out_dput:
dput(dnew);
out_unlock:
fh_unlock(ffhp);
+ fh_drop_write(tfhp);
out:
return err;

@@ -1757,6 +1758,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
if (!flen || isdotent(fname, flen) || !tlen || isdotent(tname, tlen))
goto out;

+ host_err = fh_want_write(ffhp);
+ if (host_err) {
+ err = nfserrno(host_err);
+ goto out;
+ }
+
/* cannot use fh_lock as we need deadlock protective ordering
* so do it by hand */
trap = lock_rename(tdentry, fdentry);
@@ -1787,17 +1794,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
host_err = -EXDEV;
if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
goto out_dput_new;
- host_err = fh_want_write(ffhp);
- if (host_err)
- goto out_dput_new;

host_err = nfsd_break_lease(odentry->d_inode);
if (host_err)
- goto out_drop_write;
+ goto out_dput_new;
if (ndentry->d_inode) {
host_err = nfsd_break_lease(ndentry->d_inode);
if (host_err)
- goto out_drop_write;
+ goto out_dput_new;
}
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err) {
@@ -1805,8 +1809,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
if (!host_err)
host_err = commit_metadata(ffhp);
}
-out_drop_write:
- fh_drop_write(ffhp);
out_dput_new:
dput(ndentry);
out_dput_old:
@@ -1822,6 +1824,7 @@ out_drop_write:
fill_post_wcc(tfhp);
unlock_rename(tdentry, fdentry);
ffhp->fh_locked = tfhp->fh_locked = 0;
+ fh_drop_write(ffhp);

out:
return err;
@@ -1847,6 +1850,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (err)
goto out;

+ host_err = fh_want_write(fhp);
+ if (host_err)
+ goto out_nfserr;
+
fh_lock_nested(fhp, I_MUTEX_PARENT);
dentry = fhp->fh_dentry;
dirp = dentry->d_inode;
@@ -1865,21 +1872,15 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (!type)
type = rdentry->d_inode->i_mode & S_IFMT;

- host_err = fh_want_write(fhp);
- if (host_err)
- goto out_put;
-
host_err = nfsd_break_lease(rdentry->d_inode);
if (host_err)
- goto out_drop_write;
+ goto out_put;
if (type != S_IFDIR)
host_err = vfs_unlink(dirp, rdentry);
else
host_err = vfs_rmdir(dirp, rdentry);
if (!host_err)
host_err = commit_metadata(fhp);
-out_drop_write:
- fh_drop_write(fhp);
out_put:
dput(rdentry);

diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 1dcd238..d5a50b4 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -108,12 +108,19 @@ int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);

static inline int fh_want_write(struct svc_fh *fh)
{
- return mnt_want_write(fh->fh_export->ex_path.mnt);
+ int ret = mnt_want_write(fh->fh_export->ex_path.mnt);
+
+ if (!ret)
+ fh->fh_want_write = 1;
+ return ret;
}

static inline void fh_drop_write(struct svc_fh *fh)
{
- mnt_drop_write(fh->fh_export->ex_path.mnt);
+ if (fh->fh_want_write) {
+ fh->fh_want_write = 0;
+ mnt_drop_write(fh->fh_export->ex_path.mnt);
+ }
}

#endif /* LINUX_NFSD_VFS_H */
diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
index ce4743a..fa63048 100644
--- a/include/linux/nfsd/nfsfh.h
+++ b/include/linux/nfsd/nfsfh.h
@@ -143,6 +143,7 @@ typedef struct svc_fh {
int fh_maxsize; /* max size for fh_handle */

unsigned char fh_locked; /* inode locked by us */
+ unsigned char fh_want_write; /* remount protection taken */

#ifdef CONFIG_NFSD_V3
unsigned char fh_post_saved; /* post-op attrs saved */
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Make default vm_ops provide ->page_mkwrite handler. Currently it only updates
file's modification times and gets locked page but later it will also handle
filesystem freezing.

Signed-off-by: Jan Kara <jack@s[email protected]>
---
include/linux/mm.h | 1 +
mm/filemap.c | 20 ++++++++++++++++++++
mm/filemap_xip.c | 1 +
3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17b27cd..074c075 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1419,6 +1419,7 @@ extern void truncate_inode_pages_range(struct address_space *,

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
int write_one_page(struct page *page, int wait);
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..b865c0b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1759,8 +1759,28 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+ int ret = VM_FAULT_LOCKED;
+
+ file_update_time(vma->vm_file);
+ lock_page(page);
+ if ((page->mapping != inode->i_mapping) ||
+ (page_offset(page) > i_size_read(inode))) {
+ unlock_page(page);
+ ret = VM_FAULT_NOPAGE;
+ goto out;
+ }
+out:
+ return ret;
+}
+EXPORT_SYMBOL(filemap_page_mkwrite);
+
const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .page_mkwrite = filemap_page_mkwrite,
};

/* This is used for a general mmap of a disk file */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index a4eb311..591dba6 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -304,6 +304,7 @@ out:

static const struct vm_operations_struct xip_file_vm_ops = {
.fault = xip_file_fault,
+ .page_mkwrite = filemap_page_mkwrite,
};

int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.

Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.

This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are two such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite) and one for all other writers. Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.

Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.

Signed-off-by: Jan Kara <[email protected]>
---
fs/super.c | 262 +++++++++++++++++++++++++++++++++++++++++++++++-----
include/linux/fs.h | 42 +++++++--
2 files changed, 274 insertions(+), 30 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 6277ec6..8b0b40c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -32,12 +32,15 @@
#include <linux/backing-dev.h>
#include <linux/rculist_bl.h>
#include <linux/cleancache.h>
+#include <linux/lockdep.h>
#include "internal.h"


LIST_HEAD(super_blocks);
DEFINE_SPINLOCK(sb_lock);

+static struct lock_class_key sb_writers_key[2];
+
/*
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
@@ -101,6 +104,33 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
return total_objects;
}

+static int init_sb_writers(struct super_block *s)
+{
+ int err;
+
+ err = percpu_counter_init(&s->s_writers.counter[0], 0);
+ if (err < 0)
+ return err;
+ err = percpu_counter_init(&s->s_writers.counter[1], 0);
+ if (err < 0) {
+ percpu_counter_destroy(&s->s_writers.counter[0]);
+ return err;
+ }
+ init_waitqueue_head(&s->s_writers.wait);
+ init_waitqueue_head(&s->s_writers.wait_unfrozen);
+ lockdep_init_map(&s->s_writers.lock_map[0], "sb_writers",
+ &sb_writers_key[0], 0);
+ lockdep_init_map(&s->s_writers.lock_map[1], "sb_pagefaults",
+ &sb_writers_key[1], 0);
+ return 0;
+}
+
+static void destroy_sb_writers(struct super_block *s)
+{
+ percpu_counter_destroy(&s->s_writers.counter[0]);
+ percpu_counter_destroy(&s->s_writers.counter[1]);
+}
+
/**
* alloc_super - create new superblock
* @type: filesystem type superblock should belong to
@@ -115,18 +145,19 @@ static struct super_block *alloc_super(struct file_system_type *type)

if (s) {
if (security_sb_alloc(s)) {
+ /*
+ * We cannot call security_sb_free() without
+ * security_sb_alloc() succeeding. So bail out manually
+ */
kfree(s);
s = NULL;
goto out;
}
#ifdef CONFIG_SMP
s->s_files = alloc_percpu(struct list_head);
- if (!s->s_files) {
- security_sb_free(s);
- kfree(s);
- s = NULL;
- goto out;
- } else {
+ if (!s->s_files)
+ goto err_out;
+ else {
int i;

for_each_possible_cpu(i)
@@ -135,6 +166,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
#else
INIT_LIST_HEAD(&s->s_files);
#endif
+ if (init_sb_writers(s))
+ goto err_out;
+
s->s_bdi = &default_backing_dev_info;
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
@@ -187,6 +221,16 @@ static struct super_block *alloc_super(struct file_system_type *type)
}
out:
return s;
+err_out:
+ security_sb_free(s);
+#ifdef CONFIG_SMP
+ if (s->s_files)
+ free_percpu(s->s_files);
+#endif
+ destroy_sb_writers(s);
+ kfree(s);
+ s = NULL;
+ goto out;
}

/**
@@ -200,6 +244,7 @@ static inline void destroy_super(struct super_block *s)
#ifdef CONFIG_SMP
free_percpu(s->s_files);
#endif
+ destroy_sb_writers(s);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
@@ -646,10 +691,11 @@ struct super_block *get_super_thawed(struct block_device *bdev)
{
while (1) {
struct super_block *s = get_super(bdev);
- if (!s || s->s_frozen == SB_UNFROZEN)
+ if (!s || s->s_writers.frozen == SB_UNFROZEN)
return s;
up_read(&s->s_umount);
- vfs_check_frozen(s, SB_FREEZE_WRITE);
+ wait_event(s->s_writers.wait_unfrozen,
+ s->s_writers.frozen == SB_UNFROZEN);
put_super(s);
}
}
@@ -727,7 +773,7 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
int retval;
int remount_ro;

- if (sb->s_frozen != SB_UNFROZEN)
+ if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;

#ifdef CONFIG_BLOCK
@@ -1162,6 +1208,164 @@ out:
return ERR_PTR(error);
}

+static void __sb_end_write(struct super_block *sb, int level)
+{
+ percpu_counter_dec(&sb->s_writers.counter[level-1]);
+ /*
+ * Make sure s_writers are updated before we wake up waiters in
+ * freeze_super().
+ */
+ smp_mb();
+ if (waitqueue_active(&sb->s_writers.wait))
+ wake_up(&sb->s_writers.wait);
+ rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
+}
+
+/**
+ * sb_end_write - drop write access to a superblock
+ * @sb: the super we wrote to
+ *
+ * Decrement number of writers to the filesystem. Wake up possible waiters
+ * wanting to freeze the filesystem.
+ */
+void sb_end_write(struct super_block *sb)
+{
+ __sb_end_write(sb, SB_FREEZE_WRITE);
+}
+EXPORT_SYMBOL(sb_end_write);
+
+/**
+ * sb_end_pagefault - drop write access to a superblock from a page fault
+ * @sb: the super we wrote to
+ *
+ * Decrement number of processes handling write page fault to the filesystem.
+ * Wake up possible waiters wanting to freeze the filesystem.
+ */
+void sb_end_pagefault(struct super_block *sb)
+{
+ __sb_end_write(sb, SB_FREEZE_PAGEFAULT);
+}
+EXPORT_SYMBOL(sb_end_pagefault);
+
+static int __sb_start_write(struct super_block *sb, int level, bool wait)
+{
+retry:
+ if (wait) {
+ wait_event(sb->s_writers.wait_unfrozen,
+ sb->s_writers.frozen < level);
+ } else if (sb->s_writers.frozen >= level)
+ return 0;
+
+ rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, !wait,
+ _RET_IP_);
+ percpu_counter_inc(&sb->s_writers.counter[level-1]);
+ /*
+ * Make sure counter is updated before we check for frozen.
+ * freeze_super() first sets frozen and then checks the counter.
+ */
+ smp_mb();
+ if (sb->s_writers.frozen >= level) {
+ __sb_end_write(sb, level);
+ goto retry;
+ }
+ return 1;
+}
+
+/**
+ * sb_start_write - get write access to a superblock
+ * @sb: the super we write to
+ *
+ * When a process wants to write data or metadata to a file system (i.e. dirty
+ * a page or an inode), it should embed the operation in a sb_start_write() -
+ * sb_end_write() pair to get exclusion against file system freezing. This
+ * function increments number of writers preventing freezing. If the file
+ * system is already frozen, the function waits until the file system is
+ * thawed.
+ *
+ * Since freeze protection behaves as a lock, users have to preserve
+ * ordering of freeze protection and other filesystem locks. Generally,
+ * freeze protection should be the outermost lock. In particular, we have:
+ *
+ * sb_start_write
+ * -> i_mutex (write path, truncate, directory ops, ...)
+ * -> s_umount (freeze_super, thaw_super)
+ */
+void sb_start_write(struct super_block *sb)
+{
+ __sb_start_write(sb, SB_FREEZE_WRITE, true);
+}
+EXPORT_SYMBOL(sb_start_write);
+
+int sb_start_write_trylock(struct super_block *sb)
+{
+ return __sb_start_write(sb, SB_FREEZE_WRITE, false);
+}
+EXPORT_SYMBOL(sb_start_write_trylock);
+
+/**
+ * sb_start_write - get write access to a superblock from a page fault
+ * @sb: the super we write to
+ *
+ * When a process starts handling write page fault, it should embed the
+ * operation into sb_start_pagefault() - sb_end_pagefault() pair to get
+ * exclusion against file system freezing. This is needed since the page fault
+ * is going to dirty a page. This function increments number of running page
+ * faults preventing freezing. If the file system is already frozen, the
+ * function waits until the file system is thawed.
+ *
+ * Since page fault freeze protection behaves as a lock, users have to preserve
+ * ordering of freeze protection and other filesystem locks. It is advised to
+ * put sb_start_pagefault() close to mmap_sem in lock ordering. Page fault
+ * handling code implies lock dependency:
+ *
+ * mmap_sem
+ * -> sb_start_pagefault
+ */
+void sb_start_pagefault(struct super_block *sb)
+{
+ __sb_start_write(sb, SB_FREEZE_PAGEFAULT, true);
+}
+EXPORT_SYMBOL(sb_start_pagefault);
+
+/**
+ * sb_wait_write - wait until all writers to given file system finish
+ * @sb: the super for which we wait
+ * @level: type of writers we wait for (normal vs page fault)
+ *
+ * This function waits until there are no writers of given type to given file
+ * system. Caller of this function should make sure there can be no new writers
+ * of type @level before calling this function. Otherwise this function can
+ * livelock.
+ */
+static void sb_wait_write(struct super_block *sb, int level)
+{
+ s64 writers;
+
+ /*
+ * We just cycle-through lockdep here so that it does not complain
+ * about returning with lock to userspace
+ */
+ rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
+ rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
+
+ do {
+ DEFINE_WAIT(wait);
+
+ /*
+ * We use a barrier in prepare_to_wait() to separate setting
+ * of frozen and checking of the counter
+ */
+ prepare_to_wait(&sb->s_writers.wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
+ if (writers)
+ schedule();
+
+ finish_wait(&sb->s_writers.wait, &wait);
+ } while (writers);
+}
+
/**
* freeze_super - lock the filesystem and force it into a consistent state
* @sb: the super to lock
@@ -1176,7 +1380,7 @@ int freeze_super(struct super_block *sb)

atomic_inc(&sb->s_active);
down_write(&sb->s_umount);
- if (sb->s_frozen) {
+ if (sb->s_writers.frozen != SB_UNFROZEN) {
deactivate_locked_super(sb);
return -EBUSY;
}
@@ -1187,33 +1391,48 @@ int freeze_super(struct super_block *sb)
}

if (sb->s_flags & MS_RDONLY) {
- sb->s_frozen = SB_FREEZE_TRANS;
- smp_wmb();
+ /* Nothing to do really... */
+ sb->s_writers.frozen = SB_FREEZE_COMPLETE;
up_write(&sb->s_umount);
return 0;
}

- sb->s_frozen = SB_FREEZE_WRITE;
+ /* From now on, no new normal writers can start */
+ sb->s_writers.frozen = SB_FREEZE_WRITE;
smp_wmb();

- sync_filesystem(sb);
+ /* Release s_umount to preserve sb_start_write -> s_umount ordering */
+ up_write(&sb->s_umount);
+
+ sb_wait_write(sb, SB_FREEZE_WRITE);

- sb->s_frozen = SB_FREEZE_TRANS;
+ /* Now we go and block page faults... */
+ down_write(&sb->s_umount);
+ sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
smp_wmb();

- sync_blockdev(sb->s_bdev);
+ sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
+
+ /* All writers are done so after syncing there won't be dirty data */
+ sync_filesystem(sb);
+
if (sb->s_op->freeze_fs) {
ret = sb->s_op->freeze_fs(sb);
if (ret) {
printk(KERN_ERR
"VFS:Filesystem freeze failed\n");
- sb->s_frozen = SB_UNFROZEN;
+ sb->s_writers.frozen = SB_UNFROZEN;
smp_wmb();
- wake_up(&sb->s_wait_unfrozen);
+ wake_up(&sb->s_writers.wait_unfrozen);
deactivate_locked_super(sb);
return ret;
}
}
+ /*
+ * This is just for debugging purposes so that fs can warn if it
+ * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
+ */
+ sb->s_writers.frozen = SB_FREEZE_COMPLETE;
up_write(&sb->s_umount);
return 0;
}
@@ -1230,7 +1449,7 @@ int thaw_super(struct super_block *sb)
int error;

down_write(&sb->s_umount);
- if (sb->s_frozen == SB_UNFROZEN) {
+ if (sb->s_writers.frozen == SB_UNFROZEN) {
up_write(&sb->s_umount);
return -EINVAL;
}
@@ -1243,16 +1462,15 @@ int thaw_super(struct super_block *sb)
if (error) {
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
- sb->s_frozen = SB_FREEZE_TRANS;
up_write(&sb->s_umount);
return error;
}
}

out:
- sb->s_frozen = SB_UNFROZEN;
+ sb->s_writers.frozen = SB_UNFROZEN;
smp_wmb();
- wake_up(&sb->s_wait_unfrozen);
+ wake_up(&sb->s_writers.wait_unfrozen);
deactivate_locked_super(sb);

return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..37667f3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -397,6 +397,7 @@ struct inodes_stat_t {
#include <linux/atomic.h>
#include <linux/shrinker.h>
#include <linux/migrate_mode.h>
+#include <linux/lockdep.h>

#include <asm/byteorder.h>

@@ -1392,6 +1393,14 @@ extern void f_delown(struct file *filp);
extern pid_t f_getown(struct file *filp);
extern int send_sigurg(struct fown_struct *fown);

+struct mm_struct;
+
+void sb_start_write(struct super_block *sb);
+int sb_start_write_trylock(struct super_block *sb);
+void sb_end_write(struct super_block *sb);
+void sb_start_pagefault(struct super_block *sb);
+void sb_end_pagefault(struct super_block *sb);
+
/*
* Umount options
*/
@@ -1405,6 +1414,28 @@ extern int send_sigurg(struct fown_struct *fown);
extern struct list_head super_blocks;
extern spinlock_t sb_lock;

+/* Possible states of 'frozen' field */
+enum {
+ SB_UNFROZEN = 0, /* FS is unfrozen */
+ SB_FREEZE_WRITE = 1, /* Writes, dir ops, ioctls frozen */
+ SB_FREEZE_TRANS = 2,
+ SB_FREEZE_PAGEFAULT = 2, /* Page faults stopped as well */
+ SB_FREEZE_COMPLETE = 3, /* ->freeze_fs finished successfully */
+};
+
+struct sb_writers {
+ struct percpu_counter counter[2]; /* counter of running writes
+ * and page faults */
+ wait_queue_head_t wait; /* queue for waiting for
+ writers / faults to finish */
+ int frozen; /* Is sb frozen? */
+ wait_queue_head_t wait_unfrozen; /* queue for waiting for
+ sb to be thawed */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map lock_map[2];
+#endif
+};
+
struct super_block {
struct list_head s_list; /* Keep this first */
dev_t s_dev; /* search index; _not_ kdev_t */
@@ -1454,6 +1485,7 @@ struct super_block {

int s_frozen;
wait_queue_head_t s_wait_unfrozen;
+ struct sb_writers s_writers;

char s_id[32]; /* Informational name */
u8 s_uuid[16]; /* UUID */
@@ -1507,14 +1539,8 @@ extern struct timespec current_fs_time(struct super_block *sb);
/*
* Snapshotting support.
*/
-enum {
- SB_UNFROZEN = 0,
- SB_FREEZE_WRITE = 1,
- SB_FREEZE_TRANS = 2,
-};
-
-#define vfs_check_frozen(sb, level) \
- wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
+/* Will go away when all users are converted */
+#define vfs_check_frozen(sb, level) do { } while (0)

/*
* until VFS tracks user namespaces for inodes, just make all files
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Dave Chinner
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 09, 2012 12:30AM
On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> Generic code now blocks all writers from standard write paths. So we block all
> writers coming from ioctl and replace blocking of transactions on frozen
> filesystem with a debugging check. As a bonus, we get a protection of ioctl
> against racing remount read-only. We also convert xfs_file_aio_write() to a
> non-racy freeze protection.
.....
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 329b06a..6468a2a 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -577,7 +577,6 @@ xfs_trans_alloc(
> xfs_mount_t *mp,
> uint type)
> {
> - xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> return _xfs_trans_alloc(mp, type, KM_SLEEP);
> }

So what is there to stop internal XFS threads from starting
transactions when the filesystem is frozen? Previously this
SB_FREEZE_TRANS would guarantee even internal fucntions would get
stopped, but now there's nothing?

I do beleive that ext4 has the same problem (the issue reported with
the lazy inode init background thread), and I can see that any other
filesystem that can make modifications via internal triggers will
see the same problem - freeze doesn't block them any more...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 09, 2012 09:32AM
On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > xfs_mount_t *mp,
> > uint type)
> > {
> > - xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > }
>
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
>
> I do beleive that ext4 has the same problem (the issue reported with
> the lazy inode init background thread), and I can see that any other
> filesystem that can make modifications via internal triggers will
> see the same problem - freeze doesn't block them any more...
Yes, I'll have to audit filesystem internal threads. I forgot about these
in the first round. Most of them shouldn't have anything to do because the
filesystem is clean but there are exceptions as ext4 lazyinit has shown.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 09, 2012 03:30PM
On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > xfs_mount_t *mp,
> > uint type)
> > {
> > - xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > }
>
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
I've checked XFS now and I didn't find any work that would be done on a
clean filesystem. I found:
_xfs_mru_cache_reap() - doesn't seem to do any IO
xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
reads can happen on a frozen filesystem
xfs_reclaim_worker() - everything is clean so no IO should happen
xfs_flush_worker() - everything is clean so no IO should happen
xfs_sync_worker() - again the same if I understand the code right

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Kamal Mostafa
Re: [PATCH 00/19] Fix filesystem freezing deadlocks
March 11, 2012 09:30PM
On Mon, 2012-03-05 at 17:00 +0100, Jan Kara wrote:
> Hallelujah,
>
> after a couple of weeks and several rewrites, here comes the third iteration
> of my patches to improve filesystem freezing. [...]

We've been testing this patch set at Canonical on the multipath failover
SAN configuration where we originally encountered the freeze deadlock.
We are happy to report that it does appear to fix the problem. Thanks
Jan!

Please add the following endorsements for these patches (those actually
exercised by our test case): 01, 02, 03, 06, 07, 08, 09, 10, 14, 18, 19

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Tested-by: Dann Frazier <[email protected]>
Tested-by: Massimo Morana <[email protected]>

-Kamal
Dave Chinner
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 11, 2012 11:50PM
On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > Generic code now blocks all writers from standard write paths. So we block all
> > > writers coming from ioctl and replace blocking of transactions on frozen
> > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > non-racy freeze protection.
> > ....
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 329b06a..6468a2a 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > > xfs_mount_t *mp,
> > > uint type)
> > > {
> > > - xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > > return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > > }
> >
> > So what is there to stop internal XFS threads from starting
> > transactions when the filesystem is frozen? Previously this
> > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > stopped, but now there's nothing?
> I've checked XFS now and I didn't find any work that would be done on a
> clean filesystem. I found:
> _xfs_mru_cache_reap() - doesn't seem to do any IO

Causes iput() on inodes, which if dropping the last reference to the
inode can cause them to enter reclaim and hence have transactions
run on them to either truncate blocks beyonds EOF or to do the
second phase of an unlink(). So it will definitely break the freeze
model.

> xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
> reads can happen on a frozen filesystem
> xfs_reclaim_worker() - everything is clean so no IO should happen

It will do work as other things push inodes into reclaim. e.g.
filestreams inode expiry. And it will still run to reclaim clean
inodes...

> xfs_flush_worker() - everything is clean so no IO should happen
> xfs_sync_worker() - again the same if I understand the code right

xfs_sync_worker() will always trigger a log force, so if there is
anything that has dirtied the journal, it will trigger IO. We have
no protection against the journal being dirtied anymore, so no
guarantees can be given here.

Basically, your patchset creates a "shell" around the outside of the
filesystem that catches all the external modifications that can
occur through the VFS and ioctls. The "shell" is now the only layer
of defense because the patchset removes the layer of protection that
prevents internal modifications from occurring.

For example, the XFS patch added a bunch of protection to ioctls on
XFS that only modify file metadata and so never were protected
against data freezes - they were all implicitly protected by the inner
transaction subsystem freeze. All of the above cases were protected
by the inner layer of defense, which is now gone.

Not to mention the shrinkers. The inode cache shrinkers can
cause inodes to enter reclaim which can trigger EOF truncation just
like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
IO, and dquots will be dirtied by EOF truncation. Same with buffers,
and the buffer cache shrinker.

I'm sure other filesystems have just as complex or even more complex
internal paths that trigger dirtying and IO that the "freeze shell"
model does not prevent. I don't think auditing is good enough - the
shell model is, IMO, too easy to break and too hard to test for
breakage...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 12, 2012 07:00PM
On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > > Generic code now blocks all writers from standard write paths. So we block all
> > > > writers coming from ioctl and replace blocking of transactions on frozen
> > > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > > non-racy freeze protection.
> > > ....
> > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > index 329b06a..6468a2a 100644
> > > > --- a/fs/xfs/xfs_trans.c
> > > > +++ b/fs/xfs/xfs_trans.c
> > > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > > > xfs_mount_t *mp,
> > > > uint type)
> > > > {
> > > > - xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > > > return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > > > }
> > >
> > > So what is there to stop internal XFS threads from starting
> > > transactions when the filesystem is frozen? Previously this
> > > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > > stopped, but now there's nothing?
> > I've checked XFS now and I didn't find any work that would be done on a
> > clean filesystem. I found:
> > _xfs_mru_cache_reap() - doesn't seem to do any IO
>
> Causes iput() on inodes, which if dropping the last reference to the
> inode can cause them to enter reclaim and hence have transactions
> run on them to either truncate blocks beyonds EOF or to do the
> second phase of an unlink(). So it will definitely break the freeze
> model.
OK, I've missed this. I'll fix it.

> > xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
> > reads can happen on a frozen filesystem
> > xfs_reclaim_worker() - everything is clean so no IO should happen
>
> It will do work as other things push inodes into reclaim. e.g.
> filestreams inode expiry. And it will still run to reclaim clean
> inodes...
Sure. But reclaim will start a transaction only if inode is dirty. And
inode must not be dirty on a frozen filesystem.

> > xfs_flush_worker() - everything is clean so no IO should happen
> > xfs_sync_worker() - again the same if I understand the code right
>
> xfs_sync_worker() will always trigger a log force, so if there is
> anything that has dirtied the journal, it will trigger IO. We have
> no protection against the journal being dirtied anymore, so no
> guarantees can be given here.
Yes, it would be a bug if something dirtied a journal while the
filesystem is frozen. To catch issues like this, I've added WARN_ON in
xfs_trans_alloc() to actually warn when transaction would be started on a
frozen filesystem. My testing never triggered it with the latest patch set.

> Basically, your patchset creates a "shell" around the outside of the
> filesystem that catches all the external modifications that can
> occur through the VFS and ioctls. The "shell" is now the only layer
> of defense because the patchset removes the layer of protection that
> prevents internal modifications from occurring.
Yes. I'd just note that the "shell" is there already to provide reliable
remounting read only. I just had to change some properties of the shell to
be usable for freezing as well. But the shell has to be maintained
regardless of how we decide to do freezing.

Also believe me that I've initially tried to fix the freezing without the
external shell. But it just isn't enough to prevent dirty inodes from being
created (e.g. from directory operations) while sync during freezing is
running.

Sure I could keep the freezing mechanism on transaction start. But it
seemed like a cleaner approach to me to protect all the places properly
with the generic mechanism than having two mechanisms and unclear
(especially locking) interactions between them.

But in the end, if you guys really feel strongly about it and decide that
XFS wants to keep it's mechanism of freezing on transaction start, then I
won't stop you. Although it would mean XFS would have to have three
counters to protect freezing while other filesystems will need only two.
Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
but I don't know XFS internals enough to really argue...

> For example, the XFS patch added a bunch of protection to ioctls on
> XFS that only modify file metadata and so never were protected
> against data freezes - they were all implicitly protected by the inner
> transaction subsystem freeze. All of the above cases were protected
> by the inner layer of defense, which is now gone.
Yes, but all these ioctls are currently buggy with respect to remounting
read only (not the whole filesystem, just one mountpoint of many where
filesystem may be mounted). So I'd argue that these particular changes are
actually bug fixes.

> Not to mention the shrinkers. The inode cache shrinkers can
> cause inodes to enter reclaim which can trigger EOF truncation just
> like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
> IO, and dquots will be dirtied by EOF truncation. Same with buffers,
> and the buffer cache shrinker.
OK, now I see that even iput() of a clean inode can cause truncation to
happen for XFS. Thanks for the pointer. I'll see what we could do about
this.

> I'm sure other filesystems have just as complex or even more complex
> internal paths that trigger dirtying and IO that the "freeze shell"
> model does not prevent. I don't think auditing is good enough - the
> shell model is, IMO, too easy to break and too hard to test for
> breakage...
Well, actually I don't know filesystem which would have as complex
interactions as XFS has. But maybe there are some. In either case I don't
think maintaining the "shell" is that hard. You have to maintain the one
facing to userspace anyway due to remounting. For the internal filesystem
threads, you have to be aware when you can cause write and be protected
against freezing in that case. I don't think it's that complex, although I
agree that the current test coverage will be small (freezing isn't used
very often) so bugs can go unnoticed for a long time.

Maybe we could add a test in xfs_trans_alloc() checking that fs is
protected against freezing? That would increase the test coverage a lot.
Although the test should be probably enabled only for XFS_DEBUG because it
will be an expensive one.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Dave Chinner
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 13, 2012 01:00AM
On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > It will do work as other things push inodes into reclaim. e.g.
> > filestreams inode expiry. And it will still run to reclaim clean
> > inodes...
> Sure. But reclaim will start a transaction only if inode is dirty. And
> inode must not be dirty on a frozen filesystem.

That's what I've been trying to tell you - even clean inodes can be
dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
enough guarantee. And not just through the shrinkers - just closing
a file descriptor can trigger truncation as well via .release. See
xfs_release:

/* If this is a read-only mount, don't do this (would generate I/O) */
if (mp->m_flags & XFS_MOUNT_RDONLY)
goto out;

> > > xfs_flush_worker() - everything is clean so no IO should happen
> > > xfs_sync_worker() - again the same if I understand the code right
> >
> > xfs_sync_worker() will always trigger a log force, so if there is
> > anything that has dirtied the journal, it will trigger IO. We have
> > no protection against the journal being dirtied anymore, so no
> > guarantees can be given here.
> Yes, it would be a bug if something dirtied a journal while the
> filesystem is frozen. To catch issues like this, I've added WARN_ON in
> xfs_trans_alloc() to actually warn when transaction would be started on a
> frozen filesystem. My testing never triggered it with the latest patch set.

Just hold a file descriptor open for a large write over the freeze
(don't get it caught in the freeze), and then close it after the
filesystem is frozen. Perhaps even drop the slab caches.

> > Basically, your patchset creates a "shell" around the outside of the
> > filesystem that catches all the external modifications that can
> > occur through the VFS and ioctls. The "shell" is now the only layer
> > of defense because the patchset removes the layer of protection that
> > prevents internal modifications from occurring.
> Yes. I'd just note that the "shell" is there already to provide reliable
> remounting read only.

But it doesn't provide reliable read-only behaviour - internal
filesystem triggers still need to check and disarm to make read-only
mounts reliable. Indeed, XFS has many internal checks for read only
mounts, and the places we are talking about here are similar to the
places where you've removed freeze protection from by gutting the
transaction level freezing.

> I just had to change some properties of the shell to
> be usable for freezing as well. But the shell has to be maintained
> regardless of how we decide to do freezing.
>
> Also believe me that I've initially tried to fix the freezing without the
> external shell. But it just isn't enough to prevent dirty inodes from being
> created (e.g. from directory operations) while sync during freezing is
> running.

Sure - the old mechanism was "freeze data" which only protected data
modification paths, followed by "freeze trans" which protected
metadata modification paths (like directory operations). You started
by removing the metadata modification path protection - it is any
wonder that those operations continued to dirty inodes until you
expanded the "freeze data" shell to mean "freeze data and metadata"?

> Sure I could keep the freezing mechanism on transaction start. But it
> seemed like a cleaner approach to me to protect all the places properly
> with the generic mechanism than having two mechanisms and unclear
> (especially locking) interactions between them.

Once again, that's a view that does not take into account that
modifcations don't all come through the VFS.

It's like the hard shelli/soft center security model - it protects
well from external attackers, but once they get inside, there's no
protection. Indeed, there is zero protection from inside jobs, and
thats where multiple layers of defense are needed.

Your freeze changes provide us with a hard outer shell, but it's got
a really, really soft center. We can't use the outer shell defenses
deep inside the shell because of the locking orders that have been
defined (freeze protection must be outermost), so we need another
layer....

> But in the end, if you guys really feel strongly about it and decide that
> XFS wants to keep it's mechanism of freezing on transaction start, then I
> won't stop you. Although it would mean XFS would have to have three
> counters to protect freezing while other filesystems will need only two.

There is two counters only to avoid lockdep problems because the
different entry points to the outer shell have different locking
orders. I fail to see why adding a third counter to provide an
innermost freeze lock that retains the current level of protection
is a big deal because all that it really needs to work is a
different lock order.

Also, ext4 could retain it's current outer/inner protection, so I
don't see that this behaviour is XFS specific.....

> Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
> but I don't know XFS internals enough to really argue...

.... especially given the reported ext4 regression. I have much less
confidence than you that the outer shell is sufficient for ext4 or
any other complex filesystem.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Jan Kara
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 13, 2012 10:40PM
On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > > It will do work as other things push inodes into reclaim. e.g.
> > > filestreams inode expiry. And it will still run to reclaim clean
> > > inodes...
> > Sure. But reclaim will start a transaction only if inode is dirty. And
> > inode must not be dirty on a frozen filesystem.
>
> That's what I've been trying to tell you - even clean inodes can be
> dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
> enough guarantee. And not just through the shrinkers - just closing
> a file descriptor can trigger truncation as well via .release. See
> xfs_release:
>
> /* If this is a read-only mount, don't do this (would generate I/O) */
> if (mp->m_flags & XFS_MOUNT_RDONLY)
> goto out;
Yes, I've noticed this while writing the email but forgot to update the
beginning. Sorry for confusion.

> > > Basically, your patchset creates a "shell" around the outside of the
> > > filesystem that catches all the external modifications that can
> > > occur through the VFS and ioctls. The "shell" is now the only layer
> > > of defense because the patchset removes the layer of protection that
> > > prevents internal modifications from occurring.
> > Yes. I'd just note that the "shell" is there already to provide reliable
> > remounting read only.
>
> But it doesn't provide reliable read-only behaviour - internal
> filesystem triggers still need to check and disarm to make read-only
> mounts reliable. Indeed, XFS has many internal checks for read only
> mounts, and the places we are talking about here are similar to the
> places where you've removed freeze protection from by gutting the
> transaction level freezing.
And indeed your read-only checks are racy the same way freezing was...
But that's a different story. You have actually a good point that freezing
protection and read-only mount protection should be at the same places.

> > I just had to change some properties of the shell to
> > be usable for freezing as well. But the shell has to be maintained
> > regardless of how we decide to do freezing.
> >
> > Also believe me that I've initially tried to fix the freezing without the
> > external shell. But it just isn't enough to prevent dirty inodes from being
> > created (e.g. from directory operations) while sync during freezing is
> > running.
>
> Sure - the old mechanism was "freeze data" which only protected data
> modification paths, followed by "freeze trans" which protected
> metadata modification paths (like directory operations). You started
> by removing the metadata modification path protection - it is any
> wonder that those operations continued to dirty inodes until you
> expanded the "freeze data" shell to mean "freeze data and metadata"?
That's just not true. Remember older versions of this patch set. Old
model is doing:
freeze data
sync
freeze metadata (via freezing transations)
and it is broken mainly because while sync is running, new metadata are
dirtied and thus you end up with dirty metadata on a frozen filesystem.
Forcing the journal commit in ->freeze_fs() pushes changes to disk but
inodes still remain dirty and actually, it can be too late for flusher
thread anyway because it can be already hanging trying to start a
transaction.

Then there are also other less severe problems like that mark_inode_dirty()
could race with freezing in such a way that it was added to dirty list
after filesystem was fully frozen.

> > Sure I could keep the freezing mechanism on transaction start. But it
> > seemed like a cleaner approach to me to protect all the places properly
> > with the generic mechanism than having two mechanisms and unclear
> > (especially locking) interactions between them.
>
> Once again, that's a view that does not take into account that
> modifcations don't all come through the VFS.
"all the places" in my sentence meant "including the internal sources of
modification"...

> It's like the hard shelli/soft center security model - it protects
> well from external attackers, but once they get inside, there's no
> protection. Indeed, there is zero protection from inside jobs, and
> thats where multiple layers of defense are needed.
>
> Your freeze changes provide us with a hard outer shell, but it's got
> a really, really soft center. We can't use the outer shell defenses
> deep inside the shell because of the locking orders that have been
> defined (freeze protection must be outermost), so we need another
> layer....
I think there are two different issues. One is your complaint that
extending the shell to cover also internal sources of fs modification will
be too fragile if not done implicitely on transaction start.

Another is a question whether we could reasonably fend off internal
modification using the current two counters. The MRU thread or ->release
can be easily handled with them (->release can be handled by the counter
ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
outer counter won't work. The inner one with lock ranking just below
mmap_sem might work - depends on whether there are any XFS locks under
mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
issues might be nasty enough to warrant another counter.

This is connected with another somewhat related issue: Do we want shrinkers
to block on frozen filesystem? That could result in unrelated processes
blocking on frozen filesystem. I'd find it nicer if we just skipped
the writes if the filesystem is frozen as it seems to me
xfs_free_eofblocks() is somewhat optional. Is it true?

> > But in the end, if you guys really feel strongly about it and decide that
> > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > won't stop you. Although it would mean XFS would have to have three
> > counters to protect freezing while other filesystems will need only two.
>
> There is two counters only to avoid lockdep problems because the
> different entry points to the outer shell have different locking
> orders. I fail to see why adding a third counter to provide an
> innermost freeze lock that retains the current level of protection
> is a big deal because all that it really needs to work is a
> different lock order.
The two counters are needed to avoid deadlocks, not just because of
lockdep... The third counter can be added but per-cpu counters are not
exactly cheap (in terms of memory) so I don't want to add it without a good
reason. Lock ordering constraints in shrinkers might be this good reason.

Also fitting the XFS transaction system to freezing was a bit ugly (because
you need to log a dummy transaction on a frozen filesystem and you have
things like xfs_trans_dup() which requires us to bump the counter even
though the filesystem is already in freezing process) so I'd be happier if
we could avoid it...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Dave Chinner
Re: [PATCH 11/19] xfs: Convert to new freezing code
March 14, 2012 04:10AM
On Tue, Mar 13, 2012 at 10:30:20PM +0100, Jan Kara wrote:
> On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> > On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > I just had to change some properties of the shell to
> > > be usable for freezing as well. But the shell has to be maintained
> > > regardless of how we decide to do freezing.
> > >
> > > Also believe me that I've initially tried to fix the freezing without the
> > > external shell. But it just isn't enough to prevent dirty inodes from being
> > > created (e.g. from directory operations) while sync during freezing is
> > > running.
> >
> > Sure - the old mechanism was "freeze data" which only protected data
> > modification paths, followed by "freeze trans" which protected
> > metadata modification paths (like directory operations). You started
> > by removing the metadata modification path protection - it is any
> > wonder that those operations continued to dirty inodes until you
> > expanded the "freeze data" shell to mean "freeze data and metadata"?
> That's just not true. Remember older versions of this patch set. Old
> model is doing:
> freeze data
> sync
> freeze metadata (via freezing transations)
> and it is broken mainly because while sync is running, new metadata are
> dirtied and thus you end up with dirty metadata on a frozen filesystem.

But that exactly the issue that the filesystem .freeze_fs
method is supposed to handle. It's supposed to finish the
freeze process by finishing all outstanding metadata operations and
cleaning all remaining dirty metadata. It still has to guarantee
this regardless of the model used prior to this.

> Forcing the journal commit in ->freeze_fs() pushes changes to disk but
> inodes still remain dirty and actually, it can be too late for flusher
> thread anyway because it can be already hanging trying to start a
> transaction.

If the changes are already in the journal, then there is no new
transaction work needed to clean the VFS state. Indeed, .freezefs
needs to write back all dirty metadata objects that it is tracking
internally in the journal so that a snapshot of the frozen image is
consistent and can be mounted read-only without running recovery...

> > It's like the hard shelli/soft center security model - it protects
> > well from external attackers, but once they get inside, there's no
> > protection. Indeed, there is zero protection from inside jobs, and
> > thats where multiple layers of defense are needed.
> >
> > Your freeze changes provide us with a hard outer shell, but it's got
> > a really, really soft center. We can't use the outer shell defenses
> > deep inside the shell because of the locking orders that have been
> > defined (freeze protection must be outermost), so we need another
> > layer....
> I think there are two different issues. One is your complaint that
> extending the shell to cover also internal sources of fs modification will
> be too fragile if not done implicitely on transaction start.

Most definitely. It means that we have to have freeze behaviour in
mind whether doing any async/background work. That's usually the
furthest think from people's minds when implementing stuff like
background inode initialisation...

> Another is a question whether we could reasonably fend off internal
> modification using the current two counters. The MRU thread or ->release
> can be easily handled with them (->release can be handled by the counter
> ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
> outer counter won't work. The inner one with lock ranking just below
> mmap_sem might work - depends on whether there are any XFS locks under
> mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
> issues might be nasty enough to warrant another counter.

Might be nasty? There's no question about it in my mind that it
*would* be nasty....

> This is connected with another somewhat related issue: Do we want shrinkers
> to block on frozen filesystem? That could result in unrelated processes
> blocking on frozen filesystem. I'd find it nicer if we just skipped
> the writes if the filesystem is frozen as it seems to me
> xfs_free_eofblocks() is somewhat optional. Is it true?

Not really. It needs to be done on rw filesytems[*].

Besides, this isn't really an XFS specific problem - the shrinkers
need to operate on frozen filesystems. e.g. run a directory
traversal of a frozen filesystem, and if you stop the shrinkers from
running then the dcache/icache growth will run the system out of
memory.

Sure, blocking a shrinker can be considered antisocial, but if you
can't free memory because it will dirty inodes on a frozen
filesystem then it is better to block the shrinker until the
filesytem is unfrozen than it is to allow the shrinker to fail and
eventually OOM the machine. Freezing filesysetms is supposed to be a
short term, workload-invisible operation. Kicking the OOM killer
because you can't free inodes is far more antisocial than having
those applications pause while the fs is frozen.

[*] Right now we are discussing adding a new background/async worker
that that does speculative preallocation removal to make it occur
faster than it currently does. So skipping it is not really an
option and this is exactly the sort of "new feature needs to
consider how to deal with freeze" problem I'm talking about.

> > > But in the end, if you guys really feel strongly about it and decide that
> > > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > > won't stop you. Although it would mean XFS would have to have three
> > > counters to protect freezing while other filesystems will need only two.
> >
> > There is two counters only to avoid lockdep problems because the
> > different entry points to the outer shell have different locking
> > orders. I fail to see why adding a third counter to provide an
> > innermost freeze lock that retains the current level of protection
> > is a big deal because all that it really needs to work is a
> > different lock order.
> The two counters are needed to avoid deadlocks, not just because of
> lockdep... The third counter can be added but per-cpu counters are not
> exactly cheap (in terms of memory) so I don't want to add it without a good
> reason. Lock ordering constraints in shrinkers might be this good reason.

An additional percpu counter is far, far cheaper than the cost of
always needing to consider how not to break filesystem freezing for
the forseeable future.

> Also fitting the XFS transaction system to freezing was a bit ugly (because
> you need to log a dummy transaction on a frozen filesystem

Sure - that's to ensure the log is dirty so that open-but-unlinked
inodes are handled correctly by recovery if the system crashes while the
filesystem is frozen. But the last patch in your series would prevent
freezing filesystems while there are open-but-unlinked files
present, so all that grot could be removed.

Indeed, when that patch goes to mainline, the XFS code for
handling remount,ro and freeze can be made identical....

> and you have
> things like xfs_trans_dup() which requires us to bump the counter even
> though the filesystem is already in freezing process) so I'd be happier if
> we could avoid it...

The only additional infrastructure that you need to provide is a
"start_nested" operation for xfs_trans_dup() that increments the
percpu counter. I don't think that's a difficult problem to solve.

Cheers,

Dave.
--
Dave Chinner
dchinner@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Mon, 5 Mar 2012 17:00:59 +0100
Jan Kara <[email protected]> wrote:

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1759,8 +1759,28 @@ page_not_uptodate:
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct page *page = vmf->page;
> + struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> + int ret = VM_FAULT_LOCKED;
> +
> + file_update_time(vma->vm_file);
> + lock_page(page);
> + if ((page->mapping != inode->i_mapping) ||
> + (page_offset(page) > i_size_read(inode))) {

Would benefit from a comment explaining how the page can come to be
outside i_size, and why we fail in that case.

I don't think i_mutex is held here, so this test is rather meaningless
and racy anyway?

> + unlock_page(page);
> + ret = VM_FAULT_NOPAGE;
> + goto out;
> + }
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(filemap_page_mkwrite);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> On Mon, 5 Mar 2012 17:00:59 +0100
> Jan Kara <[email protected]> wrote:
>
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > }
> > EXPORT_SYMBOL(filemap_fault);
> >
> > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct page *page = vmf->page;
> > + struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > + int ret = VM_FAULT_LOCKED;
> > +
> > + file_update_time(vma->vm_file);
> > + lock_page(page);
> > + if ((page->mapping != inode->i_mapping) ||
> > + (page_offset(page) > i_size_read(inode))) {
>
> Would benefit from a comment explaining how the page can come to be
> outside i_size, and why we fail in that case.
>
> I don't think i_mutex is held here, so this test is rather meaningless
> and racy anyway?
i_size test is racy if that's what you mean by "this test". Just I did
the test this way because it's like this in other places and I figured
truncate_pagecache() can take relatively long time so the test has some
effect. But if you think it's not worth it, I can remove it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue, 27 Mar 2012 09:55:27 +0200
Jan Kara <[email protected]> wrote:

> On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > On Mon, 5 Mar 2012 17:00:59 +0100
> > Jan Kara <[email protected]> wrote:
> >
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > }
> > > EXPORT_SYMBOL(filemap_fault);
> > >
> > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > +{
> > > + struct page *page = vmf->page;
> > > + struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > + int ret = VM_FAULT_LOCKED;
> > > +
> > > + file_update_time(vma->vm_file);
> > > + lock_page(page);
> > > + if ((page->mapping != inode->i_mapping) ||
> > > + (page_offset(page) > i_size_read(inode))) {
> >
> > Would benefit from a comment explaining how the page can come to be
> > outside i_size, and why we fail in that case.

This?

> > I don't think i_mutex is held here, so this test is rather meaningless
> > and racy anyway?
> i_size test is racy if that's what you mean by "this test". Just I did
> the test this way because it's like this in other places and I figured
> truncate_pagecache() can take relatively long time so the test has some
> effect. But if you think it's not worth it, I can remove it.

It bugs me when we copy-n-paste code without remembering why we had it
there in the first place :( iirc, mmapped pages outside i_size can and
do happen in some race situations, and are benign. But it's several
years since I thought about it and all the details have evaporated and
it would take a lot of work to reinstantiate it all. argh.

Also, it's off-by-one, isn't it? Should be page_offset(page) >= i_size?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Tue 27-03-12 14:38:15, Andrew Morton wrote:
> On Tue, 27 Mar 2012 09:55:27 +0200
> Jan Kara <[email protected]> wrote:
> > On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > > On Mon, 5 Mar 2012 17:00:59 +0100
> > > Jan Kara <[email protected]> wrote:
> > >
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > > }
> > > > EXPORT_SYMBOL(filemap_fault);
> > > >
> > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > +{
> > > > + struct page *page = vmf->page;
> > > > + struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > > + int ret = VM_FAULT_LOCKED;
> > > > +
> > > > + file_update_time(vma->vm_file);
> > > > + lock_page(page);
> > > > + if ((page->mapping != inode->i_mapping) ||
> > > > + (page_offset(page) > i_size_read(inode))) {
> > >
> > > Would benefit from a comment explaining how the page can come to be
> > > outside i_size, and why we fail in that case.
>
> This?
>
> > > I don't think i_mutex is held here, so this test is rather meaningless
> > > and racy anyway?
> > i_size test is racy if that's what you mean by "this test". Just I did
> > the test this way because it's like this in other places and I figured
> > truncate_pagecache() can take relatively long time so the test has some
> > effect. But if you think it's not worth it, I can remove it.
>
> It bugs me when we copy-n-paste code without remembering why we had it
> there in the first place :( iirc, mmapped pages outside i_size can and
> do happen in some race situations, and are benign.
Yeah. Certainly there can be pages beyond i_size because we first set
file size and then go and remove pages beyond new i_size one by one when we
do truncate. We must be careful not to create any new pages beyond i_size
but that's what filemap_fault() takes care of. So I think i_size check in
->page_mkwrite() isn't strictly needed.

> But it's several
> years since I thought about it and all the details have evaporated and
> it would take a lot of work to reinstantiate it all. argh.
>
> Also, it's off-by-one, isn't it? Should be page_offset(page) >= i_size?
Yes, it is. I'll just remove it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Mar 28, 2012 at 12:08:19AM +0200, Jan Kara wrote:
> On Tue 27-03-12 14:38:15, Andrew Morton wrote:
> > On Tue, 27 Mar 2012 09:55:27 +0200
> > Jan Kara <[email protected]> wrote:
> > > On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > > > On Mon, 5 Mar 2012 17:00:59 +0100
> > > > Jan Kara <[email protected]> wrote:
> > > >
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > > > }
> > > > > EXPORT_SYMBOL(filemap_fault);
> > > > >
> > > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > +{
> > > > > + struct page *page = vmf->page;
> > > > > + struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > > > + int ret = VM_FAULT_LOCKED;
> > > > > +
> > > > > + file_update_time(vma->vm_file);
> > > > > + lock_page(page);
> > > > > + if ((page->mapping != inode->i_mapping) ||
> > > > > + (page_offset(page) > i_size_read(inode))) {
> > > >
> > > > Would benefit from a comment explaining how the page can come to be
> > > > outside i_size, and why we fail in that case.
> >
> > This?
> >
> > > > I don't think i_mutex is held here, so this test is rather meaningless
> > > > and racy anyway?
> > > i_size test is racy if that's what you mean by "this test". Just I did
> > > the test this way because it's like this in other places and I figured
> > > truncate_pagecache() can take relatively long time so the test has some
> > > effect. But if you think it's not worth it, I can remove it.
> >
> > It bugs me when we copy-n-paste code without remembering why we had it
> > there in the first place :( iirc, mmapped pages outside i_size can and
> > do happen in some race situations, and are benign.
> Yeah. Certainly there can be pages beyond i_size because we first set
> file size and then go and remove pages beyond new i_size one by one when we
> do truncate. We must be careful not to create any new pages beyond i_size
> but that's what filemap_fault() takes care of. So I think i_size check in
> ->page_mkwrite() isn't strictly needed.

Actually, I think it is. In __do_fault(), we drop the page lock
between the .fault call and the .page_mkwrite() call, so the size
checks in .fault for the given page being faulted are no longer
valid as truncate serialises only on the page lock. Hence we have to
repeat the truncate race checks again in .page_mkwrite() after we
relock the page.

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed 28-03-12 09:45:53, Dave Chinner wrote:
> On Wed, Mar 28, 2012 at 12:08:19AM +0200, Jan Kara wrote:
> > On Tue 27-03-12 14:38:15, Andrew Morton wrote:
> > > On Tue, 27 Mar 2012 09:55:27 +0200
> > > Jan Kara <[email protected]> wrote:
> > > > On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > > > > On Mon, 5 Mar 2012 17:00:59 +0100
> > > > > Jan Kara <[email protected]> wrote:
> > > > >
> > > > > > --- a/mm/filemap.c
> > > > > > +++ b/mm/filemap.c
> > > > > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > > > > }
> > > > > > EXPORT_SYMBOL(filemap_fault);
> > > > > >
> > > > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > > +{
> > > > > > + struct page *page = vmf->page;
> > > > > > + struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > > > > + int ret = VM_FAULT_LOCKED;
> > > > > > +
> > > > > > + file_update_time(vma->vm_file);
> > > > > > + lock_page(page);
> > > > > > + if ((page->mapping != inode->i_mapping) ||
> > > > > > + (page_offset(page) > i_size_read(inode))) {
> > > > >
> > > > > Would benefit from a comment explaining how the page can come to be
> > > > > outside i_size, and why we fail in that case.
> > >
> > > This?
> > >
> > > > > I don't think i_mutex is held here, so this test is rather meaningless
> > > > > and racy anyway?
> > > > i_size test is racy if that's what you mean by "this test". Just I did
> > > > the test this way because it's like this in other places and I figured
> > > > truncate_pagecache() can take relatively long time so the test has some
> > > > effect. But if you think it's not worth it, I can remove it.
> > >
> > > It bugs me when we copy-n-paste code without remembering why we had it
> > > there in the first place :( iirc, mmapped pages outside i_size can and
> > > do happen in some race situations, and are benign.
> > Yeah. Certainly there can be pages beyond i_size because we first set
> > file size and then go and remove pages beyond new i_size one by one when we
> > do truncate. We must be careful not to create any new pages beyond i_size
> > but that's what filemap_fault() takes care of. So I think i_size check in
> > ->page_mkwrite() isn't strictly needed.
>
> Actually, I think it is. In __do_fault(), we drop the page lock
> between the .fault call and the .page_mkwrite() call, so the size
> checks in .fault for the given page being faulted are no longer
> valid as truncate serialises only on the page lock. Hence we have to
> repeat the truncate race checks again in .page_mkwrite() after we
> relock the page.
I was thinking about this scenario as well but I think doesn't cause any
problems. But maybe there's some noise in what the condition actually is.
So I currently have:
if (page->mapping != inode->i_mapping)
bail..
do stuff.
And I think that is enough. Because only filemap_fault() creates new page,
->page_mkwrite() only uses the page reference from vmf->page. So the only
thing that can happen after __do_fault() drops page lock is that
truncate_pagecache() will go and truncate the page (thus page->mapping will
be NULL) and the condition I currently have should be enough to catch that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Sorry, only registered users may post in this forum.

Click here to login