Add missed ufsi->i_dir_start_lookup initialization in ufs_read_inode in
UFS2 case. Also it cleans ufs_read_inode function to prevent such kind of
situation in the future: it move depend on UFS type parts of code into
separate functions and leaves in ufs_read_inode only generic code. It
cleans code and avoids duplication.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
locking init cleanups:
- convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
- convert rwlocks in a similar manner
this patch was generated automatically.
Motivation:
- cleanliness
- lockdep needs control of lock initialization, which the open-coded
variants do not give
- it's also useful for -rt and for lock debugging in general
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- add a proper prototype for the following global function:
- buffer_init()
- make the following needlessly global function static:
- end_buffer_async_write()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.
Use these constants in core arch., mm, driver, and fs code.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
made me look at this code (bug id #344). We only return with
XFS_ERROR(EINVAL) if mp->m_rtdev_targp is valid and pass it otherwise to
xfs_read_buf() where some function calls later it gets dereferenced by an
assert.
SGI-PV: 954266
SGI-Modid: xfs-linux-melb:xfs-kern:26363a
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Builds on ARM report link problems with common configurations like
statically linked NFS (for nfsroot). The symptom is that __init
section code references __exit section code; that won't work since
the exit sections are discarded (since they can never be called).
The best fix for these particular cases would be an "__init_or_exit"
section annotation.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Drop '&& !JFFS2_FS_WRITEBUFFER' from fs/Kconfig.
The series of previous patches enables to use those
functionality at same time.
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
In jffs2_release_xattr_datum(), it refers xd->refcnt to ensure
whether releasing xd is allowed or not.
But we can't hold xattr_sem since this function is called under
spin_lock(&c->erase_completion_lock). Thus we have to refer it
without any locking.
This patch redefine xd->refcnt as atomic_t. It enables to refer
xd->refcnt without any locking.
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
If xattr_ref is associated with an orphan inode_cache
on filesystem mounting, those xattr_refs are not
released even if this inode_cache is released.
This patch enables to call jffs2_xattr_delete_inode()
for such a irregular inode_cachde too.
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
In the followinf situation, an explicit delete marker is not
necessary, because we can certainlly detect those obsolete
xattr_datum or xattr_ref on next mounting.
- When to delete xattr_datum node.
- When to delete xattr_ref node on removing inode.
- When to delete xattr_ref node on updating xattr.
This patch rids writing delete marker in those situations.
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
This patch enable to handle the case when updating null xattr
by null ACL.
When we try to set NULL into NULL xattr, xattr subsystem returns
-ENODATA. This patch enables to handle this error code.
[2/3] jffs2-xattr-v6-02-fix_posixacl_bug.patch
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
- When xdatum is removed, a new xdatum with 'delete marker' is
written. (version==0xffffffff means 'delete marker')
- When xref is removed, a new xref with 'delete marker' is written.
(odd-numbered xseqno means 'delete marker')
- delete_xattr_(datum/xref)_delay() are new deletion functions
are added. We can only use them if we can detect the target
obsolete xdatum/xref as a orphan or errir one.
(e.g when inode deletion, or detecting crc error)
[1/3] jffs2-xattr-v6-01-delete_marker.patch
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
d_instantiate, due to fast transaction committal removing the last
remaining reference before we were all done.
SGI-PV: 953287
SGI-Modid: xfs-linux-melb:xfs-kern:26347a
Signed-off-by: Nathan Scott <nathans@sgi.com>
unused * ->t_ag_freeblks_delta, ->t_ag_flist_delta, ->t_ag_btree_delta
are debugging aid -- wrap them in everyone's favourite way. As a
result, cut "xfs_trans" slab object size from 592 to 572 bytes here.
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26319a
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
flags and mode of final inode are looked at. Pass original inode
instead. * Two occurences of bhv_vnode_t go out.
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26298a
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
This patch #if 0's the no longer used dlm_dump_lock_resources().
Since this makes dlmdebug.h empty, this patch also removes this header.
Additionally, the needlessly global dlm_is_node_recovered() is made
static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The work that is done can block for long periods of time and so is not
appropriate for keventd.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Before checking for a nonexistent lock, make sure the lockres is not marked
RECOVERING. The caller will just retry and the state should be fixed up when
recovery completes.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We cannot restart recovery. Once we begin to recover a node, keep the state
of the recovery intact and follow through, regardless of any other node
deaths that may occur.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
If the previous master of the recovery lock dies, let calc_usage take it
down completely and let the caller completely redo the dlmlock() call.
Otherwise, there will never be an opportunity to re-master the lockres and
recovery wont be able to progress.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Use the existing structure for blocking migrations when ASTs are pending to
achieve the same result. If we can catch the assert before it goes on the
wire, just cancel it and let the migration continue.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Now we never change the owner of a lock resource until unmount or node
death. This will be re-enabled once some issues in the algorithm used have
been resolved.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
In dlmlock_remote(), do not call purge_lockres until the lock resource
actually changes. otherwise, the mastery info on the lockres will go away
underneath the caller.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
When mastering non-recovery lock resources, additional time was frequently
needed to allow the disk heartbeat to catch up with the network timeout. the
recovery lock resource is time critical and avoids this path.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Recovery will spin in dlm_pre_master_reco_lockres if we do not ignore
timed-out network responses from dead nodes.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Change behavior of dlm_restart_lock_mastery() when a node goes down. Dump
all responses that have been collected and start over.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This is an error on the sending side, so gracefully error out on the
receiving end.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Immediately purge a lockress that the local node is not the master of.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Makes it easier for the recovery process to deal with node death.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Take a reference on lockres structures while they are on the recovery list.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
handle errors during lock assert master by either killing self or other node
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The check for an empty lvb should check the entire buffer not just the first
byte.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Recovery may have happened and it may now be mastered locally.
Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The OCFS2 DLM allocates a number of pages for a hash to lookup locks.
There was a bug where a PAGE_SIZE bigger than the hash size (eg, 64K
pages) would result in zero pages allocated.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This allows us to have a hash table greater than a single page which greatly
improves dlm performance on some tests.
Signed-off-by: Daniel Phillips <phillips@google.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Gains us a bit of performance on loads which heavily hit the lockres hash.
Patch suggested by Daniel Phillips <phillips@google.com>.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
typo fixes
Clean up 'inline is not at beginning' warnings for usb storage
Storage class should be first
i386: Trivial typo fixes
ixj: make ixj_set_tone_off() static
spelling fixes
fix paniced->panicked typos
Spelling fixes for Documentation/atomic_ops.txt
move acknowledgment for Mark Adler to CREDITS
remove the bouncing email address of David Campbell
This is the first patch in a series of patches that removes devfs
support from the kernel. This patch removes the core devfs code, and
its private header file.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Sometimes e.g. with crashme the compat layer warnings can be noisy.
Add a way to turn them off by gating all output through compat_printk
that checks a global sysctl. The default is not changed.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[SPARC]: Add iomap interfaces.
[OPENPROM]: Rewrite driver to use in-kernel device tree.
[OPENPROMFS]: Rewrite using in-kernel device tree and seq_file.
[SPARC]: Add unique device_node IDs and a ".node" property.
[SPARC]: Add of_set_property() interface.
[SPARC64]: Export auxio_register to modules.
[SPARC64]: Add missing interfaces to dma-mapping.h
[SPARC64]: Export _PAGE_IE to modules.
[SPARC64]: Allow floppy driver to build modular.
[SPARC]: Export x_bus_type to modules.
[RIOWATCHDOG]: Fix the build.
[CPWATCHDOG]: Fix the build.
[PARPORT] sunbpp: Fix typo.
[MTD] sun_uflash: Port to new EBUS device layer.
This patch optimizes zap_threads() for the case when there are no ->mm
users except the current's thread group. In that case we can avoid
'for_each_process()' loop.
It also adds a useful invariant: SIGNAL_GROUP_EXIT (if checked under
->siglock) always implies that all threads (except may be current) have
pending SIGKILL.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is a preparation for the next patch. No functional changes.
Basically, this patch moves '->flags & SIGNAL_GROUP_EXIT' check into
zap_threads(), and 'complete(vfork_done)' into coredump_wait outside of
->mmap_sem protected area.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes tasklist_lock from zap_threads().
This is safe wrt:
do_exit:
The caller holds mm->mmap_sem. This means that task which
shares the same ->mm can't pass exit_mm(), so it can't be
unhashed from init_task.tasks or ->thread_group lists.
fork:
None of sub-threads can fork after zap_process(leader). All
processes which were created before this point should be
visible to zap_threads() because copy_process() adds the new
process to the tail of init_task.tasks list, and ->siglock
lock/unlock provides a memory barrier.
de_thread:
It does list_replace_rcu(&leader->tasks, ¤t->tasks).
So zap_threads() will see either old or new leader, it does
not matter. However, it can change p->sighand, so we should
use lock_task_sighand() in zap_process().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to
the thread group. This means that a TASK_TRACED task
1. Will be awakened by signal_wake_up(1)
2. Can't sleep again via ptrace_notify()
3. Can't go to do_signal_stop() after return
from ptrace_stop() in get_signal_to_deliver()
So we can remove all ptrace related stuff from coredump path.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With this patch a thread group is killed atomically under ->siglock. This is
faster because we can use sigaddset() instead of force_sig_info() and this is
used in further patches.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
zap_threads() iterates over all threads to find those ones which share
current->mm. All threads in the thread group share the same ->mm, so we can
skip entire thread group if it has another ->mm.
This patch shifts the killing of thread group into the newly added
zap_process() function. This looks as unnecessary complication, but it is
used in further patches.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We should keep the value of old_leader->tasks.next in de_thread, otherwise
we can't do for_each_process/do_each_thread without tasklist_lock held.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Below is a patch to add a new /proc/self/attr/sockcreate A process may write a
context into this interface and all subsequent sockets created will be labeled
with that context. This is the same idea as the fscreate interface where a
process can specify the label of a file about to be created. At this time one
envisioned user of this will be xinetd. It will be able to better label
sockets for the actual services. At this time all sockets take the label of
the creating process, so all xinitd sockets would just be labeled the same.
I tested this by creating a tcp sender and listener. The sender was able to
write to this new proc file and then create sockets with the specified label.
I am able to be sure the new label was used since the avc denial messages
kicked out by the kernel included both the new security permission
setsockcreate and all the socket denials were for the new label, not the label
of the running process.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Try to make next_tid() a bit more readable and deletes unnecessary
"pid_alive(pos)" check.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
first_tid:
/* If nr exceeds the number of threads there is nothing todo */
if (nr) {
if (nr >= get_nr_threads(leader))
goto done;
}
This is not reliable: sub-threads can exit after this check, so the
'for' loop below can overlap and proc_task_readdir() can return an
already filldir'ed dirents.
for (; pos && pid_alive(pos); pos = next_thread(pos)) {
if (--nr > 0)
continue;
Off-by-one error, will return 'leader' when nr == 1.
This patch tries to fix these problems and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is just like my previous removal of tasklist_lock from first_tgid, and
next_tgid. It simply had to wait until it was rcu safe to walk the thread
list.
This should be the last instance of the tasklist_lock in proc. So user
processes should not be able to influence the tasklist lock hold times.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In process of getting proc_fd_access_allowed to work it has developed a few
warts. In particular the special case that always allows introspection and
the special case to allow inspection of kernel threads.
The special case for introspection is needed for /proc/self/mem.
The special case for kernel threads really should be overridable
by security modules.
So consolidate these checks into ptrace.c:may_attach().
The check to always allow introspection is trivial.
The check to allow access to kernel threads, and zombies is a little
trickier. mem_read and mem_write already verify an mm exists so it isn't
needed twice. proc_fd_access_allowed only doesn't want a check to verify
task->mm exits, s it prevents all access to kernel threads. So just move
the task->mm check into ptrace_attach where it is needed for practical
reasons.
I did a quick audit and none of the security modules in the kernel seem to
care if they are passed a task without an mm into security_ptrace. So the
above move should be safe and it allows security modules to come up with
more restrictive policy.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since 2.2 we have been doing a chroot check to see if it is appropriate to
return a read or follow one of these magic symlinks. The chroot check was
asking a question about the visibility of files to the calling process and
it was actually checking the destination process, and not the files
themselves. That test was clearly bogus.
In my first pass through I simply fixed the test to check the visibility of
the files themselves. That naive approach to fixing the permissions was
too strict and resulted in cases where a task could not even see all of
it's file descriptors.
What has disturbed me about relaxing this check is that file descriptors
are per-process private things, and they are occasionaly used a user space
capability tokens. Looking a little farther into the symlink path on /proc
I did find userid checks and a check for capability (CAP_DAC_OVERRIDE) so
there were permissions checking this.
But I was still concerned about privacy. Besides /proc there is only one
other way to find out this kind of information, and that is ptrace. ptrace
has been around for a long time and it has a well established security
model.
So after thinking about it I finally realized that the permission checks
that make sense are the permission checks applied to ptrace_attach. The
checks are simple per process, and won't cause nasty surprises for people
coming from less capable unices.
Unfortunately there is one case that the current ptrace_attach test does
not cover: Zombies and kernel threads. Single stepping those kinds of
processes is impossible. Being able to see which file descriptors are open
on these tasks is important to lsof, fuser and friends. So for these
special processes I made the rule you can't find out unless you have
CAP_SYS_PTRACE.
These proc permission checks should now conform to the principle of least
surprise. As well as using much less code to implement :)
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The code doesn't need to sleep to when making this check so I can just do the
comparison and not worry about the reference counts.
TODO: While looking at this I realized that my original cleanup did not push
the permission check far enough down into the stack. The call of
proc_check_dentry_visible needs to move out of the generic proc
readlink/follow link code and into the individual get_link instances.
Otherwise the shared resources checks are not quite correct (shared
files_struct does not require a shared fs_struct), and there are races with
unshare.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
that they work with struct pid instead of struct task_ref.
Mostly this is a straight 1-1 substitution.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Every inode in /proc holds a reference to a struct task_struct. If a
directory or file is opened and remains open after the the task exits this
pinning continues. With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.
Normally I would figure a reasonable per user process limit is about 100
processes. With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.
This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer. The so the pinning of dead
tasks does not happen.
The code now has to contend with the fact that the task may now exit at any
time. Which is a little but not muh more complicated.
With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer. Much better.
[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently in /proc at several different places we define buffers to hold a
process id, or a file descriptor . In most of them we use either a hard coded
number or a different define. Modify them all to use PROC_NUMBUF, so the code
has a chance of being maintained.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Like the bug Oleg spotted in first_tid there was also a small off by one
error in first_tgid, when a seek was done on the /proc directory. This
fixes that and changes the code structure to make it a little more obvious
what is going on.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since we no longer need the tasklist_lock for get_task_struct the lookup
methods no longer need the tasklist_lock.
This just depends on my previous patch that makes get_task_struct() rcu
safe.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We don't need the tasklist_lock to safely iterate through processes
anymore.
This depends on my previous to task patches that make get_task_struct rcu
safe, and that make next_task() rcu safe. I haven't gotten
first_tid/next_tid yet only because next_thread is missing an
rcu_dereference.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a couple of problems this patch addresses.
- /proc/<tgid>/task currently does not work correctly if you stop reading
in the middle of a directory.
- /proc/ currently requires a full pass through the task list with
the tasklist lock held, to determine there are no more processes to read.
- The hand rolled integer to string conversion does not properly running
out of buffer space.
- We seem to be batching reading of pids from the tasklist without reason,
and complicating the logic of the code.
This patch addresses that by changing how tasks are processed. A
first_<task_type> function is built that handles restarts, and a
next_<task_type> function is built that just advances to the next task.
first_<task_type> when it detects a restart usually uses find_task_by_pid. If
that doesn't work because there has been a seek on the directory, or we have
already given a complete directory listing, it first checks the number tasks
of that type, and only if we are under that count does it walk through all of
the tasks to find the one we are interested in.
The code that fills in the directory is simpler because there is only a single
for loop.
The hand rolled integer to string conversion is replaced by snprintf which
should handle the the out of buffer case correctly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
proc_lookup and task exiting are not synchronized, although some of the
previous code may have suggested that. Every time before we reuse a dentry
namei.c calls d_op->derevalidate which prevents us from reusing a stale dcache
entry. Unfortunately it does not prevent us from returning a stale dcache
entry. This race has been explicitly plugged in proc_pid_lookup but there is
nothing to confine it to just that proc lookup function.
So to prevent the race I call revalidate explictily in all of the proc lookup
functions after I call d_add, and report an error if the revalidate does not
succeed.
Years ago Al Viro did something similar but those changes got lost in the
churn.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To keep the dcache from filling up with dead /proc entries we flush them on
process exit. However over the years that code has gotten hairy with a
dentry_pointer and a lock in task_struct and misdocumented as a correctness
feature.
I have rewritten this code to look and see if we have a corresponding entry in
the dcache and if so flush it on process exit. This removes the extra fields
in the task_struct and allows me to trivially handle the case of a
/proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All of the functions for proc_maps_operations are already defined in
task_mmu.c so move the operations structure to keep the functionality
together.
Since task_nommu.c implements a dummy version of /proc/<pid>/maps give it a
simplified version of proc_maps_operations that it can modify to best suit its
needs.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use getattr to get an accurate link count when needed. This is cheaper and
more accurate than trying to derive it by walking the thread list of a
process.
Especially as it happens when needed stat instead of at readdir time.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Long ago and far away in 2.2 we started checking to ensure the files we
displayed in /proc were visible to the current process. It was an
unsophisticated time and no one was worried about functions full of FIXMES in
a stable kernel. As time passed the function became sacred and was enshrined
in the shrine of how things have always been. The fixes came in but only to
keep the function working no one really remembering or documenting why we did
things that way.
The intent and the functionality make a lot of sense. Don't let /proc be an
access point for files a process can see no other way. The implementation
however is completely wrong.
We are currently checking the root directories of the two processes, we are
not checking the actual file descriptors themselves.
We are strangely checking with a permission method instead of just when we use
the data.
This patch fixes the logic to actually check the file descriptors and make a
note that implementing a permission method for this part of /proc almost
certainly indicates a bug in the reasoning.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The inode operations only exist to support the proc_permission function.
Currently mem_read and mem_write have all the same permission checks as
ptrace. The fs check makes no sense in this context, and we can trivially get
around it by calling ptrace.
So simply the code by killing the strange weird case.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
First we can access every /proc/<tgid>/task/<pid> directory as /proc/<pid> so
proc_task_permission is not usefully limiting visibility.
Second having related filesystems information should have nothing to do with
process visibility. kill does not implement any checks like that.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sole renaming use of proc_inode.type is to discover the file descriptor
number, so just store the file descriptor number and don't wory about
processing this field. This removes any /proc limits on the maximum number of
file descriptors, and clears the path to make the hard coded /proc inode
numbers go away.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently in /proc if the task is dumpable all of files are owned by the tasks
effective users. Otherwise the files are owned by root. Unless it is the
/proc/<tgid>/ or /proc/<tgid>/task/<pid> directory in that case we always make
the directory owned by the effective user.
However the special case for directories is pointless except as a way to read
the effective user, because the permissions on both of those directories are
world readable, and executable.
/proc/<tgid>/status provides a much better way to read a processes effecitve
userid, so it is silly to try to provide that on the directory.
So this patch simplifies the code by removing a pointless special case and
gets us one step closer to being able to remove the hard coded /proc inode
numbers.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The removed fields are already set by proc_alloc_inode. Initializing them in
proc_alloc_inode implies they need it for proper cleanup. At least ei->pde
was not set on all paths making it look like proc_alloc_inode was buggy. So
just remove the redundant assignments.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We already call everything except do_proc_readlink outside of the BKL in
proc_pid_followlink, and there appears to be nothing in do_proc_readlink that
needs any special protection.
So remove this leftover from one of the BKL cleanup efforts.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I noticed recently that my CONFIG_CRYPTO_MD5 turned into a y again instead
of m. It turns out that CONFIG_NFSD_V4 is selecting it to be y even though
I've chosen to compile nfsd as a module.
In general when we have a bool sitting under a tristate it is better to
select things you need from the tristate rather than the bool since that
allows the things you select to be modules.
The following patch does it for nfsd.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add the IOCTLs of the Gigaset drivers to compat_ioctl.h in order to make
them available for 32 bit programs on 64 bit platforms. Please merge.
Signed-off-by: Hansjoerg Lipp <hjlipp@web.de>
Acked-by: Tilman Schmidt <tilman@imap.cc>
Cc: Karsten Keil <kkeil@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds "-o bh" option to force use of buffer_heads. This option
is needed when we make "nobh" as default - and if we run into problems.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a /proc/<pid>/attr/keycreate entry that stores the appropriate context for
newly-created keys. Modify the selinux_key_alloc hook to make use of the new
entry. Update the flask headers to include a new "setkeycreate" permission
for processes. Update the flask headers to include a new "create" permission
for keys. Use the create permission to restrict which SIDs each task can
assign to newly-created keys. Add a new parameter to the security hook
"security_key_alloc" to indicate whether it is being invoked by the kernel, or
from userspace. If it is being invoked by the kernel, the security hook
should never fail. Update the documentation to reflect these changes.
Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the combination of list_del(A) and list_add(A, B) to
list_move(A, B) under fs/.
Cc: Ian Kent <raven@themaw.net>
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Hans Reiser <reiserfs-dev@namesys.com>
Cc: Urban Widmark <urban@teststation.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the combination of list_del(A) and list_add(A, B) to
list_move(A, B).
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts list_add(A, B.prev) to list_add_tail(A, &B) for
readability.
Acked-by: Karsten Keil <kkeil@suse.de>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Acked-by: Jan Kara <jack@suse.cz>
AOLed-by: David Woodhouse <dwmw2@infradead.org>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
CIFS takes/releases f_owner.lock - why? It does not change anything in the
fowner state. Remove this locking.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We lose property writing functionality for the time being, but
that will be easy to add back. The code and framework is so
much simpler now.
Signed-off-by: David S. Miller <davem@davemloft.net>
binfmt_flat.c calls set_personality with PER_LINUX as the personality.
On the arm architecture this results in the program running in 26bit
usermode. PER_LINUX_32BIT should be used instead. This doesn't affect
other architectures that use binfmt_flat.
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Trond had apparently merged the same patch twice, causing a duplicate
include of the "internal.h" file, with resulting obvious confusion.
Tssk. I'm the only one allowed to send out trees that don't even
compile! Who does this Trond guy think he is?
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (51 commits)
nfs: remove nfs_put_link()
nfs-build-fix-99
git-nfs-build-fixes
Merge branch 'odirect'
NFS: alloc nfs_read/write_data as direct I/O is scheduled
NFS: Eliminate nfs_get_user_pages()
NFS: refactor nfs_direct_free_user_pages
NFS: remove user_addr, user_count, and pos from nfs_direct_req
NFS: "open code" the NFS direct write rescheduler
NFS: Separate functions for counting outstanding NFS direct I/Os
NLM: Fix reclaim races
NLM: sem to mutex conversion
locks.c: add the fl_owner to nlm_compare_locks
NFS: Display the chosen RPCSEC_GSS security flavour in /proc/mounts
NFS: Split fs/nfs/inode.c
NFS: Fix typo in nfs_do_clone_mount()
NFS: Fix compile errors introduced by referrals patches
NFSv4: Ensure that referral mounts bind to a reserved port
NFSv4: A root pathname is sent as a zero component4
NFSv4: Follow a referral
...
* master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb: (244 commits)
V4L/DVB (4210b): git-dvb: tea575x-tuner build fix
V4L/DVB (4210a): git-dvb versus matroxfb
V4L/DVB (4209): Added some BTTV PCI IDs for newer boards
Fixes some sync issues between V4L/DVB development and GIT
V4L/DVB (4206): Cx88-blackbird: always set encoder height based on tvnorm->id
V4L/DVB (4205): Merge tda9887 module into tuner.
V4L/DVB (4203): Explicitly set the enum values.
V4L/DVB (4202): allow selecting CX2341x port mode
V4L/DVB (4200): Disable bitrate_mode when encoding mpeg-1.
V4L/DVB (4199): Add cx2341x-specific control array to cx2341x.c
V4L/DVB (4198): Avoid newer usages of obsoleted experimental MPEGCOMP API
V4L/DVB (4197): Port new MPEG API to saa7134-empress with saa6752hs
V4L/DVB (4196): Port cx88-blackbird to the new MPEG API.
V4L/DVB (4193): Update cx2341x fw encoding API doc.
V4L/DVB (4192): Use control helpers for saa7115, cx25840, msp3400.
V4L/DVB (4191): Add CX2341X MPEG encoder module.
V4L/DVB (4190): Add helper functions for control processing to v4l2-common.
V4L/DVB (4189): Add videodev support for VIDIOC_S/G/TRY_EXT_CTRLS.
V4L/DVB (4188): Add new MPEG control/ioctl definitions to videodev2.h
V4L/DVB (4186): Add support for the DNTV Live! mini DVB-T card.
...
likely profiling shows that the following is a miss.
After boot:
[+- ] Type | # True | # False | Function:Filename@Line
+unlikely | 1074| 0 prune_dcache()@:fs/dcache.c@409
After a bonnie++ run:
+unlikely | 66716| 19584 prune_dcache()@:fs/dcache.c@409
So remove it.
Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Things which force me think a little: why so?
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the linkat() syscall was added the flag parameter was added in the
last minute but it wasn't used so far. The following patch should change
that. My tests show that this is all that's needed.
If OLDNAME is a symlink setting the flag causes linkat to follow the
symlink and create a hardlink with the target. This is actually the
behavior POSIX demands for link() as well but Linux wisely does not do
this. With this flag (which will most likely be in the next POSIX
revision) the programmer can choose the behavior, defaulting to the safe
variant. As a side effect it is now possible to implement a
POSIX-compliant link(2) function for those who are interested.
touch file
ln -s file symlink
linkat(fd, "symlink", fd, "newlink", 0)
-> newlink is hardlink of symlink
linkat(fd, "symlink", fd, "newlink", AT_SYMLINK_FOLLOW)
-> newlink is hardlink of file
The value of AT_SYMLINK_FOLLOW is determined by the definition we already
use in glibc.
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If you do a poll() call with timeout -1, the wait will be a big number
(depending on HZ) instead of infinite wait, since -1 is passed to the
msecs_to_jiffies function.
Signed-off-by: Frode Isaksen <frode.isaksen@gmail.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
%s is only valid if array is known to contain NUL or precision is given and
does not exceed the size of array.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update smbiod to use kthread instead of deprecated kernel_thread.
[akpm@osdl.org: cleanup]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
coverity found two needless checks in vfs_inode.c (cid #1165 and #1164)
In both cases inode is always NULL when we goto error; either because it
is still initialized to NULL or is set to NULL explicitly. This patch
simply removes these checks to save some code.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Acked-by: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@lanl.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
VFS uses current->files pointer as lock owner ID, and it wouldn't be
prudent to expose this value to userspace. So scramble it with XTEA using
a per connection random key, known only to the kernel. Only one direction
needs to be implemented, since the ID is never sent in the reverse
direction.
The XTEA algorithm is implemented inline since it's simple enough to do so,
and this adds less complexity than if the crypto API were used.
Thanks to Jesper Juhl for the idea.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add synchronous request interruption. This is needed for file locking
operations which have to be interruptible. However filesystem may implement
interruptibility of other operations (e.g. like NFS 'intr' mount option).
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename the 'interrupted' flag to 'aborted', since it indicates exactly that,
and next patch will introduce an 'interrupted' flag for a
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All POSIX locks owned by the current task are removed on close(). If the
FLUSH request resulting initiated by close() fails to reach userspace, there
might be locks remaining, which cannot be removed.
The only reason it could fail, is if allocating the request fails. In this
case use the request reserved for RELEASE, or if that is currently used by
another FLUSH, wait for it to become available.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds POSIX file locking support to the fuse interface.
This implementation doesn't keep any locking state in kernel. Unlocking on
close() is handled by the FLUSH message, which now contains the lock owner id.
Mandatory locking is not supported. The filesystem may enfoce mandatory
locking in userspace if needed.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a control filesystem to fuse, replacing the attributes currently exported
through sysfs. An empty directory '/sys/fs/fuse/connections' is still created
in sysfs, and mounting the control filesystem here provides backward
compatibility.
Advantages of the control filesystem over the previous solution:
- allows the object directory and the attributes to be owned by the
filesystem owner, hence letting unpriviled users abort the
filesystem connection
- does not suffer from module unload race
[akpm@osdl.org: fix this fs for recent dhowells depredations]
[akpm@osdl.org: fix 64-bit printk warnings]
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't put requests into the background when a fatal interrupt occurs while the
request is in userspace. This removes a major wart from the implementation.
Backgrounding of requests was introduced to allow breaking of deadlocks.
However now the same can be achieved by aborting the filesystem through the
'abort' sysfs attribute.
This is a change in the interface, but should not cause problems, since these
kinds of deadlocks never happen during normal operation.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I've found a case where invalid dentrys in a mount tree, waiting to be
cleaned up by d_invalidate, prevent the expected expire.
In this case dentrys created during a lookup for which a mount fails or has
no entry in the mount map contribute to the d_count of the parent dentry.
These dentrys may not be invalidated prior to comparing the interanl usage
count of valid autofs dentrys against the dentry d_count which makes a
mount tree appear busy so it doesn't expire.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In the course of trying to track down a bug where a file mtime was not
being updated correctly, it was discovered that the m/ctime updates were
not quite being handled correctly for ftruncate() calls.
Quoth SUSv3:
open(2):
If O_TRUNC is set and the file did previously exist, upon
successful completion, open() shall mark for update the st_ctime
and st_mtime fields of the file.
truncate(2):
Upon successful completion, if the file size is changed, this
function shall mark for update the st_ctime and st_mtime fields
of the file, and the S_ISUID and S_ISGID bits of the file mode
may be cleared.
ftruncate(2):
Upon successful completion, if fildes refers to a regular file,
the ftruncate() function shall mark for update the st_ctime and
st_mtime fields of the file and the S_ISUID and S_ISGID bits of
the file mode may be cleared. If the ftruncate() function is
unsuccessful, the file is unaffected.
The open(O_TRUNC) and truncate cases were being handled correctly, but the
ftruncate case was being handled like the truncate case. The semantics of
truncate and ftruncate don't quite match, so ftruncate needs to be handled
slightly differently.
The attached patch addresses this issue for ftruncate(2).
My thanx to Stephen Tweedie and Trond Myklebust for their help in
understanding the situation and semantics.
Signed-off-by: Peter Staubach <staubach@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The variables nlen and rlen are defined/initialized but not used in
ext3_add_entry().
Signed-off-by: Johann Lombardi <johann.lombardi@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A few days ago Arjan signaled a lockdep red flag on epoll locks, and
precisely between the epoll's device structure lock (->lock) and the wait
queue head lock (->lock).
Like I explained in another email, and directly to Arjan, this can't happen
in reality because of the explicit check at eventpoll.c:592, that does not
allow to drop an epoll fd inside the same epoll fd. Since lockdep is
working on per-structure locks, it will never be able to know of policies
enforced in other parts of the code.
It was decided time ago of having the ability to drop epoll fds inside
other epoll fds, that triggers a very trick wakeup operations (due to
possibly reentrant callback-driven wakeups) handled by the
ep_poll_safewake() function. While looking again at the code though, I
noticed that all the operations done on the epoll's main structure wait
queue head (->wq) are already protected by the epoll lock (->lock), so that
locked-style functions can be used to manipulate the ->wq member. This
makes both a lock-acquire save, and lockdep happy.
Running totalmess on my dual opteron for a while did not reveal any problem
so far:
http://www.xmailserver.org/totalmess.c
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes EXT2_DEBUG work again. Due to lack of proper include
file, EXT2_DEBUG was undefined in bitmap.c and ext2_count_free() is left
out. Moved to balloc.c and removed bitmap.c entirely.
Second, debug versions of ext2_count_free_{inodes/blocks} reacquires
superblock lock. Moved lock into callers.
Signed-off-by: Val Henson <val_henson@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make procfs non-optional unless EMBEDDED is set, just like sysfs. procfs
is already de facto required for a large subset of Linux functionality.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Convert the ext3 in-kernel filesystem blocks to ext3_fsblk_t. Convert the
rest of all unsigned long type in-kernel filesystem blocks to ext3_fsblk_t,
and replace the printk format string respondingly.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some of the in-kernel ext3 block variable type are treated as signed 4 bytes
int type, thus limited ext3 filesystem to 8TB (4kblock size based). While
trying to fix them, it seems quite confusing in the ext3 code where some
blocks are filesystem-wide blocks, some are group relative offsets that need
to be signed value (as -1 has special meaning). So it seem saner to define
two types of physical blocks: one is filesystem wide blocks, another is
group-relative blocks. The following patches clarify these two types of
blocks in the ext3 code, and fix the type bugs which limit current 32 bit ext3
filesystem limit to 8TB.
With this series of patches and the percpu counter data type changes in the mm
tree, we are able to extend exts filesystem limit to 16TB.
This work is also a pre-request for the recent >32 bit ext3 work, and makes
the kernel to able to address 48 bit ext3 block a lot easier: Simply redefine
ext3_fsblk_t from unsigned long to sector_t and redefine the format string for
ext3 filesystem block corresponding.
Two RFC with a series patches have been posted to ext2-devel list and have
been reviewed and discussed:
http://marc.theaimsgroup.com/?l=ext2-devel&m=114722190816690&w=2http://marc.theaimsgroup.com/?l=ext2-devel&m=114784919525942&w=2
Patches are tested on both 32 bit machine and 64 bit machine, <8TB ext3 and
>8TB ext3 filesystem(with the latest to be released e2fsprogs-1.39). Tests
includes overnight fsx, tiobench, dbench and fsstress.
This patch:
Defines ext3_fsblk_t and ext3_grpblk_t, and the printk format string for
filesystem wide blocks.
This patch classifies all block group relative blocks, and ext3_fsblk_t blocks
occurs in the same function where used to be confusing before. Also include
kernel bug fixes for filesystem wide in-kernel block variables. There are
some fileystem wide blocks are treated as int/unsigned int type in the kernel
currently, especially in ext3 block allocation and reservation code. This
patch fixed those bugs by converting those variables to ext3_fsblk_t(unsigned
long) type.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The problem is that when we write to a file, the copy from userspace to
pagecache is first done with preemption disabled, so if the source address is
not immediately available the copy fails *and* *zeros* *the* *destination*.
This is a problem because a concurrent read (which admittedly is an odd thing
to do) might see zeros rather that was there before the write, or what was
there after, or some mixture of the two (any of these being a reasonable thing
to see).
If the copy did fail, it will immediately be retried with preemption
re-enabled so any transient problem with accessing the source won't cause an
error.
The first copying does not need to zero any uncopied bytes, and doing so
causes the problem. It uses copy_from_user_atomic rather than copy_from_user
so the simple expedient is to change copy_from_user_atomic to *not* zero out
bytes on failure.
The first of these two patches prepares for the change by fixing two places
which assume copy_from_user_atomic does zero the tail. The two usages are
very similar pieces of code which copy from a userspace iovec into one or more
page-cache pages. These are changed to remove the assumption.
The second patch changes __copy_from_user_inatomic* to not zero the tail.
Once these are accepted, I will look at similar patches of other architectures
where this is important (ppc, mips and sparc being the ones I can find).
This patch:
There is a problem with __copy_from_user_inatomic zeroing the tail of the
buffer in the case of an error. As it is called in atomic context, the error
may be transient, so it results in zeros being written where maybe they
shouldn't be.
In the usage in filemap, this opens a window for a well timed read to see data
(zeros) which is not consistent with any ordering of reads and writes.
Most cases where __copy_from_user_inatomic is called, a failure results in
__copy_from_user being called immediately. As long as the latter zeros the
tail, the former doesn't need to. However in *copy_from_user_iovec
implementations (in both filemap and ntfs/file), it is assumed that
copy_from_user_inatomic will zero the tail.
This patch removes that assumption, so that after this patch it will
be safe for copy_from_user_inatomic to not zero the tail.
This patch also adds some commentary to filemap.h and asm-i386/uaccess.h.
After this patch, all architectures that might disable preempt when
kmap_atomic is called need to have their __copy_from_user_inatomic* "fixed".
This includes
- powerpc
- i386
- mips
- sparc
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was reported as Debian bug #336604.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The variable i is guaranteed to be the same as db_count given the previous
for loop. So get rid of it since it's dead code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If ext3 filesystem is larger than 2TB, and sector_t is a u32 (i.e.
CONFIG_LBD not defined in the kernel), the calculation of the disk sector
will overflow. Add check at ext3_fill_super() and ext3_group_extend() to
prevent mount/remount/resize >2TB ext3 filesystem if sector_t size is 4
bytes.
Verified this patch on a 32 bit platform without CONFIG_LBD defined
(sector_t is 32 bits long), mount refuse to mount a 10TB ext3.
Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
"Move" "common code" out to PTR_NOD, which does the conversion from private
pointer to node number. This is to reduce potential casting/conversion errors
due to redundancy. (The naming PTR_NOD follows PTR_ERR, turning a pointer
into xyz.)
[akpm@osdl.org: cleanups]
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
tchars is not '\0'-terminated so the strtoul may run into problems. Fix that.
Also make tchars as big as a long in hexadecimal form would take rather than
just 16.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make two needlessly global functions static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In ufs code there is function: ubh_ll_rw_block, it has parameter how many
ufs_buffer_head it should handle, but it always called with "1" on the place
of this parameter. This patch removes unused parameter of "ubh_ll_wr_block".
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ufs super block contains some statistic about file systems, like amount of
directories, free blocks, inodes and so on.
UFS1 hold this information in one location and uses 32bit integers for such
information, UFS2 hold statistic in another location and uses 64bit integers.
There is transition variant, if UFS1 has type 44BSD and flags field in super
block has some special value this mean that we work with statistic like UFS2
does. and this also means that nobody care about old(UFS1) statistic.
So if start fsck against such file system, after usage linux ufs driver, it
found error: at now only UFS1 like statistic is updated.
This patch should fix this. Also it contains some minor cleanup: CodingSytle
and remove unused variables.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently ufs doesn't support "fsync", this make some applications unhappy,
for example vim. This patch fixes this situation.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Super block of UFS usually has size >512, because of fragment size may be 512,
this cause some problems.
Currently, there are two methods to work with ufs super block:
1) split structure which describes ufs super blocks into structures with
size <=512
2) use one structure which describes ufs super block, and hope that array
of "buffer_head" which holds "super block", has such construction:
bh[n]->b_data + bh[n]->b_size == bh[n + 1]->b_data
The second variant may cause some problems in the future, and usage of two
variants cause unnecessary code duplication.
This patch remove the second variant. Also patch contains some CodingStyle
fixes.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes two bugs, which introduced by previous patches:
1) Missed "brelse"
2) Sometimes "baseblk" may be wrongly calculated, if i_size is equal to
zero, which lead infinite cycle in "mpage_writepages".
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fs/ufs/super.c: In function `ufs_print_super_stuff':
fs/ufs/super.c:103: warning: unsigned int format, different type arg (arg 2) fs/ufs/super.c: In function `ufs2_print_super_stuff': fs/ufs/super.c:147: warning: unsigned int format, different type arg (arg 2) fs/ufs/super.c: In function `ufs_print_cylinder_stuff':
fs/ufs/super.c:175: warning: unsigned int format, different type arg (arg 2)
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently if we allocate several "metadata" blocks (pointers to indirect
blocks for example), we fill with zeroes only the first block. This cause
some problems in "truncate" function. Also this patch remove some unused
arguments from several functions and add comments.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ufs_free_blocks function looks now in so way:
if (err)
goto failed;
lock_super();
failed:
unlock_super();
So if error happen we'll unlock not locked super.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At now UFS code uses DQUOT_* mechanism, but it also update inode->i_blocks
manually, this cause wrong i_blocks value.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch make little optimization of ufs_find_entry like "ext2" does. Save
number of page and reuse it again in the next call.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently to turn on debug mode "user" has to edit ~10 files, to turn off he
has to do it again.
This patch introduce such changes:
1)turn on(off) debug messages via ".config"
2)remove unnecessary duplication of code
3)make "UFSD" macros more similar to function
4)fix some compiler warnings
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
To find new bugs, I suggest revert this patch:
http://lkml.org/lkml/2006/1/31/275 in -mm tree.
So others can test "write support" of UFS.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The writing to UFS file system with block/fragment!=8 may cause bogus
behaviour. The problem in "ufs_bitmap_search" function, which doesn't work
correctly in "block/fragment!=8" case. The idea is stolen from BSD code.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are two ugly macros in ufs code:
#define UCPI_UBH ((struct ufs_buffer_head *)ucpi)
#define USPI_UBH ((struct ufs_buffer_head *)uspi)
when uspi looks like
struct {
struct ufs_buffer_head ;
}
and USPI_UBH has some sence,
ucpi looks like
struct {
struct not_ufs_buffer_head;
}
To prevent bugs in future, this patch convert macros to inline function and
fix "ucpi" structure.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change function in fs/ufs/dir.c and fs/ufs/namei.c to work with pages
instead of straight work with blocks. It fixed such bugs:
* for i in `seq 1 1000`; do touch $i; done - crash system
* mkdir create directory without "." and ".." entries
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This series of patches finished "bugs fixing" mentioned
here http://lkml.org/lkml/2006/1/31/275 .
The main bugs:
* for i in `seq 1 1000`; do touch $i; done - crash system
* mkdir create directory without "." and ".." entries
The suggested solution is work with page cache instead of straight work
with blocks. Such solution has following advantages
* reduce code size and its complexity
* some global locks go away
* fix bugs
The most part of code is stolen from ext2, because of it has similar
directory structure.
Patches testes with UFS1 and UFS2 file systems.
This patch installs i_mapping->a_ops for directory inodes and removes some
duplicated code.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
First of all some necessary notes about UFS by it self: To avoid waste of disk
space the tail of file consists not from blocks (which is ordinary big enough,
16K usually), it consists from fragments(which is ordinary 2K). When file is
growing its tail occupy 1 fragment, 2 fragments... At some stage decision to
allocate whole block is made and all fragments are moved to one block.
How this situation was handled before:
ufs_prepare_write
->block_prepare_write
->ufs_getfrag_block
->...
->ufs_new_fragments:
bh = sb_bread
bh->b_blocknr = result + i;
mark_buffer_dirty (bh);
This is wrong solution, because:
- it didn't take into consideration that there is another cache: "inode page
cache"
- because of sb_getblk uses not b_blocknr, (it uses page->index) to find
certain block, this breaks sb_getblk.
How this situation is handled now: we go though all "page inode cache", if
there are no such page in cache we load it into cache, and change b_blocknr.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* After block allocation, we map it on the same "address" as 8 others
blocks
* We nullify block several times: once in ufs/block.c and once in
block_*write_full_page, and use different "caches" for this.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, ufs write support have two sets of problems: work with files and
work with directories.
This series of patches should solve the first problem.
This patch is similar to http://lkml.org/lkml/2006/1/17/61 this patch
complements it.
The situation the same: in ufs_trunc_(not direct), we read block, check if
count of links to it is equal to one, if so we finish cycle, if not
continue. Because of "count of links" always >=2 this operation cause
infinite cycle and hang up the kernel.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix of some spelling errors in fs/freevxfs error messages and comments
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix various problems with nfs4 disabled. And various other things.
In file included from fs/nfs/inode.c:50:
fs/nfs/internal.h:24: error: static declaration of 'nfs_do_refmount' follows non-static declaration
include/linux/nfs_fs.h:320: error: previous declaration of 'nfs_do_refmount' was here
fs/nfs/internal.h:65: warning: 'struct nfs4_fs_locations' declared inside parameter list
fs/nfs/internal.h:65: warning: its scope is only this definition or declaration, which is probably not what you want
fs/nfs/internal.h: In function 'nfs4_path':
fs/nfs/internal.h:97: error: 'struct nfs_server' has no member named 'mnt_path'
fs/nfs/inode.c: In function 'init_once':
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'open_states'
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'delegation'
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'delegation_state'
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'rwsem'
distcc[26452] ERROR: compile fs/nfs/inode.c on g5/64 failed
make[1]: *** [fs/nfs/inode.o] Error 1
make: *** [fs/nfs/inode.o] Error 2
make: *** Waiting for unfinished jobs....
In file included from fs/nfs/nfs3xdr.c:26:
fs/nfs/internal.h:24: error: static declaration of 'nfs_do_refmount' follows non-static declaration
include/linux/nfs_fs.h:320: error: previous declaration of 'nfs_do_refmount' was here
fs/nfs/internal.h:65: warning: 'struct nfs4_fs_locations' declared inside parameter list
fs/nfs/internal.h:65: warning: its scope is only this definition or declaration, which is probably not what you want
fs/nfs/internal.h: In function 'nfs4_path':
fs/nfs/internal.h:97: error: 'struct nfs_server' has no member named 'mnt_path'
distcc[26486] ERROR: compile fs/nfs/nfs3xdr.c on g5/64 failed
make[1]: *** [fs/nfs/nfs3xdr.o] Error 1
make: *** [fs/nfs/nfs3xdr.o] Error 2
In file included from fs/nfs/nfs3proc.c:24:
fs/nfs/internal.h:24: error: static declaration of 'nfs_do_refmount' follows non-static declaration
include/linux/nfs_fs.h:320: error: previous declaration of 'nfs_do_refmount' was here
fs/nfs/internal.h:65: warning: 'struct nfs4_fs_locations' declared inside parameter list
fs/nfs/internal.h:65: warning: its scope is only this definition or declaration, which is probably not what you want
fs/nfs/internal.h: In function 'nfs4_path':
fs/nfs/internal.h:97: error: 'struct nfs_server' has no member named 'mnt_path'
distcc[26469] ERROR: compile fs/nfs/nfs3proc.c on bix/32 failed
make[1]: *** [fs/nfs/nfs3proc.o] Error 1
make: *** [fs/nfs/nfs3proc.o] Error 2
**FAILED**
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Cc: Andy Adamson <andros@citi.umich.edu>
Cc: Chuck Lever <cel@netapp.com>
Cc: David Howells <dhowells@redhat.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Manoj Naik <manoj@almaden.ibm.com>
Cc: Marc Eshel <eshel@almaden.ibm.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The ioctl were removed by:
V4L/DVB (3727): Remove DMX_GET_EVENT and associated data structures
due to the ioctl DMX_GET_EVENT has never been implemented, and also
scrambling events can't be generated in a useful way by the hardware.
This patch removes the corresponding entry at fs/compat_ioctl.c
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Re-arrange the logic in the NFS direct I/O path so that nfs_read/write_data
structs are allocated just before they are scheduled, rather than
allocating them all at once before we start scheduling requests.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Neil Brown observed that the kmalloc() in nfs_get_user_pages() is more
likely to fail if the I/O is large enough to require the allocation of more
than a single page to keep track of all the pinned pages in the user's
buffer.
Instead of tracking one large page array per dreq/iocb, track pages per
nfs_read/write_data, just like the cached I/O path does. An array for
pages is already allocated for us by nfs_readdata_alloc() (and the write
and commit equivalents).
This is also required for adding support for vectored I/O to the NFS direct
I/O path.
The original reason to pin the user buffer and allocate all the NFS data
structures before trying to schedule I/O was to ensure all needed resources
are allocated on the client before starting to send requests. This reduces
the chance that resource exhaustion on the client will cause a short read
or write.
On the other hand, for an application making very large application I/O
requests, this means that it will be nearly impossible for the application
to make forward progress on a resource-limited client.
Thus, moving the buffer pinning functionality into the I/O scheduling
loops should be good for scalability. The next patch will do the same for
NFS data structure allocation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up and fix a minor bug: the logic was dirtying page cache pages on
both read and write operations.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the user_addr, user_count, and pos parameters explicit to the
scheduler routines, and remove the fields from nfs_direct_req. The
iovec API will be passing in a series of these, not just one set.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An NFSv3/v4 client must reschedule on-the-wire writes if the writes are
UNSTABLE, and the server reboots before the client can complete a
subsequent COMMIT request.
To support direct asynchronous scatter-gather writes, the write
rescheduler in fs/nfs/direct.c must not depend on the I/O parameters
in the controlling nfs_direct_req structure. iovecs can be somewhat
arbitrarily complex, so there could be an unbounded amount of information
to save for a rarely encountered requirement.
Refactor the direct write rescheduler so it uses information from each
nfs_write_data structure to reschedule writes, instead of caching that
information in the controlling nfs_direct_req structure.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out the logic that increments and decrements the outstanding I/O
count. This will be a commonly used bit of code in upcoming patches.
Also make this an atomic_t again, since it will be very often manipulated
outside dreq->spin lock.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Otherwise we could be racing with truncate/mapping removal.
Problem found/fixed by Nick Piggin <npiggin@suse.de>, logic rewritten
by me.
Signed-off-by: Jens Axboe <axboe@suse.de>
A process flag to indicate whether we are doing sync io is incredibly
ugly. It also causes performance problems when one does a lot of async
io and then proceeds to sync it. Part of the io will go out as async,
and the other part as sync. This causes a disconnect between the
previously submitted io and the synced io. For io schedulers such as CFQ,
this will cause us lost merges and suboptimal behaviour in scheduling.
Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
the O_DIRECT path just directly indicate that the writes are sync
by using WRITE_SYNC instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
Sometimes partitions claim to be larger than the reported capacity of a
disk device. This patch makes the kernel warn about those partitions.
We still permit these patitions to be used. Quoting Andries Brouwer
<Andries.Brouwer@cwi.nl>:
Case 1: The kernel is mistaken about the size of the disk. (There are
commands to clip a disk to a certain capacity, there are jumpers to tell a
disk that it should report a certain capacity etc. Usually this is because
of BIOS bugs. In bad cases the machine will crash in the BIOS and hence fail
to boot if the disk reports full capacity.) In such cases actually accessing
the blocks of the partition may work fine, or may work fine after running an
unclip utility. I wrote "setmax" some years ago precisely for this reason.
Case 2: There was a messy partition table (maybe just a rounding error) but
the actual filesystem on the partition is contained in the physical disk.
Now using the filesystem goes without problem.
Case 3: Both partition and filesystem extend beyond the end of the disk. In
forensic or debugging situations one often uses a copy of the start of a
disk. Now access beyond the end gives an expected I/O error.
Signed-off-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Stephen Cameron <steve.cameron@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Split the checkpoint list of the transaction into two lists. In the first
list we keep the buffers that need to be submitted for IO. In the second
list are kept buffers that were already submitted and we just have to wait
for the IO to complete. This should simplify a handling of checkpoint
lists a bit and can eventually be also a performance gain.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1. We can use list_replace_init() instead.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The percpu counter data type are changed in this set of patches to support
more users like ext3 who need more than 32 bit to store the free blocks
total in the filesystem.
- Generic perpcu counters data type changes. The size of the global counter
and local counter were explictly specified using s64 and s32. The global
counter is changed from long to s64, while the local counter is changed from
long to s32, so we could avoid doing 64 bit update in most cases.
- Users of the percpu counters are updated to make use of the new
percpu_counter_init() routine now taking an additional parameter to allow
users to pass the initial value of the global counter.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Do a CodingStyle cleanup of fs/binfmt_elf.c and also remove some pointless
casts of kmalloc() return values in the same file.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Steven Rostedt <rostedt@goodmis.org> points out that `rsv' here is usually
NULL, so we should avoid calling kfree().
Also, fix up some nearby whitespace damage.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There are a couple of places where JBD has to check to see whether an unneeded
memory allocation was performed. Usually it _was_ needed, so we end up
calling kfree(NULL). We can micro-optimise that by checking the pointer
before calling kfree().
Thanks to Steven Rostedt <rostedt@goodmis.org> for identifying this.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix possible assertion failure in journal_commit_transaction() on
jh->b_next_transaction == NULL (when we are processing BJ_Forget list and
buffer is not jbddirty).
!jbddirty buffers can be placed on BJ_Forget list for example by
journal_forget() or by __dispose_buffer() - generally such buffer means
that it has been freed by this transaction.
Freed buffers should not be reallocated until the transaction has committed
(that's why we have the assertion there) but they *can* be reallocated when
the transaction has already been committed to disk and we are just
processing the BJ_Forget list (as soon as we remove b_committed_data from
the bitmap bh, ext3 will be able to reallocate buffers freed by the
committing transaction). So we have to also count with the case that the
buffer has been reallocated and b_next_transaction has been already set.
And one more subtle point: it can happen that we manage to reallocate the
buffer and also mark it jbddirty. Then we also add the freed buffer to the
checkpoint list of the committing trasaction. But that should do no harm.
Non-jbddirty buffers should be filed to BJ_Reserved and not BJ_Metadata
list. It can actually happen that we refile such buffers during the commit
phase when we reallocate in the running transaction blocks deleted in
committing transaction (and that can happen if the committing transaction
already wrote all the data and is just cleaning up BJ_Forget list).
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The "count" and "pt" variables are declared and modified by do_poll(), as
well as accessed and written indirectly in the do_pollfd() subroutine.
This patch pulls all handling of these variables into the do_poll()
function, thereby eliminating the odd use of indirection in do_pollfd().
This is done by pulling the "struct pollfd" traversal loop from do_pollfd()
into its only caller do_poll(). As an added bonus, the patch saves a few
clock cycles, and also adds comments to make the code easier to follow.
Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We can now make posix_locks_deadlock() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass the POSIX lock owner ID to the flush operation.
This is useful for filesystems which don't want to store any locking state
in inode->i_flock but want to handle locking/unlocking POSIX locks
internally. FUSE is one such filesystem but I think it possible that some
network filesystems would need this also.
Also add a flag to indicate that a POSIX locking request was generated by
close(), so filesystems using the above feature won't send an extra locking
request in this case.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
locks_remove_posix() can use posix_lock_file() instead of doing the lock
removal by hand. posix_lock_file() now does exacly the same.
The comment about pids no longer applies, posix_lock_file() takes only the
owner into account.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
posix_lock_file() always allocates new locks in advance, even if it's easy to
determine that no allocations will be needed.
Optimize these cases:
- FL_ACCESS flag is set
- Unlocking the whole range
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
posix_lock_file() was too cautious, failing operations on OOM, even if they
didn't actually require an allocation.
This has the disadvantage, that a failing unlock on process exit could lead to
a memory leak. There are two possibilites for this:
- filesystem implements .lock() and calls back to posix_lock_file(). On
cleanup of files_struct locks_remove_posix() is called which should remove all
locks belonging to files_struct. However if filesystem calls
posix_lock_file() which fails, then those locks will never be freed.
- if a file is closed while a lock is blocked, then after acquiring
fcntl_setlk() will undo the lock. But this unlock itself might fail on OOM,
again possibly leaking the lock.
The solution is to move the checking of the allocations until after it is sure
that they will be needed. This will solve the above problem since unlock will
always succeed unless it splits an existing region.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add read_mapping_page() which is used for callers that pass
mapping->a_ops->readpage as the filler for read_cache_page. This removes
some duplication from filesystem code.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Disable Ext2 XIP if the kernel is configured in no-MMU mode as the former
won't build.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement an LSM hook for setting a task's IO priority, similar to the hook
for setting a tasks's nice value.
A previous version of this LSM hook was included in an older version of
multiadm by Jan Engelhardt, although I don't recall it being submitted
upstream.
Also included is the corresponding SELinux hook, which re-uses the setsched
permission in the proccess class.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Current hugetlb strict accounting for shared mapping always assume mapping
starts at zero file offset and reserves pages between zero and size of the
file. This assumption often reserves (or lock down) a lot more pages then
necessary if application maps at none zero file offset. libhugetlbfs is
one example that requires proper reservation on shared mapping starts at
none zero offset.
This patch extends the reservation and hugetlb strict accounting to support
any arbitrary pair of (offset, len), resulting a much more robust and
accurate scheme. More importantly, it won't lock down any hugetlb pages
outside file mapping.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
in xfs.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Enable XFS to limit the statfs() results to the project quota covering the
dentry used as a base for call.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.
This complements the get_sb() patch. That reduced the significance of
sb->s_root, allowing NFS to place a fake root there. However, NFS does
require a dentry to use as a target for the statfs operation. This permits
the root in the vfsmount to be used instead.
linux/mount.h has been added where necessary to make allyesconfig build
successfully.
Interest has also been expressed for use with the FUSE and XFS filesystems.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Add description of d_lock handling to comments over prune_one_dentry().
- It has three callsites - uninline it, saving 200 bytes of text.
Cc: Jan Blunck <jblunck@suse.de>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Olaf Hering <olh@suse.de>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The race is that the shrink_dcache_memory shrinker could get called while a
filesystem is being unmounted, and could try to prune a dentry belonging to
that filesystem.
If it does, then it will call in to iput on the inode while the dentry is
no longer able to be found by the umounting process. If iput takes a
while, generic_shutdown_super could get all the way though
shrink_dcache_parent and shrink_dcache_anon and invalidate_inodes without
ever waiting on this particular inode.
Eventually the superblock gets freed anyway and if the iput tried to touch
it (which some filesystems certainly do), it will lose. The promised
"Self-destruct in 5 seconds" doesn't lead to a nice day.
The race is closed by holding s_umount while calling prune_one_dentry on
someone else's dentry. As a down_read_trylock is used,
shrink_dcache_memory will no longer try to prune the dentry of a filesystem
that is being unmounted, and unmount will not be able to start until any
such active prune_one_dentry completes.
This requires that prune_dcache *knows* which filesystem (if any) it is
doing the prune on behalf of so that it can be careful of other
filesystems. shrink_dcache_memory isn't called it on behalf of any
filesystem, and so is careful of everything.
shrink_dcache_anon is now passed a super_block rather than the s_anon list
out of the superblock, so it can get the s_anon list itself, and can pass
the superblock down to prune_dcache.
If prune_dcache finds a dentry that it cannot free, it leaves it where it
is (at the tail of the list) and exits, on the assumption that some other
thread will be removing that dentry soon. To try to make sure that some
work gets done, a limited number of dnetries which are untouchable are
skipped over while choosing the dentry to work on.
I believe this race was first found by Kirill Korotaev.
Cc: Jan Blunck <jblunck@suse.de>
Acked-by: Kirill Korotaev <dev@openvz.org>
Cc: Olaf Hering <olh@suse.de>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes the steal_locks() function.
steal_locks() doesn't work correctly with any filesystem that does it's own
lock management, including NFS, CIFS, etc.
In addition it has weird semantics on local filesystems in case tasks
sharing file-descriptor tables are doing POSIX locking operations in
parallel to execve().
The steal_locks() function has an effect on applications doing:
clone(CLONE_FILES)
/* in child */
lock
execve
lock
POSIX locks acquired before execve (by "child", "parent" or any further
task sharing files_struct) will after the execve be owned exclusively by
"child".
According to Chris Wright some LSB/LTP kind of suite triggers without the
stealing behavior, but there's no known real-world application that would
also fail.
Apps using NPTL are not affected, since all other threads are killed before
execve.
Apps using LinuxThreads are only affected if they
- have multiple threads during exec (LinuxThreads doesn't kill other
threads, the app may do it with pthread_kill_other_threads_np())
- rely on POSIX locks being inherited across exec
Both conditions are documented, but not their interaction.
Apps using clone() natively are affected if they
- use clone(CLONE_FILES)
- rely on POSIX locks being inherited across exec
The above scenarios are unlikely, but possible.
If the patch is vetoed, there's a plan B, that involves mostly keeping the
weird stealing semantics, but changing the way lock ownership is handled so
that network and local filesystems work consistently.
That would add more complexity though, so this solution seems to be
preferred by most people.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This race became a cause of oops, and can reproduce by the following.
while true; do
dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
done
This race condition was between __sync_single_inode() and iput().
cpu0 (fs's inode) cpu1 (bdev's inode)
----------------- -------------------
close("/dev/hda2")
[...]
__sync_single_inode()
/* copy the bdev's ->i_mapping */
mapping = inode->i_mapping;
generic_forget_inode()
bdev_clear_inode()
/* restre the fs's ->i_mapping */
inode->i_mapping = &inode->i_data;
/* bdev's inode was freed */
destroy_inode(inode);
if (wait) {
/* dereference a freed bdev's mapping->host */
filemap_fdatawait(mapping); /* Oops */
Since __sync_single_inode() is only taking a ref-count of fs's inode, the
another process can be close() and freeing the bdev's inode while writing
fs's inode. So, __sync_signle_inode() accesses the freed ->i_mapping,
oops.
This patch takes a ref-count on the bdev's inode for the fs's inode before
setting a ->i_mapping, and the clear_inode() of the fs's inode does iput() on
the bdev's inode. So if the fs's inode is still living, bdev's inode
shouldn't be freed.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many thanks to Pauline Ng for the detailed bug report and analysis!
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://oss.sgi.com:8090/xfs-2.6: (43 commits)
[XFS] Remove files from the build that are now unused.
[XFS] Fix a Makefile issue related to exports.o handling.
[XFS] Remove version 1 directory code. Never functioned on Linux, just
[XFS] Map EFSCORRUPTED to an actual error code, not just a made up one
[XFS] Kill direct access to ->count in valusema(); all we ever use it for
[XFS] Remove unneeded conditional code on NFS export interface related
[XFS] Remove an incorrect use of unlikely() on a relatively likely code
[XFS] Push some common code out of write path into core XFS code for
[XFS] Remove unnecessary local from open_exec dmapi path.
[XFS] Minor XFS documentation updates.
[XFS] Fix broken const use inside local suffix_strtoul routine.
[XFS] Fix nused counter. It's currently getting set to -1 rather than
[XFS] Fix mismerge of the fs_writable cleanup patch causing a freeze/thaw
[XFS] Fix up debug code so that bulkstat wont generate thousands of
[XFS] Remove unused parameter from di2xflags routine.
[XFS] Cleanup a missed porting conversion, and freezing.
[XFS] Resolve a namespace collision on remaining vtypes for FreeBSD
[XFS] Resolve a namespace collision on vnode/vnodeops for FreeBSD porters.
[XFS] Resolve a namespace collision on vfs/vfsops for FreeBSD porters.
[XFS] statvfs component of directory/project quota support, code
...
Like the SUBSYTEM= key we find in the environment of the uevent, this
creates a generic "subsystem" link in sysfs for every device. Userspace
usually doesn't care at all if its a "class" or a "bus" device. This
provides an unified way to determine the subsytem of a device, regardless
of the way the driver core has created it.
Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.infradead.org/~dwmw2/rbtree-2.6:
[RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
Update UML kernel/physmem.c to use rb_parent() accessor macro
[RBTREE] Update hrtimers to use rb_parent() accessor macro.
[RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
[RBTREE] Merge colour and parent fields of struct rb_node.
[RBTREE] Remove dead code in rb_erase()
[RBTREE] Update JFFS2 to use rb_parent() accessor macro.
[RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
[RBTREE] Update key.c to use rb_parent() accessor macro.
[RBTREE] Update ext3 to use rb_parent() accessor macro.
[RBTREE] Change rbtree off-tree marking in I/O schedulers.
[RBTREE] Add accessor macros for colour and parent fields of rb_node
* git://git.infradead.org/mtd-2.6: (199 commits)
[MTD] NAND: Fix breakage all over the place
[PATCH] NAND: fix remaining OOB length calculation
[MTD] NAND Fixup NDFC merge brokeness
[MTD NAND] S3C2410 driver cleanup
[MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation.
[JFFS2] Check CRC32 on dirent and data nodes each time they're read
[JFFS2] When retiring nextblock, allocate a node_ref for the wasted space
[JFFS2] Mark XATTR support as experimental, for now
[JFFS2] Don't trust node headers before the CRC is checked.
[MTD] Restore MTD_ROM and MTD_RAM types
[MTD] assume mtd->writesize is 1 for NOR flashes
[MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles
[MTD] Prepare physmap for 64-bit-resources
[JFFS2] Fix more breakage caused by janitorial meddling.
[JFFS2] Remove stray __exit from jffs2_compressors_exit()
[MTD] Allow alternate JFFS2 mount variant for root filesystem.
[MTD] Disconnect struct mtd_info from ABI
[MTD] replace MTD_RAM with MTD_GENERIC_TYPE
[MTD] replace MTD_ROM with MTD_GENERIC_TYPE
[MTD] remove a forgotten MTD_XIP
...
Allow callers to remove watches from their event handler via
inotify_remove_watch_locked(). This functionality can be used to
achieve IN_ONESHOT-like functionality for a subset of events in the
mask.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add inotify_init_watch() so caller can use inotify_watch refcounts
before calling inotify_add_watch().
Add inotify_find_watch() to find an existing watch for an (ih,inode)
pair. This is similar to inotify_find_update_watch(), but does not
update the watch's mask if one is found.
Add inotify_rm_watch() to remove a watch via the watch pointer instead
of the watch descriptor.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When an inotify event includes a dentry name, also include the inode
associated with that name.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The following series of patches introduces a kernel API for inotify,
making it possible for kernel modules to benefit from inotify's
mechanism for watching inodes. With these patches, inotify will
maintain for each caller a list of watches (via an embedded struct
inotify_watch), where each inotify_watch is associated with a
corresponding struct inode. The caller registers an event handler and
specifies for which filesystem events their event handler should be
called per inotify_watch.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(990). Turns out some ye-olde unices used EUCLEAN as
Filesystem-needs-cleaning, so now we use that too.
SGI-PV: 953954
SGI-Modid: xfs-linux-melb:xfs-kern:26286a
Signed-off-by: Nathan Scott <nathans@sgi.com>
is check if semaphore is actually locked, which can be trivially done in
portable way. Code gets more reabable, while we are at it...
SGI-PV: 953915
SGI-Modid: xfs-linux-melb:xfs-kern:26274a
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Also, make sure dirents are marked REF_UNCHECKED when we 'discover' them
through eraseblock summary.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Failing to do so makes the calculated length of the last node incorrect,
when we're not using eraseblock summaries.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Especially when summary code is used, we can have in-memory data
structures referencing certain nodes without them actually being readable
on the flash. Discard the nodes gracefully in that case, rather than
triggering a BUG().
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
If get_user_pages() returns less pages than what we asked for, we jump
to out_unmap which will return ERR_PTR(ret). But ret can contain a
positive number just smaller than local_nr_pages, so be sure to set it
to -EFAULT always.
Problem found and diagnosed by Damien Le Moal <damien@sdl.hitachi.co.jp>
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If flock_lock_file() failed to allocate flock with locks_alloc_lock()
then "error = 0" is returned. Need to return some non-zero.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
jffs2_zlib_exit() and free_workspaces() shouldn't be marked __exit because
they get called in the error case from the init functions.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Currently it is possible for a task to remove its locks at the same time as
the NLM recovery thread is trying to recover them. This quickly leads to an
Oops.
Protect the locks using an rw semaphore while they are being recovered.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As fs/nfs/inode.c is rather large, heterogenous and unwieldy, the attached
patch splits it up into a number of files:
(*) fs/nfs/inode.c
Strictly inode specific functions.
(*) fs/nfs/super.c
Superblock management functions for NFS and NFS4, normal access, clones
and referrals. The NFS4 superblock functions _could_ move out into a
separate conditionally compiled file, but it's probably not worth it as
there're so many common bits.
(*) fs/nfs/namespace.c
Some namespace-specific functions have been moved here.
(*) fs/nfs/nfs4namespace.c
NFS4-specific namespace functions (this could be merged into the previous
file). This file is conditionally compiled.
(*) fs/nfs/internal.h
Inter-file declarations, plus a few simple utility functions moved from
fs/nfs/inode.c.
Additionally, all the in-.c-file externs have been moved here, and those
files they were moved from now includes this file.
For the most part, the functions have not been changed, only some multiplexor
functions have changed significantly.
I've also:
(*) Added some extra banner comments above some functions.
(*) Rearranged the function order within the files to be more logical and
better grouped (IMO), though someone may prefer a different order.
(*) Reduced the number of #ifdefs in .c files.
(*) Added missing __init and __exit directives.
Signed-Off-By: David Howells <dhowells@redhat.com>
Respond to a moved error on NFS lookup by setting up the referral.
Note: We don't actually follow the referral during lookup/getattr, but
later when we detect fsid mismatch in inode revalidation (similar to the
processing done for cloning submounts). Referrals will have fake attributes
until they are actually followed or traversed.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up mountpoint when hitting a referral on moved error by getting
fs_locations.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move existing code into a separate function so that it can be also used by
referral code.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is (similar to getattr bitmap) but includes fs_locations and
mounted_on_fileid attributes. Use this bitmap for encoding in fs_locations
requests.
Note: We can probably do better by requesting locations as part of fsinfo
itself.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Per referral draft, only fs_locations, fsid, and mounted_on_fileid can be
requested in a GETATTR on referrals.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is ignored if fileid is also requested. This will be used on referrals
(fs_locations).
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use component4-style formats for decoding list of servers and pathnames in
fs_locations.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 allows for the fact that filesystems may be replicated across
several servers or that they may be migrated to a backup server in case of
failure of the primary server.
fs_locations is an NFSv4 operation for retrieving information about the
location of migrated and/or replicated filesystems.
Based on an initial implementation by Jiaying Zhang <jiayingz@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make automounted partitions expire using the mark_mounts_for_expiry()
function. The timeout is controlled via a sysctl.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This should enable us to detect if we are crossing a mountpoint in the
case where the server is exporting "nohide" mounts.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow filesystems to decide to perform pre-umount processing whether or not
MNT_FORCE is set.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow a submount to be marked as being 'shrinkable' by means of the
vfsmount->mnt_flags, and then add a function 'shrink_submounts()' which
attempts to recursively unmount these submounts.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace all module uses with the new vfs_kern_mount() interface, and fix up
simple_pin_fs().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>