Always use do {} while (0). Failing to do so can cause subtle compile
failures or bugs.
Cc: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Raise the maximum error number in IS_ERR_VALUE to 4095.
o Make that number available as a new constant MAX_ERRNO.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes buggy behaviour of UFS
in such kind of scenario:
open(, O_TRUNC...)
ftruncate(, 1024)
ftruncate(, 0)
Such a scenario causes ufs_panic and remount read-only. This happen
because of according to specification UFS should always allocate block for
last byte, and many parts of our implementation rely on this, but
`ufs_truncate' doesn't care about this.
To make possible return error code and to know about old size, this patch
removes `truncate' from ufs inode_operations and uses `setattr' method to
call ufs_truncate.
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make __copy_*_user_inatomic really atomic to avoid "Sleeping function called in
atomic context" warnings, especially from futex code.
This is made by adding another kmap_atomic slot and making copy_*_user_skas
use kmap_atomic; also copy_*_user() becomes atomic, but that's true and is not
a problem for i386 (and we can always add might_sleep there as done
elsewhere). For TT mode kmap is not used, so there's no need for this.
I've had to use another slot since both KM_USER0 and KM_USER1 are used
elsewhere and could cause conflicts. Till now we reused the kmap_atomic slot
list from the subarch, but that's not needed as that list must contain the
common ones (used by generic code) + the ones used in architecture specific
code (and Uml till now used none); so I've taken the i386 one after comparing
it with ones from other archs, and added KM_UML_USERCOPY.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hide the magic in alternative.h and provide some dummy inline functions
for the UP case (gcc should manage to optimize away these calls). No
changes in module.c.
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
[SPARC64]: Kill sun4v virtual device layer.
[SERIAL] sunhv: Convert to of_driver layer.
[SPARC64]: Mask out top 8-bits in physical address when building resources.
[SERIAL] sunsu: Missing return statement in su_probe().
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
[IPV6]: Added GSO support for TCPv6
[NET]: Generalise TSO-specific bits from skb_setup_caps
[IPV6]: Added GSO support for TCPv6
[IPV6]: Remove redundant length check on input
[NETFILTER]: SCTP conntrack: fix crash triggered by packet without chunks
[TG3]: Update version and reldate
[TG3]: Add TSO workaround using GSO
[TG3]: Turn on hw fix for ASF problems
[TG3]: Add rx BD workaround
[TG3]: Add tg3_netif_stop() in vlan functions
[TCP]: Reset gso_segs if packet is dodgy
* git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6/:
[PATCH] pcmcia: fix deadlock in pcmcia_parse_events
[PATCH] com20020_cs: more device support
[PATCH] au1xxx: pcmcia: fix __init called from non-init
[PATCH] kill open-coded offsetof in cm4000_cs.c ZERO_DEV()
[PATCH] pcmcia: convert pcmcia_cs to kthread
[PATCH] pcmcia: fix kernel-doc function name
[PATCH] pcmcia: hostap_cs.c - 0xc00f,0x0000 conflicts with pcnet_cs
[PATCH] pcmcia: at91_cf suspend/resume/wakeup
[PATCH] pcmcia: Make ide_cs work with the memory space of CF-Cards if IO space is not available
[PATCH] pcmcia: TI PCIxx12 CardBus controller support
[PATCH] pcmcia: warn if driver requests exclusive, but gets a shared IRQ
[PATCH] pcmcia: expose tool in pcmcia/Documentation/pcmcia/
[PATCH] pcmcia: another ID for serial_cs.c
[PATCH] yenta: fix hidden PCI bus numbers
[PATCH] yenta: do power-up only after socket is configured
* master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb:
V4L/DVB (4290): Add support for the TCL M2523_3DB_E tuner.
V4L/DVB (4289): Missing statement in drivers/media/dvb/frontends/cx22700.c
V4L/DVB (4288): Clean out a zillion sparse warnings in pvrusb2
V4L/DVB (4287): Pvrusb2/: possible cleanups
V4L/DVB (4285): Cx88: add support for Geniatech Digistar / Digiwave 103g
V4L/DVB (4284): Cx24123: fix set_voltage function according to the specs
V4L/DVB (4282): Fix: use swzigzag for swalgo
V4L/DVB (4281): TDA9887_SET_CONFIG should only be handled by the tda9887.
V4L/DVB (4277): Fix CI interface on PRO KNC1 cards
V4L/DVB (4276): Fix CI on old KNC1 DVBC cards
V4L/DVB (4275): The FE_SET_FRONTEND_TUNE_MODE ioctl always returns EOPNOTSUPP
V4L/DVB (4274): Eliminate use of tda9887 from pvrusb2 driver
V4L/DVB (4273): Always log pvrusb2 device register / unregister events
V4L/DVB (4272): Fix tveeprom supported standards
V4L/DVB (4270): Add tda9887-specific tuner configuration
V4L/DVB (4269): Subject: videocodec: make 1-bit fields unsigned
V4L/DVB (4267): Remove all instances of request_module("tda9887")
V4L/DVB (4264): Cx88-blackbird: implement VIDIOC_QUERYCTRL and VIDIOC_QUERYMENU
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (25 commits)
ACPI: Kconfig: ACPI_SRAT depends on ACPI
ACPI: drivers/acpi/scan.c: make acpi_bus_type static
ACPI: fixup memhotplug debug message
ACPI: ACPICA 20060623
ACPI: C-States: only demote on current bus mastering activity
ACPI: C-States: bm_activity improvements
ACPI: C-States: accounting of sleep states
ACPI: additional blacklist entry for ThinkPad R40e
ACPI: restore comment justifying 'extra' P_LVLx access
ACPI: fix battery on HP NX6125
ACPIPHP: prevent duplicate slot numbers when no _SUN
ACPI: static-ize handle_hotplug_event_func()
ACPIPHP: use ACPI dock driver
ACPI: dock driver
KEVENT: add new uevent for dock
ACPI: asus_acpi_init: propagate correct return value
[ACPI] Print error message if remove/install notify handler fails
ACPI: delete tracing macros from drivers/acpi/*.c
ACPI: HW P-state coordination support
ACPI: un-export ACPI_ERROR() -- use printk(KERN_ERR...)
...
This patch adds GSO support for IPv6 and TCPv6. This is based on a patch
by Ananda Raju <Ananda.Raju@neterion.com>. His original description is:
This patch enables TSO over IPv6. Currently Linux network stacks
restricts TSO over IPv6 by clearing of the NETIF_F_TSO bit from
"dev->features". This patch will remove this restriction.
This patch will introduce a new flag NETIF_F_TSO6 which will be used
to check whether device supports TSO over IPv6. If device support TSO
over IPv6 then we don't clear of NETIF_F_TSO and which will make the
TCP layer to create TSO packets. Any device supporting TSO over IPv6
will set NETIF_F_TSO6 flag in "dev->features" along with NETIF_F_TSO.
In case when user disables TSO using ethtool, NETIF_F_TSO will get
cleared from "dev->features". So even if we have NETIF_F_TSO6 we don't
get TSO packets created by TCP layer.
SKB_GSO_TCPV4 renamed to SKB_GSO_TCP to make it generic GSO packet.
SKB_GSO_UDPV4 renamed to SKB_GSO_UDP as UFO is not a IPv4 feature.
UFO is supported over IPv6 also
The following table shows there is significant improvement in
throughput with normal frames and CPU usage for both normal and jumbo.
--------------------------------------------------
| | 1500 | 9600 |
| ------------------|-------------------|
| | thru CPU | thru CPU |
--------------------------------------------------
| TSO OFF | 2.00 5.5% id | 5.66 20.0% id |
--------------------------------------------------
| TSO ON | 2.63 78.0 id | 5.67 39.0% id |
--------------------------------------------------
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generalises the TSO-specific bits from sk_setup_caps by adding
the sk_gso_type member to struct sock. This makes sk_setup_caps generic
so that it can be used by TCPv6 or UFO.
The only catch is that whoever uses this must provide a GSO implementation
for their protocol which I think is a fair deal :) For now UFO continues to
live without a GSO implementation which is OK since it doesn't use the sock
caps field at the moment.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds GSO support for IPv6 and TCPv6.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch below adds support for the TI PCIxx12 CardBus controllers.
This seems to be sufficient to detect the cardbus bridge on an HP nc6320
and works with an orinoco wifi card.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Many tda9887 settings depend on the chosen tuner. Expand the tuner parameters
to include these tda9887 settings.
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Add a rq_sendfile_ok flag to svc_rqst which will be cleared in the privacy
case so that the wrapping code will get copies of the read data instead of
real page cache pages. This makes life simpler when we encrypt the response.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add __acquire annotations to rcu_read_lock and rcu_read_lock_bh, and add
__release annotations to rcu_read_unlock and rcu_read_unlock_bh. This
allows sparse to detect improperly paired calls to these functions.
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This corrects the comments describing the 'enabled' and 'pending' flags in
struct rtc_wkalrm of include/linux/rtc.h.
Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The x86_64 build requires a definition for __raw_writeq.
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Presently, smp_processor_id() isn't necessarily set up until setup_arch().
But it's used in boot_cpu_init() and printk() and perhaps in other places,
prior to setup_arch() being called.
So provide a new smp_setup_processor_id() which is called before anything
else, wire it up for Voyager (which boots on a CPU other than #0, and broke).
Cc: James Bottomley <James.Bottomley@steeleye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new security hook definition for the sys_ioprio_get operation. At
present, the SELinux hook function implementation for this hook is
identical to the getscheduler implementation but a separate hook is
introduced to allow this check to be specialized in the future if
necessary.
This patch also creates a helper function get_task_ioprio which handles the
access check in addition to retrieving the ioprio value for the task.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds a call to the extended security_task_kill hook introduced by
the prior patch to the kill_proc_info_as_uid function so that these signals
can be properly mediated by security modules. It also updates the existing
hook call in check_kill_permission.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch extends the security_task_kill hook to handle signals sent by AIO
completion. In this case, the secid of the task responsible for the signal
needs to be obtained and saved earlier, so a security_task_getsecid() hook is
added, and then this saved value is passed subsequently to the extended
task_kill hook for use in checking.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The remaining counters in page_state after the zoned VM counter patches
have been applied are all just for show in /proc/vmstat. They have no
essential function for the VM.
We use a simple increment of per cpu variables. In order to avoid the most
severe races we disable preempt. Preempt does not prevent the race between
an increment and an interrupt handler incrementing the same statistics
counter. However, that race is exceedingly rare, we may only loose one
increment or so and there is no requirement (at least not in kernel) that
the vm event counters have to be accurate.
In the non preempt case this results in a simple increment for each
counter. For many architectures this will be reduced by the compiler to a
single instruction. This single instruction is atomic for i386 and x86_64.
And therefore even the rare race condition in an interrupt is avoided for
both architectures in most cases.
The patchset also adds an off switch for embedded systems that allows a
building of linux kernels without these counters.
The implementation of these counters is through inline code that hopefully
results in only a single instruction increment instruction being emitted
(i386, x86_64) or in the increment being hidden though instruction
concurrency (EPIC architectures such as ia64 can get that done).
Benefits:
- VM event counter operations usually reduce to a single inline instruction
on i386 and x86_64.
- No interrupt disable, only preempt disable for the preempt case.
Preempt disable can also be avoided by moving the counter into a spinlock.
- Handling is similar to zoned VM counters.
- Simple and easily extendable.
- Can be omitted to reduce memory use for embedded use.
References:
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=113512330605497&w=2
RFC http://marc.theaimsgroup.com/?l=linux-kernel&m=114988082814934&w=2
local_t http://marc.theaimsgroup.com/?l=linux-kernel&m=114991748606690&w=2
V2 http://marc.theaimsgroup.com/?t=115014808400007&r=1&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767022346&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115047968808926&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The numa statistics are really event counters. But they are per node and
so we have had special treatment for these counters through additional
fields on the pcp structure. We can now use the per zone nature of the
zoned VM counters to realize these.
This will shrink the size of the pcp structure on NUMA systems. We will
have some room to add additional per zone counters that will all still fit
in the same cacheline.
Bits Prior pcp size Size after patch We can add
------------------------------------------------------------------
64 128 bytes (16 words) 80 bytes (10 words) 48
32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
72 (128 byte)
Remove the special statistics for numa and replace them with zoned vm
counters. This has the side effect that global sums of these events now
show up in /proc/vmstat.
Also take the opportunity to move the zone_statistics() function from
page_alloc.c into vmstat.c.
Discussions:
V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No callers.
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_bounce to a per zone counter
nr_bounce is only used for proc output. So it could be left as an event
counter. However, the event counters may not be accurate and nr_bounce is
categorizing types of pages in a zone. So we really need this to also be a
per zone counter.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_unstable to a per zone counter
We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.
This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.
[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_writeback to per zone counter.
This removes the last page_state counter from arch/i386/mm/pgtable.c so we
drop the page_state from there.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.
The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_page_table_pages to a per zone counter
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Allows reclaim to access counter without looping over processor counts.
- Allows accurate statistics on how many pages are used in a zone by
the slab. This may become useful to balance slab allocations over
various zones.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone. Therefore we had to scan in
intervals to figure out if any pages were unmapped.
With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone. So we can simply skip the
reclaim if there is an insufficient number of unmapped pages. We use
SWAP_CLUSTER_MAX as the boundary.
Drop all support for /proc/sys/vm/zone_reclaim_interval.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
calculation as the number of mapped pagecache pages. However, that is not
true. NR_FILE_MAPPED includes the mapped anonymous pages. This patch
separates those and therefore allows an accurate tracking of the anonymous
pages per zone.
It then becomes possible to determine the number of unmapped pages per zone
and we can avoid scanning for unmapped pages if there are none.
Also it may now be possible to determine the mapped/unmapped ratio in
get_dirty_limit. Isnt the number of anonymous pages irrelevant in that
calculation?
Note that this will change the meaning of the number of mapped pages reported
in /proc/vmstat /proc/meminfo and in the per node statistics. This may affect
user space tools that monitor these counters! NR_FILE_MAPPED works like
NR_FILE_DIRTY. It is only valid for pagecache pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently a single atomic variable is used to establish the size of the page
cache in the whole machine. The zoned VM counters have the same method of
implementation as the nr_pagecache code but also allow the determination of
the pagecache size per zone.
Remove the special implementation for nr_pagecache and make it a zoned counter
named NR_FILE_PAGES.
Updates of the page cache counters are always performed with interrupts off.
We can therefore use the __ variant here.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
nr_mapped is important because it allows a determination of how many pages of
a zone are not mapped, which would allow a more efficient means of determining
when we need to reclaim memory in a zone.
We take the nr_mapped field out of the page state structure and define a new
per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
from NR_MAPPED in the next patch).
We replace the use of nr_mapped in various kernel locations. This avoids the
looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
zone reclaim).
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Per zone counter infrastructure
The counters that we currently have for the VM are split per processor. The
processor however has not much to do with the zone these pages belong to. We
cannot tell f.e. how many ZONE_DMA pages are dirty.
So we are blind to potentially inbalances in the usage of memory in various
zones. F.e. in a NUMA system we cannot tell how many pages are dirty on a
particular node. If we knew then we could put measures into the VM to balance
the use of memory between different zones and different nodes in a NUMA
system. For example it would be possible to limit the dirty pages per node so
that fast local memory is kept available even if a process is dirtying huge
amounts of pages.
Another example is zone reclaim. We do not know how many unmapped pages exist
per zone. So we just have to try to reclaim. If it is not working then we
pause and try again later. It would be better if we knew when it makes sense
to reclaim unmapped pages from a zone. This patchset allows the determination
of the number of unmapped pages per zone. We can remove the zone reclaim
interval with the counters introduced here.
Futhermore the ability to have various usage statistics available will allow
the development of new NUMA balancing algorithms that may be able to improve
the decision making in the scheduler of when to move a process to another node
and hopefully will also enable automatic page migration through a user space
program that can analyse the memory load distribution and then rebalance
memory use in order to increase performance.
The counter framework here implements differential counters for each processor
in struct zone. The differential counters are consolidated when a threshold
is exceeded (like done in the current implementation for nr_pageache), when
slab reaping occurs or when a consolidation function is called.
Consolidation uses atomic operations and accumulates counters per zone in the
zone structure and also globally in the vm_stat array. VM functions can
access the counts by simply indexing a global or zone specific array.
The arrangement of counters in an array also simplifies processing when output
has to be generated for /proc/*.
Counters can be updated by calling inc/dec_zone_page_state or
_inc/dec_zone_page_state analogous to *_page_state. The second group of
functions can be called if it is known that interrupts are disabled.
Special optimized increment and decrement functions are provided. These can
avoid certain checks and use increment or decrement instructions that an
architecture may provide.
We also add a new CONFIG_DMA_IS_NORMAL that signifies that an architecture can
do DMA to all memory and therefore ZONE_NORMAL will not be populated. This is
only currently set for IA64 SGI SN2 and currently only affects
node_page_state(). In the best case node_page_state can be reduced to
retrieving a single counter for the one zone on the node.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: export vm_stat[] for filesystems]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
NOTE: ZVC are *not* the lightweight event counters. ZVCs are reliable whereas
event counters do not need to be.
Zone based VM statistics are necessary to be able to determine what the state
of memory in one zone is. In a NUMA system this can be helpful for local
reclaim and other memory optimizations that may be able to shift VM load in
order to get more balanced memory use.
It is also useful to know how the computing load affects the memory
allocations on various zones. This patchset allows the retrieval of that data
from userspace.
The patchset introduces a framework for counters that is a cross between the
existing page_stats --which are simply global counters split per cpu-- and the
approach of deferred incremental updates implemented for nr_pagecache.
Small per cpu 8 bit counters are added to struct zone. If the counter exceeds
certain thresholds then the counters are accumulated in an array of
atomic_long in the zone and in a global array that sums up all zone values.
The small 8 bit counters are next to the per cpu page pointers and so they
will be in high in the cpu cache when pages are allocated and freed.
Access to VM counter information for a zone and for the whole machine is then
possible by simply indexing an array (Thanks to Nick Piggin for pointing out
that approach). The access to the total number of pages of various types does
no longer require the summing up of all per cpu counters.
Benefits of this patchset right now:
- Ability for UP and SMP configuration to determine how memory
is balanced between the DMA, NORMAL and HIGHMEM zones.
- loops over all processors are avoided in writeback and
reclaim paths. We can avoid caching the writeback information
because the needed information is directly accessible.
- Special handling for nr_pagecache removed.
- zone_reclaim_interval vanishes since VM stats can now determine
when it is worth to do local reclaim.
- Fast inline per node page state determination.
- Accurate counters in /sys/devices/system/node/node*/meminfo. Current
counters are counting simply which processor allocated a page somewhere
and guestimate based on that. So the counters were not useful to show
the actual distribution of page use on a specific zone.
- The swap_prefetch patch requires per node statistics in order to
figure out when processors of a node can prefetch. This patch provides
some of the needed numbers.
- Detailed VM counters available in more /proc and /sys status files.
References to earlier discussions:
V1 http://marc.theaimsgroup.com/?l=linux-kernel&m=113511649910826&w=2
V2 http://marc.theaimsgroup.com/?l=linux-kernel&m=114980851924230&w=2
V3 http://marc.theaimsgroup.com/?l=linux-kernel&m=115014697910351&w=2
V4 http://marc.theaimsgroup.com/?l=linux-kernel&m=115024767318740&w=2
Performance tests with AIM7 did not show any regressions. Seems to be a tad
faster even. Tested on ia64/NUMA. Builds fine on i386, SMP / UP. Includes
fixes for s390/arm/uml arch code.
This patch:
Move counter code from page_alloc.c/page-flags.h to vmstat.c/h.
Create vmstat.c/vmstat.h by separating the counter code and the proc
functions.
Move the vm_stat_text array before zoneinfo_show.
[akpm@osdl.org: s390 build fix]
[akpm@osdl.org: HOTPLUG_CPU build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
[TIPC]: Initial activation message now includes TIPC version number
[TIPC]: Improve response to requests for node/link information
[TIPC]: Fixed skb_under_panic caused by tipc_link_bundle_buf
[IrDA]: Fix the AU1000 FIR dependencies
[IrDA]: Fix RCU lock pairing on error path
[XFRM]: unexport xfrm_state_mtu
[NET]: make skb_release_data() static
[NETFILTE] ipv4: Fix typo (Bugzilla #6753)
[IrDA]: MCS7780 usb_driver struct should be static
[BNX2]: Turn off link during shutdown
[BNX2]: Use dev_kfree_skb() instead of the _irq version
[ATM]: basic sysfs support for ATM devices
[ATM]: [suni] change suni_init to __devinit
[ATM]: [iphase] should be __devinit not __init
[ATM]: [idt77105] should be __devinit not __init
[BNX2]: Add NETIF_F_TSO_ECN
[NET]: Add ECN support for TSO
[AF_UNIX]: Datagram getpeersec
[NET]: Fix logical error in skb_gso_ok
[PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec >= 1000000
...
skb_release_data() no longer has any users in other files.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>