linux

Author	SHA1	Message	Date
Joel Becker	13723d00e3	ocfs2: Use metadata-specific ocfs2_journal_access_() functions. The per-metadata-type ocfs2_journal_access_() functions hook up jbd2 commit triggers and allow us to compute metadata ecc right before the buffers are written out. This commit provides ecc for inodes, extent blocks, group descriptors, and quota blocks. It is not safe to use extened attributes and metaecc at the same time yet. The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide the type of block at their root. Before, it didn't matter, but now the root block must use the appropriate ocfs2_journal_access_*() function. To keep this abstract, the structures now have a pointer to the matching journal_access function and a wrapper call to call it. A few places use naked ocfs2_write_block() calls instead of adding the blocks to the journal. We make sure to calculate their checksum and ecc before the write. Since we pass around the journal_access functions. Let's typedef them in ocfs2.h. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:32 -08:00
Joel Becker	d6b32bbb3e	ocfs2: block read meta ecc. Add block check calls to the read_block validate functions. This is the almost all of the read-side checking of metaecc. xattr buckets are not checked yet. Writes are also unchecked, and so a read-write mount will quickly fail. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:40:31 -08:00
Joel Becker	970e4936d7	ocfs2: Validate metadata only when it's read from disk. Add an optional validation hook to ocfs2_read_blocks(). Now the validation function is only called when a block was actually read off of disk. It is not called when the buffer was in cache. We add a buffer state bit BH_NeedsValidate to flag these buffers. It must always be one higher than the last JBD2 buffer state bit. The dinode, dirblock, extent_block, and xattr_block validators are lifted to this scheme directly. The group_descriptor validator needs to be split into two pieces. The first part only needs the gd buffer and is passed to ocfs2_read_block(). The second part requires the dinode as well, and is called every time. It's only 3 compares, so it's tiny. This also allows us to clean up the non-fatal gd check used by resize.c. It now has no magic argument. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	4203530613	ocfs2: Morph the haphazard OCFS2_IS_VALID_GROUP_DESC() checks. Random places in the code would check a group descriptor bh to see if it was valid. The previous commit unified descriptor block reads, validating all block reads in the same place. Thus, these checks are no longer necessary. Rather than eliminate them, however, we change them to BUG_ON() checks. This ensures the assumptions remain true. All of the code paths to these checks have been audited to ensure they come from a validated descriptor read. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	68f64d471b	ocfs2: Wrap group descriptor reads in a dedicated function. We have a clean call for validating group descriptors, but every place that wants the always does a read_block()+validate() call pair. Create a toplevel ocfs2_read_group_descriptor() that does the right thing. This allows us to leverage the single call point later for fancier handling. We also add validation of gd->bg_generation against the superblock and gd->bg_blkno against the block we thought we read. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	57e3e79711	ocfs2: Consolidate validation of group descriptors. Currently the validation of group descriptors is directly duplicated so that one version can error the filesystem and the other (resize) can just report the problem. Consolidate to one function that takes a boolean. Wrap that function with the old call for the old users. This is in preparation for lifting the read+validate step into a single function. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:53 -08:00
Joel Becker	10995aa245	ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks. Random places in the code would check a dinode bh to see if it was valid. Not only did they do different levels of validation, they handled errors in different ways. The previous commit unified inode block reads, validating all block reads in the same place. Thus, these haphazard checks are no longer necessary. Rather than eliminate them, however, we change them to BUG_ON() checks. This ensures the assumptions remain true. All of the code paths to these checks have been audited to ensure they come from a validated inode read. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-01-05 08:36:52 -08:00
Joel Becker	0fcaa56a2a	ocfs2: Simplify ocfs2_read_block() More than 30 callers of ocfs2_read_block() pass exactly OCFS2_BH_CACHED. Only six pass a different flag set. Rather than have every caller care, let's make ocfs2_read_block() take no flags and always do a cached read. The remaining six places can call ocfs2_read_blocks() directly. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:51:57 -07:00
Joel Becker	31d33073ca	ocfs2: Require an inode for ocfs2_read_block(s)(). Now that synchronous readers are using ocfs2_read_blocks_sync(), all callers of ocfs2_read_blocks() are passing an inode. Use it unconditionally. Since it's there, we don't need to pass the ocfs2_super either. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-14 11:43:29 -07:00
Mark Fasheh	a81cb88b64	ocfs2: Don't check for NULL before brelse() This is pointless as brelse() already does the check. Signed-off-by: Mark Fasheh	2008-10-13 17:02:44 -07:00
Joel Becker	12462f1d9f	ocfs2: Add the 'inode64' mount option. Now that ocfs2 limits inode numbers to 32bits, add a mount option to disable the limit. This parallels XFS. 64bit systems can handle the larger inode numbers. [ Added description of inode64 mount option in ocfs2.txt. --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:08 -07:00
Joel Becker	1187c96885	ocfs2: Limit inode allocation to 32bits. ocfs2 inode numbers are block numbers. For any filesystem with less than 2^32 blocks, this is not a problem. However, when ocfs2 starts using JDB2, it will be able to support filesystems with more than 2^32 blocks. This would result in inode numbers higher than 2^32. The problem is that stat(2) can't handle those numbers on 32bit machines. The simple solution is to have ocfs2 allocate all inodes below that boundary. The suballoc code is changed to honor an optional block limit. Only the inode suballocator sets that limit - all other allocations stay unlimited. The biggest trick is to grow the inode suballocator beneath that limit. There's no point in allocating block groups that are above the limit, then rejecting their elements later on. We want to prevent the inode allocator from ever having block groups above the limit. This involves a little gyration with the local alloc code. If the local alloc window is above the limit, it signals the caller to try the global bitmap but does not disable the local alloc file (which can be used for other allocations). [ Minor cleanup - removed an ML_NOTICE comment. --Mark ] Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:07 -07:00
Joel Becker	f99b9b7ccf	ocfs2: Make ocfs2_extent_tree the first-class representation of a tree. We now have three different kinds of extent trees in ocfs2: inode data (dinode), extended attributes (xattr_tree), and extended attribute values (xattr_value). There is a nice abstraction for them, ocfs2_extent_tree, but it is hidden in alloc.c. All the calling functions have to pick amongst a varied API and pass in type bits and often extraneous pointers. A better way is to make ocfs2_extent_tree a first-class object. Everyone converts their object to an ocfs2_extent_tree() via the ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all tree calls to alloc.c. This simplifies a lot of callers, making for readability. It also provides an easy way to add additional extent tree types, as they only need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree() function. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Tiger Yang	cf1d6c763f	ocfs2: Add extended attribute support This patch implements storing extended attributes both in inode or a single external block. We only store EA's in-inode when blocksize > 512 or that inode block has free space for it. When an EA's value is larger than 80 bytes, we will store the value via b-tree outside inode or block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tao Ma	f56654c435	ocfs2: Add extent tree operation for xattr value btrees Add some thin wrappers around ocfs2_insert_extent() for each of the 3 different btree types, ocfs2_inode_insert_extent(), ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The last is for the xattr index btree, which will be used in a followup patch. All the old callers in file.c etc will call ocfs2_dinode_insert_extent(), while the other two handle the xattr issue. And the init of extent tree are handled by these functions. When storing xattr value which is too large, we will allocate some clusters for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In order to re-use the b-tree operation code, a new parameter named "private" is added into ocfs2_extent_tree and it is used to indicate the root of ocfs2_exent_list. The reason is that we can't deduce the root from the buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse, in any place in an ocfs2_xattr_bucket. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:01 -07:00
Tao Ma	e7d4cb6bc1	ocfs2: Abstract ocfs2_extent_tree in b-tree operations. In the old extent tree operation, we take the hypothesis that we are using the ocfs2_extent_list in ocfs2_dinode as the tree root. As xattr will also use ocfs2_extent_list to store large value for a xattr entry, we refactor the tree operation so that xattr can use it directly. The refactoring includes 4 steps: 1. Abstract set/get of last_eb_blk and update_clusters since they may be stored in different location for dinode and xattr. 2. Add a new structure named ocfs2_extent_tree to indicate the extent tree the operation will work on. 3. Remove all the use of fe_bh and di, use root_bh and root_el in extent tree instead. So now all the fe_bh is replaced with et->root_bh, el with root_el accordingly. 4. Make ocfs2_lock_allocators generic. Now it is limited to be only used in file extend allocation. But the whole function is useful when we want to store large EAs. Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used for anything other than truncate inode data btrees. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Tao Ma	811f933df1	ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode. ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and ocfs2_reserve_new_metadata() are all useful for extent tree operations. But they are all limited to an inode btree because they use a struct ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list (the part of an ocfs2_dinode they actually use) so that the xattr btree code can use these functions. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Mark Fasheh	9c7af40b21	ocfs2: throttle back local alloc when low on disk space Ocfs2's local allocator disables itself for the duration of a mount point when it has trouble allocating a large enough area from the primary bitmap. That can cause performance problems, especially for disks which were only temporarily full or fragmented. This patch allows for the allocator to shrink it's window first, before being disabled. Later, it can also be re-enabled so that any performance drop is minimized. To do this, we allow the value of osb->local_alloc_bits to be shrunk when needed. The default value is recorded in a mostly read-only variable so that we can re-initialize when required. Locking had to be updated so that we could protect changes to local_alloc_bits. Mostly this involves protecting various local alloc values with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which is used when the local allocator is has shrunk, but is not disabled. If the available space dips below 1 megabyte, the local alloc file is disabled. In either case, local alloc is re-enabled 30 seconds after the event, or when an appropriate amount of bits is seen in the primary bitmap. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Tao Ma	4d0ddb2ce2	ocfs2: Add inode stealing for ocfs2_reserve_new_inode Inode allocation is modified to look in other nodes allocators during extreme out of space situations. We retry our own slot when space is freed back to the global bitmap, or whenever we've allocated more than 1024 inodes from another slot. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Tao Ma	a4a4891164	ocfs2: Add ac_alloc_slot in ocfs2_alloc_context In inode stealing, we no longer restrict the allocation to happen in the local node. So it is neccessary for us to add a new member in ocfs2_alloc_context to indicate which slot we are using for allocation. We also modify the process of local alloc so that this member can be used there also. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Tao Ma	ffda89a3bf	ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits In some cases(Inode stealing from other nodes), we may not want ocfs2_reserve_suballoc_bits to allocate new groups from the global_bitmap since it may already be full. So add a new parameter for this. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-04-18 08:56:10 -07:00
Joe Perches	c78bad11fb	fs/: Spelling fixes Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2008-02-03 17:33:42 +02:00
Sunil Mushran	2fbe8d1ebe	ocfs2: Local alloc window size changeable via mount option Local alloc is a performance optimization in ocfs2 in which a node takes a window of bits from the global bitmap and then uses that for all small local allocations. This window size is fixed to 8MB currently. This patch allows users to specify the window size in MB including disabling it by passing in 0. If the number specified is too large, the fs will use the default value of 8MB. mount -o localalloc=X /dev/sdX /mntpoint Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2008-01-25 15:05:43 -08:00
Tao Ma	d659072f73	[PATCH 1/2] ocfs2: Add group extend for online resize This patch adds the ability for a userspace program to request an extend of last cluster group on an Ocfs2 file system. The request is made via ioctl, OCFS2_IOC_GROUP_EXTEND. This is derived from EXT3_IOC_GROUP_EXTEND, but is obviously Ocfs2 specific. tunefs.ocfs2 would call this for an online-resize operation if the last cluster group isn't full. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2008-01-25 14:53:35 -08:00
Mark Fasheh	e63aecb651	ocfs2: Rename ocfs2_meta_[un]lock Call this the "inode_lock" now, since it covers both data and meta data. This patch makes no functional changes. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2008-01-25 14:46:01 -08:00
Mark Fasheh	415cb80037	ocfs2: Allow smaller allocations during large writes The ocfs2 write code loops through a page much like the block code, except that ocfs2 allocation units can be any size, including larger than page size. Typically it's equal to or larger than page size - most kernels run 4k pages, the minimum ocfs2 allocation (cluster) size. Some changes introduced during 2.6.23 changed the way writes to pages are handled, and inadvertantly broke support for > 4k page size. Instead of just writing one cluster at a time, we now handle the whole page in one pass. This means that multiple (small) seperate allocations might happen in the same pass. The allocation code howver typically optimizes by getting the maximum which was reserved. This triggered a BUG_ON in the extend code where it'd ask for a single bit (for one part of a > 4k page) and get back more than it asked for. Fix this by providing a variant of the high level allocation function which allows the caller to specify a maximum. The traditional function remains and just calls the new one with a maximum determined from the initial reservation. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-09-20 15:06:09 -07:00
Mark Fasheh	1f6697d072	ocfs2: use all extent block suballocators Now that we have a method to deallocate blocks from them, each node should allocate extent blocks from their local suballocator file. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-07-10 17:31:56 -07:00
Mark Fasheh	59a5e416d1	ocfs2: plug truncate into cached dealloc routines Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-07-10 17:31:55 -07:00
Mark Fasheh	2b604351bc	ocfs2: simplify deallocation locking Deallocation of suballocator blocks, most notably extent blocks, might involve multiple suballocator inodes. The locking for this can get extremely complicated, especially when the suballocator inodes to delete from aren't known until deep within an unrelated codepath. Implement a simple scheme for recording the blocks to be unlinked so that the actual deallocation can be done in a context which won't deadlock. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-07-10 17:31:54 -07:00
Mark Fasheh	1ca1a111b1	ocfs2: fix sparse warnings in fs/ocfs2 None of these are actually harmful, but the noise makes looking for real problems difficult. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-05-02 15:08:08 -07:00
Mark Fasheh	8110b073a9	ocfs2: Fix up i_blocks calculation to know about holes Older file systems which didn't support holes did a dumb calculation of i_blocks based on i_size. This is no longer accurate, so fix things up to take actual allocation into account. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2007-04-26 15:07:40 -07:00
Robert P. J. Day	cd86128088	[PATCH] Fix numerous kcalloc() calls, convert to kzalloc() All kcalloc() calls of the form "kcalloc(1,...)" are converted to the equivalent kzalloc() calls, and a few kcalloc() calls with the incorrect ordering of the first two arguments are fixed. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: Jeff Garzik <jeff@garzik.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Adam Belay <ambx1@neo.rr.com> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Greg KH <greg@kroah.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>	2006-12-13 09:05:52 -08:00
Mark Fasheh	1fabe1481f	ocfs2: Remove struct ocfs2_journal_handle in favor of handle_t This is mostly a search and replace as ocfs2_journal_handle is now no more than a container for a handle_t pointer. ocfs2_commit_trans() becomes very straight forward, and we remove some out of date comments / code. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-12-01 18:28:28 -08:00
Mark Fasheh	65eff9ccf8	ocfs2: remove handle argument to ocfs2_start_trans() All callers either pass in NULL directly, or a local variable that is already set to NULL. The internals of ocfs2_start_trans() get a nice cleanup as a result. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-12-01 18:28:23 -08:00
Mark Fasheh	02dc1af44e	ocfs2: pass ocfs2_super * into ocfs2_commit_trans() This sets us up to remove handle->journal. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-12-01 18:28:08 -08:00
Mark Fasheh	4bcec1847a	ocfs2: remove unused handle argument from ocfs2_meta_lock_full() Now that this is unused and all callers pass NULL, we can safely remove it. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-12-01 18:28:05 -08:00
Mark Fasheh	da5cbf2f9d	ocfs2: don't use handle for locking in allocation functions Instead we record our state on the allocation context structure which all callers already know about and lifetime correctly. This means the reservation functions don't need a handle passed in any more, and we can also take it off the alloc context. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-12-01 18:27:49 -08:00
Mark Fasheh	c161f89be7	ocfs2: remove ocfs2_journal_handle flags field Callers can set h_sync directly on the handle_t, whether a transaction has been started or not can be determined via the existence of the handle_t on the struct ocfs2_journal_handle. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-12-01 18:27:06 -08:00
Mark Fasheh	883d4cae4a	ocfs2: allocation hints Record the most recently used allocation group on the allocation context, so that subsequent allocations can attempt to optimize for contiguousness. Local alloc especially should benefit from this as the current chain search tends to let it spew across the disk. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-08-07 11:07:01 -07:00
Mark Fasheh	7bf72edee6	ocfs2: better group descriptor consistency checks Try to catch corrupted group descriptors with some stronger checks placed in a couple of strategic locations. Detect a failed resizefs and refuse to allocate past what bitmap i_clusters allows. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-08-07 11:04:38 -07:00
Mark Fasheh	b0697053f9	ocfs2: don't use MLF* in the file system Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>	2006-03-24 14:58:28 -08:00
Mark Fasheh	ccd979bdbc	[PATCH] OCFS2: The Second Oracle Cluster Filesystem The OCFS2 file system module. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Kurt Hackel <kurt.hackel@oracle.com>	2006-01-03 11:45:47 -08:00

42 Commits