Open buckets on the partial list should not count as allocated when
we're trying to allocate from the partial list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a filesystem flag to indicate whether we did a clean recovery -
using c->sb.clean after we've got rw is incorrect, since c->sb is
updated whenever we write the superblock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_SB_ERRS() has a field for the actual enum val so that we can reorder
to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for
this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
the standard vfs inode hash table suffers from painful lock contention -
this is long overdue
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes two problems in the handling of negative times:
• rem is signed, but the rem * c->sb.nsec_per_time_unit operation
produced a bogus unsigned result, because s32 * u32 = u32.
• The timespec was not normalized (it could contain more than a
billion nanoseconds).
For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).
Cc: stable@vger.kernel.org
Fixes: 595c1e9bab ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743
Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Limit these messages to once every 2 minutes to avoid spamming logs;
with multiple devices the output can be quite significant.
Also, up the default timeout to 30 seconds from 10 seconds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
gc_lock is now only for synchronization between check_alloc_info and
interior btree updates - nothing else
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.
Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.
This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.
Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.
Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.
Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
LRUs only have 48 bits for the time field (i.e. LRU order); thus we need
overflow checks and guards.
Reported-by: syzbot+df3bf3f088dcaa728857@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Split the workqueues for btree read completions and btree write
submissions; we don't want concurrency control on btree read
completions, but we do want concurrency control on write submissions,
else blocking in submit_bio() will cause a ton of kworkers to be
allocated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Setting this flag on a filesystem results in validity checks being
skipped when writing bkeys. This flag will be used by tooling that
deliberately injects corruption into a filesystem in order to exercise
fsck. It shouldn't be set outside of testing/debugging code.
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We now have a small bitmap in the member info section of the superblock
for "regions that have btree nodes", so that if we ever have to scan for
btree nodes in repair we don't have to scan the whole device(s).
This tweaks the allocator to prefer allocating from regions that are
already marked in this bitmap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed for the next patch - the write submit path has to be able
to allocate a replica bio even when we weren't able to get a ref on the
device.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This unifies the online and offline btree gc passes; we're not yet
running it online.
We now iterate over one level of the btree at a time - the same as
check_extents_to_backpointers(); this ordering preserves order of keys
regardless of btree splits and merges, which will be important when we
re-enable online gc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
looping when we change a bucket gen is not ideal - it means we risk
failing if we'd go into an infinite loop, and it's better to make
forward progress even if fsck doesn't fix everything.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since btree_ptr_v2, we no longer require the journal seq blacklist table
for skipping blacklisted bsets (btree node entries); the pointer to a
given node indicates how much data is present.
Therefore there's no longer any need for journal seq blacklist gc to
walk the btree - we can prune entries older than journal last_seq.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a nice cleanup - and we've also been having problems with
kthread creation in the mount path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When the snapshots btree is going, we'll have to delete huge amounts of
data - unless we can reconstruct it by looking at the keys that refer to
it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree root or interior btree node goes bad, we're going to lose a
lot of data, unless we can recover the nodes that it pointed to by
scanning.
Fortunately btree node headers are fully self describing, and
additionally the magic number is xored with the filesytem UUID, so we
can do so safely.
This implements the scanning - next patch will rework topology repair to
make use of the found nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.
So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, struct time_stats has the optional ability to quantize the
information that it collects. This is /probably/ useful for callers who
want to see quantized information, but it more than doubles the size of
the structure from 224 bytes to 464. For users who don't care about
that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per
counter, split the two into separate pieces.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.
Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.
We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.
So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>