1
Commit Graph

299 Commits

Author SHA1 Message Date
Kent Overstreet
778ac324cc bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
Open buckets on the partial list should not count as allocated when
we're trying to allocate from the partial list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Kent Overstreet
1c0ee43b2c bcachefs: BCH_FS_clean_recovery
Add a filesystem flag to indicate whether we did a clean recovery -
using c->sb.clean after we've got rw is incorrect, since c->sb is
updated whenever we write the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:22 -04:00
Kent Overstreet
8d65b15f8d bcachefs: Fix BCH_SB_ERRS() so we can reorder
BCH_SB_ERRS() has a field for the actual enum val so that we can reorder
to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for
this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
83ccd9b31d bcachefs: bch_fs.rw_devs_change_count
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:49 -04:00
Kent Overstreet
4f19a60c32 bcachefs: Options for recovery_passes, recovery_passes_exclude
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-21 11:39:48 -04:00
Kent Overstreet
2c6a7bff2a bcachefs: Switch gc bucket array to a genradix
A user with a 30 tb device is overflowing the INT_MAX limit on vmalloc
allocations...

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:49 -04:00
Kent Overstreet
c7652f253a bcachefs: promote_whole_extents is now a normal option
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:48 -04:00
Kent Overstreet
112d21fd1a bcachefs: switch to rhashtable for vfs inodes hash
the standard vfs inode hash table suffers from painful lock contention -
this is long overdue

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Alyssa Ross
a3ed1cc413 bcachefs: Fix negative timespecs
This fixes two problems in the handling of negative times:

 • rem is signed, but the rem * c->sb.nsec_per_time_unit operation
   produced a bogus unsigned result, because s32 * u32 = u32.

 • The timespec was not normalized (it could contain more than a
   billion nanoseconds).

For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).

Cc: stable@vger.kernel.org
Fixes: 595c1e9bab ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743
Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:47 -04:00
Kent Overstreet
06a8693b89 bcachefs: Add a time_stat for blocked on key cache flush
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
cecf72798b bcachefs: Make allocator stuck timeout configurable, ratelimit messages
Limit these messages to once every 2 minutes to avoid spamming logs;
with multiple devices the output can be quite significant.

Also, up the default timeout to 30 seconds from 10 seconds.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Kent Overstreet
b0d3ab531f bcachefs: Reduce the scope of gc_lock
gc_lock is now only for synchronization between check_alloc_info and
interior btree updates - nothing else

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
2574e95a8b bcachefs: Refactor disk accounting data structures
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.

Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:15 -04:00
Kent Overstreet
36008d5d01 bcachefs: Plumb more logging through stdio redirect
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:14 -04:00
Kent Overstreet
fb23d57a6d bcachefs: Convert gc to new accounting
Rewrite fsck/gc for the new accounting scheme.

This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
4c4a7d48bd bcachefs: Kill replicas_journal_res
More dead code deletion

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
8bb8d683a4 bcachefs: Delete journal-buf-sharded old style accounting
More deletion of dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
3afb8dbf03 bcachefs: kill bch2_fs_usage_read()
With bch2_ioctl_fs_usage(), this is now dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
f5095b9f85 bcachefs: dev_usage updated by new accounting
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.

This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
1d16c605cc bcachefs: Disk space accounting rewrite
Main part of the disk accounting rewrite.

This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.

With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.

Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.

An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5d9667d1d6 bcachefs: btree write buffer knows how to accumulate bch_accounting keys
Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.

Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
7773df19c3 bcachefs: metadata version bucket_stripe_sectors
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.

Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
26a170aa61 bcachefs: add capacity, reserved to fs_alloc_debug_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
64ee1431cc bcachefs: Discard, invalidate workers are now per device
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.

Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-25 18:47:55 -04:00
Kent Overstreet
cff07e2739 bcachefs: Guard against overflowing LRU_TIME_BITS
LRUs only have 48 bits for the time field (i.e. LRU order); thus we need
overflow checks and guards.

Reported-by: syzbot+df3bf3f088dcaa728857@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19 18:27:24 -04:00
Kent Overstreet
161f73c2c7 bcachefs: Split out btree_write_submit_wq
Split the workqueues for btree read completions and btree write
submissions; we don't want concurrency control on btree read
completions, but we do want concurrency control on write submissions,
else blocking in submit_bio() will cause a ton of kworkers to be
allocated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-10 13:17:15 -04:00
Kent Overstreet
088d0de812 bcachefs: btree_gc can now handle unknown btrees
Compatibility fix - we no longer have a separate table for which order
gc walks btrees in, and special case the stripes btree directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-28 11:29:26 -04:00
Thomas Bertschinger
07f9a27f19 bcachefs: add no_invalid_checks flag
Setting this flag on a filesystem results in validity checks being
skipped when writing bkeys. This flag will be used by tooling that
deliberately injects corruption into a filesystem in order to exercise
fsck. It shouldn't be set outside of testing/debugging code.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-09 16:24:30 -04:00
Kent Overstreet
c670509134 bcachefs: Allocator prefers not to expand mi.btree_allocated bitmap
We now have a small bitmap in the member info section of the superblock
for "regions that have btree nodes", so that if we ever have to scan for
btree nodes in repair we don't have to scan the whole device(s).

This tweaks the allocator to prefer allocating from regions that are
already marked in this bitmap.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:24 -04:00
Kent Overstreet
3858aa4268 bcachefs: ptr_stale() -> dev_ptr_stale()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
dbd0408087 bcachefs: move replica_set from bch_dev to bch_fs
This is needed for the next patch - the write submit path has to be able
to allocate a replica bio even when we weren't able to get a ref on the
device.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:23 -04:00
Kent Overstreet
552aa54865 bcachefs: Debug asserts for ca->ref
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:22 -04:00
Kent Overstreet
24b27975a9 bcachefs: Kill gc_init_recurse()
This unifies the online and offline btree gc passes; we're not yet
running it online.

We now iterate over one level of the btree at a time - the same as
check_extents_to_backpointers(); this ordering preserves order of keys
regardless of btree splits and merges, which will be important when we
re-enable online gc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:21 -04:00
Kent Overstreet
930e1a92d6 bcachefs: kill gc looping for bucket gens
looping when we change a bucket gen is not ideal - it means we risk
failing if we'd go into an infinite loop, and it's better to make
forward progress even if fsck doesn't fix everything.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:20 -04:00
Kent Overstreet
f04158290d bcachefs: journal seq blacklist gc no longer has to walk btree
Since btree_ptr_v2, we no longer require the journal seq blacklist table
for skipping blacklisted bsets (btree node entries); the pointer to a
given node indicates how much data is present.

Therefore there's no longer any need for journal seq blacklist gc to
walk the btree - we can prune entries older than journal last_seq.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:20 -04:00
Kent Overstreet
103304021e bcachefs: Move gc of bucket.oldest_gen to workqueue
This is a nice cleanup - and we've also been having problems with
kthread creation in the mount path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:20 -04:00
Kent Overstreet
2f724563fc bcachefs: member helper cleanups
Some renaming for better consistency

bch2_member_exists	-> bch2_member_alive
bch2_dev_exists		-> bch2_member_exists
bch2_dev_exsits2	-> bch2_dev_exists
bch_dev_locked		-> bch2_dev_locked
bch_dev_bkey_exists	-> bch2_dev_bkey_exists

new helper - bch2_dev_safe

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:19 -04:00
Kent Overstreet
d155272b6e bcachefs: bucket_valid()
cut out a branch from doing it the obvious way

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:19 -04:00
Kent Overstreet
5577881455 bcachefs: add btree_node_merging_disabled debug param
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:18 -04:00
Kent Overstreet
9e203c43dc bcachefs: Fix missing write refs in fs fio paths
bch2_journal_flush_seq requires us to have a write ref

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-13 22:48:17 -04:00
Kent Overstreet
a292be3b68 bcachefs: Reconstruct missing snapshot nodes
When the snapshots btree is going, we'll have to delete huge amounts of
data - unless we can reconstruct it by looking at the keys that refer to
it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
55936afe11 bcachefs: Flag btrees with missing data
We need this to know when we should attempt to reconstruct the snapshots
btree

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:46:51 -04:00
Kent Overstreet
4409b8081d bcachefs: Repair pass for scanning for btree nodes
If a btree root or interior btree node goes bad, we're going to lose a
lot of data, unless we can recover the nodes that it pointed to by
scanning.

Fortunately btree node headers are fully self describing, and
additionally the magic number is xored with the filesytem UUID, so we
can do so safely.

This implements the scanning - next patch will rework topology repair to
make use of the found nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
d2554263ad bcachefs: Split out recovery_passes.c
We've grown a fair amount of code for managing recovery passes; tracking
which ones we're running, which ones need to be run, and flagging in the
superblock which ones need to be run on the next recovery.

So it's worth splitting out into its own file, this code is pretty
different from the code in recovery.c.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
63332394c7 bcachefs: Move snapshot table size to struct snapshot_table
We need to add bounds checking for snapshot table accesses - it turns
out there are cases where we do need to use the snapshots table before
fsck checks have completed (and indeed, fsck may not have been run).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
a0a466ea98 bcachefs: Split out btree_node_rewrite_worker
This fixes a deadlock due to using btree_interior_update_worker for non
interior updates - async btree node rewrites were blocking, and then
blocking other interior updates.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-17 20:53:12 -04:00
Darrick J. Wong
273960b8f3 bcachefs: time_stats: split stats-with-quantiles into a separate structure
Currently, struct time_stats has the optional ability to quantize the
information that it collects.  This is /probably/ useful for callers who
want to see quantized information, but it more than doubles the size of
the structure from 224 bytes to 464.  For users who don't care about
that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per
counter, split the two into separate pieces.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:38:01 -04:00
Kent Overstreet
f1ca1abfb0 bcachefs: pull out time_stats.[ch]
prep work for lifting out of fs/bcachefs/

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:30:35 -04:00
Kent Overstreet
894d062254 bcachefs: Rename journal_keys.d -> journal_keys.data
This will let us use some darray helpers in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
a393f33123 bcachefs: Split out discard fastpath
Buckets usually can't be discarded until the transaction that made them
empty has been committed in the journal.

Tracing has indicated that we're queuing the discard worker excessively,
only for it to skip over many buckets that are still waiting on a
journal commit, discarding only one or two buckets per iteration.

We want to switch to only queuing the discard worker after a journal
flush write, but there's an important optimization we need to preserve:
if a bucket becomes empty and it was never committed in the journal
while it was in use, we want to discard it and reuse it right away -
since overwriting it before the previous writes are flushed from the
device cache eans those writes only cost bus bandwidth.

So, this patch implements a fast path for buckets that can be discarded
right away. We need new locking between the two discard workers; the new
list of buckets being discarded provides that locking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00