Use try_cmpxchg() family of functions instead of
cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).
Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On a new filesystem or device we have to allocate the journal with a
bump allocator, because allocation info isn't ready yet - but when
hot-adding a device that doesn't have a journal, we don't want to use
that path.
Reported-by: syzbot+24a867cb90d8315cccff@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We shouln't be running the journal shutdown sequence if we never fully
initialized the journal.
Reported-by: syzbot+ffd2270f0bca3322ee00@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we block on the allocator for more than 10 seconds, print out some
useful debugging info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
prep work for making the allocator try to keep btree nodes within the
existing member info btree allocated bitmap
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Combine iter/update/trigger/str_hash flags into a single enum, and
x-macroize them for a to_text() function later.
These flags are all for a specific iter/key/update context, so it makes
sense to group them together - iter/update/trigger flags were already
given distinct bits, this cleans up and unifies that handling.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Consolidate mark_superblock() and trans_mark_superblock(), like we did
with the other trigger paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're using mutex_lock() inside a wait_event() conditional -
prepare_to_wait() has already flipped task state, so potentially
blocking ops need annotation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to check for journal shutdown first in __journal_res_get() -
after the journal is shutdown, j->watermark won't be changing anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Improved journal pipelining broke journal_noflush_seq(); it implicitly
assumed only the oldest outstanding journal buf could be in flight, but
that's no longer true.
Make this more straightforward by just setting buf->must_flush whenever
we know a journal buf is going to be flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recently a severe performance regression was discovered, which bisected
to
a6548c8b5e bcachefs: Avoid flushing the journal in the discard path
It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.
The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.
This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.
This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.
This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.
Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.
We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.
The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.
We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.
Also do some cleanup and improvements on the time stats code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since outstanding journal buffers hold a journal pin, when flushing all
pins we need to close the current journal entry if necessary so its pin
can be released.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
journal_cur_seq() can legitimately be used outside of the journal lock,
where this assert can race
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't mark device superblocks or allocate journal on a device that
isn't online.
That means we may need to do this on every mount, because we may have
formatted a new filesystem and then done the first mount
(bch2_fs_initialize()) in degraded mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Instead of using token pasting to generate methods for each superblock
section, just make the type a parameter to bch2_sb_field_get().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs freeze testing via fstests generic/390 occasionally
reproduces the following BUG from bch2_fs_read_only():
BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));
This indicates that one or more dirty key cache keys still exist
after the attempt to flush and quiesce the fs. The sequence that
leads to this problem actually occurs on unfreeze (ro->rw), and
looks something like the following:
- Task A begins a transaction commit and acquires journal_res for
the current seq. This transaction intends to perform key cache
insertion.
- Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends
up in journal_entry_want_write(), which closes the current journal
entry and drops the reference to the pin list created on entry open.
The pin put pops the front of the journal via fast reclaim since the
reference count has dropped to 0.
- Task A attempts to set the journal pin for the associated cached
key, but bch2_journal_pin_set() skips the pin insert because the
seq of the transaction reservation is behind the front of the pin
list fifo.
The end result is that the pin associated with the cached key is not
added, which prevents a subsequent reclaim from processing the key
and thus leaves it dangling at freeze time. The fundamental cause of
this problem is that the front of the journal is allowed to pop
before a transaction with outstanding reservation on the associated
journal seq is able to add a pin. The count for the pin list
associated with the seq drops to zero and is prematurely reclaimed
as a result.
The logical fix for this problem lies in how the journal buffer is
managed in similar scenarios where the entry might have been closed
before a transaction with outstanding reservations happens to be
committed.
When a journal entry is opened, the current sequence number is
bumped, the associated pin list is initialized with a reference
count of 1, and the journal buffer reference count is bumped (via
journal_state_inc()). When a journal reservation is acquired, the
reservation also acquires a reference on the associated buffer. If
the journal entry is closed in the meantime, it drops both the pin
and buffer references held by the open entry, but the buffer still
has references held by outstanding reservation. After the associated
transaction commits, the reservation release drops the associated
buffer references and the buffer is written out once the reference
count has dropped to zero.
The fundamental problem here is that the lifecycle of the pin list
reference held by an open journal entry is too short to cover the
processing of transactions with outstanding reservations. The
simplest way to address this is to expand the pin list reference to
the lifecycle of the buffer vs. the shorter lifecycle of the open
journal entry. This ensures the pin list for a seq with outstanding
reservation cannot be popped and reclaimed before all outstanding
reservations have been released, even if the associated journal
entry has been closed for further reservations.
Move the pin put from journal entry close to where final processing
of the journal buffer occurs. Create a duplicate helper to cover the
case where the caller doesn't already hold the journal lock. This
allows generic/390 to pass reliably.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs freeze testing has uncovered some raciness between journal
entry open/close and pin list reference count management. The
details of the problem are described in a separate patch. In
preparation for the associated fix, refactor the journal buffer put
path a bit to allow it to eventually handle dropping the pin list
reference currently held by an open journal entry.
Retain the journal write dispatch helper since the closure code is
inlined and we don't want to increase the amount of inline code in
the transaction commit path, but rename the function to reflect
the purpose of final processing of the journal buffer.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a couple journal pin put helpers to handle cases where the
journal lock is already held or not. Refactor the helpers to lock
and reclaim from the highest level and open code the reclaim from
the one caller of the internal variant. The latter call will be
moved into the journal buf release helper in a later patch.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>