1
Commit Graph

112 Commits

Author SHA1 Message Date
Alex Elder
f42299e6c3 libceph: small refactor in write_partial_kvec()
Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call.  Same functionality, just blocked out
a little differently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
fe3ad593e2 libceph: do crc calculations outside loop
Move blocks of code out of loops in read_partial_message_section()
and read_partial_message().  They were only was getting called at
the end of the last iteration of the loop anyway.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
a9a0c51af4 libceph: separate CRC calculation from byte swapping
Calculate CRC in a separate step from rearranging the byte order
of the result, to improve clarity and readability.

Use offsetof() to determine the number of bytes to include in the
CRC calculation.

In read_partial_message(), switch which value gets byte-swapped,
since the just-computed CRC is already likely to be in a register.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
bca064d236 libceph: use "do" in CRC-related Boolean variables
Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.

Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages():  the value of "do_crc"
assigned appears to be the opposite of what it should be.  No
attempt to fix this is made here; this change preserves the
erroneous behavior.  The problem I found is documented here:
    http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
cffaba15cd ceph: ensure Boolean options support both senses
Many ceph-related Boolean options offer the ability to both enable
and disable a feature.  For all those that don't offer this, add
a new option so that they do.

Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
d3002b974c libceph: a few small changes
This gathers a number of very minor changes:
    - use %hu when formatting the a socket address's address family
    - null out the ceph_msgr_wq pointer after the queue has been
      destroyed
    - drop a needless cast in ceph_write_space()
    - add a WARN() call in ceph_state_change() in the event an
      unrecognized socket state is encountered
    - rearrange the logic in ceph_con_get() and ceph_con_put() so
      that:
        - the reference counts are only atomically read once
	- the values displayed via dout() calls are known to
	  be meaningful at the time they are formatted

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
41617d0c9c libceph: make ceph_tcp_connect() return int
There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller.  Instead, have it return an error code,
which tidies things up a bit.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:51 -05:00
Alex Elder
6173d1f02f libceph: encapsulate some messenger cleanup code
Define a helper function to perform various cleanup operations.  Use
it both in the exit routine and in the init routine in the event of
an error.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
e0f43c9419 libceph: make ceph_msgr_wq private
The messenger workqueue has no need to be public.  So give it static
scope.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
859eb79948 libceph: encapsulate connection kvec operations
Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array.  Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.

Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
963be4d770 libceph: move prepare_write_banner()
One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.

This was split out from the next patch to highlight this change in
logic.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder
ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder
99f0f3b2c4 ceph: eliminate some abusive casts
This fixes some spots where a type cast to (void *) was used as
as a universal type hiding mechanism.  Instead, properly cast the
type to the intended target type.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
bd40614512 ceph: eliminate some needless casts
This eliminates type casts in some places where they are not
required.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
f64a93172b ceph: kill addr_str_lock spinlock; use atomic instead
A spinlock is used to protect a value used for selecting an array
index for a string used for formatting a socket address for human
consumption.  The index is reset to 0 if it ever reaches the maximum
index value.

Instead, use an ever-increasing atomic variable as a sequence
number, and compute the array index by masking off all but the
sequence number's lowest bits.  Make the number of entries in the
array a power of two to allow the use of such a mask (to avoid jumps
in the index value when the sequence number wraps).

The length of these strings is somewhat arbitrarily set at 60 bytes.
The worst-case length of a string produced is 54 bytes, for an IPv6
address that can't be shortened, e.g.:
    [1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
Change it so we arbitrarily use 64 bytes instead; if nothing else
it will make the array of these line up better in hex dumps.

Rename a few things to reinforce the distinction between the number
of strings in the array and the length of individual strings.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
a5bc3129a2 ceph: make use of "else" where appropriate
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
re-testing a value with consecutive "if" statements.  Don't record a
connection's socket pointer unless the connect operation is
successful.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Alex Elder
5766651971 ceph: use a shared zero page rather than one per messenger
Each messenger allocates a page to be used when writing zeroes
out in the event of error or other abnormal condition.  Instead,
use the kernel ZERO_PAGE() for that purpose.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Xi Wang
6448669777 libceph: fix overflow check in crush_decode()
The existing overflow check (n > ULONG_MAX / b) didn't work, because
n = ULONG_MAX / b would both bypass the check and still overflow the
allocation size a + n * b.

The correct check should be (n > (ULONG_MAX - a) / b).

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:45 -05:00
Jim Schutt
182fac2689 net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer
The Ceph messenger would sometimes queue multiple work items to write
data to a socket when the socket buffer was full.

Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the
same way that net/core/stream.c:sk_stream_write_space() does, i.e.,
clearing it only when sufficient space is available in the socket buffer.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:45 -05:00
Linus Torvalds
6c073a7ee2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix safety of rbd_put_client()
  rbd: fix a memory leak in rbd_get_client()
  ceph: create a new session lock to avoid lock inversion
  ceph: fix length validation in parse_reply_info()
  ceph: initialize client debugfs outside of monc->mutex
  ceph: change "ceph.layout" xattr to be "ceph.file.layout"
2012-02-02 15:47:33 -08:00
Sage Weil
ab434b60ab ceph: initialize client debugfs outside of monc->mutex
Initializing debufs under monc->mutex introduces a lock dependency for
sb->s_type->i_mutex_key, which (combined with several other dependencies)
leads to an annoying lockdep warning.  There's no particular reason to do
the debugfs setup under this lock, so move it out.

It used to be the case that our first monmap could come from the OSD; that
is no longer the case with recent servers, so we will reliably set up the
client entry during the initial authentication.

We don't have to worry about racing with debugfs teardown by
ceph_debugfs_client_cleanup() because ceph_destroy_client() calls
ceph_msgr_flush() first, which will wait for the message dispatch work
to complete (and the debugfs init to complete).

Fixes: #1940
Signed-off-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:01 -08:00
Sage Weil
56e925b677 libceph: remove useless return value for osd_client __send_request()
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:57:03 -08:00
Sage Weil
e11b05d31f crush: fix force for non-root TAKE
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:56:57 -08:00
Thomas Meyer
186482560f ceph: Use kmemdup rather than duplicating its implementation
Use kmemdup rather than duplicating its implementation

The semantic patch that makes this change is available
in scripts/coccinelle/api/memdup.cocci.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-10 08:56:54 -08:00
Linus Torvalds
653f42f6b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: add missing spin_unlock at ceph_mdsc_build_path()
  ceph: fix SEEK_CUR, SEEK_SET regression
  crush: fix mapping calculation when force argument doesn't exist
  ceph: use i_ceph_lock instead of i_lock
  rbd: remove buggy rollback functionality
  rbd: return an error when an invalid header is read
  ceph: fix rasize reporting by ceph_show_options
2011-12-13 14:59:42 -08:00
Sage Weil
f1932fc1a6 crush: fix mapping calculation when force argument doesn't exist
If the force argument isn't valid, we should continue calculating a
mapping as if it weren't specified.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-12 09:09:45 -08:00
Linus Torvalds
c292fe4aae Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: Allocate larger oid buffer in request msgs
  ceph: initialize root dentry
  ceph: fix iput race when queueing inode work
2011-11-21 12:11:13 -08:00
Stratos Psomadakis
224736d911 libceph: Allocate larger oid buffer in request msgs
ceph_osd_request struct allocates a 40-byte buffer for object names.
RBD image names can be up to 96 chars long (100 with the .rbd suffix),
which results in the object name for the image being truncated, and a
subsequent map failure.

Increase the oid buffer in request messages, in order to avoid the
truncation.

Signed-off-by: Stratos Psomadakis <psomas@grnet.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-11-11 09:50:19 -08:00
Paul Gortmaker
bc3b2d7fb9 net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:30 -04:00
Sage Weil
38d6453ca3 libceph: force resend of osd requests if we skip an osdmap
If we skip over one or more map epochs, we need to resend all osd requests
because it is possible they remapped to other servers and then back.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:17 -07:00
Noah Watkins
ee3b56f265 ceph: use kernel DNS resolver
Change ceph_parse_ips to take either names given as
IP addresses or standard hostnames (e.g. localhost).
The DNS lookup is done using the dns_resolver facility
similar to its use in AFS, NFS, and CIFS.

This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER
that controls if this feature is on or off.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Noah Watkins
49d9224c04 ceph: fix ceph_monc_init memory leak
failure clean up does not consider ceph_auth_init.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Sage Weil
f0ed1b7cef libceph: warn on msg allocation failures
Any non-masked msg allocation failure should generate a warning and stack
trace to the console.  All of these need to eventually be replaced by
safe preallocation or msgpools.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:16 -07:00
Sage Weil
b61c27636f libceph: don't complain on msgpool alloc failures
The pool allocation failures are masked by the pool; there is no need to
spam the console about them.  (That's the whole point of having the pool
in the first place.)

Mark msg allocations whose failure is safely handled as such.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
f6a2f5be07 libceph: always preallocate mon connection
Allocate the mon connection on init.  We already reuse it across
reconnects.  Remove now unnecessary (and incomplete) NULL checks.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Sage Weil
6ab00d465a libceph: create messenger with client
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Linus Torvalds
92bb062fe3 Merge branch 'for-linus' of git://github.com/NewDreamNetwork/ceph-client
* 'for-linus' of git://github.com/NewDreamNetwork/ceph-client:
  libceph: fix pg_temp mapping update
  libceph: fix pg_temp mapping calculation
  libceph: fix linger request requeuing
  libceph: fix parse options memory leak
  libceph: initialize ack_stamp to avoid unnecessary connection reset
2011-09-29 19:58:58 -07:00
Sage Weil
8adc8b3d78 libceph: fix pg_temp mapping update
The incremental map updates have a record for each pg_temp mapping that is
to be add/updated (len > 0) or removed (len == 0).  The old code was
written as if the updates were a complete enumeration; that was just wrong.
Update the code to remove 0-length entries and drop the rbtree traversal.

This avoids misdirected (and hung) requests that manifest as server
errors like

[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11

Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-28 10:13:35 -07:00
Sage Weil
782e182e91 libceph: fix pg_temp mapping calculation
We need to apply the modulo pg_num calculation before looking up a pgid in
the pg_temp mapping rbtree.  This fixes pg_temp mappings, and fixes
(some) misdirected requests that result in messages like

[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11

on the server and stall make the client block without getting a reply (at
least until the pg_temp mapping goes way, but that can take a long long
time).

Reorder calc_pg_raw() a bit to make more sense.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-28 10:13:31 -07:00
Sage Weil
935b639a04 libceph: fix linger request requeuing
The r_req_lru_item list node moves between several lists, and that cycle
is not directly related (and does not begin) with __register_request().
Initialize it in the request constructor, not __register_request(). This
fixes later badness (below) when OSDs restart underneath an rbd mount.

Crashes we've seen due to this include:

[  213.974288] kernel BUG at net/ceph/messenger.c:2193!

and

[  144.035274] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[  144.035278] IP: [<ffffffffa036c053>] con_work+0x1463/0x2ce0 [libceph]

Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-16 11:13:17 -07:00
Noah Watkins
1cad78932a libceph: fix parse options memory leak
ceph_destroy_options does not free opt->mon_addr that
is allocated in ceph_parse_options.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-16 09:19:53 -07:00
Jim Schutt
c0d5f9db1c libceph: initialize ack_stamp to avoid unnecessary connection reset
Commit 4cf9d54463 recorded when an outgoing ceph message was ACKed,
in order to avoid unnecessary connection resets when an OSD is busy.

However, ack_stamp is uninitialized, so there is a window between
when the message is sent and when it is ACKed in which handle_timeout()
interprets the unitialized value as an expired timeout, and resets
the connection unnecessarily.

Close the window by initializing ack_stamp.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-09-16 09:16:22 -07:00
Linus Torvalds
0d20fbbe82 Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client
* 'for-linus' of git://ceph.newdream.net/git/ceph-client:
  libceph: fix leak of osd structs during shutdown
  ceph: fix memory leak
  ceph: fix encoding of ino only (not relative) paths
  libceph: fix msgpool
2011-09-09 15:48:34 -07:00
Sage Weil
aca420bc51 libceph: fix leak of osd structs during shutdown
We want to remove all OSDs, not just those on the idle LRU.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-31 15:22:46 -07:00
Sage Weil
5185352c16 libceph: fix msgpool
There were several problems here:

 1- we weren't tagging allocations with the pool, so they were never
    returned to the pool.
 2- msgpool_put didn't add back to the mempool, even it were called.
 3- msgpool_release didn't clear the pool pointer, so it would have looped
    had #1 not been broken.

These may or may not have been responsible for #1136 or #1381 (BUG due to
non-empty mempool on umount).  I can't seem to trigger the crash now using
the method I was using before.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-08-09 15:26:17 -07:00
Linus Torvalds
ba5b56cb3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  ceph: document unlocked d_parent accesses
  ceph: explicitly reference rename old_dentry parent dir in request
  ceph: document locking for ceph_set_dentry_offset
  ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
  ceph: protect d_parent access in ceph_d_revalidate
  ceph: protect access to d_parent
  ceph: handle racing calls to ceph_init_dentry
  ceph: set dir complete frag after adding capability
  rbd: set blk_queue request sizes to object size
  ceph: set up readahead size when rsize is not passed
  rbd: cancel watch request when releasing the device
  ceph: ignore lease mask
  ceph: fix ceph_lookup_open intent usage
  ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
  ceph: fix bad parent_inode calc in ceph_lookup_open
  ceph: avoid carrying Fw cap during write into page cache
  libceph: don't time out osd requests that haven't been received
  ceph: report f_bfree based on kb_avail rather than diffing.
  ceph: only queue capsnap if caps are dirty
  ceph: fix snap writeback when racing with writes
  ...
2011-07-26 13:38:50 -07:00
Sage Weil
4cf9d54463 libceph: don't time out osd requests that haven't been received
Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing).  Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-26 11:27:24 -07:00
David S. Miller
033b1142f4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/bluetooth/l2cap_core.c
2011-07-21 13:38:42 -07:00
Sage Weil
38be7a79f7 ceph: fix file mode calculation
open(2) must always include one of O_RDONLY, O_WRONLY, or O_RDWR.  No need
for any O_APPEND special case.

Passing O_WRONLY|O_RDWR is undefined according to the man page, but the
Linux VFS interprets this as O_RDWR, so we'll do the same.

This fixes open(2) with flags O_RDWR|O_APPEND, which was incorrectly being
translated to readonly.

Reported-by: Fyodor Ustinov <ufm@ufm.su>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-07-19 11:25:04 -07:00
David S. Miller
6a7ebdf2fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/bluetooth/l2cap_core.c
2011-07-14 07:56:40 -07:00