|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Always get Split brain after reboot of both nodesHello,
i always get a Split Brain situation on one drbd device after a reboot of both nodes is done. I'm wondering why this doesn't happen on the second drbd device? on the peer node there are [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5 messages in the logfile, but i checked network copnnectivity on the sync if (crossover 100mbit FD, equal nics) from both sides, and i get around 11,5mb/s everytime i try with iperf. i also tuned tcp stack with sysctl with the following params: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 i don't know if these values are fine with my setup, but with Ubuntu 8.04 server defaults the same behaviour happens ... how could i track down what the problem is with this device? And why the other device is not affected by this network timeouts? thx, Chris the log shows: Aug 15 17:28:51 parastore01 kernel: [ 49.368122] drbd0: disk( Diskless -> Attaching ) Aug 15 17:28:51 parastore01 kernel: [ 49.368132] drbd0: Starting worker thread (from cqueue/0 [3899]) Aug 15 17:28:51 parastore01 kernel: [ 49.425995] drbd0: Found 31 transactions (565 active extents) in activity log. Aug 15 17:28:51 parastore01 kernel: [ 49.426005] drbd0: max_segment_size ( = BIO size ) = 32768 Aug 15 17:28:51 parastore01 kernel: [ 49.426012] drbd0: drbd_bm_resize called with capacity == 95551624 Aug 15 17:28:51 parastore01 kernel: [ 49.428212] drbd0: resync bitmap: bits=11943953 words=373250 Aug 15 17:28:51 parastore01 kernel: [ 49.428223] drbd0: size = 45 GB (47775812 KB) Aug 15 17:28:51 parastore01 kernel: [ 49.506257] drbd0: reading of bitmap took 8 jiffies Aug 15 17:28:51 parastore01 kernel: [ 49.508871] drbd0: recounting of set bits took additional 0 jiffies Aug 15 17:28:51 parastore01 kernel: [ 49.508878] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Aug 15 17:28:51 parastore01 kernel: [ 49.509167] drbd0: Marked additional 2192 MB as out-of-sync based on AL. Aug 15 17:28:52 parastore01 kernel: [ 49.717365] drbd0: disk( Attaching -> UpToDate ) Aug 15 17:28:52 parastore01 kernel: [ 49.717377] drbd0: Writing meta data super block now. Aug 15 17:28:52 parastore01 kernel: [ 49.876601] drbd0: conn( StandAlone -> Unconnected ) Aug 15 17:28:52 parastore01 kernel: [ 49.876762] drbd0: Starting receiver thread (from drbd0_worker [5090]) Aug 15 17:28:52 parastore01 kernel: [ 49.877852] drbd0: receiver (re)started Aug 15 17:28:52 parastore01 kernel: [ 49.877864] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:28:52 parastore01 kernel: [ 49.972310] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:28:52 parastore01 kernel: [ 50.004672] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:28:52 parastore01 kernel: [ 50.004684] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:28:52 parastore01 kernel: [ 50.004690] drbd0: Starting asender thread (from drbd0_receiver [5138]) Aug 15 17:28:52 parastore01 kernel: [ 50.008915] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Aug 15 17:28:52 parastore01 kernel: [ 50.008932] drbd0: Writing meta data super block now. Aug 15 17:29:40 parastore01 kernel: [ 98.531749] drbd0: meta connection shut down by peer. Aug 15 17:29:40 parastore01 kernel: [ 98.531848] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Aug 15 17:29:40 parastore01 kernel: [ 98.531865] drbd0: asender terminated Aug 15 17:29:40 parastore01 kernel: [ 98.531868] drbd0: Terminating asender thread Aug 15 17:29:40 parastore01 kernel: [ 98.532383] drbd0: role( Secondary -> Primary ) Aug 15 17:29:40 parastore01 kernel: [ 98.532395] drbd0: Writing meta data super block now. Aug 15 17:29:40 parastore01 kernel: [ 98.533186] drbd0: sock_sendmsg returned -104 Aug 15 17:29:40 parastore01 kernel: [ 98.533251] drbd0: short sent ReportState size=12 sent=0 Aug 15 17:29:40 parastore01 kernel: [ 98.534146] drbd0: tl_clear() Aug 15 17:29:40 parastore01 kernel: [ 98.534152] drbd0: Connection closed Aug 15 17:29:40 parastore01 kernel: [ 98.534157] drbd0: conn( NetworkFailure -> Unconnected ) Aug 15 17:29:40 parastore01 kernel: [ 98.534161] drbd0: receiver terminated Aug 15 17:29:40 parastore01 kernel: [ 98.534164] drbd0: receiver (re)started Aug 15 17:29:40 parastore01 kernel: [ 98.534167] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:29:41 parastore01 kernel: [ 98.830269] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:29:41 parastore01 kernel: [ 98.862770] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:29:41 parastore01 kernel: [ 98.862790] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:29:41 parastore01 kernel: [ 98.862796] drbd0: Starting asender thread (from drbd0_receiver [5138]) Aug 15 17:29:41 parastore01 kernel: [ 98.863560] drbd0: Split-Brain detected, dropping connection! Aug 15 17:29:41 parastore01 kernel: [ 98.863632] drbd0: self D86BD7893327D85B:5F076D68071F86E5:A7706C3E4205FA52:3E7B3E38C51EC4FF Aug 15 17:29:41 parastore01 kernel: [ 98.863636] drbd0: peer 60C952BD7240404F:5F076D68071F86E4:A7706C3E4205FA53:3E7B3E38C51EC4FF Aug 15 17:29:41 parastore01 kernel: [ 98.863642] drbd0: conn( WFReportParams -> Disconnecting ) Aug 15 17:29:41 parastore01 kernel: [ 98.863648] drbd0: helper command: /sbin/drbdadm split-brain Aug 15 17:29:41 parastore01 kernel: [ 98.869163] drbd0: error receiving ReportState, l: 4! Aug 15 17:29:41 parastore01 kernel: [ 98.869395] drbd0: asender terminated Aug 15 17:29:41 parastore01 kernel: [ 98.869401] drbd0: Terminating asender thread Aug 15 17:29:41 parastore01 kernel: [ 98.870023] drbd0: tl_clear() Aug 15 17:29:41 parastore01 kernel: [ 98.870030] drbd0: Connection closed Aug 15 17:29:41 parastore01 kernel: [ 98.870043] drbd0: conn( Disconnecting -> StandAlone ) Aug 15 17:29:41 parastore01 kernel: [ 98.870049] drbd0: receiver terminated Aug 15 17:29:41 parastore01 kernel: [ 98.870052] drbd0: Terminating receiver thread the log of the peer node: Aug 15 17:28:43 parastore02 kernel: [ 66.035432] drbd0: disk( Diskless -> Attaching ) Aug 15 17:28:43 parastore02 kernel: [ 66.035442] drbd0: Starting worker thread (from cqueue/0 [3890]) Aug 15 17:28:43 parastore02 kernel: [ 66.074118] drbd0: Found 6 transactions (6 active extents) in activity log. Aug 15 17:28:43 parastore02 kernel: [ 66.074127] drbd0: max_segment_size ( = BIO size ) = 32768 Aug 15 17:28:43 parastore02 kernel: [ 66.074134] drbd0: drbd_bm_resize called with capacity == 95551624 Aug 15 17:28:43 parastore02 kernel: [ 66.076351] drbd0: resync bitmap: bits=11943953 words=373250 Aug 15 17:28:43 parastore02 kernel: [ 66.076362] drbd0: size = 45 GB (47775812 KB) Aug 15 17:28:43 parastore02 kernel: [ 66.131997] drbd0: reading of bitmap took 6 jiffies Aug 15 17:28:43 parastore02 kernel: [ 66.134610] drbd0: recounting of set bits took additional 0 jiffies Aug 15 17:28:43 parastore02 kernel: [ 66.134615] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Aug 15 17:28:43 parastore02 kernel: [ 66.134644] drbd0: Marked additional 24 MB as out-of-sync based on AL. Aug 15 17:28:43 parastore02 kernel: [ 66.149767] drbd0: disk( Attaching -> UpToDate ) Aug 15 17:28:43 parastore02 kernel: [ 66.149778] drbd0: Writing meta data super block now. Aug 15 17:28:43 parastore02 kernel: [ 66.314471] drbd0: conn( StandAlone -> Unconnected ) Aug 15 17:28:43 parastore02 kernel: [ 66.314636] drbd0: Starting receiver thread (from drbd0_worker [5118]) Aug 15 17:28:43 parastore02 kernel: [ 66.315752] drbd0: receiver (re)started Aug 15 17:28:43 parastore02 kernel: [ 66.315764] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:28:44 parastore02 kernel: [ 67.017585] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:28:44 parastore02 kernel: [ 67.018675] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:28:44 parastore02 kernel: [ 67.018687] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:28:44 parastore02 kernel: [ 67.018706] drbd0: Starting asender thread (from drbd0_receiver [5137]) Aug 15 17:28:44 parastore02 kernel: [ 67.064460] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Aug 15 17:28:44 parastore02 kernel: [ 67.064476] drbd0: Writing meta data super block now. Aug 15 17:29:02 parastore02 kernel: [ 85.589243] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5 Aug 15 17:29:08 parastore02 kernel: [ 91.586558] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 4 Aug 15 17:29:14 parastore02 kernel: [ 97.583871] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 3 Aug 15 17:29:20 parastore02 kernel: [ 103.581185] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 2 Aug 15 17:29:26 parastore02 kernel: [ 109.578499] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 1 Aug 15 17:29:32 parastore02 kernel: [ 115.575814] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown ) Aug 15 17:29:32 parastore02 kernel: [ 115.575829] drbd0: short sent ReportBitMap size=4096 sent=3216 Aug 15 17:29:32 parastore02 kernel: [ 115.575925] drbd0: error receiving ReportBitMap, l: 0! Aug 15 17:29:32 parastore02 kernel: [ 115.576429] drbd0: role( Secondary -> Primary ) Aug 15 17:29:32 parastore02 kernel: [ 115.576441] drbd0: Creating new current UUID Aug 15 17:29:32 parastore02 kernel: [ 115.576451] drbd0: Writing meta data super block now. Aug 15 17:29:32 parastore02 kernel: [ 115.576548] drbd0: asender terminated Aug 15 17:29:32 parastore02 kernel: [ 115.576554] drbd0: Terminating asender thread Aug 15 17:29:32 parastore02 kernel: [ 115.577220] drbd0: tl_clear() Aug 15 17:29:32 parastore02 kernel: [ 115.577226] drbd0: Connection closed Aug 15 17:29:32 parastore02 kernel: [ 115.577233] drbd0: conn( Timeout -> Unconnected ) Aug 15 17:29:32 parastore02 kernel: [ 115.577237] drbd0: receiver terminated Aug 15 17:29:32 parastore02 kernel: [ 115.577240] drbd0: receiver (re)started Aug 15 17:29:32 parastore02 kernel: [ 115.577243] drbd0: conn( Unconnected -> WFConnection ) Aug 15 17:29:33 parastore02 kernel: [ 115.875697] drbd0: Handshake successful: DRBD Network Protocol version 86 Aug 15 17:29:33 parastore02 kernel: [ 115.876347] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC Aug 15 17:29:33 parastore02 kernel: [ 115.876359] drbd0: conn( WFConnection -> WFReportParams ) Aug 15 17:29:33 parastore02 kernel: [ 115.876364] drbd0: Starting asender thread (from drbd0_receiver [5137]) Aug 15 17:29:33 parastore02 kernel: [ 115.915155] drbd0: meta connection shut down by peer. Aug 15 17:29:33 parastore02 kernel: [ 115.915226] drbd0: conn( WFReportParams -> NetworkFailure ) Aug 15 17:29:33 parastore02 kernel: [ 115.915236] drbd0: asender terminated Aug 15 17:29:33 parastore02 kernel: [ 115.915239] drbd0: Terminating asender thread Aug 15 17:29:33 parastore02 kernel: [ 115.916116] drbd0: tl_clear() Aug 15 17:29:33 parastore02 kernel: [ 115.916122] drbd0: Connection closed Aug 15 17:29:33 parastore02 kernel: [ 115.916130] drbd0: conn( NetworkFailure -> Unconnected ) Aug 15 17:29:33 parastore02 kernel: [ 115.916134] drbd0: receiver terminated Aug 15 17:29:33 parastore02 kernel: [ 115.916163] drbd0: receiver (re)started Aug 15 17:29:33 parastore02 kernel: [ 115.916168] drbd0: conn( Unconnected -> WFConnection ) config of drbd0: disk { size 0s _is_default; # bytes on-io-error detach; fencing dont-care _is_default; } net { timeout 60 _is_default; # 1/10 seconds max-epoch-size 2048 _is_default; max-buffers 2048 _is_default; unplug-watermark 128 _is_default; connect-int 10 _is_default; # seconds ping-int 10 _is_default; # seconds sndbuf-size 131070 _is_default; # bytes ko-count 6; allow-two-primaries; cram-hmac-alg "md5"; shared-secret "Para2008Store"; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect _is_default; rr-conflict disconnect _is_default; ping-timeout 5 _is_default; # 1/10 seconds } syncer { rate 5120k; # bytes/second after -1 _is_default; al-extents 1801; } protocol C; _this_host { device "/dev/drbd0"; disk "/dev/sda4"; meta-disk internal; address 192.168.99.2:7788; } _remote_host { address 192.168.99.1:7788; } -- "The greatest proof that intelligent life other that humans exists in the universe is that none of it has tried to contact us!" _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: Always get Split brain after reboot of both nodesOn Fri, Aug 15, 2008 at 06:09:28PM +0200, Chris Joelly wrote:
> Hello, > > i always get a Split Brain situation on one drbd device after a reboot > of both nodes is done. what is your DRBD version? -- : Lars Ellenberg : LINBIT HA-Solutions GmbH : DRBD®/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT Information Technologies GmbH __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: Always get Split brain after reboot of both nodesHello,
versions are: paraadm@parastore02:~$ dpkg -l | grep drbd ii drbd8-utils 2:8.0.11-0ubuntu3 RAID 1 over tcp/ip for Linux utilities paraadm@parastore02:~$ modinfo drbd filename: /lib/modules/2.6.24-19-server/ubuntu/block/drbd/drbd.ko alias: block-major-147-* license: GPL description: drbd - Distributed Replicated Block Device v8.0.11 author: Philipp Reisner <phil@...>, Lars Ellenberg <lars@...> srcversion: 619CD61F5E18B8116CEF6DB depends: cn vermagic: 2.6.24-19-server SMP mod_unload 686 parm: minor_count:Maximum number of drbd devices (1-255) (int) parm: allow_oos:DONT USE! (bool) parm: enable_faults:int parm: fault_rate:int parm: fault_count:int parm: fault_devs:int parm: trace_level:int parm: trace_type:int parm: trace_devs:int parm: usermode_helper:string thanks, chris On Sat, Aug 16, 2008 at 05:05:57PM +0200, Lars Ellenberg wrote: > On Fri, Aug 15, 2008 at 06:09:28PM +0200, Chris Joelly wrote: > > Hello, > > > > i always get a Split Brain situation on one drbd device after a reboot > > of both nodes is done. > > what is your DRBD version? > > -- > : Lars Ellenberg > : LINBIT HA-Solutions GmbH > : DRBD®/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks > of LINBIT Information Technologies GmbH > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user@... > http://lists.linbit.com/mailman/listinfo/drbd-user -- Nur der Kleingeist hält Ordnung, das Genie beherrscht das Chaos. ;-) _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: Always get Split brain after reboot of both nodesOn Fri, Aug 22, 2008 at 10:11:15PM +0200, Chris Joelly wrote: > Hello, > > versions are: > > paraadm@parastore02:~$ dpkg -l | grep drbd > ii drbd8-utils 2:8.0.11-0ubuntu3 RAID 1 over tcp/ip for Linux utilities from the ChangeLog of drbd-8.0.13 8.0.13 (api:86/proto:86) -------- ... * Fixed a possible deadlock in case "become-primary-on-both" is used, and a resync starts ... maybe you are a victime of that one? -- : Lars Ellenberg : LINBIT HA-Solutions GmbH : DRBD®/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT Information Technologies GmbH __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
|
|
Re: Always get Split brain after reboot of both nodesOn Sat, Aug 23, 2008 at 01:43:47AM +0200, Lars Ellenberg wrote: > > On Fri, Aug 22, 2008 at 10:11:15PM +0200, Chris Joelly wrote: > > Hello, > > > > versions are: > > > > paraadm@parastore02:~$ dpkg -l | grep drbd > > ii drbd8-utils 2:8.0.11-0ubuntu3 RAID 1 over tcp/ip for Linux utilities > > from the ChangeLog of drbd-8.0.13 > 8.0.13 (api:86/proto:86) > -------- > ... > * Fixed a possible deadlock in case "become-primary-on-both" is used, > and a resync starts > ... > > maybe you are a victime of that one? > yes, sounds reasonable. but i'm wondering why there is a resync needed after rebooting both nodes, as i assume that both sides are in sync. at least at the moment i issue the reboot command on both nodes, and thee is no filesystem mounted on the resource ... -- Nur der Kleingeist hält Ordnung, das Genie beherrscht das Chaos. ;-) _______________________________________________ drbd-user mailing list drbd-user@... http://lists.linbit.com/mailman/listinfo/drbd-user |
| Free Forum Powered by Nabble | Forum Help |