Always get Split brain after reboot of both nodes

View: New views
5 Messages — Rating Filter:   Alert me  

Always get Split brain after reboot of both nodes

by Chris Joelly-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

i always get a Split Brain situation on one drbd device after a reboot
of both nodes is done. I'm wondering why this doesn't happen on the
second drbd device?

on the peer node there are

[drbd0_receiver/5137] sock_sendmsg time expired, ko = 5

messages in the logfile, but i checked network copnnectivity on the sync
if (crossover 100mbit FD, equal nics) from both sides, and i get around
11,5mb/s everytime i try with iperf.

i also tuned tcp stack with sysctl with the following params:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

i don't know if these values are fine with my setup, but with Ubuntu 8.04
server defaults the same behaviour happens ...

how could i track down what the problem is with this device? And why the
other device is not affected by this network timeouts?

thx,

Chris

the log shows:

Aug 15 17:28:51 parastore01 kernel: [   49.368122] drbd0: disk( Diskless -> Attaching )
Aug 15 17:28:51 parastore01 kernel: [   49.368132] drbd0: Starting worker thread (from cqueue/0 [3899])
Aug 15 17:28:51 parastore01 kernel: [   49.425995] drbd0: Found 31 transactions (565 active extents) in activity log.
Aug 15 17:28:51 parastore01 kernel: [   49.426005] drbd0: max_segment_size ( = BIO size ) = 32768
Aug 15 17:28:51 parastore01 kernel: [   49.426012] drbd0: drbd_bm_resize called with capacity == 95551624
Aug 15 17:28:51 parastore01 kernel: [   49.428212] drbd0: resync bitmap: bits=11943953 words=373250
Aug 15 17:28:51 parastore01 kernel: [   49.428223] drbd0: size = 45 GB (47775812 KB)
Aug 15 17:28:51 parastore01 kernel: [   49.506257] drbd0: reading of bitmap took 8 jiffies
Aug 15 17:28:51 parastore01 kernel: [   49.508871] drbd0: recounting of set bits took additional 0 jiffies
Aug 15 17:28:51 parastore01 kernel: [   49.508878] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 15 17:28:51 parastore01 kernel: [   49.509167] drbd0: Marked additional 2192 MB as out-of-sync based on AL.
Aug 15 17:28:52 parastore01 kernel: [   49.717365] drbd0: disk( Attaching -> UpToDate )
Aug 15 17:28:52 parastore01 kernel: [   49.717377] drbd0: Writing meta data super block now.
Aug 15 17:28:52 parastore01 kernel: [   49.876601] drbd0: conn( StandAlone -> Unconnected )
Aug 15 17:28:52 parastore01 kernel: [   49.876762] drbd0: Starting receiver thread (from drbd0_worker [5090])
Aug 15 17:28:52 parastore01 kernel: [   49.877852] drbd0: receiver (re)started
Aug 15 17:28:52 parastore01 kernel: [   49.877864] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:28:52 parastore01 kernel: [   49.972310] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:28:52 parastore01 kernel: [   50.004672] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:28:52 parastore01 kernel: [   50.004684] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:28:52 parastore01 kernel: [   50.004690] drbd0: Starting asender thread (from drbd0_receiver [5138])
Aug 15 17:28:52 parastore01 kernel: [   50.008915] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Aug 15 17:28:52 parastore01 kernel: [   50.008932] drbd0: Writing meta data super block now.
Aug 15 17:29:40 parastore01 kernel: [   98.531749] drbd0: meta connection shut down by peer.
Aug 15 17:29:40 parastore01 kernel: [   98.531848] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 15 17:29:40 parastore01 kernel: [   98.531865] drbd0: asender terminated
Aug 15 17:29:40 parastore01 kernel: [   98.531868] drbd0: Terminating asender thread
Aug 15 17:29:40 parastore01 kernel: [   98.532383] drbd0: role( Secondary -> Primary )
Aug 15 17:29:40 parastore01 kernel: [   98.532395] drbd0: Writing meta data super block now.
Aug 15 17:29:40 parastore01 kernel: [   98.533186] drbd0: sock_sendmsg returned -104
Aug 15 17:29:40 parastore01 kernel: [   98.533251] drbd0: short sent ReportState size=12 sent=0
Aug 15 17:29:40 parastore01 kernel: [   98.534146] drbd0: tl_clear()
Aug 15 17:29:40 parastore01 kernel: [   98.534152] drbd0: Connection closed
Aug 15 17:29:40 parastore01 kernel: [   98.534157] drbd0: conn( NetworkFailure -> Unconnected )
Aug 15 17:29:40 parastore01 kernel: [   98.534161] drbd0: receiver terminated
Aug 15 17:29:40 parastore01 kernel: [   98.534164] drbd0: receiver (re)started
Aug 15 17:29:40 parastore01 kernel: [   98.534167] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:29:41 parastore01 kernel: [   98.830269] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:29:41 parastore01 kernel: [   98.862770] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:29:41 parastore01 kernel: [   98.862790] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:29:41 parastore01 kernel: [   98.862796] drbd0: Starting asender thread (from drbd0_receiver [5138])
Aug 15 17:29:41 parastore01 kernel: [   98.863560] drbd0: Split-Brain detected, dropping connection!
Aug 15 17:29:41 parastore01 kernel: [   98.863632] drbd0: self D86BD7893327D85B:5F076D68071F86E5:A7706C3E4205FA52:3E7B3E38C51EC4FF
Aug 15 17:29:41 parastore01 kernel: [   98.863636] drbd0: peer 60C952BD7240404F:5F076D68071F86E4:A7706C3E4205FA53:3E7B3E38C51EC4FF
Aug 15 17:29:41 parastore01 kernel: [   98.863642] drbd0: conn( WFReportParams -> Disconnecting )
Aug 15 17:29:41 parastore01 kernel: [   98.863648] drbd0: helper command: /sbin/drbdadm split-brain
Aug 15 17:29:41 parastore01 kernel: [   98.869163] drbd0: error receiving ReportState, l: 4!
Aug 15 17:29:41 parastore01 kernel: [   98.869395] drbd0: asender terminated
Aug 15 17:29:41 parastore01 kernel: [   98.869401] drbd0: Terminating asender thread
Aug 15 17:29:41 parastore01 kernel: [   98.870023] drbd0: tl_clear()
Aug 15 17:29:41 parastore01 kernel: [   98.870030] drbd0: Connection closed
Aug 15 17:29:41 parastore01 kernel: [   98.870043] drbd0: conn( Disconnecting -> StandAlone )
Aug 15 17:29:41 parastore01 kernel: [   98.870049] drbd0: receiver terminated
Aug 15 17:29:41 parastore01 kernel: [   98.870052] drbd0: Terminating receiver thread

the log of the peer node:

Aug 15 17:28:43 parastore02 kernel: [   66.035432] drbd0: disk( Diskless -> Attaching )
Aug 15 17:28:43 parastore02 kernel: [   66.035442] drbd0: Starting worker thread (from cqueue/0 [3890])
Aug 15 17:28:43 parastore02 kernel: [   66.074118] drbd0: Found 6 transactions (6 active extents) in activity log.
Aug 15 17:28:43 parastore02 kernel: [   66.074127] drbd0: max_segment_size ( = BIO size ) = 32768
Aug 15 17:28:43 parastore02 kernel: [   66.074134] drbd0: drbd_bm_resize called with capacity == 95551624
Aug 15 17:28:43 parastore02 kernel: [   66.076351] drbd0: resync bitmap: bits=11943953 words=373250
Aug 15 17:28:43 parastore02 kernel: [   66.076362] drbd0: size = 45 GB (47775812 KB)
Aug 15 17:28:43 parastore02 kernel: [   66.131997] drbd0: reading of bitmap took 6 jiffies
Aug 15 17:28:43 parastore02 kernel: [   66.134610] drbd0: recounting of set bits took additional 0 jiffies
Aug 15 17:28:43 parastore02 kernel: [   66.134615] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Aug 15 17:28:43 parastore02 kernel: [   66.134644] drbd0: Marked additional 24 MB as out-of-sync based on AL.
Aug 15 17:28:43 parastore02 kernel: [   66.149767] drbd0: disk( Attaching -> UpToDate )
Aug 15 17:28:43 parastore02 kernel: [   66.149778] drbd0: Writing meta data super block now.
Aug 15 17:28:43 parastore02 kernel: [   66.314471] drbd0: conn( StandAlone -> Unconnected )
Aug 15 17:28:43 parastore02 kernel: [   66.314636] drbd0: Starting receiver thread (from drbd0_worker [5118])
Aug 15 17:28:43 parastore02 kernel: [   66.315752] drbd0: receiver (re)started
Aug 15 17:28:43 parastore02 kernel: [   66.315764] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:28:44 parastore02 kernel: [   67.017585] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:28:44 parastore02 kernel: [   67.018675] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:28:44 parastore02 kernel: [   67.018687] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:28:44 parastore02 kernel: [   67.018706] drbd0: Starting asender thread (from drbd0_receiver [5137])
Aug 15 17:28:44 parastore02 kernel: [   67.064460] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 15 17:28:44 parastore02 kernel: [   67.064476] drbd0: Writing meta data super block now.
Aug 15 17:29:02 parastore02 kernel: [   85.589243] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 5
Aug 15 17:29:08 parastore02 kernel: [   91.586558] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 4
Aug 15 17:29:14 parastore02 kernel: [   97.583871] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 3
Aug 15 17:29:20 parastore02 kernel: [  103.581185] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 2
Aug 15 17:29:26 parastore02 kernel: [  109.578499] drbd0: [drbd0_receiver/5137] sock_sendmsg time expired, ko = 1
Aug 15 17:29:32 parastore02 kernel: [  115.575814] drbd0: peer( Secondary -> Unknown ) conn( WFBitMapT -> Timeout ) pdsk( UpToDate -> DUnknown )
Aug 15 17:29:32 parastore02 kernel: [  115.575829] drbd0: short sent ReportBitMap size=4096 sent=3216
Aug 15 17:29:32 parastore02 kernel: [  115.575925] drbd0: error receiving ReportBitMap, l: 0!
Aug 15 17:29:32 parastore02 kernel: [  115.576429] drbd0: role( Secondary -> Primary )
Aug 15 17:29:32 parastore02 kernel: [  115.576441] drbd0: Creating new current UUID
Aug 15 17:29:32 parastore02 kernel: [  115.576451] drbd0: Writing meta data super block now.
Aug 15 17:29:32 parastore02 kernel: [  115.576548] drbd0: asender terminated
Aug 15 17:29:32 parastore02 kernel: [  115.576554] drbd0: Terminating asender thread
Aug 15 17:29:32 parastore02 kernel: [  115.577220] drbd0: tl_clear()
Aug 15 17:29:32 parastore02 kernel: [  115.577226] drbd0: Connection closed
Aug 15 17:29:32 parastore02 kernel: [  115.577233] drbd0: conn( Timeout -> Unconnected )
Aug 15 17:29:32 parastore02 kernel: [  115.577237] drbd0: receiver terminated
Aug 15 17:29:32 parastore02 kernel: [  115.577240] drbd0: receiver (re)started
Aug 15 17:29:32 parastore02 kernel: [  115.577243] drbd0: conn( Unconnected -> WFConnection )
Aug 15 17:29:33 parastore02 kernel: [  115.875697] drbd0: Handshake successful: DRBD Network Protocol version 86
Aug 15 17:29:33 parastore02 kernel: [  115.876347] drbd0: Peer authenticated using 16 bytes of 'md5' HMAC
Aug 15 17:29:33 parastore02 kernel: [  115.876359] drbd0: conn( WFConnection -> WFReportParams )
Aug 15 17:29:33 parastore02 kernel: [  115.876364] drbd0: Starting asender thread (from drbd0_receiver [5137])
Aug 15 17:29:33 parastore02 kernel: [  115.915155] drbd0: meta connection shut down by peer.
Aug 15 17:29:33 parastore02 kernel: [  115.915226] drbd0: conn( WFReportParams -> NetworkFailure )
Aug 15 17:29:33 parastore02 kernel: [  115.915236] drbd0: asender terminated
Aug 15 17:29:33 parastore02 kernel: [  115.915239] drbd0: Terminating asender thread
Aug 15 17:29:33 parastore02 kernel: [  115.916116] drbd0: tl_clear()
Aug 15 17:29:33 parastore02 kernel: [  115.916122] drbd0: Connection closed
Aug 15 17:29:33 parastore02 kernel: [  115.916130] drbd0: conn( NetworkFailure -> Unconnected )
Aug 15 17:29:33 parastore02 kernel: [  115.916134] drbd0: receiver terminated
Aug 15 17:29:33 parastore02 kernel: [  115.916163] drbd0: receiver (re)started
Aug 15 17:29:33 parastore02 kernel: [  115.916168] drbd0: conn( Unconnected -> WFConnection )

config of drbd0:

disk {
        size             0s _is_default; # bytes
        on-io-error     detach;
        fencing         dont-care _is_default;
}
net {
        timeout         60 _is_default; # 1/10 seconds
        max-epoch-size   2048 _is_default;
        max-buffers     2048 _is_default;
        unplug-watermark 128 _is_default;
        connect-int     10 _is_default; # seconds
        ping-int         10 _is_default; # seconds
        sndbuf-size     131070 _is_default; # bytes
        ko-count         6;
        allow-two-primaries;
        cram-hmac-alg   "md5";
        shared-secret   "Para2008Store";
        after-sb-0pri   discard-zero-changes;
        after-sb-1pri   discard-secondary;
        after-sb-2pri   disconnect _is_default;
        rr-conflict     disconnect _is_default;
        ping-timeout     5 _is_default; # 1/10 seconds
}
syncer {
        rate             5120k; # bytes/second
        after           -1 _is_default;
        al-extents       1801;
}
protocol C;
_this_host {
        device "/dev/drbd0";
        disk "/dev/sda4";
        meta-disk internal;
        address 192.168.99.2:7788;
}
_remote_host {
        address 192.168.99.1:7788;
}


--
"The greatest proof that intelligent life other that humans exists in
 the universe is that none of it has tried to contact us!"

_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: Always get Split brain after reboot of both nodes

by Lars Ellenberg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Aug 15, 2008 at 06:09:28PM +0200, Chris Joelly wrote:
> Hello,
>
> i always get a Split Brain situation on one drbd device after a reboot
> of both nodes is done.

what is your DRBD version?

--
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: Always get Split brain after reboot of both nodes

by Chris Joelly-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

versions are:

paraadm@parastore02:~$ dpkg -l | grep drbd
ii  drbd8-utils 2:8.0.11-0ubuntu3 RAID 1 over tcp/ip for Linux utilities
paraadm@parastore02:~$ modinfo drbd
filename:       /lib/modules/2.6.24-19-server/ubuntu/block/drbd/drbd.ko
alias:          block-major-147-*
license:        GPL
description:    drbd - Distributed Replicated Block Device v8.0.11
author:         Philipp Reisner <phil@...>, Lars Ellenberg
<lars@...>
srcversion:     619CD61F5E18B8116CEF6DB
depends:        cn
vermagic:       2.6.24-19-server SMP mod_unload 686
parm:           minor_count:Maximum number of drbd devices (1-255) (int)
parm:           allow_oos:DONT USE! (bool)
parm:           enable_faults:int
parm:           fault_rate:int
parm:           fault_count:int
parm:           fault_devs:int
parm:           trace_level:int
parm:           trace_type:int
parm:           trace_devs:int
parm:           usermode_helper:string

thanks, chris

On Sat, Aug 16, 2008 at 05:05:57PM +0200, Lars Ellenberg wrote:

> On Fri, Aug 15, 2008 at 06:09:28PM +0200, Chris Joelly wrote:
> > Hello,
> >
> > i always get a Split Brain situation on one drbd device after a reboot
> > of both nodes is done.
>
> what is your DRBD version?
>
> --
> : Lars Ellenberg                
> : LINBIT HA-Solutions GmbH
> : DRBD®/HA support and consulting    http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks
> of LINBIT Information Technologies GmbH
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user@...
> http://lists.linbit.com/mailman/listinfo/drbd-user

--
Nur der Kleingeist hält Ordnung, das Genie beherrscht das Chaos. ;-)
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: Always get Split brain after reboot of both nodes

by Lars Ellenberg :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Fri, Aug 22, 2008 at 10:11:15PM +0200, Chris Joelly wrote:
> Hello,
>
> versions are:
>
> paraadm@parastore02:~$ dpkg -l | grep drbd
> ii  drbd8-utils 2:8.0.11-0ubuntu3 RAID 1 over tcp/ip for Linux utilities

from the ChangeLog of drbd-8.0.13
8.0.13 (api:86/proto:86)
--------
  ...
 * Fixed a possible deadlock in case "become-primary-on-both" is used,
   and a resync starts
  ...

maybe you are a victime of that one?

--
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: Always get Split brain after reboot of both nodes

by Chris Joelly-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Sat, Aug 23, 2008 at 01:43:47AM +0200, Lars Ellenberg wrote:

>
> On Fri, Aug 22, 2008 at 10:11:15PM +0200, Chris Joelly wrote:
> > Hello,
> >
> > versions are:
> >
> > paraadm@parastore02:~$ dpkg -l | grep drbd
> > ii  drbd8-utils 2:8.0.11-0ubuntu3 RAID 1 over tcp/ip for Linux utilities
>
> from the ChangeLog of drbd-8.0.13
> 8.0.13 (api:86/proto:86)
> --------
>   ...
>  * Fixed a possible deadlock in case "become-primary-on-both" is used,
>    and a resync starts
>   ...
>
> maybe you are a victime of that one?
>

yes, sounds reasonable. but i'm wondering why there is a resync needed
after rebooting both nodes, as i assume that both sides are in sync. at
least at the moment i issue the reboot command on both nodes, and thee
is no filesystem mounted on the resource ...

--
Nur der Kleingeist hält Ordnung, das Genie beherrscht das Chaos. ;-)
_______________________________________________
drbd-user mailing list
drbd-user@...
http://lists.linbit.com/mailman/listinfo/drbd-user
LightInTheBox - Buy quality products at wholesale price!