Details
-
Bug
-
Status: Closed
-
High
-
Resolution: Cannot Reproduce
-
Dublin Release
-
None
-
SDNC Fr Sp4:1/6-1/24
Description
Mariadb galera pods are failing with connection timeout errors and goes into CrashLoopBackoff repeatedly.
Interestingly, If data0, data1 and data2 directories are deleted from /dockerdata-nfs/rel-mariadb-galera folder, Mariadb cluster pods come up successfully. But after few hours goes again into error state.
SO pods are also not coming up due to mariadb cluster error state.
See below logs from one of the pod. Similar errors are present in other pod logs
+ CONTAINER_SCRIPTS_DIR=/usr/share/container-scripts/mysql + EXTRA_DEFAULTS_FILE=/etc/my.cnf.d/galera.cnf + '[' -z onap ']' + echo 'Galera: Finding peers' Galera: Finding peers ++ hostname -f ++ cut -d. -f2 + K8S_SVC_NAME=mariadb-galera + echo 'Using service name: mariadb-galera' + cp /usr/share/container-scripts/mysql/galera.cnf /etc/my.cnf.d/galera.cnf Using service name: mariadb-galera + /usr/bin/peer-finder -on-start=/usr/share/container-scripts/mysql/configure-galera.sh -service=mariadb-galera 2019/07/17 03:35:22 Peer list updated was [] now [dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local] 2019/07/17 03:35:22 execing: /usr/share/container-scripts/mysql/configure-galera.sh with stdin: dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local 2019/07/17 03:35:22 2019/07/17 03:35:23 Peer finder exiting + '[' '!' -d /var/lib/mysql/mysql ']' + exec mysqld 2019-07-17 3:35:23 140449607362816 [Note] mysqld (mysqld 10.1.24-MariaDB) starting as process 1 ... 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Read nil XID from storage engines, skipping position init 2019-07-17 3:35:23 140449607362816 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so' 2019-07-17 3:35:23 140449607362816 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy <info@codership.com> loaded successfully. 2019-07-17 3:35:23 140449607362816 [Note] WSREP: CRC-32C: using hardware acceleration. 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Found saved state: 5b27b8a6-a77d-11e9-a00b-26960ebe383d:-1, safe_to_bootsrap: 0 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.versi 2019-07-17 3:35:23 140449607362816 [Note] WSREP: GCache history reset: old(5b27b8a6-a77d-11e9-a00b-26960ebe383d:0) -> new(5b27b8a6-a77d-11e9-a00b-26960ebe383d:-1) 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1 2019-07-17 3:35:23 140449607362816 [Note] WSREP: wsrep_sst_grab() 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Start replication 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1 2019-07-17 3:35:23 140449607362816 [Note] WSREP: protonet asio version 0 2019-07-17 3:35:23 140449607362816 [Note] WSREP: Using CRC-32C for message checksums. 2019-07-17 3:35:23 140449607362816 [Note] WSREP: backend: asio 2019-07-17 3:35:23 140449607362816 [Note] WSREP: gcomm thread scheduling priority set to other:0 2019-07-17 3:35:23 140449607362816 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) 2019-07-17 3:35:23 140449607362816 [Note] WSREP: restore pc from disk failed 2019-07-17 3:35:23 140449607362816 [Note] WSREP: GMCast version 0 2019-07-17 3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 2019-07-17 3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 2019-07-17 3:35:23 140449607362816 [Note] WSREP: EVS version 0 2019-07-17 3:35:23 140449607362816 [Note] WSREP: gcomm: connecting to group 'mariadb-galera', peer 'dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local:,dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local:,dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local:' 2019-07-17 3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection established to e8fea524 tcp://10.42.6.57:4567 2019-07-17 3:35:23 140449607362816 [Warning] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') address 'tcp://10.42.6.57:4567' points to own listening address, blacklisting 2019-07-17 3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection established to d87d55d6 tcp://10.42.5.67:4567 2019-07-17 3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 2019-07-17 3:35:24 140449607362816 [Note] WSREP: declaring d87d55d6 at tcp://10.42.5.67:4567 stable 2019-07-17 3:35:24 140449607362816 [Warning] WSREP: no nodes coming from prim view, prim not possible 2019-07-17 3:35:24 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,d87d55d6,2) memb { d87d55d6,0 e8fea524,0 } joined { } left { } partitioned { }) 2019-07-17 3:35:26 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection to peer e8fea524 with addr tcp://10.42.6.57:4567 timed out, no messages seen in PT3S 2019-07-17 3:35:27 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting off 2019-07-17 3:35:29 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.42.5.67:4567 2019-07-17 3:35:30 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') reconnecting to d87d55d6 (tcp://10.42.5.67:4567), attempt 0 2019-07-17 3:35:33 140449607362816 [Note] WSREP: evs::proto(e8fea524, OPERATIONAL, view_id(REG,d87d55d6,2)) suspecting node: d87d55d6 2019-07-17 3:35:33 140449607362816 [Note] WSREP: evs::proto(e8fea524, OPERATIONAL, view_id(REG,d87d55d6,2)) suspected node without join message, declaring inactive 2019-07-17 3:35:34 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,d87d55d6,2) memb { e8fea524,0 } joined { } left { } partitioned { d87d55d6,0 }) 2019-07-17 3:35:34 140449607362816 [Warning] WSREP: no nodes coming from prim view, prim not possible 2019-07-17 3:35:34 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,e8fea524,3) memb { e8fea524,0 } joined { } left { } partitioned { d87d55d6,0 }) 2019-07-17 3:35:34 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection established to ef8a1c31 tcp://10.42.5.67:4567 2019-07-17 3:35:34 140449607362816 [Note] WSREP: remote endpoint tcp://10.42.5.67:4567 changed identity d87d55d6 -> ef8a1c31 2019-07-17 3:35:35 140449607362816 [Note] WSREP: declaring ef8a1c31 at tcp://10.42.5.67:4567 stable 2019-07-17 3:35:35 140449607362816 [Warning] WSREP: no nodes coming from prim view, prim not possible 2019-07-17 3:35:35 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,e8fea524,4) memb { e8fea524,0 ef8a1c31,0 } joined { } left { } partitioned { d87d55d6,0 }) 2019-07-17 3:35:37 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting off 2019-07-17 3:35:54 140449607362816 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():158 2019-07-17 3:35:54 140449607362816 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out) 2019-07-17 3:35:54 140449607362816 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel 'mariadb-galera' at 'gcomm://dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local': -110 (Connection timed out) 2019-07-17 3:35:54 140449607362816 [ERROR] WSREP: gcs connect failed: Connection timed out 2019-07-17 3:35:54 140449607362816 [ERROR] WSREP: wsrep::connect(gcomm://dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local) failed: 7 2019-07-17 3:35:54 140449607362816 [ERROR] Aborting