Uploaded image for project: 'ONAP Operations Manager'
  1. ONAP Operations Manager
  2. OOM-1995

Mariadb Galera cluster pods keep failing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: High High
    • Frankfurt Release
    • Dublin Release
    • None
    • SDNC Fr Sp4:1/6-1/24

      Mariadb galera pods are failing with connection timeout errors and goes into CrashLoopBackoff repeatedly.

      Interestingly, If  data0, data1 and data2 directories are deleted from /dockerdata-nfs/rel-mariadb-galera folder, Mariadb cluster pods come up successfully. But after few hours goes again into error state. 

      SO pods are also not coming up due to mariadb cluster error state.

      See below logs from one of the pod. Similar errors are present in other pod logs

       

       + CONTAINER_SCRIPTS_DIR=/usr/share/container-scripts/mysql
      + EXTRA_DEFAULTS_FILE=/etc/my.cnf.d/galera.cnf
      + '[' -z onap ']'
      + echo 'Galera: Finding peers'
      Galera: Finding peers
      ++ hostname -f
      ++ cut -d. -f2
      + K8S_SVC_NAME=mariadb-galera
      + echo 'Using service name: mariadb-galera'
      + cp /usr/share/container-scripts/mysql/galera.cnf /etc/my.cnf.d/galera.cnf
      Using service name: mariadb-galera
      + /usr/bin/peer-finder -on-start=/usr/share/container-scripts/mysql/configure-galera.sh -service=mariadb-galera
      2019/07/17 03:35:22 Peer list updated
      was []
      now [dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local]
      2019/07/17 03:35:22 execing: /usr/share/container-scripts/mysql/configure-galera.sh with stdin: dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local
      dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local
      dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local
      2019/07/17 03:35:22 
      2019/07/17 03:35:23 Peer finder exiting
      + '[' '!' -d /var/lib/mysql/mysql ']'
      + exec mysqld
      2019-07-17  3:35:23 140449607362816 [Note] mysqld (mysqld 10.1.24-MariaDB) starting as process 1 ...
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Read nil XID from storage engines, skipping position init
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy <info@codership.com> loaded successfully.
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: CRC-32C: using hardware acceleration.
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Found saved state: 5b27b8a6-a77d-11e9-a00b-26960ebe383d:-1, safe_to_bootsrap: 0
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.versi
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: GCache history reset: old(5b27b8a6-a77d-11e9-a00b-26960ebe383d:0) -> new(5b27b8a6-a77d-11e9-a00b-26960ebe383d:-1)
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: wsrep_sst_grab()
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Start replication
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: protonet asio version 0
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: Using CRC-32C for message checksums.
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: backend: asio
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: gcomm thread scheduling priority set to other:0 
      2019-07-17  3:35:23 140449607362816 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: restore pc from disk failed
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: GMCast version 0
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: EVS version 0
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: gcomm: connecting to group 'mariadb-galera', peer 'dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local:,dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local:,dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local:'
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection established to e8fea524 tcp://10.42.6.57:4567
      2019-07-17  3:35:23 140449607362816 [Warning] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') address 'tcp://10.42.6.57:4567' points to own listening address, blacklisting
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection established to d87d55d6 tcp://10.42.5.67:4567
      2019-07-17  3:35:23 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
      2019-07-17  3:35:24 140449607362816 [Note] WSREP: declaring d87d55d6 at tcp://10.42.5.67:4567 stable
      2019-07-17  3:35:24 140449607362816 [Warning] WSREP: no nodes coming from prim view, prim not possible
      2019-07-17  3:35:24 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,d87d55d6,2) memb {
      	d87d55d6,0
      	e8fea524,0
      } joined {
      } left {
      } partitioned {
      })
      2019-07-17  3:35:26 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection to peer e8fea524 with addr tcp://10.42.6.57:4567 timed out, no messages seen in PT3S
      2019-07-17  3:35:27 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2019-07-17  3:35:29 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.42.5.67:4567 
      2019-07-17  3:35:30 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') reconnecting to d87d55d6 (tcp://10.42.5.67:4567), attempt 0
      2019-07-17  3:35:33 140449607362816 [Note] WSREP: evs::proto(e8fea524, OPERATIONAL, view_id(REG,d87d55d6,2)) suspecting node: d87d55d6
      2019-07-17  3:35:33 140449607362816 [Note] WSREP: evs::proto(e8fea524, OPERATIONAL, view_id(REG,d87d55d6,2)) suspected node without join message, declaring inactive
      2019-07-17  3:35:34 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,d87d55d6,2) memb {
      	e8fea524,0
      } joined {
      } left {
      } partitioned {
      	d87d55d6,0
      })
      2019-07-17  3:35:34 140449607362816 [Warning] WSREP: no nodes coming from prim view, prim not possible
      2019-07-17  3:35:34 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,e8fea524,3) memb {
      	e8fea524,0
      } joined {
      } left {
      } partitioned {
      	d87d55d6,0
      })
      2019-07-17  3:35:34 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') connection established to ef8a1c31 tcp://10.42.5.67:4567
      2019-07-17  3:35:34 140449607362816 [Note] WSREP: remote endpoint tcp://10.42.5.67:4567 changed identity d87d55d6 -> ef8a1c31
      2019-07-17  3:35:35 140449607362816 [Note] WSREP: declaring ef8a1c31 at tcp://10.42.5.67:4567 stable
      2019-07-17  3:35:35 140449607362816 [Warning] WSREP: no nodes coming from prim view, prim not possible
      2019-07-17  3:35:35 140449607362816 [Note] WSREP: view(view_id(NON_PRIM,e8fea524,4) memb {
      	e8fea524,0
      	ef8a1c31,0
      } joined {
      } left {
      } partitioned {
      	d87d55d6,0
      })
      2019-07-17  3:35:37 140449607362816 [Note] WSREP: (e8fea524, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2019-07-17  3:35:54 140449607362816 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
      	 at gcomm/src/pc.cpp:connect():158
      2019-07-17  3:35:54 140449607362816 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
      2019-07-17  3:35:54 140449607362816 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel 'mariadb-galera' at 'gcomm://dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local': -110 (Connection timed out)
      2019-07-17  3:35:54 140449607362816 [ERROR] WSREP: gcs connect failed: Connection timed out
      2019-07-17  3:35:54 140449607362816 [ERROR] WSREP: wsrep::connect(gcomm://dev-mariadb-galera-mariadb-galera-0.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-1.mariadb-galera.onap.svc.cluster.local,dev-mariadb-galera-mariadb-galera-2.mariadb-galera.onap.svc.cluster.local) failed: 7
      2019-07-17  3:35:54 140449607362816 [ERROR] Aborting
      
      

       

            sdesbure sdesbure
            divyang.patel divyang.patel
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: