Uploaded image for project: 'ONAP Operations Manager'
  1. ONAP Operations Manager
  2. OOM-2057

race condition: readiness-check pod ready.py script fails to detect shared Cassandra is ready

XMLWordPrintable

      Referring to problem posted on https://lists.onap.org/g/onap-discuss/topic/32859796

      Background:

      • a Dublin environment was deployed and used for demos
      • a problem was detected where AAI returned errors
      • presumably some attempt was made to fix AAI by re-deploying
      • further investigation showed that the shared Cassandra failed with multiple problems (out of memory, error in commit log, out of disk space)
      • the AAI pods and other dependent pods were left in Init state
      • the Cassandra problems were eventually rectified by restarting the pods and deleting the offending commit log file (causing some data loss)

      Problem:

      • after fixing the Cassandra problems, the Cassandra pods were in Running state and healthy according to their logs, etc, the dependent AAI pods and others were still stuck in Init state
      • kubectl logs command could not display any data in Init state
      • docker logs command showed that readiness-check pod was failing (see below)
      2019-08-14 05:05:22,283 - INFO - Checking if cassandra  is ready
      2019-08-14 05:05:22,963 - INFO - Statefulset dev-cassandra-cassandra  is not ready
      2019-08-14 05:05:22,971 - WARNING - timed out waiting for 'cassandra' to be ready
      

      Inspection on ready.py script in readiness-check pod shows that wait_for_statefulset_complete has this condition:

              s = response.status
              if (s.updated_replicas == response.spec.replicas and
                      s.replicas == response.spec.replicas and
                      s.ready_replicas == response.spec.replicas and
                      s.current_replicas == response.spec.replicas and
                      s.observed_generation == response.metadata.generation):
                  log.info("Statefulset " + statefulset_name + "  is ready")
                  return True
      

      but the info provided by kubectl had only these fields:

      # kubectl -n onap get statefulset/dev-cassandra-cassandra -o yaml
      ...
      status:
        collisionCount: 0
        currentReplicas: 3
        currentRevision: dev-cassandra-cassandra-84f4d86c9f
        observedGeneration: 1
        readyReplicas: 3
        replicas: 3
        updateRevision: dev-cassandra-cassandra-84f4d86c9f
      

      Attempted workarounds:

      • Using command (with a couple of different numbers): kubectl -n onap scale statefulsets dev-cassandra-cassandra --replicas=3

      did change the number of replicas but did not add an "updatedReplicas" parameter in the status.

      • Using command (to make a minor change in timeout seconds): kubectl -n onap edit statefulset/dev-cassandra-cassandra -o yaml

      did introduce an "updatedReplicas" parameter that changed values as the update propagated through the system, e.g.

      status:
        collisionCount: 0
        currentReplicas: 1
        currentRevision: dev-cassandra-cassandra-84f4d86c9f
        observedGeneration: 4
        readyReplicas: 2
        replicas: 3
        updateRevision: dev-cassandra-cassandra-6fb846979d
        updatedReplicas: 2
      

      however, that change was transient and the "updatedReplicas" value disappeared again once the updates were completed:

      status:
        collisionCount: 0
        currentReplicas: 3
        currentRevision: dev-cassandra-cassandra-6fb846979d
        observedGeneration: 4
        readyReplicas: 3
        replicas: 3
        updateRevision: dev-cassandra-cassandra-6fb846979d
      

      Observations:

      • each Cassandra node took about 3 minutes to update
      • the podManagementPolicy of OrderedReady and updateStrategy of RollingUpdate means that each one is updated in turn
      • it took nearly 10 minutes for the changes to fully propagate through the system
      • this is a long time to wait for such a trivial parameter tweak

      Conclusion:

      • the AAI pods were still stuck in Init state
      • the Cassandra pods were updated and in Running state
      • the ready.py is basing the condition on a transient parameter
      • there is a race condition between the status values appearing in the StatefulSet and the readiness-check pod detecting those values to satisfy the conditions

      This should be considered as a High/Highest problem, as it is blocking all usage of ONAP at this point.

            melliott melliott
            keong keong
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: