-
Bug
-
Resolution: Done
-
Highest
-
Dublin Release
-
None
Referring to problem posted on https://lists.onap.org/g/onap-discuss/topic/32859796
Background:
- a Dublin environment was deployed and used for demos
- a problem was detected where AAI returned errors
- presumably some attempt was made to fix AAI by re-deploying
- further investigation showed that the shared Cassandra failed with multiple problems (out of memory, error in commit log, out of disk space)
- the AAI pods and other dependent pods were left in Init state
- the Cassandra problems were eventually rectified by restarting the pods and deleting the offending commit log file (causing some data loss)
Problem:
- after fixing the Cassandra problems, the Cassandra pods were in Running state and healthy according to their logs, etc, the dependent AAI pods and others were still stuck in Init state
- kubectl logs command could not display any data in Init state
- docker logs command showed that readiness-check pod was failing (see below)
2019-08-14 05:05:22,283 - INFO - Checking if cassandra is ready 2019-08-14 05:05:22,963 - INFO - Statefulset dev-cassandra-cassandra is not ready 2019-08-14 05:05:22,971 - WARNING - timed out waiting for 'cassandra' to be ready
Inspection on ready.py script in readiness-check pod shows that wait_for_statefulset_complete has this condition:
s = response.status if (s.updated_replicas == response.spec.replicas and s.replicas == response.spec.replicas and s.ready_replicas == response.spec.replicas and s.current_replicas == response.spec.replicas and s.observed_generation == response.metadata.generation): log.info("Statefulset " + statefulset_name + " is ready") return True
but the info provided by kubectl had only these fields:
# kubectl -n onap get statefulset/dev-cassandra-cassandra -o yaml ... status: collisionCount: 0 currentReplicas: 3 currentRevision: dev-cassandra-cassandra-84f4d86c9f observedGeneration: 1 readyReplicas: 3 replicas: 3 updateRevision: dev-cassandra-cassandra-84f4d86c9f
Attempted workarounds:
- Using command (with a couple of different numbers): kubectl -n onap scale statefulsets dev-cassandra-cassandra --replicas=3
did change the number of replicas but did not add an "updatedReplicas" parameter in the status.
- Using command (to make a minor change in timeout seconds): kubectl -n onap edit statefulset/dev-cassandra-cassandra -o yaml
did introduce an "updatedReplicas" parameter that changed values as the update propagated through the system, e.g.
status: collisionCount: 0 currentReplicas: 1 currentRevision: dev-cassandra-cassandra-84f4d86c9f observedGeneration: 4 readyReplicas: 2 replicas: 3 updateRevision: dev-cassandra-cassandra-6fb846979d updatedReplicas: 2
however, that change was transient and the "updatedReplicas" value disappeared again once the updates were completed:
status: collisionCount: 0 currentReplicas: 3 currentRevision: dev-cassandra-cassandra-6fb846979d observedGeneration: 4 readyReplicas: 3 replicas: 3 updateRevision: dev-cassandra-cassandra-6fb846979d
Observations:
- each Cassandra node took about 3 minutes to update
- the podManagementPolicy of OrderedReady and updateStrategy of RollingUpdate means that each one is updated in turn
- it took nearly 10 minutes for the changes to fully propagate through the system
- this is a long time to wait for such a trivial parameter tweak
Conclusion:
- the AAI pods were still stuck in Init state
- the Cassandra pods were updated and in Running state
- the ready.py is basing the condition on a transient parameter
- there is a race condition between the status values appearing in the StatefulSet and the readiness-check pod detecting those values to satisfy the conditions
This should be considered as a High/Highest problem, as it is blocking all usage of ONAP at this point.