Loading...

Type: Bug
Resolution: Done
Priority: Highest
Fix Version/s: El Alto Release
Affects Version/s: Dublin Release
Component/s: None
Labels:
- AAI
- Cassandra
- OOM
- Resiliency

Referring to problem posted on https://lists.onap.org/g/onap-discuss/topic/32859796

Background:

a Dublin environment was deployed and used for demos
a problem was detected where AAI returned errors
presumably some attempt was made to fix AAI by re-deploying
further investigation showed that the shared Cassandra failed with multiple problems (out of memory, error in commit log, out of disk space)
the AAI pods and other dependent pods were left in Init state
the Cassandra problems were eventually rectified by restarting the pods and deleting the offending commit log file (causing some data loss)

Problem:

after fixing the Cassandra problems, the Cassandra pods were in Running state and healthy according to their logs, etc, the dependent AAI pods and others were still stuck in Init state
kubectl logs command could not display any data in Init state
docker logs command showed that readiness-check pod was failing (see below)

2019-08-14 05:05:22,283 - INFO - Checking if cassandra  is ready
2019-08-14 05:05:22,963 - INFO - Statefulset dev-cassandra-cassandra  is not ready
2019-08-14 05:05:22,971 - WARNING - timed out waiting for 'cassandra' to be ready

Inspection on ready.py script in readiness-check pod shows that wait_for_statefulset_complete has this condition:

        s = response.status
        if (s.updated_replicas == response.spec.replicas and
                s.replicas == response.spec.replicas and
                s.ready_replicas == response.spec.replicas and
                s.current_replicas == response.spec.replicas and
                s.observed_generation == response.metadata.generation):
            log.info("Statefulset " + statefulset_name + "  is ready")
            return True

but the info provided by kubectl had only these fields:

# kubectl -n onap get statefulset/dev-cassandra-cassandra -o yaml
...
status:
  collisionCount: 0
  currentReplicas: 3
  currentRevision: dev-cassandra-cassandra-84f4d86c9f
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updateRevision: dev-cassandra-cassandra-84f4d86c9f

Attempted workarounds:

Using command (with a couple of different numbers): kubectl -n onap scale statefulsets dev-cassandra-cassandra --replicas=3

did change the number of replicas but did not add an "updatedReplicas" parameter in the status.

Using command (to make a minor change in timeout seconds): kubectl -n onap edit statefulset/dev-cassandra-cassandra -o yaml

did introduce an "updatedReplicas" parameter that changed values as the update propagated through the system, e.g.

status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: dev-cassandra-cassandra-84f4d86c9f
  observedGeneration: 4
  readyReplicas: 2
  replicas: 3
  updateRevision: dev-cassandra-cassandra-6fb846979d
  updatedReplicas: 2

however, that change was transient and the "updatedReplicas" value disappeared again once the updates were completed:

status:
  collisionCount: 0
  currentReplicas: 3
  currentRevision: dev-cassandra-cassandra-6fb846979d
  observedGeneration: 4
  readyReplicas: 3
  replicas: 3
  updateRevision: dev-cassandra-cassandra-6fb846979d

Observations:

each Cassandra node took about 3 minutes to update
the podManagementPolicy of OrderedReady and updateStrategy of RollingUpdate means that each one is updated in turn
it took nearly 10 minutes for the changes to fully propagate through the system
this is a long time to wait for such a trivial parameter tweak

Conclusion:

the AAI pods were still stuck in Init state
the Cassandra pods were updated and in Running state
the ready.py is basing the condition on a transient parameter
there is a race condition between the status values appearing in the StatefulSet and the readiness-check pod detecting those values to satisfy the conditions

This should be considered as a High/Highest problem, as it is blocking all usage of ONAP at this point.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

dev-cass.json
10 kB
14/Aug/19 1:36 PM

relates to

OOM-2059 sdc-be restarts : livensess probe timeout too short

Closed

OOM-514 Readiness prob fails sometimes even though the relevant pods are running

Closed

SDC-2521 workflow design BE pod in crash loopback

Closed

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 mentioned in)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates