Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Medium
Fix Version/s: Frankfurt Release
Affects Version/s: Casablanca Release
Labels:
- AAI
- OOM
- OOM_Healthcheck
- cassandra

Epic Link:
Frankfurt AAI Proposals

Symptoms:

aai-resources REST API was working successfully with GET, PUT, etc, calls returning successfully and quickly, as expected
a couple of days later, the aai-resources REST API was giving incorrect output to the same GET, PUT, etc, calls and also being extremely slow to respond. In this condition, the PUT call gives an error message about an internal error:

{
  "requestError": {
    "serviceException": {
      "messageId": "SVC3002",
      "text": "Error writing output performing %1 on %2 (msg=%3) (ec=%4)",
      "variables": [
        "PUT",
        "v14/network/sdwan-bandwidth-policies/sdwan-bandwidth-policy/test-policy-1",
        "Internal Error:java.lang.IndexOutOfBoundsException: Index: 0, Size: 0",
        "ERR.5.4.4000"
      ]
    }
  }
}

while the GET incorrectly shows no data:

{
  "requestError": {
    "serviceException": {
      "messageId": "SVC3001",
      "text": "Resource not found for %1 using id %2 (msg=%3) (ec=%4)",
      "variables": [
        "GET",
        "network/sdwan-bandwidth-policies",
        "Node Not Found:No Node of type sdwan-bandwidth-policy found at: network/sdwan-bandwidth-policies",
        "ERR.5.4.6114"
      ]
    }
  }
}

Investigation:

As this was an OOM deployment of Casablanca release, used kubectl to get some information about the pods:

 root@kubectl:~# kubectl get pods -n onap | grep aai
dev-aai-5f58b6bb4f-sj4pj                                1/1       Running            0          25d
dev-aai-babel-5767fc855c-zxkmr                          2/2       Running            0          25d
dev-aai-cassandra-0                                     1/1       Running            46         25d
dev-aai-cassandra-1                                     1/1       Running            0          25d
dev-aai-cassandra-2                                     1/1       Running            0          25d
dev-aai-champ-f48fbb795-8bt9c                           2/2       Running            0          25d
dev-aai-data-router-7c549f657-tt5tv                     2/2       Running            0          25d
dev-aai-elasticsearch-67cfb8dfc5-p77z9                  1/1       Running            0          25d
dev-aai-gizmo-6c9465c596-97f4x                          2/2       Running            0          25d
dev-aai-graphadmin-7747f78755-458gs                     2/2       Running            0          25d
dev-aai-graphadmin-create-db-schema-4mhj9               0/1       Completed          0          25d
dev-aai-graphadmin-create-db-schema-b9496               0/1       Init:Error         0          25d
dev-aai-modelloader-9fdb4df5c-p9wfb                     2/2       Running            0          25d
dev-aai-resources-d5465675d-msjlf                       2/2       Running            0          25d
dev-aai-search-data-bcfb67bdc-74pn8                     2/2       Running            0          25d
dev-aai-sparky-be-6b7d5b5dc4-dl44d                      2/2       Running            0          25d
dev-aai-spike-56f8768d9c-mwm62                          2/2       Running            0          25d
dev-aai-traversal-bdf87969c-pkzn9                       2/2       Running            0          25d
dev-aai-traversal-update-query-data-zsxnm               0/1       Completed          0          25d
dev-pomba-aaictxbuilder-7c5b94fd9d-cq9w6                2/2       Running            0          25d

However, a closer look at the cassandra pods showed that dev-aai-cassandra-0 had 46 restarts, while the other 2 pods had zero restarts, but the logs showed:

root@kubectl:~# kubectl logs dev-aai-cassandra-1 -n onap | tail -20
WARN  00:55:14 Stopping gossip by operator request
INFO  00:55:14 Announcing shutdown
INFO  00:55:14 Node /172.16.0.58 state jump to shutdown
ERROR 00:55:16 Stopping RPC server
INFO  00:55:16 Stop listening to thrift clients
ERROR 00:55:16 Stopping native transport
INFO  00:55:16 Stop listening for CQL clients
ERROR 00:55:16 Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device
        at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:343) ~[apache-cassandra-2.1.20.jar:2.1.20]
        at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:196) ~[apache-cassandra-2.1.20.jar:2.1.20]
        at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.20.jar:2.1.20]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
Caused by: java.io.IOException: No space left on device
        at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_181]
        at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_181]
        at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.20.jar:2.1.20]
        ... 3 common frames omitted
INFO  01:20:22 Enqueuing flush of peers: 152 (0%) on-heap, 0 (0%) off-heap
INFO  01:20:22 Enqueuing flush of size_estimates: 5797533 (4%) on-heap, 0 (0%) off-heap

and

root@kubectl:~# kubectl logs dev-aai-cassandra-2 -n onap | tail -20
WARN  00:59:00 Stopping gossip by operator request
INFO  00:59:00 Announcing shutdown
INFO  00:59:00 Node /172.16.0.75 state jump to shutdown
ERROR 00:59:02 Stopping RPC server
INFO  00:59:02 Stop listening to thrift clients
ERROR 00:59:02 Stopping native transport
INFO  00:59:02 Stop listening for CQL clients
ERROR 00:59:02 Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread
org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device
        at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:343) ~[apache-cassandra-2.1.20.jar:2.1.20]
        at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:196) ~[apache-cassandra-2.1.20.jar:2.1.20]
        at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.20.jar:2.1.20]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
Caused by: java.io.IOException: No space left on device
        at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_181]
        at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_181]
        at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.20.jar:2.1.20]
        ... 3 common frames omitted
INFO  01:28:09 Enqueuing flush of peers: 152 (0%) on-heap, 0 (0%) off-heap
INFO  01:28:09 Enqueuing flush of size_estimates: 5790141 (4%) on-heap, 0 (0%) off-heap

Conclusion:

the AAI microservices (e.g. aai-resources) continued to respond to REST API calls (although very slowly) even though the Cassandra cluster was down and gave incorrect output
the Cassandra pods still appeared to be running in kubectl, even though the logs showed that the system was shutdown due to errors
the empty database is probably a result of the single Cassandra node behaving as its own quorum, even though the 2 other nodes were shutdown with errors
this behaviour is a bug

Suggestions:

in this case, the disk was full, so the system should fallback to a read-only mode that allows GET calls to succeed but rejects PUT calls with an appropriate message. AAI should never be serving up false results
the AAI pods/microservices should detect that the Cassandra service is down
the Cassandra pods should reflect their error conditions in health checks
if 2 out of 3 Cassandra nodes are down, the Cassandra service should probably shutdown rather than continuing to run as a single node

Workaround:

Delete the pods and allow kubernetes to replace them
Run createDBSchema script in graphadmin
Restart the AAI containers

root@kubectl:~# kubectl delete pods dev-aai-cassandra-0 dev-aai-cassandra-1 dev-aai-cassandra-2 -n onap
pod "dev-aai-cassandra-0" deleted
pod "dev-aai-cassandra-1" deleted
pod "dev-aai-cassandra-2" deleted

root@kubectl:~# kubectl get pods -n onap | grep aai-cas
dev-aai-cassandra-0                                     1/1       Running            0          1m
dev-aai-cassandra-1                                     1/1       Running            0          50s
dev-aai-cassandra-2                                     0/1       Running            0          31s

[root@onap-84484 tests]# docker exec `docker ps --filter name=k8s_aai-graphadmin --quiet` su aaiadmin -c /opt/app/aai-graphadmin/bin/createDBSchema.sh
[root@onap-84484 tests]# docker restart `docker ps --filter name=k8s_aai-resources --quiet`
[root@onap-84484 tests]# docker restart `docker ps --filter name=k8s_aai-traversal --quiet`

A quick test with curl confirms that the REST API responds quickly again and with the data expected in the responses.

relates to

AAI-1928 AAI-Internal server Error:500

Closed

AAF-722 aaf continues to be available to aai-resources even though aaf database appears to be down

Closed

OOM-1542 ONAP vid/aai pods crashes automatically after sometime

Closed

AAI-2092 aai-resources does excessive amounts of logging

Closed

AAI-2474 AAF healthcheck passes but AAI receives 403 error during authentication and certificate_unknown

Closed

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(9 mentioned in)

Assignee:: jimmydot

Reporter:: keong

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 16/Jan/19 11:02 AM

Updated:: 16/Oct/19 1:50 PM

Resolved:: 16/Oct/19 1:50 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates