Details
-
Bug
-
Status: Closed
-
Medium
-
Resolution: Done
-
Casablanca Release
Description
Symptoms:
- aai-resources REST API was working successfully with GET, PUT, etc, calls returning successfully and quickly, as expected
- a couple of days later, the aai-resources REST API was giving incorrect output to the same GET, PUT, etc, calls and also being extremely slow to respond. In this condition, the PUT call gives an error message about an internal error:
{ "requestError": { "serviceException": { "messageId": "SVC3002", "text": "Error writing output performing %1 on %2 (msg=%3) (ec=%4)", "variables": [ "PUT", "v14/network/sdwan-bandwidth-policies/sdwan-bandwidth-policy/test-policy-1", "Internal Error:java.lang.IndexOutOfBoundsException: Index: 0, Size: 0", "ERR.5.4.4000" ] } } }
while the GET incorrectly shows no data:
{ "requestError": { "serviceException": { "messageId": "SVC3001", "text": "Resource not found for %1 using id %2 (msg=%3) (ec=%4)", "variables": [ "GET", "network/sdwan-bandwidth-policies", "Node Not Found:No Node of type sdwan-bandwidth-policy found at: network/sdwan-bandwidth-policies", "ERR.5.4.6114" ] } } }
Investigation:
- As this was an OOM deployment of Casablanca release, used kubectl to get some information about the pods:
root@kubectl:~# kubectl get pods -n onap | grep aai dev-aai-5f58b6bb4f-sj4pj 1/1 Running 0 25d dev-aai-babel-5767fc855c-zxkmr 2/2 Running 0 25d dev-aai-cassandra-0 1/1 Running 46 25d dev-aai-cassandra-1 1/1 Running 0 25d dev-aai-cassandra-2 1/1 Running 0 25d dev-aai-champ-f48fbb795-8bt9c 2/2 Running 0 25d dev-aai-data-router-7c549f657-tt5tv 2/2 Running 0 25d dev-aai-elasticsearch-67cfb8dfc5-p77z9 1/1 Running 0 25d dev-aai-gizmo-6c9465c596-97f4x 2/2 Running 0 25d dev-aai-graphadmin-7747f78755-458gs 2/2 Running 0 25d dev-aai-graphadmin-create-db-schema-4mhj9 0/1 Completed 0 25d dev-aai-graphadmin-create-db-schema-b9496 0/1 Init:Error 0 25d dev-aai-modelloader-9fdb4df5c-p9wfb 2/2 Running 0 25d dev-aai-resources-d5465675d-msjlf 2/2 Running 0 25d dev-aai-search-data-bcfb67bdc-74pn8 2/2 Running 0 25d dev-aai-sparky-be-6b7d5b5dc4-dl44d 2/2 Running 0 25d dev-aai-spike-56f8768d9c-mwm62 2/2 Running 0 25d dev-aai-traversal-bdf87969c-pkzn9 2/2 Running 0 25d dev-aai-traversal-update-query-data-zsxnm 0/1 Completed 0 25d dev-pomba-aaictxbuilder-7c5b94fd9d-cq9w6 2/2 Running 0 25d
However, a closer look at the cassandra pods showed that dev-aai-cassandra-0 had 46 restarts, while the other 2 pods had zero restarts, but the logs showed:
root@kubectl:~# kubectl logs dev-aai-cassandra-1 -n onap | tail -20 WARN 00:55:14 Stopping gossip by operator request INFO 00:55:14 Announcing shutdown INFO 00:55:14 Node /172.16.0.58 state jump to shutdown ERROR 00:55:16 Stopping RPC server INFO 00:55:16 Stop listening to thrift clients ERROR 00:55:16 Stopping native transport INFO 00:55:16 Stop listening for CQL clients ERROR 00:55:16 Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:343) ~[apache-cassandra-2.1.20.jar:2.1.20] at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:196) ~[apache-cassandra-2.1.20.jar:2.1.20] at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.20.jar:2.1.20] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.io.IOException: No space left on device at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_181] at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_181] at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.20.jar:2.1.20] ... 3 common frames omitted INFO 01:20:22 Enqueuing flush of peers: 152 (0%) on-heap, 0 (0%) off-heap INFO 01:20:22 Enqueuing flush of size_estimates: 5797533 (4%) on-heap, 0 (0%) off-heap
and
root@kubectl:~# kubectl logs dev-aai-cassandra-2 -n onap | tail -20 WARN 00:59:00 Stopping gossip by operator request INFO 00:59:00 Announcing shutdown INFO 00:59:00 Node /172.16.0.75 state jump to shutdown ERROR 00:59:02 Stopping RPC server INFO 00:59:02 Stop listening to thrift clients ERROR 00:59:02 Stopping native transport INFO 00:59:02 Stop listening for CQL clients ERROR 00:59:02 Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:343) ~[apache-cassandra-2.1.20.jar:2.1.20] at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:196) ~[apache-cassandra-2.1.20.jar:2.1.20] at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.20.jar:2.1.20] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181] Caused by: java.io.IOException: No space left on device at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_181] at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_181] at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.20.jar:2.1.20] ... 3 common frames omitted INFO 01:28:09 Enqueuing flush of peers: 152 (0%) on-heap, 0 (0%) off-heap INFO 01:28:09 Enqueuing flush of size_estimates: 5790141 (4%) on-heap, 0 (0%) off-heap
Conclusion:
- the AAI microservices (e.g. aai-resources) continued to respond to REST API calls (although very slowly) even though the Cassandra cluster was down and gave incorrect output
- the Cassandra pods still appeared to be running in kubectl, even though the logs showed that the system was shutdown due to errors
- the empty database is probably a result of the single Cassandra node behaving as its own quorum, even though the 2 other nodes were shutdown with errors
- this behaviour is a bug
Suggestions:
- in this case, the disk was full, so the system should fallback to a read-only mode that allows GET calls to succeed but rejects PUT calls with an appropriate message. AAI should never be serving up false results
- the AAI pods/microservices should detect that the Cassandra service is down
- the Cassandra pods should reflect their error conditions in health checks
- if 2 out of 3 Cassandra nodes are down, the Cassandra service should probably shutdown rather than continuing to run as a single node
Workaround:
- Delete the pods and allow kubernetes to replace them
- Run createDBSchema script in graphadmin
- Restart the AAI containers
root@kubectl:~# kubectl delete pods dev-aai-cassandra-0 dev-aai-cassandra-1 dev-aai-cassandra-2 -n onap pod "dev-aai-cassandra-0" deleted pod "dev-aai-cassandra-1" deleted pod "dev-aai-cassandra-2" deleted root@kubectl:~# kubectl get pods -n onap | grep aai-cas dev-aai-cassandra-0 1/1 Running 0 1m dev-aai-cassandra-1 1/1 Running 0 50s dev-aai-cassandra-2 0/1 Running 0 31s [root@onap-84484 tests]# docker exec `docker ps --filter name=k8s_aai-graphadmin --quiet` su aaiadmin -c /opt/app/aai-graphadmin/bin/createDBSchema.sh [root@onap-84484 tests]# docker restart `docker ps --filter name=k8s_aai-resources --quiet` [root@onap-84484 tests]# docker restart `docker ps --filter name=k8s_aai-traversal --quiet`
A quick test with curl confirms that the REST API responds quickly again and with the data expected in the responses.
Attachments
Issue Links
- relates to
-
AAI-1928 AAI-Internal server Error:500
-
- Closed
-
-
AAF-722 aaf continues to be available to aai-resources even though aaf database appears to be down
-
- Closed
-
-
OOM-1542 ONAP vid/aai pods crashes automatically after sometime
-
- Closed
-
-
AAI-2092 aai-resources does excessive amounts of logging
-
- Closed
-
-
AAI-2474 AAF healthcheck passes but AAI receives 403 error during authentication and certificate_unknown
-
- Closed
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...