Uploaded image for project: 'Active and Available Inventory'
  1. Active and Available Inventory
  2. AAI-2082

aai-resources gives incorrect output when aai-cassandra has shutdown with failure

    Details

      Description

      Symptoms:

      • aai-resources REST API was working successfully with GET, PUT, etc, calls returning successfully and quickly, as expected
      • a couple of days later, the aai-resources REST API was giving incorrect output to the same GET, PUT, etc, calls and also being extremely slow to respond.  In this condition, the PUT call gives an error message about an internal error:
      {
        "requestError": {
          "serviceException": {
            "messageId": "SVC3002",
            "text": "Error writing output performing %1 on %2 (msg=%3) (ec=%4)",
            "variables": [
              "PUT",
              "v14/network/sdwan-bandwidth-policies/sdwan-bandwidth-policy/test-policy-1",
              "Internal Error:java.lang.IndexOutOfBoundsException: Index: 0, Size: 0",
              "ERR.5.4.4000"
            ]
          }
        }
      }
       

      while the GET incorrectly shows no data:

      {
        "requestError": {
          "serviceException": {
            "messageId": "SVC3001",
            "text": "Resource not found for %1 using id %2 (msg=%3) (ec=%4)",
            "variables": [
              "GET",
              "network/sdwan-bandwidth-policies",
              "Node Not Found:No Node of type sdwan-bandwidth-policy found at: network/sdwan-bandwidth-policies",
              "ERR.5.4.6114"
            ]
          }
        }
      }
       

      Investigation:

      • As this was an OOM deployment of Casablanca release, used kubectl to get some information about the pods:
       root@kubectl:~# kubectl get pods -n onap | grep aai
      dev-aai-5f58b6bb4f-sj4pj                                1/1       Running            0          25d
      dev-aai-babel-5767fc855c-zxkmr                          2/2       Running            0          25d
      dev-aai-cassandra-0                                     1/1       Running            46         25d
      dev-aai-cassandra-1                                     1/1       Running            0          25d
      dev-aai-cassandra-2                                     1/1       Running            0          25d
      dev-aai-champ-f48fbb795-8bt9c                           2/2       Running            0          25d
      dev-aai-data-router-7c549f657-tt5tv                     2/2       Running            0          25d
      dev-aai-elasticsearch-67cfb8dfc5-p77z9                  1/1       Running            0          25d
      dev-aai-gizmo-6c9465c596-97f4x                          2/2       Running            0          25d
      dev-aai-graphadmin-7747f78755-458gs                     2/2       Running            0          25d
      dev-aai-graphadmin-create-db-schema-4mhj9               0/1       Completed          0          25d
      dev-aai-graphadmin-create-db-schema-b9496               0/1       Init:Error         0          25d
      dev-aai-modelloader-9fdb4df5c-p9wfb                     2/2       Running            0          25d
      dev-aai-resources-d5465675d-msjlf                       2/2       Running            0          25d
      dev-aai-search-data-bcfb67bdc-74pn8                     2/2       Running            0          25d
      dev-aai-sparky-be-6b7d5b5dc4-dl44d                      2/2       Running            0          25d
      dev-aai-spike-56f8768d9c-mwm62                          2/2       Running            0          25d
      dev-aai-traversal-bdf87969c-pkzn9                       2/2       Running            0          25d
      dev-aai-traversal-update-query-data-zsxnm               0/1       Completed          0          25d
      dev-pomba-aaictxbuilder-7c5b94fd9d-cq9w6                2/2       Running            0          25d
      

       

      However, a closer look at the cassandra pods showed that dev-aai-cassandra-0 had 46 restarts, while the other 2 pods had zero restarts, but the logs showed:

      root@kubectl:~# kubectl logs dev-aai-cassandra-1 -n onap | tail -20
      WARN  00:55:14 Stopping gossip by operator request
      INFO  00:55:14 Announcing shutdown
      INFO  00:55:14 Node /172.16.0.58 state jump to shutdown
      ERROR 00:55:16 Stopping RPC server
      INFO  00:55:16 Stop listening to thrift clients
      ERROR 00:55:16 Stopping native transport
      INFO  00:55:16 Stop listening for CQL clients
      ERROR 00:55:16 Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread
      org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device
              at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:343) ~[apache-cassandra-2.1.20.jar:2.1.20]
              at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:196) ~[apache-cassandra-2.1.20.jar:2.1.20]
              at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.20.jar:2.1.20]
              at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
      Caused by: java.io.IOException: No space left on device
              at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_181]
              at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_181]
              at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.20.jar:2.1.20]
              ... 3 common frames omitted
      INFO  01:20:22 Enqueuing flush of peers: 152 (0%) on-heap, 0 (0%) off-heap
      INFO  01:20:22 Enqueuing flush of size_estimates: 5797533 (4%) on-heap, 0 (0%) off-heap
       

      and

      root@kubectl:~# kubectl logs dev-aai-cassandra-2 -n onap | tail -20
      WARN  00:59:00 Stopping gossip by operator request
      INFO  00:59:00 Announcing shutdown
      INFO  00:59:00 Node /172.16.0.75 state jump to shutdown
      ERROR 00:59:02 Stopping RPC server
      INFO  00:59:02 Stop listening to thrift clients
      ERROR 00:59:02 Stopping native transport
      INFO  00:59:02 Stop listening for CQL clients
      ERROR 00:59:02 Failed to persist commits to disk. Commit disk failure policy is stop; terminating thread
      org.apache.cassandra.io.FSWriteError: java.io.IOException: No space left on device
              at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:343) ~[apache-cassandra-2.1.20.jar:2.1.20]
              at org.apache.cassandra.db.commitlog.CommitLog.sync(CommitLog.java:196) ~[apache-cassandra-2.1.20.jar:2.1.20]
              at org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:81) ~[apache-cassandra-2.1.20.jar:2.1.20]
              at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
      Caused by: java.io.IOException: No space left on device
              at java.nio.MappedByteBuffer.force0(Native Method) ~[na:1.8.0_181]
              at java.nio.MappedByteBuffer.force(MappedByteBuffer.java:203) ~[na:1.8.0_181]
              at org.apache.cassandra.db.commitlog.CommitLogSegment.sync(CommitLogSegment.java:329) ~[apache-cassandra-2.1.20.jar:2.1.20]
              ... 3 common frames omitted
      INFO  01:28:09 Enqueuing flush of peers: 152 (0%) on-heap, 0 (0%) off-heap
      INFO  01:28:09 Enqueuing flush of size_estimates: 5790141 (4%) on-heap, 0 (0%) off-heap
       

       

      Conclusion:

      • the AAI microservices (e.g. aai-resources) continued to respond to REST API calls (although very slowly) even though the Cassandra cluster was down and gave incorrect output
      • the Cassandra pods still appeared to be running in kubectl, even though the logs showed that the system was shutdown due to errors
      • the empty database is probably a result of the single Cassandra node behaving as its own quorum, even though the 2 other nodes were shutdown with errors
      • this behaviour is a bug

      Suggestions:

      • in this case, the disk was full, so the system should fallback to a read-only mode that allows GET calls to succeed but rejects PUT calls with an appropriate message. AAI should never be serving up false results
      • the AAI pods/microservices should detect that the Cassandra service is down
      • the Cassandra pods should reflect their error conditions in health checks
      • if 2 out of 3 Cassandra nodes are down, the Cassandra service should probably shutdown rather than continuing to run as a single node

       Workaround:

      • Delete the pods and allow kubernetes to replace them
      • Run createDBSchema script in graphadmin
      • Restart the AAI containers
      root@kubectl:~# kubectl delete pods dev-aai-cassandra-0 dev-aai-cassandra-1 dev-aai-cassandra-2 -n onap
      pod "dev-aai-cassandra-0" deleted
      pod "dev-aai-cassandra-1" deleted
      pod "dev-aai-cassandra-2" deleted
      
      root@kubectl:~# kubectl get pods -n onap | grep aai-cas
      dev-aai-cassandra-0                                     1/1       Running            0          1m
      dev-aai-cassandra-1                                     1/1       Running            0          50s
      dev-aai-cassandra-2                                     0/1       Running            0          31s
      
      [root@onap-84484 tests]# docker exec `docker ps --filter name=k8s_aai-graphadmin --quiet` su aaiadmin -c /opt/app/aai-graphadmin/bin/createDBSchema.sh
      [root@onap-84484 tests]# docker restart `docker ps --filter name=k8s_aai-resources --quiet`
      [root@onap-84484 tests]# docker restart `docker ps --filter name=k8s_aai-traversal --quiet` 

      A quick test with curl confirms that the REST API responds quickly again and with the data expected in the responses.

       

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

              People

              • Assignee:
                jimmydot James Forsyth
                Reporter:
                keong Keong Lim
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: