Uploaded image for project: 'Policy Framework'
  1. Policy Framework
  2. POLICY-872

investigate potential race conditions during rules version upgrades during call loads

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Medium Medium
    • Casablanca Release
    • Beijing Release, Casablanca Release
    • None
    • SB07

      It has been occasionally observed when rules version upgrades occur during transaction loads, some locks may have been leftover, and never released by safeguard timers provisioned in the drools application template used by the use cases.   This is in the context of CLAMP testing with multiple upgrades to the rules taking place during a load.

      The issue seems more repeatable when the upgrade scenario under load involves: 1) upgrade 1 -remove a set control loop rules, 2) upgrade 2 - add it back.     In this scenario, it has been seen that a targetlock and an operationtimer seem to have left orphan in the working session.

      Simpler tests have been run by the policy team but has not been able to reproduce the problem.

      For the case where rules are removed and added under load, the recommendation is to used alternative methods to suspend a control loop since the operation is very disruptive under load, probably best is the is-closed-loop-disabled flag in AAI settings which is honored by Policy.   Other approach  may be using the filtering API of the PDP-D to drop ONSETs for an specific control loop.

      It's important to note that the templates as they are, are not production quality, and would have to be conditioned to support these situations, if the intent is to make them resilient to these scenarions.

      Another note to take into account is that there's is an effort from Casablanca moving forward to rearchitect several aspects of the system.   The current mode of operation of the brmsgw with PDP-D is problematic.

      These are some notes from rshacham on this matter:

      I believe I have reproduced an issue several times where the restart is not being done, apparently because of the target being locked.

      Some of these details are not so relevant to you, but it is the way the situation comes about.

      1. A TCA is deployed and its config policy has threshold 20
      2. VES agent on vgmux is set manually to create packet-loss of 25
      3. Messages are passed from collector to TCA to Policy, etc. and restart works
      4. On VM restart the vgmux is back to 0 packet-loss, so I manually set it to 25 again
      5. The above could be done repeatedly and the VM is restarted
      6. The TCA threshold is bumped up to 28
      7. No ONSET messages are sent because packet-loss is below threshold
      8. TCA threshold is lowered to 23, through CLAMP->Policy->DCAE
      9. With threshold lowered, the TCA emits ONSET again (since vgmux is still sending 25)

       __ 

      When this onset is sent, there is no restart until I manually restart the pdp

      I have noticed that in this scenario, there are many ONSET messages suddenly sent at step 9.

      I wonder if this is because there is some backlog of messages being handled by TCA at that point.

      Could it be the volume of ONSET messages for the same VM that is causing problems?

       __ 

      The frequency of messages from vgmux is every 10 seconds.  I don’t know if this would happen if it were set to send every 2 minutes.

       

            jhh jhh
            jhh jhh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: