We recently upgraded from QV10 SR4 to QV11.2 SR2 and have noticed a lot of odd behaviour in the new QDS environment with regards to task scheduling, task execution, and task result status reporting in the QMC. We have raised cases with QT support, but I just thought I'd see if anyone else was seeing similar issues.
The main oddities we have noticed:
- Tasks not firing - I have observed some tasks which have simply not been executed at their scheduled time. (E.g. could be a daily schedule that works fine for a week and then just misses a day).
- Tasks not reporting error status back to QMC correctly. E.g. task reaches timeout and then fails, but does not show up as failed in the QMC. This makes it extremely difficult to manage the production environment.
- Intermittent and seemingly random COM Exception Errors. We are seeing about 10 of these per day. This problem is coupled with issue 2 above - i.e. the failed statuses are not always reported back. Here is an example: QDSMain.Exceptions.DistributionFailedException: Distribute failed with errors to follow. ---> QDSMain.Exceptions.ReloadFailedException: Reload failed ---> QDSMain.Exceptions.LogBucketErrorException: The sourcedocument failed to reload.. Exception=System.Runtime.InteropServices.COMException (0x800706BE): The remote procedure call failed. (Exception from HRESULT: 0x800706BE)
We had exactly the same configuration on QV10 in terms of task schedules and dependencies (i.e. the QV10 documents and tasks were migrated across "as is"), and we did not see any of these issues on the QV10 SR4 QDS.
The new environment is clustered (although we are only running one hot QDS node currently), and the servers are 40 core / 256GB RAM. The max number of QDS engines has been set to 40, and the heap size has been increased as per QV Support's recommendations based on the hardware configuration. The server doesn't appear to be resource bound. It would be great to know if anyone else is experiencing any strange behaviour with QV11.2 SR2.
Any feedback would great - even if it's just a "we're seeing no problems on version X".