Tuesday, November 26, 2013

R12: Bouncing Workflow agent process

I've regaled you with tales of our DBAs being proactive and our Workflow Mailer going down, so when we went to R12 they were put together to continue our story from yesterday as they developed a new alert to tell us when Workflow agent processes went down in order to allow us to provide better support for the users of our application.  Come to find out that during our first week of R12, the new alert is going off at odd and random times causing us and the DBAs to start researching what is going wrong.  Is it the alert?  Nope, we manually check and the Workflow Deferred Agent Listener service component is bouncing sporadically so the DBAs think we need to reference MOS Notes 958178.1 and 953103.1 but before we can run the associated SQL scripts to check what's going on the alert clears as the service component is now Running.  Odd to say the least.

The next time this starts bouncing, we get the chance to review the SQL scripts which tell us there are items in the WF_DEFERRED Advanced Queue but nothing which matches the problems described in the MOS notes so we keep digging.  Referring to MOS note ID 736898.1 it does indicate to use our current re-build notification queue script, but it has an additional step to re-create an index that currently does not exist in our environment.  We run the re-build notification queue script with no changes to the service component.  I keep digging through MOS, and I come across a really good document for Workflow issues (1191125.1) which directed me to the WF_ERROR table and that seems to be saying that the oracle.apps.ar.applications.CashApp.apply business event is causing this because the WF_ERROR table is filled with those entries.  This seems to match what the log file is saying:

oracle.apps.fnd.wf.bes.AgentListenerProcessor.listen()]:WF_EVENT.listen returned processed count: 500, error count: 0

oracle.apps.fnd.wf.bes.AgentListenerProcessor.listen()]:WF_EVENT.listen returned processed count: 223, error count: 10

oracle.apps.fnd.wf.bes.AgentListenerProcessor.read()]:10consecutive errors occurred

Looking in WF_ERROR I can eyeball the errors, and also with this script I was able to see what was being queued and then what errors are associated with that:

select * from wf_deferred a, wf_error b
where a.enq_tid = b.enq_tid
order by a.enq_tid

So what is happening is that the Workflow Deferred Agent Listener service component processes through hundreds of events and then all of a sudden there is a batch of records which has more than 10 errors and it begins the dance of restart death until it gets restarted successfully all by itself when the events get dequeued from WF_DEFERRED and WF_ERROR.

What did we do to address this?  First off, we increased the error count so the service component wouldn't stumble over just 10 errored items, no matter what business event they were from, and then we made sure to turn off the offending business event subscription.

No comments:

Post a Comment