Tuesday, September 24, 2013

What happens when your Concurrent Manager goes crazy?

To answer the question I posed, I'll share with you a scenario we ran into a few months ago.

On a day like any other, at 11 AM we received an alert that we had created to tell us when users had queued up reports and their FND_USER records had become end dated.  This was created as a result of the issue detailed here that we encountered after upgrading our database to 11.2.0.2 when we went to Exadata:

“Scheduled Requests Running for END_DATED Users 100% CPU and Giving ORA-01002 [ID 1080047.1]”

but the alert showed that the reports were Pending and the bug is obviously about Running reports right?  So later in the day when the DBAs asked why Concurrency was so high, we really didn't think about these reports or the alert we received on them since the alert came in at 11 AM and the graph shows 5 AM.


At the time, I wanted to say that it was a problem maybe with the manager itself, with this nasty SQL behind the sql_id taking up the Concurrency time:

SELECT R.Conc_Login_Id, R.Request_Id, R.Phase_Code, R.Status_Code, P.Application_ID, P.Concurrent_Program_ID, P.Concurrent_Program_Name, R.Enable_Trace, R.Restart, DECODE(R.Increment_Dates, 'Y', 'Y', 'N'), R.NLS_Compliant, R.OUTPUT_FILE_TYPE, E.Executable_Name, E.Execution_File_Name, A2.Basepath, DECODE(R.Stale, 'Y', 'C', P.Execution_Method_Code), P.Print_Flag, P.Execution_Options, DECODE(P.Srs_Flag, 'Y',  'Y', 'Q', 'Y', 'N'), P.Argument_Method_Code, R.Print_Style, R.Argument_Input_Method_Code, R.Queue_Method_Code, R.Responsibility_ID, R.Responsibility_Application_ID, R.Requested_By, R.Number_Of_Copies, R.Save_Output_Flag, R.Printer, R.Print_Group, R.Priority, U.User_Name, O.Oracle_Username, O.Encrypted_Oracle_Password, R.Cd_Id, A.Basepath, A.Application_Short_Name, TO_CHAR(R.Requested_Start_Date,'YYYY/MM/DD HH24:MI:SS'), R.Nls_Language, R.Nls_Territory, DECODE(R.Parent_Request_ID, NULL, 0, R.Parent_Request_ID), R.Priority_Request_ID, R.Single_Thread_Flag, R.Has_Sub_Request, R.Is_Sub_Request, R.Req_Information, R.Description, R.Resubmit_Time, TO_CHAR(R.Resubmit_Interval), R.Resubmit_Interval_Type_Code, R.Resubmit_Interval_Unit_Code, TO_CHAR(R.Resubmit_End_Date,'YYYY/MM/DD HH24:MI:SS'), Decode(E.Execution_File_Name, NULL, 'N', Decode(E.Subroutine_Name, NULL, Decode(E.Execution_Method_Code, 'I', 'Y', 'J', 'Y', 'N'), 'Y')), R.Argument1,  R.Argument2,  R.Argument3,  R.Argument4,  R.Argument5, R.Argument6,  R.Argument7,  R.Argument8,  R.Argument9,  R.Argument10, R.Argument11, R.Argument12, R.Argument13, R.Argument14, R.Argument15, R.Argument16, R.Argument17, R.Argument18, R.Argument19, R.Argument20, R.Argument21, R.Argument22, R.Argument23, R.Argument24, R.Argument25, X.Argument26, X.Argument27, X.Argument28, X.Argument29, X.Argument30, X.Argument31, X.Argument32, X.Argument33, X.Argument34, X.Argument35, X.Argument36, X.Argument37, X.Argument38, X.Argument39, X.Argument40, X.Argument41, X.Argument42, X.Argument43, X.Argument44, X.Argument45, X.Argument46, X.Argument47, X.Argument48, X.Argument49, X.Argument50, X.Argument51, X.Argument52, X.Argument53, X.Argument54, X.Argument55, X.Argument56, X.Argument57, X.Argument58, X.Argument59, X.Argument60, X.Argument61, X.Argument62, X.Argument63, X.Argument64, X.Argument65, X.Argument66, X.Argument67, X.Argument68, X.Argument69, X.Argument70, X.Argument71, X.Argument72, X.Argument73, X.Argument74, X.Argument75, X.Argument76, X.Argument77, X.Argument78, X.Argument79, X.Argument80, X.Argument81, X.Argument82, X.Argument83, X.Argument84, X.Argument85, X.Argument86, X.Argument87, X.Argument88, X.Argument89, X.Argument90, X.Argument91, X.Argument92, X.Argument93, X.Argument94, X.Argument95, X.Argument96, X.Argument97, X.Argument98, X.Argument99, X.Argument100, R.number_of_arguments, C.CD_Name, NVL(R.Security_Group_ID, 0) FROM fnd_concurrent_requests R, fnd_concurrent_programs P, fnd_application A, fnd_user U, fnd_oracle_userid O, fnd_conflicts_domain C, fnd_concurrent_queues Q, fnd_application A2, fnd_executables E, fnd_conc_request_arguments X WHERE R.Status_code = 'I' And ((R.OPS_INSTANCE is null) or (R.OPS_INSTANCE = -1) or   (R.OPS_INSTANCE =     decode(:dcp_on,1,FND_CONC_GLOBAL.OPS_INST_NUM,R.OPS_INSTANCE))) And R.Request_ID = X.Request_ID(+) And R.Program_Application_Id = P.Application_Id(+) And R.Concurrent_Program_Id  = P.Concurrent_Program_Id(+) And R.Program_Application_Id = A.Application_Id(+) And P.Executable_Application_Id = E.Application_Id(+) And P.Executable_Id             = E.Executable_Id(+) And P.Executable_Application_Id = A2.Application_Id(+) And R.Requested_By   = U.User_Id(+) And R.Cd_Id = C.Cd_Id(+) And R.Oracle_Id = O.Oracle_Id(+) And Q.Application_Id = :q_applid And Q.Concurrent_Queue_Id = :queue_id And (P.Enabled_Flag is NULL OR P.Enabled_Flag = 'Y') And R.Hold_Flag = 'N' And R.Requested_Start_Date <= Sysdate And ( R.Enforce_Seriality_Flag = 'N' OR  ( C.RunAlone_Flag = P.Run_Alone_Flag  And (P.Run_Alone_Flag = 'N'  OR Not Exists  (Select Null From Fnd_Concurrent_Requests Sr  Where Sr.Status_Code In ('R', 'T')  And Sr.Enforce_Seriality_Flag = 'Y'  And Sr.CD_id = C.CD_Id))))  And Q.Running_Processes <= Q.Max_Processes  And R.Rowid = :reqname  And ((P.Execution_Method_Code != 'S' OR (R.PROGRAM_APPLICATION_ID,R.CONCURRENT_PROGRAM_ID) IN ((0,98),(0,100),(0,31721),(0,31722),(0,31757))) AND ((R.PROGRAM_APPLICATION_ID,R.CONCURRENT_PROGRAM_ID) NOT IN ((0,36888),(111,31715),(140,20445),(140,20810),(140,38315),(200,20532),(200,36067),(200,36068),(201,32439),(201,32460),(201,32469),(201,32844),(222,20865),(222,20893),(222,31738),(222,31893),(222,38640),(510,43959),(510,43960),(510,43966),(510,43967),(530,49102),(530,49106),(535,43927),(535,43928),(535,43929),(20004,31959),(20004,31961),(20004,31971),(20004,32058),(20004,33549),(20004,33625),(20004,41677),(20009,31993),(20009,32032),(20009,32164),(20009,32243),(20009,32250),(20009,32349),(20009,33564),(20009,33605),(20009,33679),(20009,33689),(20009,34482),(20009,34924),(20009,35325),(20009,35385),(20009,37049),(20009,43079),(20009,46390),(20009,56086),(20033,33685),(20033,39049),(20033,48498),(20033,48499),(20033,48518),(20033,48519),(20033,48539),(20033,48540),(20033,48558),(20033,48560),(20154,37533),(20154,37538),(20154,37909),(20154,39109),(20154,39169),(20154,39189),(20154,39457),(20154,39497),(20154,39517),(20154,39677),(20154,40377),(20154,40517),(20154,40537),(20154,40737),(20154,40818),(20154,41798),(20154,41857),(20154,41858),(20154,41897),(20154,42171),(20154,42173),(20154,42417),(20154,42697),(20154,42838),(20154,43039),(20154,43077),(20154,43297),(20154,43558),(20154,43777),(20154,43797),(20154,43877),(20154,45870),(20154,45872),(20154,45990),(20154,46070),(20154,46694),(20154,46695),(20154,46710),(20154,47030),(20154,47211),(20154,48418),(20154,50610),(20154,50611),(20154,50651),(20154,50880),(20154,50901),(20154,51141),(20154,51183),(20154,51242),(20154,51244),(20154,51323),(20154,51361),(20154,51463),(20154,51481),(20154,51801),(20154,51946),(20154,52383),(20154,52384),(20154,52385),(20154,52386),(20154,52388),(20154,53643),(20154,54063),(20154,54104),(20154,55482),(20154,55500),(20154,55881),(20154,55900),(20154,56360),(20154,56521),(20154,56930),(20154,56931),(20154,56932),(20154,56933),(20154,56934),(20154,56935),(20154,56936),(20154,56937),(20154,56938),(20154,56939),(20154,56940),(20154,56941),(20154,56942),(20154,56943),(20154,56944),(20154,56962),(20154,56963),(20154,57158),(20154,57159),(20154,57248),(20154,57284),(20154,57333),(20154,57334),(20154,57507),(20214,39804),(20214,39838),(20214,40937)) AND R.REQUESTED_BY NOT IN (20829,21465,192077))) FOR UPDATE OF R.status_code NoWait

Which looks like it is the actual manager locking (FOR UPDATE) the records when they were coming in ready to run the reports.  Yet when I looked at the sessions, new sessions were popping in and out with recent logon_time values, all from applmgr on Concurrent Manager node 2 and many of the events are listed as “library cache: mutex X” (references here and here for articles I DID find that had some relevance).  The next day, I decide to look at this from a different view point since what I had been doing was to query the DB to try and find what was holding a lock (with no success).  That's when I decided to look closer at the OS on CM node 2, which was the only CM server showing as having locked processes, by doing a top command, and found that 4 different threads were trying to take 100% of the CPU below:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
3165 applmgr   25   0 3069m 3.0g 2904 R 100.1  4.7 496:35.38 FNDLIBR
19021 applmgr   25   0 3069m 3.0g 2840 R 100.1  4.7 496:35.03 FNDLIBR
21783 applmgr   25   0 3068m 3.0g 2912 R 100.1  4.7 493:46.01 FNDLIBR
29837 applmgr   25   0 3068m 3.0g 2828 R 100.1  4.7 493:28.17 FNDLIBR

Using those PIDs, I then ran this command:

select * from fnd_concurrent_requests
where os_process_id = &OS_PID

What was odd is that each PID was giving me somewhere around 25-50 results when I was expecting a single row, until I realized that this corresponded to a dedicated concurrent manager process running in Financials.  Going into the application, I went to the Administer Concurrent Manager form (FNDCPVCM) and looked for what CMs were being run by CM node 2 and found that Standard Manager was Active on that node.  Clicking on this entry and then Processes presented me with all the processes (Active, Deactivated, or Terminated) where I noticed that the values under System looked an awful lot like our OS PIDs so I queried on one and low and behold a. it returned an Active process and b. the log file of the process showed that it was thrashing on a user with inactive responsibilities just like in the My Oracle Support note above.  The error message was initially:

Cause:  UPDATE_STATUS failed due to ORA-01002: fetch out of sequence.

and knowing it was related to an end_dated user I tried to extend the end date for the responsibilities by an extra day, but instead of making everything happy in CM land it just changed the error message being produced to:

Cause:  UPDATE_STATUS failed due to ORA-01008: not all variables bound

with the SQL statement changing from FND_CONCURRENT_REQUESTS (which was in exclusive mode due to the FOR UPDATE part of the statement) to FND_USER.  I don't know if this is because all 4 threads were still trying to hold all the server resources, and it would have returned normally after burning itself out on the statements it was going through, but this is where I deactivated the Standard Manager when there were no long running requests in the CM.  Monitoring for a few minutes with no change, after all jobs Running had completed, I had to Terminate the CM because those threads were still showing as Active and were still taking up 100% of the CPU on the server.  Once Terminated those 4 processes no longer showed as being Active and the thrashing which the DBA alerted us to, ceased to be.  With the processes gone, I could now put the user requests which caused all of this On Hold so that the CM wouldn't find itself in the same old situation once we turned everything back on.  I then Activated the Standard Manager, verified that no threads were hitting 100% on the CM node, and called it a job well done after I made sure the user still needed their reports after transitioning to another position in the company (which is why their account was end dated in the first place).

The moral of the story is that even though all the processes were Pending, so I thought there wasn't a problem, it appears the bug which we created this alert for affects the system not only when the request is Running and it can't finish but it also hits us when the system TRIES to run something and the user/responsibilities are end dated!

No comments:

Post a Comment