Thursday, February 7, 2013

RAC nodes with unknown code differences

As I said several months ago, while going to Exadata we went to a RAC platform and since that time I have yet to encounter any pitfalls with going to this technology stack....until now!  After applying some Exadata related patches we bounced our system as we've done several times since RAC, and just like we did in our QA mock deployment drills, but when the user community got a chance a few days later to put the system through its paces we received various errors reported in both canned and custom functionality.  Drilling into the issue further, we realized it had something to do with our tax solution with "com.evermind.server.http.HttpIOException: Broken pipe" in our application server logs and this in the error logging table:

ORA-29270: too many open HTTP requests
ORA-29273: HTTP request failed
ORA-06512: at "SYS.UTL_HTTP", line 1130
ORA-29270: too many open HTTP requests


It is important to note that one of the patches applied upgraded us from 11.2.0.2.0 to 11.2.0.3.0 and subsequent red herrings from My Oracle Support of 1301699.1 and 1293056.1 had us all assuming the issue resided in new Access Control Lists functionality that might have been introduced with the patching.  Eventually we were able to find that the code being run to get a URL response passed back would run successfully on one of the RAC nodes but not the other and that had nothing to do with ACLs at all.  With this we continued to research and found that the package being called was actually corrupt on one node, but not the other, and had no outwards sign of being INVALID or otherwise corrupted since when it was compiled as a new object on the "bad" node it would run successfully.

This could be the end to our story, if not for the fact that after we recompiled the object we were still receiving the "too many open HTTP requests" error.  At this point, we dug deeper into the system and one item we encountered in our research on My Oracle Support stuck out to us from 961468.1:

What is the reason of ORA-29270?

The database server has a hardcoded limit of 5 open HTTP connections per session. When you attempt to open a 6th http connection, this error is thrown.

Now we hadn't changed the application, or the other database which we're communicating with in this scenario, over the weekend but in some way the database received (or otherwise caused) corrupted communications and held onto these 5 HTTP connections in such a way as to not release them for further communications.  Running lsof and netstat found that on the node in question there were several hundred connections stuck from the other database, and once a kill was issued for them the natural order was restored to the universe when they died off and allowed connections to work again without issue.

This was quite an interesting problem as it was nebulous in finding where the root cause resided, and added a few tools into my toolbox for the next time we have something like this!

No comments:

Post a Comment