Wednesday, October 17, 2012

System Wait Events Part 2

For my first follow up post, I thought it would be appropriate to show why ample alerting can help you to save the day.  It just so happens that on the day that we were closing A/R for the year, we started seeing some very odd things in a new alert we have.

Waits                                                           Wait Time

  gc buffer busy acquire                                     17253.907174
  read by other session                                      16173.301272
  cell single block physical read                             4533.532144
  Disk file operations I/O                                    3961.591694
  gc buffer busy release                                       3783.63651
  RMAN backup & recovery I/O                                  2962.854376
  buffer busy waits                                           1803.283808
  gc current block busy                                       1652.088758

At this time, we see in our DB logs a multitude of these entries:

Tue Sep 04 13:15:04 2012
OSPID: 19311: connect: ossnet: connection failed to server x.x.x.x, result=5 (login: sosstcpreadtry failed) (difftime=1953)

Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account.  So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:

WAIT_EVENT
TOT_WAITS
WAIT_PCT
WAIT_SECONDS
WAIT_MINUTES
TIME_PCT
gc buffer busy acquire
1033
0.09
418.49
6.97
0.03
gc cr block 2-way
31312
2.71
3.21
0.05
0
gc cr block busy
6460
0.56
3.24
0.05
0
SQL*Net message from client
532955
46.14
1526240.71
25437.35
99.96
cell single block physical read
1972
0.17
2.8
0.05
0

I don’t know what the session was doing, or if it was successful, but that doesn’t look like a normal session wait time profile on our system.  Later we saw this as well (but it was related to our DBA trying to kill the RMAN jobs):
Tue Sep 04 14:18:52 2012
ARCH: Possible network disconnect with primary database

So, this is all nice and everything but what happened?

Answer:  Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together.  There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head.  When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2.  What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.

No comments:

Post a Comment