Waits Wait Time
gc buffer busy acquire 17253.907174
read by other session 16173.301272
cell single block physical read 4533.532144
Disk file operations I/O 3961.591694
gc buffer busy release 3783.63651
RMAN backup & recovery I/O 2962.854376
buffer busy waits 1803.283808
gc current block busy 1652.088758
At this time, we see in our DB logs a multitude of these entries:
Tue Sep 04 13:15:04 2012
OSPID: 19311: connect: ossnet: connection failed to server x.x.x.x, result=5 (login: sosstcpreadtry failed) (difftime=1953)
Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account. So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:
Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account. So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:
WAIT_EVENT
|
TOT_WAITS
|
WAIT_PCT
|
WAIT_SECONDS
|
WAIT_MINUTES
|
TIME_PCT
|
gc buffer busy acquire
|
1033
|
0.09
|
418.49
|
6.97
|
0.03
|
gc cr block 2-way
|
31312
|
2.71
|
3.21
|
0.05
|
0
|
gc cr block busy
|
6460
|
0.56
|
3.24
|
0.05
|
0
|
SQL*Net message from client
|
532955
|
46.14
|
1526240.71
|
25437.35
|
99.96
|
cell single block physical read
|
1972
|
0.17
|
2.8
|
0.05
|
0
|
I don’t know what the session was doing, or if it was successful, but that doesn’t look like a normal session wait time profile on our system. Later we saw this as well (but it was related to our DBA trying to kill the RMAN jobs):
Tue Sep 04 14:18:52 2012
ARCH: Possible network disconnect with primary database
So, this is all nice and everything but what happened?
Answer: Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together. There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head. When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2. What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.
No comments:
Post a Comment