The Oracle EMT: System Wait Events Part 2

Wednesday, October 17, 2012

System Wait Events Part 2

For my first follow up post, I thought it would be appropriate to show why ample alerting can help you to save the day. It just so happens that on the day that we were closing A/R for the year, we started seeing some very odd things in a new alert we have.

Waits Wait Time

gc buffer busy acquire 17253.907174

read by other session 16173.301272

cell single block physical read 4533.532144

Disk file operations I/O 3961.591694

gc buffer busy release 3783.63651

RMAN backup & recovery I/O 2962.854376

buffer busy waits 1803.283808

gc current block busy 1652.088758

At this time, we see in our DB logs a multitude of these entries:

Tue Sep 04 13:15:04 2012

OSPID: 19311: connect: ossnet: connection failed to server x.x.x.x, result=5 (login: sosstcpreadtry failed) (difftime=1953)

Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account. So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:

WAIT_EVENT	TOT_WAITS	WAIT_PCT	WAIT_SECONDS	WAIT_MINUTES	TIME_PCT
gc buffer busy acquire	1033	0.09	418.49	6.97	0.03
gc cr block 2-way	31312	2.71	3.21	0.05	0
gc cr block busy	6460	0.56	3.24	0.05	0
SQL*Net message from client	532955	46.14	1526240.71	25437.35	99.96
cell single block physical read	1972	0.17	2.8	0.05	0

I don’t know what the session was doing, or if it was successful, but that doesn’t look like a normal session wait time profile on our system. Later we saw this as well (but it was related to our DBA trying to kill the RMAN jobs):

Tue Sep 04 14:18:52 2012

ARCH: Possible network disconnect with primary database

So, this is all nice and everything but what happened?

Answer: Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together. There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head. When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2. What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.

The Oracle EMT

Wednesday, October 17, 2012

System Wait Events Part 2

No comments:

Post a Comment