Thursday, July 12, 2012

Proactive monitoring

Reading a post by Jonathan Lewis got me to thinking about something I'd been meaning to write up, and there is no time like the present!

In the past year or so I have been growing ever more curious as to how the system is doing what it's doing, and even more importantly to my niche, what is the system doing and why is it doing it. 

Before going to the wonderful HOTSOS Symposium this year, my boss and I were chewing over a problem plaguing us regularly of the system seeming to be doing a lot of work and churning information over in the v$session_longops table quite a bit.  Since we never had a lot of exposure to this view before, we weren't sure how to react to some of the things we saw in there and then we had a crisis come up, and this was research was shelved for a short time.  After the Symposium, and the Training Day with Jonathan, we came back to the office rejuvenated a bit and dug into this with some ferocity; therefore, it was with some chagrin that we realized we had stumbled upon something we wish had been around for several years at our shop.  We now had a window into the SQL running in our system, so we could determine not only what was running long but more importantly if it should be running long.  Now we had a powerful tool in our toolbox, and were able to wield it as one would a sword through a thick jungle to cut down many of the vines in our path but were starting to run out of vines and wanted to turn our sword into a scalpel.  Our choice was to create an Oracle Alert running the below 10.2.0.4 script every 15 minutes alerting our PROD Support team:

select vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork


After weeks of running this, we were able to capture several performance issues which never would have come to light unless we went looking in our system and I'll be sharing those over the next weeks!  Going even further, we expanded the alert to read across DB links to other 10.2.0.4 instances:

select 'INSTANCE1' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork

union
select 'INSTANCE2' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE2 vsl, v$session@INSTANCE2 vs, v$sqlarea@INSTANCE2 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork 

union
select 'INSTANCE3' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE3 vsl, v$session@INSTANCE3 vs, v$sqlarea@INSTANCE3 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork

order by "SERVER", "% WORK_REMAINING" desc, module

No comments:

Post a Comment