Thursday, July 19, 2012

Performance

Performance.  A word that's strong enough to strike fear into anybody dealing with an Oracle database system.  That's not to say that Oracle performs badly, but that seldom do people and organizations understand how to get the best performance out of the system they have put into place.  A DBA will tell you performance is one thing, with a certain way to measure it, a developer will look at performance in a totally different way with completely different tools to understand it, while your production support team will tell you they are looking for yet another set of metrics to be able to anticipate performance (or a lack of) from the system.

This complexity is why I consider performance to be one of my Main Tenants of Support, as typically the amount of performance can be boiled down to this simple formula:

Actual system performance capacity = User expectations - lack of development performance planning + amount of DBA hands on assistance + framework for performance measurement

So.  What can we do about this?  Obviously for somebody in the production support team the only piece you usually can directly affect is your user expectations, while getting outputs from the framework you have to measure your system performance, but this only goes so far.  Performance needs to be an institutional truism that gets baked into your system to address the other 3 parts of the equation:
  • Developers need to understand how both the system and users operate in order to effectively design and produce results.
  • DBAs should be understanding that reactively doing something like reading through the DB log after "something happens" is far too late in the game.
  • Support has to make sure they participate in the feedback loop about what the system (or users) are doing and fight to get their hands on a performance framework, or create their own measurement system.
Yes, I'm THAT guy.  The one that believes we can start to change our patterns of behavior so we stop repeating the same mistakes we've made time and time again.  If we know that writing to a table before a user opens a form takes x milliseconds, but realize that thousands of people are opening the table every day, and the table growth will almost be exponential based on your system growth, you'll have a potential hazard introduced into the ecosystem which HAS to have a way to be able to get ferreted out in production.  Subpar development outputs, combined with subpar performance measurement, combined with a subpar maintained Oracle system is a recipe for disaster.

Why should we care about performance?  Nobody should want your system to operate below maximum, right?  So why do we allow it to happen on a daily basis?!?  I'll tell you, listening to a presentation from Cary Millsap about being able to measure everything your system is doing makes SO much sense sitting there; yet being able to implement a new system, or break out of your norms, all on your own is pretty much impossible.  Performance is a grassroots revolution which we all need to enroll in.

Thursday, July 12, 2012

Proactive monitoring

Reading a post by Jonathan Lewis got me to thinking about something I'd been meaning to write up, and there is no time like the present!

In the past year or so I have been growing ever more curious as to how the system is doing what it's doing, and even more importantly to my niche, what is the system doing and why is it doing it. 

Before going to the wonderful HOTSOS Symposium this year, my boss and I were chewing over a problem plaguing us regularly of the system seeming to be doing a lot of work and churning information over in the v$session_longops table quite a bit.  Since we never had a lot of exposure to this view before, we weren't sure how to react to some of the things we saw in there and then we had a crisis come up, and this was research was shelved for a short time.  After the Symposium, and the Training Day with Jonathan, we came back to the office rejuvenated a bit and dug into this with some ferocity; therefore, it was with some chagrin that we realized we had stumbled upon something we wish had been around for several years at our shop.  We now had a window into the SQL running in our system, so we could determine not only what was running long but more importantly if it should be running long.  Now we had a powerful tool in our toolbox, and were able to wield it as one would a sword through a thick jungle to cut down many of the vines in our path but were starting to run out of vines and wanted to turn our sword into a scalpel.  Our choice was to create an Oracle Alert running the below 10.2.0.4 script every 15 minutes alerting our PROD Support team:

select vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork


After weeks of running this, we were able to capture several performance issues which never would have come to light unless we went looking in our system and I'll be sharing those over the next weeks!  Going even further, we expanded the alert to read across DB links to other 10.2.0.4 instances:

select 'INSTANCE1' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork

union
select 'INSTANCE2' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE2 vsl, v$session@INSTANCE2 vs, v$sqlarea@INSTANCE2 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork 

union
select 'INSTANCE3' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE3 vsl, v$session@INSTANCE3 vs, v$sqlarea@INSTANCE3 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork

order by "SERVER", "% WORK_REMAINING" desc, module