Having taken a short (unintended) break from blogging in November, it was astonishing to see that while it took this blog 5 months to get 100 views it only took another 2 months to get an additional 50 views with only posting twice in that time!! I can't tell you how much this motivates me to keep this blog going, but also to rededicate myself to the "art of Oracle" so that I can learn things that can be of use to the community as a whole. So to everybody that follows what I'm doing here, or stumbles upon this site searching for something Oracle related, thank you!!!
Also, I want to wish everybody out there a Happy Holiday no matter which you choose (or not) to celebrate and here's to looking forward to ringing in a Happy New Year with all the challenges and learning we hope to encounter!!
-The Oracle EMT
Monday, December 31, 2012
Thursday, December 13, 2012
Timestamp across DB versions
So the other day I decided to write up an alert to let us know when a DB crashes, and nobody hears it, but I was struggling with joining our instances for some unknown reason due to this error message:
ORA-30175: invalid type given for an argument
I shelved it for a day or two until I had a chunk of time to dig into this a bit deeper, and with the help of an Oracle forums post I realized what my problem was. You see, the source of my confusion resided in the fact that the below script was able to be run individually for each instance and I only had problems when starting to union them together:
select 'INST1' AS "SERVER", process_id, host_id, substr(reason, 0, 40) AS STATUS, creation_time from DBA_ALERT_HISTORY@inst1
where creation_time > sysdate -1
and object_type = 'INSTANCE'
union all
select 'INST2' AS "SERVER", process_id, host_id, substr(reason, 0, 40) AS STATUS, creation_time from DBA_ALERT_HISTORY@inst2
where creation_time > sysdate -1
and object_type = 'INSTANCE'
Looks benign right? Well I realized that I was joining 11g and 10g DB tables together and that the timestamp object has apparently changed a bit, so that even with the table definitions not changing our system just wouldn't let me join the tables together no matter what I did when I do a value comparison on creation_time! Finding that post was the key to getting this changed, because when I updated the script to be:
select 'INST1' AS "SERVER", process_id, host_id, substr(reason, 0, 40) AS STATUS, creation_time from DBA_ALERT_HISTORY@inst1
where creation_time > to_timestamp(trunc(systimestamp))
and object_type = 'INSTANCE'
union all
select 'INST2' AS "SERVER", process_id, host_id, substrDBA_ALERT_HISTORY@inst2
where creation_time > to_timestamp(trunc(systimestamp))
and object_type = 'INSTANCE'
everything was right with the world, and no more error messages were returned. Oddly enough, or not as I've already explained, the same script with creation_time > to_timestamp(trunc(systimestamp)) -1 starts to bomb out again because the timestamp object is again having a comparison done to it.
ORA-30175: invalid type given for an argument
I shelved it for a day or two until I had a chunk of time to dig into this a bit deeper, and with the help of an Oracle forums post I realized what my problem was. You see, the source of my confusion resided in the fact that the below script was able to be run individually for each instance and I only had problems when starting to union them together:
select 'INST1' AS "SERVER", process_id, host_id, substr(reason, 0, 40) AS STATUS, creation_time from DBA_ALERT_HISTORY@inst1
where creation_time > sysdate -1
and object_type = 'INSTANCE'
union all
select 'INST2' AS "SERVER", process_id, host_id, substr(reason, 0, 40) AS STATUS, creation_time from DBA_ALERT_HISTORY@inst2
where creation_time > sysdate -1
and object_type = 'INSTANCE'
Looks benign right? Well I realized that I was joining 11g and 10g DB tables together and that the timestamp object has apparently changed a bit, so that even with the table definitions not changing our system just wouldn't let me join the tables together no matter what I did when I do a value comparison on creation_time! Finding that post was the key to getting this changed, because when I updated the script to be:
select 'INST1' AS "SERVER", process_id, host_id, substr(reason, 0, 40) AS STATUS, creation_time from DBA_ALERT_HISTORY@inst1
where creation_time > to_timestamp(trunc(systimestamp))
and object_type = 'INSTANCE'
union all
select 'INST2' AS "SERVER", process_id, host_id, substrDBA_ALERT_HISTORY@inst2
where creation_time > to_timestamp(trunc(systimestamp))
and object_type = 'INSTANCE'
everything was right with the world, and no more error messages were returned. Oddly enough, or not as I've already explained, the same script with creation_time > to_timestamp(trunc(systimestamp)) -1 starts to bomb out again because the timestamp object is again having a comparison done to it.
Thursday, December 6, 2012
Troubleshooting latch contention by Tanel
In trying to find out what was holding latches in our new shiny 11.2.0.2 system, I started digging around on the web and lo and behold I ran across a really good article by Tanel Poder a few years ago not only explaining latches, and their contention, but giving us tools to be able to try and catch them in the act!
P.S. I know I've taken a short break from the blog, but I've got some really good posts coming up over the next few weeks so stay tuned.
P.S. I know I've taken a short break from the blog, but I've got some really good posts coming up over the next few weeks so stay tuned.
Wednesday, October 31, 2012
Thank you!
I started this blog with some small hope that it would find a place online to grow, and tonight it has hit 100 page views! To me:
-The Oracle EMT
- that is really amazing especially since I've kept it to myself and told nobody in my personal or professional life at this point, let alone share it online at some of the other blogs I've been frequenting
- it means that what I have to say, or share, is relevant or of some value to the general Oracle space
- this gives me some motivation to keep going forward and building this site up more
-The Oracle EMT
Data (historical and backups)
I had found this post a little while ago on cleaning up historical data, but I wanted to save blogging about it until I had something real to say about it. This is that time as it has become very apparent that another one of our Main Tenants should be something to do with data, maybe Data Consumption or Data Usage but SOMETHING because data is just piling up in our systems every day. Audit data, user data, this data, and that data and how much is being purged, how much can be purged, who has a strategy for keeping track of all of this and where does that fit in with a corporate strategy?
Skip several years later, and the database had grown to be huge for that time. Retrievals were taking longer and longer, and talking to some of the sysadmins late at night around the coffee machine, I learned that the backup was getting too large to even back it up within a week. The application was starting to hit pretty much a ‘brick wall’; it started to fall further and further behind, and something really needed to be done.
Does this sound familiar? How big is your system? Has it hit 1 TB? Are you approaching 5, 10, 50, 100, or more TB of space for your footprint? How do you approach your backup system? Do the people that manage your backups tell you when you're having a problem, or do you have to figure it out yourself? Questions like this are going through my head as I re-read that link because lately in the morning I'm seeing something like this in our new event monitoring alert:
Skip several years later, and the database had grown to be huge for that time. Retrievals were taking longer and longer, and talking to some of the sysadmins late at night around the coffee machine, I learned that the backup was getting too large to even back it up within a week. The application was starting to hit pretty much a ‘brick wall’; it started to fall further and further behind, and something really needed to be done.
Does this sound familiar? How big is your system? Has it hit 1 TB? Are you approaching 5, 10, 50, 100, or more TB of space for your footprint? How do you approach your backup system? Do the people that manage your backups tell you when you're having a problem, or do you have to figure it out yourself? Questions like this are going through my head as I re-read that link because lately in the morning I'm seeing something like this in our new event monitoring alert:
Waits Wait Time Server
Backup: MML create a backup piece 79291.40622 X1
Backup: sbtbackup 3937.101761 X2
Backup: sbtbackup 17308.812339 X3
This sure seems like our systems are suddenly "wasting" a lot of cycles per hour on backups, especially when these events are from backups that run into our morning several hours and sometimes even into the afternoon. So begins the investigation!
Wednesday, October 17, 2012
System Wait Events Part 2
For my first follow up post, I thought it would be appropriate to show why ample alerting can help you to save the day. It just so happens that on the day that we were closing A/R for the year, we started seeing some very odd things in a new alert we have.
At this time, we see in our DB logs a multitude of these entries:
Tue Sep 04 13:15:04 2012
So, this is all nice and everything but what happened?
Answer: Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together. There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head. When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2. What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.
Waits Wait Time
gc buffer busy acquire 17253.907174
read by other session 16173.301272
cell single block physical read 4533.532144
Disk file operations I/O 3961.591694
gc buffer busy release 3783.63651
RMAN backup & recovery I/O 2962.854376
buffer busy waits 1803.283808
gc current block busy 1652.088758
At this time, we see in our DB logs a multitude of these entries:
Tue Sep 04 13:15:04 2012
OSPID: 19311: connect: ossnet: connection failed to server x.x.x.x, result=5 (login: sosstcpreadtry failed) (difftime=1953)
Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account. So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:
Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account. So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:
WAIT_EVENT
|
TOT_WAITS
|
WAIT_PCT
|
WAIT_SECONDS
|
WAIT_MINUTES
|
TIME_PCT
|
gc buffer busy acquire
|
1033
|
0.09
|
418.49
|
6.97
|
0.03
|
gc cr block 2-way
|
31312
|
2.71
|
3.21
|
0.05
|
0
|
gc cr block busy
|
6460
|
0.56
|
3.24
|
0.05
|
0
|
SQL*Net message from client
|
532955
|
46.14
|
1526240.71
|
25437.35
|
99.96
|
cell single block physical read
|
1972
|
0.17
|
2.8
|
0.05
|
0
|
I don’t know what the session was doing, or if it was successful, but that doesn’t look like a normal session wait time profile on our system. Later we saw this as well (but it was related to our DBA trying to kill the RMAN jobs):
Tue Sep 04 14:18:52 2012
ARCH: Possible network disconnect with primary database
So, this is all nice and everything but what happened?
Answer: Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together. There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head. When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2. What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.
Wednesday, October 3, 2012
Parallel Executions
I mentioned earlier that we implemented Exadata, and I learned something about parallel executions as a result! Ignoring the fact that too much parallel execution on our system after the first few hours almost brought our system down, one of our developers had found that a simple query of a table with 54 rows in 8 blocks was taking too long to process versus our old environment.
The 0 values in the first example were actually recorded to be 0.000079s and when we were seeing it then taking 2-3 seconds at time for each call, he started drilling down to see why there was such bad for our online performance. He found that the default INSTANCES setting on the table was causing our Optimizer to compute a parallel execution plan which never caused us a problem because we didn't have an exponential number of potential slaves available for the parallel execution to take up the overhead of our RAC nodes. The sins of our past, marking tables and indexes for parallel execution without needing to or really utilizing the feature, had caught up with us and Exadata didn't CAUSE the flaws it merely REVEALED them to us. So we obviously fixed this, as well as limited our parallel executions on our DB nodes, but what else did we do?
We learned from our failure by creating an alert to track the current (and future) items that were runningmarked as parallel in our application just in case we had other objects running marked similarly (or implemented new objects in the future):
SELECT sql_id,
substr(sql_text, 0, 200) AS sql_text,
SUM(px_servers_executions) AS sum_px_exec,
to_char(MAX(last_active_time), 'DD-MON-YYYY HH24:MI:SS') AS last_active_time
INTO
&SQLID,
&TEXT,
&EXEC,
&TIME
FROM gv$sqlstats
WHERE px_servers_executions > 0
AND upper(sql_text) NOT LIKE '%GV$%'
AND upper(sql_text) NOT LIKE '%ORACLEOEM%'
AND upper(sql_text) NOT LIKE '%GV%DB%'
AND upper(sql_text) NOT LIKE '%SYS%DB%'
AND last_active_time > SYSDATE - 1
GROUP BY
sql_id,
sql_text
order by
sum_px_exec desc
Obviously you can see I "commented" out a lot of the returns from gv$sqlstats since many objects owned by the system are parallel which we really don't care too much about. Here are a few more scripts you can use to identify the objects in your system which have parallelism activated:
Global Stats
=======================================
| Elapsed | Other | Fetch | Buffer |
| Time(s) | Waits(s) | Calls | Gets |
=======================================
| 0.00 | 0.00 | 1 | 3 |
=======================================
Global Stats
===============================================================
| Elapsed | Cpu | Concurrency | Other | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls | Gets |
===============================================================
| 1.92 | 0.02 | 0.86 | 1.04 | 1 | 5 |
===============================================================
The 0 values in the first example were actually recorded to be 0.000079s and when we were seeing it then taking 2-3 seconds at time for each call, he started drilling down to see why there was such bad for our online performance. He found that the default INSTANCES setting on the table was causing our Optimizer to compute a parallel execution plan which never caused us a problem because we didn't have an exponential number of potential slaves available for the parallel execution to take up the overhead of our RAC nodes. The sins of our past, marking tables and indexes for parallel execution without needing to or really utilizing the feature, had caught up with us and Exadata didn't CAUSE the flaws it merely REVEALED them to us. So we obviously fixed this, as well as limited our parallel executions on our DB nodes, but what else did we do?
We learned from our failure by creating an alert to track the current (and future) items that were running
SELECT sql_id,
substr(sql_text, 0, 200) AS sql_text,
SUM(px_servers_executions) AS sum_px_exec,
to_char(MAX(last_active_time), 'DD-MON-YYYY HH24:MI:SS') AS last_active_time
INTO
&SQLID,
&TEXT,
&EXEC,
&TIME
FROM gv$sqlstats
WHERE px_servers_executions > 0
AND upper(sql_text) NOT LIKE '%GV$%'
AND upper(sql_text) NOT LIKE '%ORACLEOEM%'
AND upper(sql_text) NOT LIKE '%GV%DB%'
AND upper(sql_text) NOT LIKE '%SYS%DB%'
AND last_active_time > SYSDATE - 1
GROUP BY
sql_id,
sql_text
order by
sum_px_exec desc
Obviously you can see I "commented" out a lot of the returns from gv$sqlstats since many objects owned by the system are parallel which we really don't care too much about. Here are a few more scripts you can use to identify the objects in your system which have parallelism activated:
select owner, table_name, degree
from dba_tables
where TRIM(DEGREE) > TO_CHAR(1);
select owner, table_name, index_name, degree
from dba_indexes
where TRIM(DEGREE) > TO_CHAR(1);
Edit 10/4/2012: Re-read this today and realized that I mispoke about what our alert was all about, so I corrected it by changing a few words.
Edit 10/4/2012: Re-read this today and realized that I mispoke about what our alert was all about, so I corrected it by changing a few words.
Thursday, September 13, 2012
How to learn Oracle?
Tom Kyte has recently posted a great article about how to start learning Oracle databases, and I have a lot of room to grow so this is absolutely welcome advice by one of the masters which builds on some comments he has in his Expert Oracle Database Architecture book about how he either created or edited all the content in one of the Oracle server architecture guides. Astounding!
Thursday, September 6, 2012
Exadata on TV!
Thursday, August 30, 2012
CPUs vs PSUs
I thought this was an interesting article which breaks down the difference between a Critical Patch Update and Patch Set Updates, with an official word from Oracle included.
http://www.dbspecialists.com/blog/best-practices/patching-your-oracle-database-%e2%80%93-critical-patch-update-cpu-or-patch-set-update-psu/
http://www.dbspecialists.com/blog/best-practices/patching-your-oracle-database-%e2%80%93-critical-patch-update-cpu-or-patch-set-update-psu/
Thursday, August 16, 2012
Concurrent:Active Request Limit (or how I finally learned to know what my system was saying)
The other day I saw a few jobs queuing in the Conflict Resolution Manager, and investigating I saw the following message:
This request is pending because the profile option 'Concurrent:Active Request Limit' is set.
This profile option controls how many requests a given user may have running at one time.
User X submitted this request on 09-AUG-2012 08:00:00.
The Conflict Resolution Manager will release this request when the number of running requests for this user is less than the maximum allowed by the profile option.
Now I’d seen this quite a few times but I always thought it was a bug and the CRM thought we couldn’t let anything run, since I interrepted it as being set at a user level, yet today when I looked at the message I wondered if we had it set at a site level and sure enough we did. It is entirely possible that with the old server hardware we were on a few years ago there was a reason this was setup but since it doesn't even look like we're hitting this limit I don’t think we need it set in our environment today since requests appear to be slowed down by this.
This request is pending because the profile option 'Concurrent:Active Request Limit' is set.
This profile option controls how many requests a given user may have running at one time.
User X submitted this request on 09-AUG-2012 08:00:00.
The Conflict Resolution Manager will release this request when the number of running requests for this user is less than the maximum allowed by the profile option.
Now I’d seen this quite a few times but I always thought it was a bug and the CRM thought we couldn’t let anything run, since I interrepted it as being set at a user level, yet today when I looked at the message I wondered if we had it set at a site level and sure enough we did. It is entirely possible that with the old server hardware we were on a few years ago there was a reason this was setup but since it doesn't even look like we're hitting this limit I don’t think we need it set in our environment today since requests appear to be slowed down by this.
Thursday, August 9, 2012
Four words that strike terror in the hearts of men (and women)
After having a major system uplift, there are four words I’ve heard over and over again which are stalking my dreams and turning them into nightmares lately. “We didn’t test that.” Well, that doesn’t seem too dire really, but the repercussions of this are huge and cross every team that worked on the uplift in the first place.
Who is impacted?
Developers - In some cases, these are the unsung heroes of your uplift and what is their reward after it is completed? To keep their noses to the grindstone and work a few more weeks off on the fires that keep cropping up. Doesn’t sound like a real promising job huh?
Projects - Not just the project people, which I'll get back to, but you're now putting off being able to work on your next projects. It doesn't matter if it is old production fixes or currently pipelined new projects, delays will drain your political capital because you are now missing target deployment dates and the business thinks you are pretty much worthless for not catching all the things you are now responsible for fixing. Project personnel start feeling burned out just like your developers because they've been on the project forever, and the bigger the project the longer you'll want to keep them around to play clean up for all the new debris coming up.
Support - How many fires do you get on a daily basis? Multiply that by a factor of 2 or 3 after doing an uplift like this for about a month, and you’ll see what kind of volume you’ll be slowly crushed under when you have people in the new system doing their daily work which they never really replicated in the test environment.
DBAs - During the project and throughout the implementation we've been wearing them thin, and they are now getting no rest by making them look at applying patches and identifying system issues which should’ve been caught months ago so they aren't able to instead be able to focus on things like building new monitoring tools. Fundamental things like missing critical DB bugs can easily happen because everything in the system is being thrown against the wall without a whole lot of investigation possible for each item.
How can we do better?
We need to be more proactive about showing the business and project teams what has been run in the last 30, 60, 180 days in your system and by whom. Having these snapshots would be able to show the majority of the daily, monthly and quarterly reporting goals that the project needs to meet.
We can do a better job about documenting out our system flow, so we know what all our parts and pieces are which are required to provide full service to the business. This way you won't get surprised by a report pushing output to a third party in-house group, to a server you didn't even know was connected to your system.
I'll probably add more as I encounter them, but what would you add to this list?
Who is impacted?
Developers - In some cases, these are the unsung heroes of your uplift and what is their reward after it is completed? To keep their noses to the grindstone and work a few more weeks off on the fires that keep cropping up. Doesn’t sound like a real promising job huh?
Projects - Not just the project people, which I'll get back to, but you're now putting off being able to work on your next projects. It doesn't matter if it is old production fixes or currently pipelined new projects, delays will drain your political capital because you are now missing target deployment dates and the business thinks you are pretty much worthless for not catching all the things you are now responsible for fixing. Project personnel start feeling burned out just like your developers because they've been on the project forever, and the bigger the project the longer you'll want to keep them around to play clean up for all the new debris coming up.
Support - How many fires do you get on a daily basis? Multiply that by a factor of 2 or 3 after doing an uplift like this for about a month, and you’ll see what kind of volume you’ll be slowly crushed under when you have people in the new system doing their daily work which they never really replicated in the test environment.
DBAs - During the project and throughout the implementation we've been wearing them thin, and they are now getting no rest by making them look at applying patches and identifying system issues which should’ve been caught months ago so they aren't able to instead be able to focus on things like building new monitoring tools. Fundamental things like missing critical DB bugs can easily happen because everything in the system is being thrown against the wall without a whole lot of investigation possible for each item.
How can we do better?
We need to be more proactive about showing the business and project teams what has been run in the last 30, 60, 180 days in your system and by whom. Having these snapshots would be able to show the majority of the daily, monthly and quarterly reporting goals that the project needs to meet.
We can do a better job about documenting out our system flow, so we know what all our parts and pieces are which are required to provide full service to the business. This way you won't get surprised by a report pushing output to a third party in-house group, to a server you didn't even know was connected to your system.
I'll probably add more as I encounter them, but what would you add to this list?
Thursday, August 2, 2012
OS Changes - Commands
So we've upgraded from SUN Solaris to RedHat Linux, and I've found some key differences so far. The command prstat no longer really works well to find out how healthy your server/CPUs are, as it needs to be combined with the new top command to get a fuller understanding of how healthy your system is and this site will help you understand why there is a difference between not only the commands but how the OS is operating as well.
To investigate printing issues I grew accustomed to the lpstat command combined with the -lp switches so that a command lpstat -lp PRINTERNAME would give me information as to how the printer was connected to our system and if the printer queue was up or not. I've found out that issuing this command in a RedHat Linux environment doesn't provide that old type of feedback, so we've had to alter our approach and instead use the -a switch so that the entire command looks like lpstat -a PRINTERNAME now.
To investigate printing issues I grew accustomed to the lpstat command combined with the -lp switches so that a command lpstat -lp PRINTERNAME would give me information as to how the printer was connected to our system and if the printer queue was up or not. I've found out that issuing this command in a RedHat Linux environment doesn't provide that old type of feedback, so we've had to alter our approach and instead use the -a switch so that the entire command looks like lpstat -a PRINTERNAME now.
Thursday, July 19, 2012
Performance
Performance. A word that's strong enough to strike fear into anybody dealing with an Oracle database system. That's not to say that Oracle performs badly, but that seldom do people and organizations understand how to get the best performance out of the system they have put into place. A DBA will tell you performance is one thing, with a certain way to measure it, a developer will look at performance in a totally different way with completely different tools to understand it, while your production support team will tell you they are looking for yet another set of metrics to be able to anticipate performance (or a lack of) from the system.
This complexity is why I consider performance to be one of my Main Tenants of Support, as typically the amount of performance can be boiled down to this simple formula:
Actual system performance capacity = User expectations - lack of development performance planning + amount of DBA hands on assistance + framework for performance measurement
So. What can we do about this? Obviously for somebody in the production support team the only piece you usually can directly affect is your user expectations, while getting outputs from the framework you have to measure your system performance, but this only goes so far. Performance needs to be an institutional truism that gets baked into your system to address the other 3 parts of the equation:
Why should we care about performance? Nobody should want your system to operate below maximum, right? So why do we allow it to happen on a daily basis?!? I'll tell you, listening to a presentation from Cary Millsap about being able to measure everything your system is doing makes SO much sense sitting there; yet being able to implement a new system, or break out of your norms, all on your own is pretty much impossible. Performance is a grassroots revolution which we all need to enroll in.
This complexity is why I consider performance to be one of my Main Tenants of Support, as typically the amount of performance can be boiled down to this simple formula:
Actual system performance capacity = User expectations - lack of development performance planning + amount of DBA hands on assistance + framework for performance measurement
So. What can we do about this? Obviously for somebody in the production support team the only piece you usually can directly affect is your user expectations, while getting outputs from the framework you have to measure your system performance, but this only goes so far. Performance needs to be an institutional truism that gets baked into your system to address the other 3 parts of the equation:
- Developers need to understand how both the system and users operate in order to effectively design and produce results.
- DBAs should be understanding that reactively doing something like reading through the DB log after "something happens" is far too late in the game.
- Support has to make sure they participate in the feedback loop about what the system (or users) are doing and fight to get their hands on a performance framework, or create their own measurement system.
Why should we care about performance? Nobody should want your system to operate below maximum, right? So why do we allow it to happen on a daily basis?!? I'll tell you, listening to a presentation from Cary Millsap about being able to measure everything your system is doing makes SO much sense sitting there; yet being able to implement a new system, or break out of your norms, all on your own is pretty much impossible. Performance is a grassroots revolution which we all need to enroll in.
Thursday, July 12, 2012
Proactive monitoring
Reading a post by Jonathan Lewis got me to thinking about something I'd been meaning to write up, and there is no time like the present!
In the past year or so I have been growing ever more curious as to how the system is doing what it's doing, and even more importantly to my niche, what is the system doing and why is it doing it.
Before going to the wonderful HOTSOS Symposium this year, my boss and I were chewing over a problem plaguing us regularly of the system seeming to be doing a lot of work and churning information over in the v$session_longops table quite a bit. Since we never had a lot of exposure to this view before, we weren't sure how to react to some of the things we saw in there and then we had a crisis come up, and this was research was shelved for a short time. After the Symposium, and the Training Day with Jonathan, we came back to the office rejuvenated a bit and dug into this with some ferocity; therefore, it was with some chagrin that we realized we had stumbled upon something we wish had been around for several years at our shop. We now had a window into the SQL running in our system, so we could determine not only what was running long but more importantly if it should be running long. Now we had a powerful tool in our toolbox, and were able to wield it as one would a sword through a thick jungle to cut down many of the vines in our path but were starting to run out of vines and wanted to turn our sword into a scalpel. Our choice was to create an Oracle Alert running the below 10.2.0.4 script every 15 minutes alerting our PROD Support team:
select vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
After weeks of running this, we were able to capture several performance issues which never would have come to light unless we went looking in our system and I'll be sharing those over the next weeks! Going even further, we expanded the alert to read across DB links to other 10.2.0.4 instances:
select 'INSTANCE1' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
union
select 'INSTANCE2' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE2 vsl, v$session@INSTANCE2 vs, v$sqlarea@INSTANCE2 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
union
select 'INSTANCE3' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE3 vsl, v$session@INSTANCE3 vs, v$sqlarea@INSTANCE3 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
order by "SERVER", "% WORK_REMAINING" desc, module
In the past year or so I have been growing ever more curious as to how the system is doing what it's doing, and even more importantly to my niche, what is the system doing and why is it doing it.
Before going to the wonderful HOTSOS Symposium this year, my boss and I were chewing over a problem plaguing us regularly of the system seeming to be doing a lot of work and churning information over in the v$session_longops table quite a bit. Since we never had a lot of exposure to this view before, we weren't sure how to react to some of the things we saw in there and then we had a crisis come up, and this was research was shelved for a short time. After the Symposium, and the Training Day with Jonathan, we came back to the office rejuvenated a bit and dug into this with some ferocity; therefore, it was with some chagrin that we realized we had stumbled upon something we wish had been around for several years at our shop. We now had a window into the SQL running in our system, so we could determine not only what was running long but more importantly if it should be running long. Now we had a powerful tool in our toolbox, and were able to wield it as one would a sword through a thick jungle to cut down many of the vines in our path but were starting to run out of vines and wanted to turn our sword into a scalpel. Our choice was to create an Oracle Alert running the below 10.2.0.4 script every 15 minutes alerting our PROD Support team:
select vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
After weeks of running this, we were able to capture several performance issues which never would have come to light unless we went looking in our system and I'll be sharing those over the next weeks! Going even further, we expanded the alert to read across DB links to other 10.2.0.4 instances:
select 'INSTANCE1' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops vsl, v$session vs, v$sqlarea vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
union
select 'INSTANCE2' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE2 vsl, v$session@INSTANCE2 vs, v$sqlarea@INSTANCE2 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
union
select 'INSTANCE3' AS "SERVER",
vs.sid,
vs.serial#,
vs.username,
vs.osuser,
vs.module,
vs.action,
substr(vsa.sql_text, 0, 80) AS SQL_PART,
vs.status,
vsl.opname,
vs.event,
vs.blocking_session_status,
to_char(vsl.start_time, 'DD-MON-YY HH24:MI:SS') AS "START TIME",
vsl.last_update_time,
(round((1 - (vsl.elapsed_seconds / (vsl.elapsed_seconds + vsl.time_remaining) ) ),2)) * 100 AS "% WORK_REMAINING",
round((vsl.elapsed_seconds/60), 2) as "MINUTES ELAPSED",
round((vsl.time_remaining/60), 2) as "MINUTES REMAINING",
vsl.message,
vs.sql_id,
vsa.users_executing
from v$session_longops@INSTANCE3 vsl, v$session@INSTANCE3 vs, v$sqlarea@INSTANCE3 vsa
where vsl.sql_id = vs.sql_id
and vs.sql_id = vsa.sql_id
and vsl.sid = vs.sid
and vsl.sofar != vsl.totalwork
order by "SERVER", "% WORK_REMAINING" desc, module
Thursday, June 28, 2012
Users do what users do
If I had to write a book, the first chapter would be titled after this post. Maybe more accurately, users do what users can do. Last week I was going to write something up about this type of thing, but it makes SO much more sense after this week and introduces what I'll call Main Tenants of Support.
These Main Tenants are things we support individuals hold to be self evident, even if developers and project personnel don't understand a damn thing about them.
Our case: We were given scripts from Oracle to drop and recreate a queue table that we were told by our developers wasn't being used since it had built up over 10 years of non-use.
rem start queue: exec dbms_aqadm.start_queue(‘AR.AR_REV_REC_Q’);
So instead of just purging it once and then letting another 10 years go by before we purge it again, we decided to wrap it into a concurrent program and let it run under 2 parameters:
What we didn't know: The queue table is used in base Oracle functionality for invoice processing. So what you say? The API for the drop queue was designed to delete all the rows AFTER it locked the object.
Error: Unable to create invoice due to an error: ORA-04021: timeout occurred while waiting to lock object AR.AR_REV_REC_Q
ORA-06512: at "APPS.RA_CUSTOMER_TRX
Why was this a problem: That meant when we kicked off the concurrent program 10 PM Friday night, it prevented receipts from getting created in Financials from our web payment front end until we saw an alert on Monday shortly before noon that was telling us we had a backlog in receipt creation. Killing the concurrent request didn't do anything because the process was still alive, so we had to mark the session for kill but the session was still alive. It did stop what it was doing after getting past the 9 millionth record, but then it popped out to our shell of the concurrent program which was a problem because it found itself in a loop that it had no way out of. I researched a bit and found a good article from Tanel Poder breaking down a situation a lot like ours, and I had some tricks in my bag from a previous issue involving the Revenue Recognition, to realize that the OS process needed to be killed manually.
What this shows us about our Main Tenant: We were able to make the mistake of locking the queue. Oracle told us to do this, so why would we believe it would have an issue? We did exactly what the application allowed us to do, so we did it. The same can be said of forms, fields, and just about anything else in the application so we need to be aware of how developers and project analysts (and Oracle) allow users to interact with the application. Did we build a form that allows blind queries? Does a report spend 3 days figuring out a Hash Join before running the actual meat and potatoes of the report in 15 minutes time? It is the aftermath of many of these unasked, or unanswered, questions prior to implementation that I have to deal with on a daily basis.
Edit: 8/16/12 - Needed to change the link to Tanel's article since it was somehow linking the search engine URL instead.
These Main Tenants are things we support individuals hold to be self evident, even if developers and project personnel don't understand a damn thing about them.
Our case: We were given scripts from Oracle to drop and recreate a queue table that we were told by our developers wasn't being used since it had built up over 10 years of non-use.
rem stop queue: exec dbms_aqadm.stop_queue(‘AR.AR_REV_REC_Q’);
rem drop queue: exec dbms_aqadm.drop_queue(‘AR.AR_REV_REC_Q’);
rem drop queue table: exec dbms_aqadm.drop_queue_table(‘AR.AR_REV_REC_QT’);
rem create new queue table: begin DBMS_AQADM.CREATE_QUEUE_TABLE( Queue_table => ‘AR.AR_REV_REC_QT’, Queue_payload_type => ‘SYSTEM.AR_REV_REC_TYP’, storage_clause => ‘TABLESPACE AR’, Sort_list => ‘ENQ_TIME’, Multiple_consumers => TRUE, Compatible => ‘10.2’, comment => ‘Revenue Recognition Table’); end; /
rem create queue: exec dbms_aqadm.create_queue(‘AR.AR_REV_REC_Q’,’AR.AR_REV_REC_QT’);
So instead of just purging it once and then letting another 10 years go by before we purge it again, we decided to wrap it into a concurrent program and let it run under 2 parameters:
- drop and recreate queue
- delete records from the queue
What we didn't know: The queue table is used in base Oracle functionality for invoice processing. So what you say? The API for the drop queue was designed to delete all the rows AFTER it locked the object.
Error: Unable to create invoice due to an error: ORA-04021: timeout occurred while waiting to lock object AR.AR_REV_REC_Q
ORA-06512: at "APPS.RA_CUSTOMER_TRX
Why was this a problem: That meant when we kicked off the concurrent program 10 PM Friday night, it prevented receipts from getting created in Financials from our web payment front end until we saw an alert on Monday shortly before noon that was telling us we had a backlog in receipt creation. Killing the concurrent request didn't do anything because the process was still alive, so we had to mark the session for kill but the session was still alive. It did stop what it was doing after getting past the 9 millionth record, but then it popped out to our shell of the concurrent program which was a problem because it found itself in a loop that it had no way out of. I researched a bit and found a good article from Tanel Poder breaking down a situation a lot like ours, and I had some tricks in my bag from a previous issue involving the Revenue Recognition, to realize that the OS process needed to be killed manually.
What this shows us about our Main Tenant: We were able to make the mistake of locking the queue. Oracle told us to do this, so why would we believe it would have an issue? We did exactly what the application allowed us to do, so we did it. The same can be said of forms, fields, and just about anything else in the application so we need to be aware of how developers and project analysts (and Oracle) allow users to interact with the application. Did we build a form that allows blind queries? Does a report spend 3 days figuring out a Hash Join before running the actual meat and potatoes of the report in 15 minutes time? It is the aftermath of many of these unasked, or unanswered, questions prior to implementation that I have to deal with on a daily basis.
Edit: 8/16/12 - Needed to change the link to Tanel's article since it was somehow linking the search engine URL instead.
Thursday, June 21, 2012
Every blog begins with a first post.
This is mine. This came about after reading a great post by Tom Kyte where he posts a bit from one of his books comparing what we do in our daily Oracle lives to different kinds of doctors in the field. Really, go read it. Okay now that you're back, I'll tell you why I decided on being The Oracle EMT.
While he identified many types of physicians, I didn't really feel like any of them fit with what I do in my Production support role because I'm trying to do all of them at different times. I more empathize with a fire-fighter, or a police detective at times, but tying back to the example the EMT is somebody going out into the world never knowing what they will encounter from day to day. And they are dealing with critical life threatening issues. Their symbol is the Star of Life for Pete's sake! That's how I feel constantly. While I might sit at my desk and be a target for work to come in, and be like the emergency room doctor, it really is wide open for what will come my way and feels like I'm out in that world with anything around the next corner. Deciding what to work on today, might not be a huge deal, but picking large projects to invest my time into could have a small issue languish for weeks, which in its own right become a hair on fire problem that I'll have to address immediately when I'm in the middle of doing something completely else.
Lining up my thoughts a bit more, I am looking for this blog to be directed at the everyday Oracle applications specialist that had to be out there in the field. I'm not looking for work, I'm perfectly happy learning from the job I have right now with the technology we're dealing with, so this isn't a launching pad for some consulting company outing and eventually I'd like to see more people contributing to the endeavor once it has some steam going. I want to be able to vent or show things here, with individuals helping each other to make this a safe place to talk and share ideas in the comments. Nothing is taboo. Not even me.
-The Oracle EMT
While he identified many types of physicians, I didn't really feel like any of them fit with what I do in my Production support role because I'm trying to do all of them at different times. I more empathize with a fire-fighter, or a police detective at times, but tying back to the example the EMT is somebody going out into the world never knowing what they will encounter from day to day. And they are dealing with critical life threatening issues. Their symbol is the Star of Life for Pete's sake! That's how I feel constantly. While I might sit at my desk and be a target for work to come in, and be like the emergency room doctor, it really is wide open for what will come my way and feels like I'm out in that world with anything around the next corner. Deciding what to work on today, might not be a huge deal, but picking large projects to invest my time into could have a small issue languish for weeks, which in its own right become a hair on fire problem that I'll have to address immediately when I'm in the middle of doing something completely else.
Lining up my thoughts a bit more, I am looking for this blog to be directed at the everyday Oracle applications specialist that had to be out there in the field. I'm not looking for work, I'm perfectly happy learning from the job I have right now with the technology we're dealing with, so this isn't a launching pad for some consulting company outing and eventually I'd like to see more people contributing to the endeavor once it has some steam going. I want to be able to vent or show things here, with individuals helping each other to make this a safe place to talk and share ideas in the comments. Nothing is taboo. Not even me.
-The Oracle EMT
Subscribe to:
Posts (Atom)