Saturday, December 28, 2013

The 10 days of Blogging past: Day 7

As we start hitting the home stretch of the 10 days of Blogging past, we will review a rather straightforward follow up post in "Learning more about DBA_ALERT_HISTORY" from 2/14/13 with 86 views.

As is often the case I learn the most about our systems when something "bad" happens or things go "wrong" in ways we had not anticipated.  I don't know if that is your typical experience in supporting an application, but that is certainly my experience!  While going to Exadata we were presented with the opportunity to uplift our entire infrastructure footprint, and one way we did that was going to a multi-database RAC node instance as opposed to a single database node instance that we were previously running.  We (and our DBAs) knew the flaws with the old design, but we had no idea all of the possible challenges that would face us with going to a RAC instance, of which this is obviously one of them.

As a result, I was able to learn more about what the table DBA_ALERT_HISTORY would (and wouldn't) log and the timing behind it.  It seems obvious that a node which is offline or disconnected for some reason wouldn't be able to log to a table if it can't be connected to it, but what was not obvious is that the remaining node would reconfigure as if the other node is not there AND not log something to the table.  Sure, the server comes back online/available it logs to the table, yet what happens during that time if you think DBA_ALERT_HISTORY will tell you the facts in a real time manner?  You'll get alerted way too late and oddly enough, the original message was logged to the table AFTER a new message was entered saying everything was fine.

This is a great point that dovetails with a lot of things other Oracle experts go through with having a hypothesis or knowing something might be true but being able to test it out and actually verify if that is a true statement/hypothesis or not.

No comments:

Post a Comment