Monday, August 19, 2013

Network canary in the coal shaft

Following part 1 and part 2 of getting to the bottom of wait events in your system, when we had a "network event" at work the other day we could see the after effect of this disruption on our system by the wait event "TCP Socket (KGAS)" being on the top of our alert.  Seems like a "no duh?" moment right?  Yet it got me to thinking, that this wait event is really the canary in the coal shaft for your network.

What if we modified our previous alert to run every minute, looking only for spikes in this wait event? Wouldn't we get proactive notifications for incidents like this, or be able to easier identify "network spikes" or other disruptions in communication between our systems themselves or the systems and other parts of the infrastructure?  I think that just might be the case, so I'm going to create a simple alert and mock it up so the next time we have any network disruption we can start fine tuning the alert to get at what are the really important measurements needed in a scenario like this.

No comments:

Post a Comment