Tuesday, August 6, 2013

Memory leak causing database crashes

After upgrading our database to 11.2.0.3 and applying both BP11 and EX10 Exadata patches we started having database node crashes every 2 to 3 weeks.  As a result we started looking closer at the monitoring information that was being shown before the crashes.  First we identified that only a single database node was starting to run out of virtual memory (swap space) in what appeared to be a random pattern, yet it appeared to happen the most at 9PM for some reason we couldn't fathom.  We even had an outlier happen on Saturday morning where the system recovered free memory around 8 PM, and then ran out of free memory again an hour later!

Time
Swap
Free Mem(G)
Usable Mem(G)
7:47 AM
2317
16
43
8:02 AM
3136
1
44
8:17 AM
4508
0
45
8:47 AM
6315
0
45
9:23 AM
7036
0
45
9:47 AM
7865
0
45
10:17 AM
8915
0
47
12:17 PM
9822
0
47
2:02 PM
10254
0
48
3:47 PM
10250
1
48
8:02 PM
9625
0
47
8:17 PM
9605
16
45
9:02 PM
9515
17
46
9:17 PM
9536
5
46
9:23 PM
9581
0
46

Another mystery was why our "usable" memory never really wavered at all, even though the system was under pressure.  Regardless of that, we all agreed that this was likely not an OS bug since the symptom was only presenting itself on a single node.  An option presented to us by Oracle was changing the sysctl parameter to increase the memory allocated to the database processes in order to try and eliminate the stresses on swap space usage but a nagging question being asked was if we added any new stress or load to the system which would cause this node to under perform.  Obviously, we didn't think so but we kept digging to see if maybe a rogue ETL process or other external partner were hitting us at the same time every night when a pattern of this occurring on a Wednesday or Saturday night started to emerge.  Having dug into this a few times, our scope expanded trying to understand the problem (this is why database load was added to this next trend analysis):

Time
Swap
Free Mem(G)
Usable Mem(G)
DB load (1 m avg)
9:02 PM
8353
29
37
1.19
9:17 PM
8455
1
37
4.04
9:22 PM
9097
0
38
4.39
9:47 PM
11144
0
38
5.65
10:02 PM
11840
0
38
3.19
10:17 PM
12160
0
39
3.19
10:23 PM
12142
0
39
1.72

This type of load jump was abnormal and led the DBAs to look closer at possible suspects like cron jobs to find this:

5 21 * * 3,6 /s_u01/app/applmgr/scripts/tar_tops.sh > /s_u01/app/applmgr/scripts/tar_tops.logs

Since the 3,6 parameter meant the script was only running on Wednesday and Saturday at 21:00 (9:00 PM), QA testing commenced.  Guess what we found out?  The script is the culprit for our memory depletion because after a while free memory became 0!!  Here is how QA looked before the cron job was run:

[applmgr@server midtier]$ free -g
             total       used       free     shared    buffers     cached
Mem:            94         68         26          0          1         14
-/+ buffers/cache:         51         42
Swap:           23          4         19

After some time of the cron job running in QA, you can see the difference:

[applmgr@server midtier]$ free -g
             total       used       free     shared    buffers     cached
Mem:            94         94          0          0          1         40
-/+ buffers/cache:         52         42
Swap:           23          5         18

Once the DBAs changed the cron commenting out this job in PROD, we stopped having regular database crashes.

No comments:

Post a Comment