Time
|
Swap
|
Free Mem(G)
|
Usable Mem(G)
|
7:47 AM
|
2317
|
16
|
43
|
8:02 AM
|
3136
|
1
|
44
|
8:17 AM
|
4508
|
0
|
45
|
8:47 AM
|
6315
|
0
|
45
|
9:23 AM
|
7036
|
0
|
45
|
9:47 AM
|
7865
|
0
|
45
|
10:17 AM
|
8915
|
0
|
47
|
12:17 PM
|
9822
|
0
|
47
|
2:02 PM
|
10254
|
0
|
48
|
3:47 PM
|
10250
|
1
|
48
|
8:02 PM
|
9625
|
0
|
47
|
8:17 PM
|
9605
|
16
|
45
|
9:02 PM
|
9515
|
17
|
46
|
9:17 PM
|
9536
|
5
|
46
|
9:23 PM
|
9581
|
0
|
46
|
Another mystery was why our "usable" memory never really wavered at all, even though the system was under pressure. Regardless of that, we all agreed that this was likely not an OS bug since the symptom was only presenting itself on a single node. An option presented to us by Oracle was changing the sysctl parameter to increase the memory allocated to the database processes in order to try and eliminate the stresses on swap space usage but a nagging question being asked was if we added any new stress or load to the system which would cause this node to under perform. Obviously, we didn't think so but we kept digging to see if maybe a rogue ETL process or other external partner were hitting us at the same time every night when a pattern of this occurring on a Wednesday or Saturday night started to emerge. Having dug into this a few times, our scope expanded trying to understand the problem (this is why database load was added to this next trend analysis):
Time
|
Swap
|
Free Mem(G)
|
Usable Mem(G)
|
DB load (1 m avg)
|
9:02 PM
|
8353
|
29
|
37
|
1.19
|
9:17 PM
|
8455
|
1
|
37
|
4.04
|
9:22 PM
|
9097
|
0
|
38
|
4.39
|
9:47 PM
|
11144
|
0
|
38
|
5.65
|
10:02 PM
|
11840
|
0
|
38
|
3.19
|
10:17 PM
|
12160
|
0
|
39
|
3.19
|
10:23 PM
|
12142
|
0
|
39
|
1.72
|
This type of load jump was abnormal and led the DBAs to look closer at possible suspects like cron jobs to find this:
5 21 * * 3,6 /s_u01/app/applmgr/scripts/tar_tops.sh > /s_u01/app/applmgr/scripts/tar_tops.logs
Since the 3,6 parameter meant the script was only running on Wednesday and Saturday at 21:00 (9:00 PM), QA testing commenced. Guess what we found out? The script is the culprit for our memory depletion because after a while free memory became 0!! Here is how QA looked before the cron job was run:
[applmgr@server midtier]$ free -g
total used free shared buffers cached
Mem: 94 68 26 0 1 14
-/+ buffers/cache: 51 42
Swap: 23 4 19
After some time of the cron job running in QA, you can see the difference:
[applmgr@server midtier]$ free -g
total used free shared buffers cached
Mem: 94 94 0 0 1 40
-/+ buffers/cache: 52 42
Swap: 23 5 18
Once the DBAs changed the cron commenting out this job in PROD, we stopped having regular database crashes.
Once the DBAs changed the cron commenting out this job in PROD, we stopped having regular database crashes.
No comments:
Post a Comment