Saturday, August 31, 2013

Windows R12 certs Part Deux

Crossing the streams between my first two spotlights on Steven Chan's blog and the EBS Support blog, is an article related to something I touched on last month regarding the release of EBS 12.1 on Windows systems.  This month is the reinforcement of the same certification which may seem odd but it was this post that helped me realize that even though the two blogs are similar, and may have the same content from time to time, they are really complementary blogs that may have different audiences but the best way to enjoy them is in conjunction.  I hope by the end of this spotlight, you'll feel the same way!

Friday, August 30, 2013

Newsworthy A/R Articles

Continuing the spotlight on the EBS Support blog, I wanted to focus on several articles related to different modules and tonight is the Accounts Receivable module.  Have you ever encountered problems with the Autoinvoice?  If so, this article is for you!  Even if you've never had an issue with the Autoinvoice, the article has a link to the Oracle Receivables User Guide which may have some interesting information in it for you.

Have you heard of the Trading Community Architecture?  The TCA is the data model for the relationship between Parties and Customers so it should be interesting to read an article about the way it is changing with R12 right?  That's what I thought too!  Plus it has more links to help you understand the Customer Form in R12, what troubleshooting methods are in R12, and some common known issues too.

Last, but not least, is a compilation of notes for A/R ranging from XLA, to Create Accounting, to receipt application, to closing periods and duplicating Receipt Numbers!  Noteworthy notes indeed!

Thursday, August 29, 2013

EBS Analyzers

Today in the spotlight on the EBS Support blog I share articles which have a common thread: system analyzers.  First up is a new analyzer, which I'm eager to use at work, for PO Approvals because when I've ran the Concurrent Processing analyzer it told me SO much about our system and some of the data which still lingered on in the system.  Finally, is the Workflow analyzer which I'm quite fond of because this is the one that started me on a journey which has taken me to giving a presentation in Pittsburgh and really opened my eyes to what BIG type of issues reside in systems that take a lot of work, effort, and mental elbow grease to solve.


Wednesday, August 28, 2013

R12 Procurement Update

For those of you heading to R12, our spotlight on the EBS Support blog has several articles you should read about Procurement.  First, the new look of Procurement with the Buyer Work Center which provides a one-stop shop for Buyers so they can perform their daily tasks all in one place with ease.  Next, primarily for DBAs, are articles on several invalid R12 objects which are the cause of at least 25 known issues for Procurement and a new Rollup Patch for Procurement Approvals and AME which is not included in the RPCs I posted about yesterday.

Tuesday, August 27, 2013

R12 EBS RPCs

Continuing the blog spotlight on the EBS Support blog, is a very comprehensive article detailing all of the Recommended Patch Collections released in March for several modules of the EBS system.  Not only do you get the big listing on this page, but included as well is the My Oracle Support note which tracks these RPCs and that you can add it as a favorite so you can stay on top of when content is changed like new RPCs being released!  It might be too late in our R12 implementation to deploy some of these, but I won't be the least bit surprised if we have to apply some of these to our system in the first few months after go live.

Monday, August 26, 2013

Troubleshooting PO Output

After doing my first blog spotlight last month, I'm convinced that it is a good idea so this week I'm going to spotlight the Oracle EBS Support blog.  You may be asking yourself how this blog is different from Steven Chan's blog?  Well, Steven is in charge of the Applications Technology Group at Oracle and as a result his blog is focused on the EBS from the ATG perspective where this blog is specifically aimed at the EBS and the modules that comprise it.

A few months ago, the Master Troubleshooting Guide for POXPOPDF (PO Output for Communication) was released on My Oracle Support.  This guide includes setup checks, targeted patching, common issues, and additional troubleshooting help so it is well worth putting into your toolbox and procedures in order to avoid issues!

Sunday, August 25, 2013

Proactive DBAs and TEMP space

On the same day that we ran out of FLASH, we later saw our database running out of TEMP space and thought this might be related to the same archive job but in reality it was a user report which was run with invalid parameters so it was just sitting there consuming temp space trying to complete.  In addition to resolving the issue by terminating the session, the DBAs created an alert that day to check if any sessions are taking up more than 6 GB of TEMP space and if anything is found it e-mails the DBA and Support teams.  This is a critically important and proactive alert that has been put into place, which has helped us several times to identify potential issues before they could halt the database until the TEMP space issue was resolved.

Saturday, August 24, 2013

Weekend reading: Fusion discussion with Steve Miranda

Even though the interview is almost a year old, if you click on the link within the article and read the whole story it is really quite interesting.  While we aren't using Oracle Fusion, I've been hearing rumblings about it for several years and even a few weeks ago Steven Chan had to clear up that Fusion isn't an upgrade path for Oracle EBS so it was really enlightening to hear what Oracle is planning and how they are approaching their marketing.  The best take away for me was that even Oracle considers a product as complete and successful internally after they do a few month end closes!  Why?  Our group has come to the same conclusion and we're already looking at the calendar for staffing needs during our first few post-R12 closes.

Friday, August 23, 2013

Weekend Homework: Oracle Virtual Networking

When I initially looked over the article about Oracle Virtual Networking, from the OTN Garage, I was impressed with all of the information and diagrams but it did seem a bit overwhelming until I had a chance to focus on it and now I want more.  Of course there are some links in here to the Oracle Fabric Interconnect page and other good content, yet the content of the article is more than enough to engage the reader and get those mental juices flowing so one can think about their current installation for how they might be able to tweak or outright change it.  Do yourself a favor this weekend and spend a few minutes with this homework, and let me know what you think.

Thursday, August 22, 2013

What's the difference in a column?

This blog is filled with the times I'm just so brilliant, but this is not one of those times.  In our due-diligence in verifying all our items, objects, and processes will stand up to being upgraded by R12 an analyst started asking us some questions about an alert which just so happens to be one I created.  The statement was being made that "it does not pull up anything" which was curious to me, since I usually don't send alerts out unless I can test them successfully (often many many times) before putting my team on them.  Sure enough, I went back to see that I DID test the alert and it DID show information so now I'm really curious and digging into the statements being made in relation to actual_start_time being selected as the criteria for the alert.

Seems innocent enough right?  I had wanted to find out when a Journal Import ended within the first hour of it being in a non-normal status (Cancelled, Terminated, Error, you get the idea) but in fact that isn't what I asked the alert to do.  Because I used actual_start_time, and not actual_completion_time, I was asking the alert to tell me if a Journal Import ended in a non-normal state only if it started within the past hour.  How about that huh?  So I looked back in the history of when the alert should have come out, and sure enough there was a Journal Import that started at 8 AM and didn't end abnormally until after 10 AM so when the alert ran this wasn't picked up.  A good lesson learned that even with good intentions, you can do something completely wrong, and not find out about it until too late.

Wednesday, August 21, 2013

(M)obile responsibilities

When users were assigned responsibilities and they did not show up in the Navigator or "pick list" within the Financials application, it was a strange phenomenon which took some investigating to figure out.  We have had some experience with several tables getting out of sync with what the assigned responsibility tables show, but a quick glance at these problem tables told us that we were okay from that aspect so I was quickly running out of ideas.  I even went to the end users to make sure the responsibility wasn't showing up, and sure enough it wasn't there.  I mentioned this to a co-worker to bounce ideas off of and he reminded me that we had some problems a year or two back with a report or responsibility not showing the right information because they had some table level details mixed up.  Thinking it was worth a shot, I then looked at the FND_RESPONSIBILITY_VL table for the problem child in question and was puzzled by the fact that in the VERSION column there was an 'M'.

Versions for most things are numbers and when I compared it against other responsibilities they had the number '4' listed, so I looked at the form and couldn't see anything out of the ordinary.  Back to the table to run a query looking for other items marked with an M and I found several other responsibilities listed with almost all of them having "Field" or "Mobile" in the name of the responsibility, which did not seem to make sense because this responsibility wasn't part of some Field or Mobile installation of Oracle Apps.  Well back to the form I go and there is the answer staring me straight in the face this time: there is an option to pick what responsibilities are "Available From."  At some point an individual accidentally picked Oracle Mobile Applications when they were trying to make another change (likely a menu change since this is right above the Menu entry field) and didn't notice the radio box selection change when they miss-clicked so they saved this bad change with the changes they were planning on making.  After changing the radio box selection back to Oracle Applications, life was good as the users could once again access the responsibility.

Tuesday, August 20, 2013

This week in Training!

We'll just ignore that my play on the "This Week in Baseball" gives me the acronym of TWIT!

There are some very good training sessions over the next week, including one on the 21st, that I want to share with you just in case you're interested.  Hopefully I'll start having more time to read up on the blogs in my blog list, so I can help provide these on a more frequent and consistent basis.

8/21
Seeing that I have not heard of OSH, I want to get up to speed on it and how I might be able to leverage it in potentially finding more workflow issues so the session "Webcast: Using The Oracle Internet Expenses Setup Helper (OSH) To Troubleshoot Oracle Internet Expenses (OIE) Setup Issues" sounds right up my alley.

8/27
I know that we have some RHEL machines in our environment and if we're told to switch I want to be prepared so that is why I'm going to attend the session "Migration Made Easy: Switch from Red Hat Enterprise Linux to Oracle Linux in Minutes".

8/28
While I put this under the 28th, the first offering "Plug into Cloud Consolidation with Oracle Database 12c" is the first of four in a series which will launch each month.  I signed up for all of them, as I'd like to learn more about 12c and this seems like a very good outlet to do so.

Monday, August 19, 2013

Network canary in the coal shaft

Following part 1 and part 2 of getting to the bottom of wait events in your system, when we had a "network event" at work the other day we could see the after effect of this disruption on our system by the wait event "TCP Socket (KGAS)" being on the top of our alert.  Seems like a "no duh?" moment right?  Yet it got me to thinking, that this wait event is really the canary in the coal shaft for your network.

What if we modified our previous alert to run every minute, looking only for spikes in this wait event? Wouldn't we get proactive notifications for incidents like this, or be able to easier identify "network spikes" or other disruptions in communication between our systems themselves or the systems and other parts of the infrastructure?  I think that just might be the case, so I'm going to create a simple alert and mock it up so the next time we have any network disruption we can start fine tuning the alert to get at what are the really important measurements needed in a scenario like this.

Sunday, August 18, 2013

How to secure your EBS platform

Did you know that Oracle provides you documentation on how to successfully configure the Oracle EBS for security?  It's true!  My Oracle Support note 403537.1 details the items for R12, while note 189367.1 goes over what you need to do on the older R11 version of EBS.

Since this is Security Week, I'll go out with a bang and tell you that while you're thinking security you should visit notes 564125.1 and 443353.1 as well to find out how to setup password security and how to change the GUEST password respectively.  Now let me ask a question.  Did you enjoy Security Week or did it fail to deliver for you?  Let me know in the comments!

Saturday, August 17, 2013

Don't mix and match your GUEST passwords

Continuing Security Week here, My Oracle Support note 602425.1 tells you what happens when you change the password for the GUEST account to all lowercase or mixed characters in R12.  This type of situation reminds me of a time where we went to a single sign-on application but didn't test any permutations of accepted user passwords so there was a range of symbols which went untested before release.  When we made the switch to the new solution, it was in conjunction with a localization patch as well so it made it hard to understand what was causing a very small group of users not to be able to successfully log into the EBS system.  It wasn't until we did additional research like enabling logs and asking specific questions about what was in their passwords (without actually getting the passwords of course) that we realized the population having the problem all had special characters in their passwords and that the translation of the passwords from one system to another wasn't setup to allow the special characters until we could release a hot fix to the system.

Friday, August 16, 2013

The password is right there!

In an interesting post, here we see Oracle telling you right where a file is that has some of your critical and sensitive passwords in it in My Oracle Support note 458064.1.  Sure this might be filed under "for emergencies only" but I think it is dangerous for individuals with either server access or the technical know-how to get to the files being able to get a hold of this file and the associated passwords.  Good note to read, since the first step in battling security holes is to know where they currently exist.

Thursday, August 15, 2013

EBS password size

This is a little off course from our usual "Security Week" type of post, but I think it is important information to know anyways in your efforts to secure the EBS application as well as your environment.  Per My Oracle Support note 1550649.1, users can encounter errors on the R12 platform if passwords are longer than 30 characters as some functions like FNDCPASS and AFPASSWD do not behave properly.  A few interesting bits on the page about what characters are NOT allowed in the password, other error messages that may occur for your end users, and the Enhancement bug which has been logged to get the UI updated to not allow passwords to be set to this length.

Wednesday, August 14, 2013

Unauthenticated setting of a profile value

You read the title right.  Oracle has documented in My Oracle Support note 364503.1 how you can set a profile value without being authenticated to the EBS application at all.  Sure, it is supposed to be used in an emergency manner and only for those that want to go and do some good or fix their systems.  What if it is used by individuals that don't exactly have good things in mind?  Would it be possible for somebody to get an authenticated database session, maybe using other things I talk about during Security Week, to your system and then really start to take over your EBS application?  Maybe they don't even try to take over your EBS system, but just start to collect usernames and passwords by changing something like where your authentication system directs to which can then be leveraged for more sensitive systems that you might have.

Of course if they get a authenticated database session with enough privileges it might make this a moot point, but if you visit that note you might find out a new way of setting your profile values at least!

Tuesday, August 13, 2013

APPLSYSPUB security - Part 2

Following up my first article on securing the APPLSYSPUB account, is this new article which tells you yet another place the password is hidden in your system and Oracle will ask you for it.  When the report Diagnostics: Apps Check (OMCHECK.sql) is run with the parameter of Application Object Library it will find all of the associated Profile Option Values for the Application Object Library module, which doesn't seem like a bad thing unless you realize that it will list the value for the profile option "Gateway User ID".

Not only does this profile option value have the username/password combination for APPLSYSPUB hard coded, but the Diagnostics: Apps Check report has no further security validation built into it so once it is available to a responsibility it is available to run for ANY module.  Ever have to add this to a responsibility to give Oracle some data?  Did you make sure to remove it?  Do you know who is running this in your system?

Monday, August 12, 2013

Security Week - GUEST during password change

This started out perfectly innocently as I needed to go into a test environment and reset a user account password so that they could get into the application and test something out.  Imagine my surprise when I encountered this scenario!

USERA logs into the application, in a second window USERB logs into the application but is prompted for a password change; going back to the first window and logging out USERA provides you with a returning User ID of GUEST for some strange reason and not USERA.  This means to me that before USERB gets authenticated, the Oracle EBS utilizes the GUEST user account and gets an authenticated session with that user ID.

This pretty much blew my mind, and lead me to do some more investigation of security issues related to user security for EBS so this week I give you "Security Week" after finding quite a few interesting nuggets from My Oracle Support.

Even better to prove my original hypothesis, going back to the window for USERB and trying to complete the password change attempt results in this message being displayed:

Error : You do not have a current session, please log in before visiting this URL.

Sunday, August 11, 2013

Now on Twitter!

In a departure from a very technical posting week, I wanted you let you know that I'm now a part of the Twitterverse!  I created the account a few weeks ago, and started tweeting while at the conference, but I thought I'd just make sure you knew and were up to speed if you had missed it.  You can find me @TheOracleEMT and while I'll start following people next week you can always take this chance to follow me so you can stay up to date!  Well, that is after I find out why LinkedIn isn't always posting to Twitter and THEN you can be completely up to date.  :}

Saturday, August 10, 2013

OVC - Reclaiming space

In running our Gather Schema for ALL schemas some interesting results popped up, one of which was the error message of “ORA-00900: invalid SQL statement” for a few objects.  Researching what the items were by just running a select on the tables (OVCX_FINP_HLD_INVALIDS_0_0 and OVCX_FINP_HLD_PROFILES_0_0) I found that the items in question were actually tables in our system which were columns of HTML code which made up webpages when put together!!

Research on My Oracle Support finds this article "OVC Diagnostic Readme and Available Parameters" [ID 252733.1] which explains that these tables are artifacts of an older version of the Oracle Diagnostics, deprecated in lieu of the current Oracle Diagnostics bundle, and the tables were generated from the ovc.zip file which is still provided by Oracle via MOS Notes 297613.1 and 398834.1.  While we only have a few of these objects in our system, reclaiming the space by dropping the tables won't be a whole lot for us by percentage, every byte is crucial and multiplied by every environment adds up to what your total storage budget is.  Maybe you don't have any of these objects in your system laying around, but are you sure of it?  Might be worth your time to go looking, don't you think?

Thursday, August 8, 2013

Concurrent Manager bottleneck

A continuation of a yesterday's post, we had several "overnight" reports that normally started anywhere between 7 PM and 3 AM actually wait until 4 AM to start running in the Concurrent Manager.  I was even alerted on our on-call phone about a backlog between 3-4 AM on one day, but I figured it was something to do with our user's requests that had been sitting there for a few days and that after cancelling them I assumed the issue would resolve itself.

Good news?  It did, but I was still curious about the why of it so I looked at Autoinvoice Import Program reports, which should have kicked off but didn’t, until I ran into the same report scheduled by a user after 4 PM which didn’t kick off until after 4 AM.  Very curious right?  Why would the system just stop for 12 hours?  Suddenly it hit me.  What did the work shifts look like for the Standard Manager (assuming that’s where they went, and not Quick Manager)?


Our user in question had 7 of 10 slots in the Standard Manager with his stuck reports running before 4 PM and once the shift changed on the first night the process maximum was 7, which didn’t let anything else actually run the following morning until another shift change at 4 AM!  Then later on the second night he had 8 of 10 slots before 4 PM, and again after the shift change nothing was able to process on the third morning until another shift change but that night was a different story because we had his reports complete during the day so there was no lingering backlog to stop up the managers the next morning.

As a byproduct of allowing user reports to sit running for too long, we identified a bottleneck in our system since usually the only time you bother looking at your Concurrent Manager setting is when you have a problem right?  If the system has run for years, and nothing is wrong, you won't go looking for problems since there are none reported!  I find it really interesting that this shift setup is almost completely backwards from our bigger EBS platform where we let the Concurrent Managers ramp up at night until we throttle them back when we have more of a user load on the system, but here we throttle it back at night and then ramp up the processes when the users are in the system.  Maybe this group of users is more report intensive, but I doubt it so off on the hunt for details and a good long term solution I go!  (Sure I can just change the processes here so we avoid that kind of bottleneck in the future, but is that really a solution?)

Wednesday, August 7, 2013

Concurrent reports not completing for a single person

Sounds like a weird situation right?  That's what we thought too, but we weren't really laughing about it since it happened on our A/R Close night and we still had it happening after the DBAs bounced the Concurrent Manager.  A few days later jobs by this person are still running and nothing is completing but I don't see any locks at the database and the OS_PROCESS_IDs from FND_CONCURRENT_REQUESTS aren't actually alive on the server.  So what's going on?  Well, the log files tell the story.

Upon closer inspection, it appears that the reports are actually completing all of their work before passing off the job for printing as all of the reports have the same last lines since the user is trying to PRINT TO THE SAME PRINTER.  Alright.  We have a Solaris OS machine here which we aren't running CUPS on, so lpstat -lp <printer name> should give us some details about the printer yet when I issue the command nothing happens.  No detail results.  No command prompt returned.  Nothing.  So it appears that the concurrent manager hands off the reports for printing, but if the printer exists and can't complete the request (in this case just hanging out there) the concurrent request will never continue to Completed.

After we get the printer issue resolved, I try to cancel the reports in Running status but the application locks up on me.  Looking at the locks in the system I can see something is holding me up, but I don't think it is the concurrent request session but the concurrent manager itself so I can't kill it but I took the PROCESS number 14517 from GV$SESSION and went to the server:

[me@server ~]$ ps -eaf | grep 14517
 applmgr  9895 14517   0   Aug 06 ?           0:00 /bin/sh -c lp -c -dPRINTERNAME -n1 -t"USERNAME.REQUESTID" /u01/app/applmgr/11i/i
 applmgr 14517  3854   0   Aug 06 ?           0:07 FNDLIBR FND Concurrent_Processor

You can see that this process number corresponded to the requests which were running under the USERNAME (USER_NAME.FND_USER), REQUESTID (REQUEST_ID.FND_CONCURRENT_REQUESTS), and the PRINTERNAME (PRINTER_NAME.FND_PRINTER) so we can look at the other OS processes with that information:

[me@server ~]$ ps -eaf | grep USERNAME
 applmgr  9895 14517   0   Aug 06 ?           0:00 /bin/sh -c lp -c -dPRINTERNAME -n1 -t"USERNAME.REQUESTID" /u01/app/applmgr/11i/i
 applmgr  9896  9895   0   Aug 06 ?           0:00 lp -c -dPRINTERNAME -n1 -tUSERNAME.REQUESTID /u01/app/applmgr/11i/i
 applmgr  7434  3303   0   Aug 02 ?           0:00 /bin/sh -c lp -c -dPRINTERNAME -n1 -t"USERNAME.REQUESTID2" /u01/app/applmgr/11i/i
 applmgr  7435  7434   0   Aug 02 ?           0:00 lp -c -dPRINTERNAME -n1 -tUSERNAME.REQUESTID2 /u01/app/applmgr/11i/i

What I find interesting is that you can see in the first two lines the process ID 9895 is listed twice, and the same with the last two lines and process ID 7434 so we've learned something about Unix here that the printer OS processes are marked as children of the main concurrent request OS process.  Next up, I issued kill -9 <OS process ID> commands for 9896 and 7435 as applmgr and after I kill the first one process 9895 disappears and the application is no longer locked up on me trying to cancel the report.  Why?  Killing the print job that was hanging at the OS level allowed the concurrent OS process to complete believing it was done printing, which then allowed it to report back to the application that it was done and allowed my session to "complete".  Next up, I killed all of the other threads that had the lp command for our user and when I went back to the application to search for their Running reports there were no longer any reports running.

Tuesday, August 6, 2013

Memory leak causing database crashes

After upgrading our database to 11.2.0.3 and applying both BP11 and EX10 Exadata patches we started having database node crashes every 2 to 3 weeks.  As a result we started looking closer at the monitoring information that was being shown before the crashes.  First we identified that only a single database node was starting to run out of virtual memory (swap space) in what appeared to be a random pattern, yet it appeared to happen the most at 9PM for some reason we couldn't fathom.  We even had an outlier happen on Saturday morning where the system recovered free memory around 8 PM, and then ran out of free memory again an hour later!

Time
Swap
Free Mem(G)
Usable Mem(G)
7:47 AM
2317
16
43
8:02 AM
3136
1
44
8:17 AM
4508
0
45
8:47 AM
6315
0
45
9:23 AM
7036
0
45
9:47 AM
7865
0
45
10:17 AM
8915
0
47
12:17 PM
9822
0
47
2:02 PM
10254
0
48
3:47 PM
10250
1
48
8:02 PM
9625
0
47
8:17 PM
9605
16
45
9:02 PM
9515
17
46
9:17 PM
9536
5
46
9:23 PM
9581
0
46

Another mystery was why our "usable" memory never really wavered at all, even though the system was under pressure.  Regardless of that, we all agreed that this was likely not an OS bug since the symptom was only presenting itself on a single node.  An option presented to us by Oracle was changing the sysctl parameter to increase the memory allocated to the database processes in order to try and eliminate the stresses on swap space usage but a nagging question being asked was if we added any new stress or load to the system which would cause this node to under perform.  Obviously, we didn't think so but we kept digging to see if maybe a rogue ETL process or other external partner were hitting us at the same time every night when a pattern of this occurring on a Wednesday or Saturday night started to emerge.  Having dug into this a few times, our scope expanded trying to understand the problem (this is why database load was added to this next trend analysis):

Time
Swap
Free Mem(G)
Usable Mem(G)
DB load (1 m avg)
9:02 PM
8353
29
37
1.19
9:17 PM
8455
1
37
4.04
9:22 PM
9097
0
38
4.39
9:47 PM
11144
0
38
5.65
10:02 PM
11840
0
38
3.19
10:17 PM
12160
0
39
3.19
10:23 PM
12142
0
39
1.72

This type of load jump was abnormal and led the DBAs to look closer at possible suspects like cron jobs to find this:

5 21 * * 3,6 /s_u01/app/applmgr/scripts/tar_tops.sh > /s_u01/app/applmgr/scripts/tar_tops.logs

Since the 3,6 parameter meant the script was only running on Wednesday and Saturday at 21:00 (9:00 PM), QA testing commenced.  Guess what we found out?  The script is the culprit for our memory depletion because after a while free memory became 0!!  Here is how QA looked before the cron job was run:

[applmgr@server midtier]$ free -g
             total       used       free     shared    buffers     cached
Mem:            94         68         26          0          1         14
-/+ buffers/cache:         51         42
Swap:           23          4         19

After some time of the cron job running in QA, you can see the difference:

[applmgr@server midtier]$ free -g
             total       used       free     shared    buffers     cached
Mem:            94         94          0          0          1         40
-/+ buffers/cache:         52         42
Swap:           23          5         18

Once the DBAs changed the cron commenting out this job in PROD, we stopped having regular database crashes.

Monday, August 5, 2013

Thanks somewhere between 600 and 2000 times!

Confused yet?  Well let me help you understand!  This is one of those "stop and smell the roses" posts that you see every once in a while on this blog, so please sit back and relax your brain because there won't be anything technical shared today.

The previous high for my blog over a month's time was 324 views which was amazing and the following month I posted almost a third less but I still had over 200 views and this was inspiring because it felt like I didn't NEED to post all the time for individuals to find value from my blog.  The following month I decided to post once a day and also opened up about the blog publicly since I was going to the OAUG conference, and as a result I hit 685 views last month and just the other day I crossed the threshold of 2000 views!  I want to thank each and every one of you that continue to visit, that believe I provide some measure of sanity or are hanging on waiting for me to provide one of those moments.  :}

Sunday, August 4, 2013

Lesson learned: Rebooting a server

Following yesterday's post about encountering an issue with adding space via ASM, we learned a very valuable lesson with rebooting a server: don't leave your servers up for hundreds of days without a reboot.  Why do I say that?  What happens when you have a problem with your server and it needs to be rebooted, but you have had it up for 300, 400, 500, 800, 1000 days?  When you reboot, fstab in Linux needs to check the associated filesystems and when you've left your server up that long the check has more to do but also you've let possible corruption creep into your system without regular reboots and preventative system checks by letting fstab check the filesystems at regularly scheduled times.

So now you're rebooting a server that has a problem, it is taking longer than normal (which needs to be planned for as well in your change plans) and you have no confidence that the filesystem won't be corrupted when it comes back up which can exacerbate an already critical situation.

Saturday, August 3, 2013

Bug while adding space with ASM

We were attempting to add space to a volume on the fly with ASM, yet the change wasn't successful because the system would hang on the “service oracleasm scandisks” command.  After trying to make the change a second time and it hung again, the DBA was able to find a bug related to this on My Oracle Support so we had our Storage team open an SR with Oracle to confirm that we are encountering the bug as detailed below:

Bug 7576680 : ASM - ORACLEASM SCANDISKS (ASMSCAN) HANGS

Oracle confirmed that we were encountering the issue in the bug, and we were required to reboot our servers before they would be able to see the volumes being presented by ASM.  So a. be warned and b. tomorrow we'll explore why this would be a big issue for us.

Friday, August 2, 2013

Database FLASH exhaustion

Have you seen a database exhaust FLASH before?  Well if not, here's an introduction!

New errors detected in "/u01/app/oracle/diag/rdbms/db/dbnode/trace/dbname_arc3_17851.trc":
===========================================================
ORA-19816: WARNING: Files may exist in db_recovery_file_dest that are not known to database.
ORA-17502: ksfdcre:4 Failed to create file +FLASH
ORA-15041: diskgroup "FLASH" space exhausted

During this time, we have flashback logs which are using up more than a TB of space which our DBA found by going into ASMCMD to see that FLASH is consuming a lot of space:

[oracle@server trace]$ asmcmd
ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  4194304  15593472  6107124          5197824          454650              0             N  DATA/
MOUNTED  NORMAL  N         512   4096  4194304    894720   893272           298240          297516              0             Y  DBFS_DG/
MOUNTED  NORMAL  N         512   4096  4194304   3896064   626852          1298688          
-335918              0             N  FLASH/
ASMCMD> exit

Here the DBA can see how much space is used by the flashback logs:

ASMCMD> du FLASHBACK/
Used_MB      Mirror_used_MB
891468             1782936

What does this mean to us?  Well, this is a typical snapshot of our system at this time (with our handy dandy alert we created months earlier):

Waits                                                                 Wait Time
  Backup: MML write backup piece                   5482.197026
  SQL*Net message from dblink                        4727.746732
  TCP Socket (KGAS)                                      762.1687

This is a snapshot of our system when it starts running out of FLASH:

Waits                                                                 Wait Time
  statement suspended, wait error to be cleared   212154.60987
  inactive transaction branch                                11951.353872
  SQL*Net message from dblink                         2742.039774
  TCP Socket (KGAS)                                       1175.376008
  SQL*Net break/reset to client                          418.624794

The DBA cleaned up archive logs older than 24 hours, and now we have space to grow and those sessions which were suspended are now able to resume activity:

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  4194304  15593472  6107108          5197824          454642              0             N  DATA/
MOUNTED  NORMAL  N         512   4096  4194304    894720   893272           298240          297516              0             Y  DBFS_DG/
MOUNTED  NORMAL  N         512   4096  4194304   3896064  1393468          1298688           47390              0             N  FLASH/
ASMCMD>

As the DBA tells us, if we use our archive application and delete 300 GB deletion of data from the DATA diskgroup, it gets into the FLASH diskgroup in the flashback logs as 600 gigs with mirroring.  This results in the 1.7TB of usage by flashback logs which is abnormal and as you can see takes up too much space.