Failed garbage collection retries can bring down your Application

uh boy can they ever!

Yesterday I came in to find my Inbox full of messages from users. (Teach me for not checking email on the weekend!)

One after another Instructors were reporting that over the weekend they couldn’t do one thing or another, add WebLinks, upload files, enroll TAs, add an Assessment, Edit their syllabus… I didn’t read the emails, just scanned the subject lines. 

Logging in to our production 4.1.2 servers, I found the JMS servers all over the place, integration still on NodeA, Mail on NodeB, and Chat on NodeC. Bb Support told me “JMS is currently not running causing this error and it is not migrating.” When I told them I saw them running in the Weblogic Console, they said, “It’s not running though.  It can’t process the messages to the weblogic admin node and then to the JMS node.” Because I believe my TSM even when I don’t get what he’s at, I restarted the cluster.

Whatever it was… it happened again today.

And I finally got it. I think.

Here’s what probably happened: We are migrating Vista3 course sections to Vista4. We restore the archive on this server, then reset it to wipe out all student data. Then we change the sourcedID and source to match its original incarnation, and voila! our Banner SSB customization works. (Of course, it could be argued we don’t really need it in Vista4 like we did in Vista3 because Instructors can now copy any content associated with their netID on the server to the course they are just starting… but let’s not make them change horses midstream).

Where was I?

Oh… So Garbage Collection was running and failing parts of it, parts created by our restore and reset actions. And these failed Garbage Collections were retrying and in the process, using up all the memory for the JVM. This brought down the JMS servers effectively, even though Weblogic reported them as running.

Remediation? Turn off Garbage Collection. It doesn’t really work in 4.1.2 anyway.

Now we wait. For the version in which Garbage Collection is fixed. We bloat the database. We turn off the search tool. We hang on. And we act like people of faith. Or not. According to our individual predilections.

Advertisements

One response to “Failed garbage collection retries can bring down your Application

  1. In reviewing a long running ticket when I first started working with the University System of Georgia, I learned WebCT (pre-merger) determined a easily caused perfect storm of any processes using a high amount of JVM can cause these kinds of issues. When asked what kinds of processes, the list was kinds of unsettling: Garbage Collection, System Integration API (siapi.sh), tracking reports, section archive creations and restores, and failing mail messages. That wasn’t an exhaustive list as its from memory. 😦
    Its also for Vista 3. The more we hear about brave souls like you who have migrated to Vista 4 and finding the same problems we encountered in Vista 3, the less certain we are that Vista 4 will actually “solve [insert issue here]”. Garbage Collection, one of those issues supposedly better, is waiting for fixes.
    See you at the conference. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s