Banner ICGORLDI extract causes stuck thread

Since registration for spring opens Nov. 16th and we’re integrated through the Luminis Message Broker with the Registrar’s system, it is vital that spring courses and sections exist in Bb Vista prior to the flood of events open registration causes.

So, Notre Dame’s standard procedure is:


1. Registrar: Add the term to ACTIVE_TERM on GORICCR (event generation begins) Don Steinke.

2. Registrar (Don again): Run GURIROL for the STUDENT and FACULTY roles (these are the only ones that depend on the active term for assignment) Don Steinke –doesn’t have a term param, so does whatever persons are active at the time.

CMS ADmin and Luminis Admin do these steps once the RFC appears in our ‘to-do’ list and the Registrar has signaled readiness:

3. Luminis: Stop LMB from processing events (events begin to queue)

4. CMS: Extract ICGORLDI from Banner for term

5. Luminis: Import ICGORLDI to Luminis

6. CMS: Import ICGORLDI to Concourse

7. Luminis: Restart LMB processing events (queue first)

8. CMS: Reconnect Concourse to LMB



The ICGORLDI_XXXXXXXX.xml this time was 76 mg in size. I used my standard perl scripts to chop and filter smaller file sizes, figuring I could get away with importing 25 mg at a time. It turned out that almost 50 mg of the xml extract were <person> tags, so I ran those 2 files first with standard results. Each one took about an hour and a half. Wish may or may not be an improved time now that our JVM startup option for MaxPermSize has been altered from 192m to 256m.

But, the real weirdness happened on the 3rd import I kicked off at 3:30pm, leaving the office an hour later. Around 6pm the siapi import command finished, but a STUCK thread message began to be logged, meanwhile unaware though I was the import was still being processed.

All night long, while the logging which I still normally keep at DEBUG for our systemintegration stuff chewed up hard drive space… 24 gig by the time Operations called me at 6:30am the next morning (must call them to set monitoring threshold higher… there was only 600 mg left on the volume by that point!).

At 6:30am I freed up 27 gig. By 8 at the office 10gig had been consumed by logging again.

So, you would think the node would have to be restarted in order to stabilize, right? And because it’s the JMS node, to avoid a failed JMS migration, the cluster should be restarted.

This was not the case.

I thought it would be, but I wanted to wait until all of the xml was committed to the database, which seemed to still be ongoing based on the webct.log.

I grabbed the sourcedID from the webct.log for the currently importing LC context change on a cross listed section; grepped the original xml for how far down the million line file the system was now working and recognized that only 10,000 lines of the xml remained to be processed. I confirmed that in the UI by getting a count of how many sections with titles “Cross Listed Section Group” and sourcedIDs ending in the term code I was importing, and figured I could manage the logging until the process finished and then restart the cluster. Meanwhile, the Linux utility “top” was displaying two processes eating up 45% of memory, but no pegging of the cpu and very low I/O wait. ps -ef confirmed that there was no 3rd process, like siapi, running.

After the <group> tags in an ICGORLDI xml file, the <membership> tags are next, these are both child section memberships in a parent section as well as person memberships, or enrollments, both teacher and student, in a section.

I can not say with certainty when this long running STUCK thread got ‘unstuck’ since I had changed the logging level on the fly in the Weblogic console for framework.ejb to FATAL. That way I didn’t have to watch hard drive space so closely. I suspect as soon as the last of the xml import was committed to the database it changed.

And the node continues on. And the cluster continues on.

All’s well that ends well.

Note to self: Test JMS node failover in Test on our current 8.0.3 when Node A is shutdown first. I really could use this tool in my toolkit for future incidents.

5 thoughts on “Banner ICGORLDI extract causes stuck thread

  1. I’m adding this comment because this SAME event happened just yesterday. We are now on Blackboard Vista 8.0.3

    However, I am reimporting today after the DBA added a new 30gig .dbf file to the WEBCT_LOB table (previously there were two .dbf files which were NOT at capacity and between the two of them our DBA said we already had 30gig).

    The undo and tempdb tables were fine.

    But I did make another change, also (like the one above) at the recommendation of Matt Brady, Tier II engineer with Blackboard. I altered the /customconfig/ COMMAND_LINE_JVM settings on my JMS node to give command lines a bit more wiggle room.

    #PREVIOUS COMMAND_LINE_JVM_OPTIONS=-ms128m -mx1024m -XX:NewSize=32m -XX:MaxNewSize=128m -XX:+JavaMonitorsInStackTrace -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseConcMarkSweepGC -Dcom.sun.xml.namespace.QName.useCompatibleSerialVersionUID=1.0 -Duser.language=en

    NEW COMMAND_LINE_JVM_OPTIONS=-ms256m -mx1536m -XX:NewSize=32m -XX:MaxNewSize=256m -XX:+JavaMonitorsInStackTrace -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseConcMarkSweepGC -Dcom.sun.xml.namespace.QName.useCompatibleSerialVersionUID=1.0 -Duser.language=en

  2. As I recall, the largest client in our consortia has thousands of cross-listed courses. Years ago, with the first introduction of cross-listed courses in Vista 3, their group files started consume all of the memory and swap. Somehow, I do not recall how, I decided the issue was every cross-listed parent was named the same thing which meant the process had to try each permutation “Cross Listed Section Group 1” to “Cross Listed Section Group n” which was chewing up the memory and swap.

    The first thing I got implemented was splitting the files in 10 MB files. Some testing found each 2x larger the file, processing took 4x times longer. The client happened to also want the name to be more meaningful, so they changed their Banner to put a unique identifier in the name. Between these two changes, the issue disappeared.

    Hopes this helps.

    P.S. The JMS node is the one node I would not want to crash because of what it does to the cluster. Maybe you need a virtual machine managed node to process this stuff?

  3. Post-mortem for 3/8/2011 issue:
    1. App server resources appeared fine during the entire 48 hour period. No JVM memory concerns per grep “java.lang.OutOfMemoryError” *.log* or grep “Available memory” webct_monitor.log
    2. DBA reported two partially full WEBCT_LOBS .dbf files (max on our system is set for 30 gig each). Cursory review of db server .trc files shows nothing.
    3. Import files were bigger than 10M. The one which started the transaction rollback errors was 18M in size and composed entirely of the tag which DOES entail creation of those “Cross List Section Group n”.
    4. Next line of analysis is to confirm that background jobs have been running successfully. Not sure whether there are index optimization routines which might help.

  4. Not even an index can help a truly inefficient method. Having to check one-by-one whether the value already exists and moving to the next is very inefficient, aka slow. Think of it this way, say you have 3,000 of these already. At a half of second per lookup, it looks at 3,000 of them to get to 3,001 so it takes 1500 seconds. That is 25 minutes to add one rather than a half a second for one which does not already exist. As you enter more cross-listed section, the problem just gets worse. The best optimized index could get you to a tenth of a second per lookup but even then you are still at 5 minutes per item you add and getting worse as you add thousands more.

Comments are closed.