Since registration for spring opens Nov. 16th and we’re integrated through the Luminis Message Broker with the Registrar’s system, it is vital that spring courses and sections exist in Bb Vista prior to the flood of events open registration causes.
So, Notre Dame’s standard procedure is:
1. Registrar: Add the term to ACTIVE_TERM on GORICCR (event generation begins) Don Steinke.
2. Registrar (Don again): Run GURIROL for the STUDENT and FACULTY roles (these are the only ones that depend on the active term for assignment) Don Steinke –doesn’t have a term param, so does whatever persons are active at the time.
CMS ADmin and Luminis Admin do these steps once the RFC appears in our ‘to-do’ list and the Registrar has signaled readiness:
3. Luminis: Stop LMB from processing events (events begin to queue)
4. CMS: Extract ICGORLDI from Banner for term
5. Luminis: Import ICGORLDI to Luminis
6. CMS: Import ICGORLDI to Concourse
7. Luminis: Restart LMB processing events (queue first)
8. CMS: Reconnect Concourse to LMB
The ICGORLDI_XXXXXXXX.xml this time was 76 mg in size. I used my standard perl scripts to chop and filter smaller file sizes, figuring I could get away with importing 25 mg at a time. It turned out that almost 50 mg of the xml extract were <person> tags, so I ran those 2 files first with standard results. Each one took about an hour and a half. Wish may or may not be an improved time now that our JVM startup option for MaxPermSize has been altered from 192m to 256m.
But, the real weirdness happened on the 3rd import I kicked off at 3:30pm, leaving the office an hour later. Around 6pm the siapi import command finished, but a STUCK thread message began to be logged, meanwhile unaware though I was the import was still being processed.
All night long, while the logging which I still normally keep at DEBUG for our systemintegration stuff chewed up hard drive space… 24 gig by the time Operations called me at 6:30am the next morning (must call them to set monitoring threshold higher… there was only 600 mg left on the volume by that point!).
At 6:30am I freed up 27 gig. By 8 at the office 10gig had been consumed by logging again.
So, you would think the node would have to be restarted in order to stabilize, right? And because it’s the JMS node, to avoid a failed JMS migration, the cluster should be restarted.
This was not the case.
I thought it would be, but I wanted to wait until all of the xml was committed to the database, which seemed to still be ongoing based on the webct.log.
I grabbed the sourcedID from the webct.log for the currently importing LC context change on a cross listed section; grepped the original xml for how far down the million line file the system was now working and recognized that only 10,000 lines of the xml remained to be processed. I confirmed that in the UI by getting a count of how many sections with titles “Cross Listed Section Group” and sourcedIDs ending in the term code I was importing, and figured I could manage the logging until the process finished and then restart the cluster. Meanwhile, the Linux utility “top” was displaying two processes eating up 45% of memory, but no pegging of the cpu and very low I/O wait. ps -ef confirmed that there was no 3rd process, like siapi, running.
After the <group> tags in an ICGORLDI xml file, the <membership> tags are next, these are both child section memberships in a parent section as well as person memberships, or enrollments, both teacher and student, in a section.
I can not say with certainty when this long running STUCK thread got ‘unstuck’ since I had changed the logging level on the fly in the Weblogic console for framework.ejb to FATAL. That way I didn’t have to watch hard drive space so closely. I suspect as soon as the last of the xml import was committed to the database it changed.
And the node continues on. And the cluster continues on.
All’s well that ends well.
Note to self: Test JMS node failover in Test on our current 8.0.3 when Node A is shutdown first. I really could use this tool in my toolkit for future incidents.