Category Archives: Course Management Server-side

All the Connecting Dots

We’re planning an upgrade of our LMS. Consider these “dots” and their connections:

  • The old gradebook is out. No further development. When do we take it away from our instructors? How many times do we notify them first?
  • We’re in the middle of changing video on demand (VOD) providers. The old one was to have been available ’til Jan. 2019, but the old one does not provide a standard LTI integration and therefore isn’t ready for the upgraded LMS version.
  • Also, there is no straight migration of content between the old VOD and the new one. Our instructors want to take their content with them. There are 576 sites which have media that needs to be migrated. This needs to be decoupled from any and all LMS upgrades. I wish.
  • We’ve been running this LMS since 2011 with that aforementioned gradebook. There is no archival method for old sites in this LMS, nor in any other LMS I’m aware of. Nor should anyone ever expect that an LMS is a system of record. But instructors do. So when we upgrade to the new version without the old gradebook, none of their old sites will have grades. Just a blank screen. Can we live with that?
  • There’s one best time frame for an LMS upgrade, even one that mostly streamlines performance and usability without adding new features or seriously changing workflows. That time is Commencement weekend, a mere 5 and a half weeks from now.

abstract-313629_640

Instead of focusing on the technical details, all the splotches that need to be corralled, tested, organized, cleaned up and formed into a pleasing whole… let’s see if it helps to imagine this from our constituents’ point of view. What do they need to know and when?

  • Their current gradebook is going away. Even in old course sites they haven’t seen in a long time. Now would be a good time to check their records and export gradebooks from long ago since instructors are responsible for these things, not the LMS, nor those who run it.
    • From Fall 2018 course sites on, the old gradebook will no longer be available.
    • From that same time, old course sites also won’t have grades records.
    • We’re working on a copy of all those sites to run on the old server software version, so that we can take requests for anyone who missed these messages, and get them spreadsheet exports of those old gradebooks. But that old server software won’t be able to be up for ever. We’re thinking may another year?
  • They’re going to like the new gradebook. And we’ll help. In fact, we’ve been offering it as an alternative since last May, so even if it wasn’t in use by any given instructor, LMS support staff already knows a good deal about it and the kinds of questions  people have.
  • The upgrade to the LMS? Not really of consequence this time. People will care more about their gradebooks and their embedded video service.
  • How fast can we migrate their content from the old (“K”) to the new (“P”) video on demand/playback service? Because of course we can’t take away K without delivering on an alternative!

Sometime in August I hope to deliver on a cohesive orchestrated totality.

background-2734972_640

 

Advertisements

Bringing Blackboard Vista 8 into the Oracle 11g world..

Just a few notes, probably only of significance to myself, of the work involved when the technology around a software changes even as that software is being brought to an end of life…

Blackboard’s CE and Vista product is scheduled to be de-supported in January of 2013. Notre Dame will continue to run it as we come along side it with another system and move our Faculty and Students to it.

  • Meanwhile Oracle has de-supported (as of June 2010) its database software version Oracle 10g, on which many institutions have been running their Bb Vista databases.
  • Meanwhile Oracle has acquired Java and issued Update 29 of Java 6 (available for PCs Oct 10th and for Mac 10.6.8 and Mac 10.7.1 shortly thereafter) with which Bb Vista 8 doesn’t play well.
  • Meanwhile Oracle 11g is being deployed as a database cluster – RAC, that is, a feature some of this older software wouldn’t have dreamed of.

This just makes keeping the old girl running that much more of an effort.

This week here at Notre Dame we validated our Bb Vista 8 Dev environment to service pack 6 (SP6) on our Oracle 11g RAC database farm. Here were our tests:

Test Result
NDCustom copy content tool (uses siapi) perl, cron, DB link to Banner, permissions, UI display: all looks good
Created supersections with our NDcustom job (uses siapi) Same as above. Passed.
Took a quiz while stopping the database on one node (no failover). System Exception error. Session remained open. Saved answers were saved. When database ‘returned’ saves continued. Repeated logged messages as app tried to reconnect to the database. Passed
Took a quiz while gracefully failing over database nodes. No system exception error. Session remained open. Everything saved correctly. Only indication the db node had failed over was watching the netstat –a close connections on 1 db and open on another node, also saw 1 unpinned connection error in the logs. RAC works!
JMS real-time messaging server failover Still fails. Same behavior as always. Recommend Weblogic setting to leave Target set to a single non-migratable node.
Background Jobs: Garbage Collection. Deleted hundreds of courses. Checked timing & completion. No essential change in performance. GC completes & took over an hour. Our live system job  averages 2 hrs nightly on 10g to complete. We now anticipate the same on 11g.
Background Jobs: Content Index Search No essential change between 10g and 11g. Works. Passes.
Background Jobs: Tracking Event No essential change between 10g and 11g. Works. Passes.

Pithy Truisms on Blackboard Vista and Chat and Wimba under SSL

Those who’ve followed, referenced or read this blog for any length of time know that my posts follow my thoughts – one day pondering the ineffable, another day contemplating market changes in the LMS space, and the very next mired in the nuts and bolts of maintaining one of those systems.

Today is one of those days. Maybe these observations will help someone else, at least they’ll be breadcrumbs for me.

So far, in the last week on Bb Vista 8.0.5 I have confirmed these things:

  • Blackboard Collaborate/former Wimba changed out the cert at our site, https://notredamevoice.wimba.com/ on April 14th without notifying us, effectively breaking SSL.
  • The cert, key and ca files in the /WebCTDomain/userdir referenced by Weblogic are only read when the application is started. In other words, overwriting their contents while the application is running does not constitute a valid test unless you restart the app node.
  • In these days, cert renewals are being complicated by the fact that 2048 bit is the new standard but your old cert is probably still 1024 bit-based encryption. This makes a difference if you’re chaining certs, make certain you don’t mix 1024 with 2048 …  (I can’t say if it makes a difference to keystores. I would think you could import both types to keystores.)
  • Configuring Chat for end to end encryption means nothing more than sharing the key and cert files from your load balancer and pointing to them in the Chat config file and Weblogic > Server (incl. Admin) > SSL tab.
  • Configuring end to end encryption with a 3rd party server such as Wimba means constructing a ca chain which includes their cert, the intermediate and the root. Don’t worry if your cert vendor’s intermediate and root are not there – focus on theirs.
  • In order to encrypt both, I ended up chaining our cert vendor’s intermediate cert to the chat cert in /WebCTDomain/userdir AND chaining Wimba’s cert vendor’s intermediate cert in the ca.pem file located in that same directory /WebCTDomain/userdir

Bb Vista Health Checks 101

1. The load balancer health checks are configured on the load balancer. You see them in your webserver.log file as

Date/Time IP Address 200 GET /webct/checkStatusForLb   guest

The interval is usually 10 seconds.

They’re used by the load balancer to determine whether a user requesting a session should be sent to that node or not. If the node doesn’t respond, end user sessions aren’t sent there.

Weblogic also does health checks. A couple of different kinds it turns out. Here’s where you find them in your Weblogic Console:

2. Peer Connectivity. How are my sister nodes? Are they available to replicate my end user sessions if I’m unable to handle them all?

Left Panel > Environment > Servers > Click on Configuration tab, Tuning sub-tab and hit the Advanced portion at the bottom of the screen to reveal all. We’re interested in the Period Length and the Idle Period Until Timeout

By default, as I mentioned Friday, in Bb Vista (8.0.3 is the one I checked) this is set to broadcast once a minute (60000 milliseconds) and to consider the peer unavailable if 4 messages are lost.

As far as I can tell, our cluster has suffered no ill consequences if a node thinks its peer is gone. Probably because our loads are quite small and because the load balancer’s health checks are still indicating the node is able to receive end user sessions.

3. Domain or Cluster Health. The admin is also asking managed nodes whether they are still participating in the cluster, and probably whether the designated JMS node is still running the services the Admin believes it to be.

Left Panel > Environment > Servers > Click on Configuration tab, Health Monitoring sub-tab (scroll to the right)

By default the health check interval is every 180 seconds . Auto-restart is also enabled so that the Node Manager application could automatically attempt to restart the node services or the server itself if the node does not report appropriately. (I think that’s the best way of saying it). I have a feeling this setting is used by other settings…

Your Bb Vista cluster config.xml file

Tip: Keep a ‘last known good’ config.xml file on your admin node.

With even a small cluster, anytime a change to a setting in the Weblogic console causes a bad write-out to this file (like almost every other time!), all you have to do to recover is copy your last known good config.xml file over the faulty one, create an empty file named REFRESH on each node (including admin) and rename all of the ../WebCTDomain/server/ directories  (all nodes, including admin) so that a new /server directory gets created.

It shouldn’t happen. It sounds goofy. But it’s a lifesaver when you need it!

*In some cases the Vista_WLSstore database table also needs to be renamed so that it gets recreated on application startup.

Weblogic 9.2 and Bb Vista cluster node migration 101

There’ve been a couple recent incidents at Notre Dame requiring our Bb Vista cluster to be restarted. In follow-up root cause analysis, the culprit has been dropped multicast healthcheck packets.

The admin, for reasons I haven’t discovered yet, has not received the communication, and believing the node to be unavailable, has begun a migration process which failed. But what exactly is it attempting to migrate? Where is the failure?

I do know WHEN the failure happens, with catastrophic consequence. It happens when a configured number of JMS health checks are not received by the admin. The setting is configurable in your Weblogic Console here (not that I’d mess with these!):

Left Panel > Environment > Servers > Click on Configuration tab, Tuning sub-tab and hit the Advanced portion at the bottom of the screen to reveal all. We’re interested in the Period Length and the Idle Period Until Timeout

The default settings are 1 health check every 60000 milliseconds (1 per minute) with a period interval of 4 before migration starts. In our case, packets drop a few times a week but rarely 4 in a row, so the migration process doesn’t begin (then fail and hose the cluster) more than once or twice a year. Too many times a year for my tastes.

And you can see the evidence in your weblogic.log on the admin node. Check your logs for the word ‘connect’ . You’ll get the connected AND disconnected events that happen WITHOUT attempted migrations. Are there any? If so, should you be worried?

(I dunno the answer to this yet. I don’t think I have to worry unless the migration attempt happens, as in the the BEA-1410 73 event, aka, the dreaded JMS node migration attempt.)

In testing I’ve done, however, I haven’t been simulating dropped healthchecks, instead I’ve been purposely shutting down the JMS node first, out of sequence to see what happens to the rest of the cluster. And that has proven helpful in my understanding of the process…

…because the JMS node does migrate correctly. In fact, I’ve been part way through a quiz on a server I’m shutting down, and my session even fails over. So far, I haven’t been able to make any kind of migration I know about fail.

Banner ICGORLDI extract causes stuck thread

Since registration for spring opens Nov. 16th and we’re integrated through the Luminis Message Broker with the Registrar’s system, it is vital that spring courses and sections exist in Bb Vista prior to the flood of events open registration causes.

So, Notre Dame’s standard procedure is:

     
 

1. Registrar: Add the term to ACTIVE_TERM on GORICCR (event generation begins) Don Steinke.

2. Registrar (Don again): Run GURIROL for the STUDENT and FACULTY roles (these are the only ones that depend on the active term for assignment) Don Steinke –doesn’t have a term param, so does whatever persons are active at the time.

CMS ADmin and Luminis Admin do these steps once the RFC appears in our ‘to-do’ list and the Registrar has signaled readiness:

3. Luminis: Stop LMB from processing events (events begin to queue)

4. CMS: Extract ICGORLDI from Banner for term

5. Luminis: Import ICGORLDI to Luminis

6. CMS: Import ICGORLDI to Concourse

7. Luminis: Restart LMB processing events (queue first)

8. CMS: Reconnect Concourse to LMB

 
     

 

The ICGORLDI_XXXXXXXX.xml this time was 76 mg in size. I used my standard perl scripts to chop and filter smaller file sizes, figuring I could get away with importing 25 mg at a time. It turned out that almost 50 mg of the xml extract were <person> tags, so I ran those 2 files first with standard results. Each one took about an hour and a half. Wish may or may not be an improved time now that our JVM startup option for MaxPermSize has been altered from 192m to 256m.

But, the real weirdness happened on the 3rd import I kicked off at 3:30pm, leaving the office an hour later. Around 6pm the siapi import command finished, but a STUCK thread message began to be logged, meanwhile unaware though I was the import was still being processed.

All night long, while the logging which I still normally keep at DEBUG for our systemintegration stuff chewed up hard drive space… 24 gig by the time Operations called me at 6:30am the next morning (must call them to set monitoring threshold higher… there was only 600 mg left on the volume by that point!).

At 6:30am I freed up 27 gig. By 8 at the office 10gig had been consumed by logging again.

So, you would think the node would have to be restarted in order to stabilize, right? And because it’s the JMS node, to avoid a failed JMS migration, the cluster should be restarted.

This was not the case.

I thought it would be, but I wanted to wait until all of the xml was committed to the database, which seemed to still be ongoing based on the webct.log.

I grabbed the sourcedID from the webct.log for the currently importing LC context change on a cross listed section; grepped the original xml for how far down the million line file the system was now working and recognized that only 10,000 lines of the xml remained to be processed. I confirmed that in the UI by getting a count of how many sections with titles “Cross Listed Section Group” and sourcedIDs ending in the term code I was importing, and figured I could manage the logging until the process finished and then restart the cluster. Meanwhile, the Linux utility “top” was displaying two processes eating up 45% of memory, but no pegging of the cpu and very low I/O wait. ps -ef confirmed that there was no 3rd process, like siapi, running.

After the <group> tags in an ICGORLDI xml file, the <membership> tags are next, these are both child section memberships in a parent section as well as person memberships, or enrollments, both teacher and student, in a section.

I can not say with certainty when this long running STUCK thread got ‘unstuck’ since I had changed the logging level on the fly in the Weblogic console for framework.ejb to FATAL. That way I didn’t have to watch hard drive space so closely. I suspect as soon as the last of the xml import was committed to the database it changed.

And the node continues on. And the cluster continues on.

All’s well that ends well.

Note to self: Test JMS node failover in Test on our current 8.0.3 when Node A is shutdown first. I really could use this tool in my toolkit for future incidents.