There’ve been a couple recent incidents at Notre Dame requiring our Bb Vista cluster to be restarted. In follow-up root cause analysis, the culprit has been dropped multicast healthcheck packets.
The admin, for reasons I haven’t discovered yet, has not received the communication, and believing the node to be unavailable, has begun a migration process which failed. But what exactly is it attempting to migrate? Where is the failure?
I do know WHEN the failure happens, with catastrophic consequence. It happens when a configured number of JMS health checks are not received by the admin. The setting is configurable in your Weblogic Console here (not that I’d mess with these!):
Left Panel > Environment > Servers > Click on Configuration tab, Tuning sub-tab and hit the Advanced portion at the bottom of the screen to reveal all. We’re interested in the Period Length and the Idle Period Until Timeout
The default settings are 1 health check every 60000 milliseconds (1 per minute) with a period interval of 4 before migration starts. In our case, packets drop a few times a week but rarely 4 in a row, so the migration process doesn’t begin (then fail and hose the cluster) more than once or twice a year. Too many times a year for my tastes.
And you can see the evidence in your weblogic.log on the admin node. Check your logs for the word ‘connect’ . You’ll get the connected AND disconnected events that happen WITHOUT attempted migrations. Are there any? If so, should you be worried?
(I dunno the answer to this yet. I don’t think I have to worry unless the migration attempt happens, as in the the BEA-1410 73 event, aka, the dreaded JMS node migration attempt.)
In testing I’ve done, however, I haven’t been simulating dropped healthchecks, instead I’ve been purposely shutting down the JMS node first, out of sequence to see what happens to the rest of the cluster. And that has proven helpful in my understanding of the process…
…because the JMS node does migrate correctly. In fact, I’ve been part way through a quiz on a server I’m shutting down, and my session even fails over. So far, I haven’t been able to make any kind of migration I know about fail.