I woke up at 6am this morning to a phone call from my boss. Barely awake, I answered in a very tired voice, “Hello.” His response was “The Xen environment is down.” I mean, it is read only Friday, this isn’t suppose to happen. Now, our environment is pretty small, but it runs all of our critical services, DNS, DHCP, AD, Monitoring, File storage. I had my boss ssh into the box and see if there were any zombie processes. Sure enough, there were. Now, before, we had processes become zombies when our log files filled up the log partition and made everything choke. Now this was on XenServer 5.6 and things are different now. I was able to assign myself a static IP from my room and get into XenCenter, to find all of our hosts in maintenance mode. This would explain the zombie processes and the reason no VM’s were on the hosts. I attempted to bring each one out of maintenance mode, but received an error (see below).
I got into console of the master, and ran xe pool-ha-disable and boom, I was able to spin up all of our VM’s. Once we restored service to the city, I attempted to think about what caused this issue. Obviously it was related to HA, but why would that cause all of our VM’s to stop running. Part of the solution was found in the alerts section of XenCenter.
After cycling through each host, HA ran out of working hosts to break, so it just killed all of our VM’s and placed all of the servers into maintenance mode. Since I as still really tired, and wanted to get some sleep before my classes, I told my boss to open a case with Citrix, and have them dig through the logs. I went back to bed.
Turns out our NIC drivers were out of date, and it caused instability within our hosts. The resolution was to install the some updated drivers from XenServer 6.2. It would seem that the upgrade to 6.5 wiped the already updated drivers, and they needed to be re-installed. Woot! Same, day I made my drive back up to Sandy, and did a late night BIOS upgrades of our IBM and Dell hosts, and installed updated Broadcom and Intel NIC drivers. I followed the guide from a Citrix support page to upgrade them. The upgrade took no time at all, but the migration of VM’s over a 1Gbps connection was more than slow. After rebooting each host, the applied drivers should resolve our issue. This upgrade was performed on 4/3/15 and we have not had any reported issues yet.
We still have a case open with Citrix though, and we have not re-enabled HA just yet. I am waiting to find time to call and chat with them. According my boss, if call back in, they can assist and help get HA configured, tested and stabilized. I’ll update this post when that is completed, and show the results and process.
In addition, I made a post over at /r/citrix regarding my frustrations. The responses didn’t quite yield the response I was looking for, but were none the less interesting.