Bug hunt with Citrix

This is the (short) story of how we discovered a bug using Xen Orchestra and XenServer, then how it was fixed very quickly by Citrix team.

This is me telling calmly but firmly that we found a noticeable bug.

Symptoms

In short: losing consoles in HVM guests.

HVM guests are using QEMU for a lot of things. One of this thing is QEMU-VNC, which allows to expose a VNC console of the virtual machine.

In a lot of cases, it works perfectly. You can start a Windows VM, even install one. Or display a running Linux HVM.

But when you try to install a brand new Linux in HVM, the console just vanishes after the initial menu (you know, where you can change the language or choose between Install or Live etc.)

And you can't display it back, unless you reboot the guest, so you'll hit the same issue again. And again. Also, crashing QEMU things on a guest is not very a good news.

Consoles are tricky

Our initial reaction was: "consoles issues, again, meh", but we decided to dig in order to find the issue. By using more debugging, we found that the console was closed on XenServer side.

And you know what? We can't reproduce the problem on XenCenter. Is it our "fault" or not?

Thanks to one of our user, we got more details on a XenServer host during this bug:

kernel: [718079.110218] qemu-dm[29404]: segfault at 1591000 ip 00000000004a309b sp 00007fff47ec4880 error 6 in qemu-dm[400000+122000]

Figure 1: ouch.

So, we (XO) are triggering a QEMU segfault by doing some VNC stuff on a VM.

We quickly reported the issue on the XenServer bug tracker.

Citrix quick answer

Thanks to our contacts at Citrix, the issue was very quickly raised to their dev team: by using XO on their side, they could reproduce the problem (and not only with XO, but any other VNC client like tightVNC and Vinagre).

It seems that when the VNC window size is changing for a very small size (even for few seconds), it just kills the QEMU VNC process. Guess when this is happening? Just in a lot of Linux installers.

The fix itself was done very fast, and the result was incorporated in this hotfix.

Why this is not happening in XenCenter? In fact, it's possible that their VNC implementation never asks for a partial frame update. Which is the thing will cause the crash.

Conclusion

Thanks to our usage of Xen Orchestra, combined to our mutual work with XenServer Citrix teams (and their very fast reaction), we helped to fix a QEMU issue, which is now backported in the following XenServer versions:

  • Citrix XenServer 6.5 SP1
  • Citrix XenServer 6.5
  • Citrix XenServer 6.2 SP1
  • Citrix XenServer 6.1
  • Citrix XenServer 6.0.2
  • Citrix XenServer 6.0.2
  • Citrix XenServer 6.0

Not bad isn't it?

Working closer with Citrix helped to improve both XenServer and Xen Orchestra. User experience is now better by using this great stack!

A big thanks for their quick reaction :)