Continuous Replication starts failing



  • That's what the log says. Replication seems to use more RAM than other operations.



  • @Danp do you still experience VDI I/O errors? I think we managed to detect a XenServer issue there.



  • @olivierlambert Yes. I recently reinstalled XS on this server, and also updated XO to v4.16. Now I can't get a CR job to successfully run for this one VM. I don't know where the issue lies at this time (software, hardware, etc), but I can assist with gathering some logs if that would help.



  • My steps to reproduce it:

    • set a job every minute
    • works fine for ~ 30 first replications
    • then start to fail due to VHD export issue on XenServer
    • but sometimes works (around 10% of the time)

    Do you have the same numbers?

    edit: at least now, it doesn't "break" the replication, I mean if a job succeeded on the XenServer side, the replication will work for this one.



  • Not exactly. My job was set to run every 5 minutes. It would take approx 1 min to complete and could run for days / weeks before it began failing. Once it began failing, the source snapshots would not be deleted properly until you went and cleaned up everything manually. Not sure if that's been addressed in 4.16.

    When examining the destination VM, one or more of the VDIs would be missing.



  • We improved stuff on 4.16, now the snapshot is removed if it failed to be exported (no longer going to fill the snapshot chain \o/ )

    But indeed, the faulty VDI will be absent on the destination because it couldn't be downloaded from XO (due to the export error on origin host)

    So now, with 4.16, you could try often and see of it works again sometimes after failing.



  • @Danp seems to work fine with Dundee (on the source host, destination is okay even in 6.5), but I'll let the schedule run for a while.



  • @olivierlambert Would be good if they could backport the fix to 6.5. 😉



  • It doesn't work like that. It works better because they upgraded a ton of libraries. Plus delta is not officially supported, so there's no chance to have this backported.



  • Okay so I successfully reached 1000 replications without any errors.


Log in to reply