Přeskočit na hlavní obsah

Veeam Community Forums Digest for dandvo [Mar 26 - Apr 1, 2018]


Veeam Community Forums Digest March 26 - April 1, 2018

THE WORD FROM GOSTEV
What can be worse than a new vSphere changed block tracking (CBT) bug? The CBT bug that even an Active Full backup does not help against! And unfortunately, we have just confirmed such bug to exist. Now, I recognize this may look like an April Fool's joke, so to be clear – this is NOT one. We've already demonstrated this bug to VMware support using naked API, so everyone is at risk no matter what vSphere backup product they are using. However, there's a hope that the bug is isolated to the particular storage model and/or Virtual Volumes (VVols) only – otherwise, we'd probably have way more customers reporting failed recoveries.

It all started from a support case coming from a fresh vSphere 6.0 deployment running VMs with thin disks hosted on VVols backed by Nimble storage. The customer was experiencing classic data corruption issues after full VM restores – the restored VMs had Windows firing up chkdsk, and checkdb was reporting corruptions in Microsoft SQL Server databases. This normally points at storage corruption – but production VMs did not have these issues, while backup files' content was matching the checksums. Which made CBT the next suspect - but the following troubleshooting steps revealed that the corruption could occur even when restoring from an active full backup! On the other hand, the issue would not reproduce when CBT was disabled completely. Magic, eh?

But our genius support folks did come up with a way to nail this problem down. They first changed permissions on the vCenter Server account used by Veeam Backup & Replication to make it unable to delete working snapshots created by a backup job. Then, after reproducing the issue again, they cloned the VM from the corresponding working snapshot, mounted VMDK of the clone and VMDK from the full backup file to a Linux box, and did a binary compare – which not surprisingly showed a mismatch in some disk areas. And finally, by referring to the debug log of the corresponding job run, they found the differences were in the disk areas that were NOT returned by QueryChangedDiskAreas() function call with changeID * parameter.

Now, let me step back and explain what this vSphere API function does. It is actually the cornerstone CBT function that is used to query used and changed VMDK blocks. During an incremental run, a backup job passes this function changeID of the snapshot created by the previous run, and thus gets all blocks changed since the last backup – a very simple concept. While during an initial run aka full backup, when there's no previous backup run to reference yet – the special * parameter is passed to this function, which makes it return allocated VMDK blocks only. This dramatically accelerates full backups due to not having to read though TBs of unallocated (and thus obviously empty) VMDK blocks. But even if a backup vendor chooses not to use this functionality for full backups, this query will still be issued by an ESXi itself when CBT is first initialized on a VM – meaning, there's no way to avoid one.

Bear with me, we're almost there now! There's one key difference in QueryChangedDiskAreas() logic with the two scenarios I explained above. Using a changeID belonging to a previous VM snapshot will return all changed blocks that were tracked by the ESXi host itself in the CTK file. This functionality had its own share of bugs in the past years, but all of them were fixed and by now we can be fairly confident that modern ESXi versions track changes reliably respecting disk resizes, VMotions and so on (you know, all those bugs we've been through). However, when using that special changeID * parameter, this function returns allocated blocks based on the data provided by the storage itself. And at least in this particular case, it appears that the storage provides an invalid allocation data.

According to the latest update, VMware is now working on a tool that should help to confirm this bug is indeed with the particular storage. I will keep you posted as we learn more – meanwhile, as always, remember to test your backups! And a big shout out to many VMware teams – SDK support, VADP team, VVOL to name a few! We had an absolutely incredible collaboration working with them on this issue, receiving very prompt responses and seeing great involvement with what in the end appears to most likely be a 3rd party vendor bug. A very refreshing experience, for sure!

BEST POST OF THE WEEK
Re: Version 10 has disappeared   [BY: Joey Famiglietti • LIKED: 5 times]
Hello Gostev!

In the genuinely nicest, most sincerely respectful, and unwaveringly gratitude-filled manner which can possibly be conveyed through a forum post...please read rgreen83's post repeatedly until the "no more version numbers" idea ends up in the dumpster out back.

The idea that version numbers are passé is the result of letting marketing run the show. Everyone running on the latest release is not going to happen. There will always be someone like me around who has specific requirements, and if at any point depreciating compatibility or API calls is planned, there will always be someone who needs whatever it is you depreciated. As always this has been described by the folks over at XKCD: https://xkcd.com/1172/.

But don't take it from me. Spend an hour or three with the tier one support reps and ask them how many issues are identified by knowing what version number the person is on. I guarantee you that it is a massive help.

"But Voyager, the solution is simple: we only support the most recent release. If users don't update, we can't provide support, so that can be solved with the 'check for updates' command; users can call back when it says 'no updates available'." Alright, fine. That goes back to the depreciating compatibility issues. Either remaining compatible with everything will be an albatross, or there will be users who are happy to remain on older releases to retain their needed compatibility. If given a choice between "upgrade your production software" and "use a different backup software", one of those is far easier than the other. It's not a threat, it's just a reality. Backups are the sort of thing that can be messed with during the business day. Production systems are not.

"But Voyager, all the Veeam systems you deal with are Server 2012 or newer, or VMWare 6.0 or newer, both of which will be supported by Veeam well into the next decade. Why are you so worried about something that is very unlikely to impact you?" Because agile development and constant updating keeps making a foundational assumption: what's newer is more desirable than what is already in place. If this was universally true, sites like oldversion.com and oldapps.com wouldn't exist. Windows Updates have caused more problems than they've solved, have not added a single useful feature since the RTM release, and have wasted millennia of man hours installing and breaking. Oracle keeps updating the JVM for Windows, but all I have gotten with every release is a "security change" that just makes it progressively more difficult to manage my JetDirect cards and iDRAC interfaces, the only two things in my environments that still use it. Agile development is a mindset that end users didn't ask for. The rest of us want stability, the ability to schedule maintenance windows, and to know that we won't have to choose between 'support' and 'a stable environment'.

"But Voyager! We're not an OS, we've got a perfect update system in place, our support reps are ready to do their magic every time, and we don't let the UX department rearrange things when they feel like it! You can trust us on this one, you'll never have the issues you've experienced with everyone else, because we thought of everything you're worried about and won't be removing version numbers until our system avoids every problem you've had!" I'll believe it when I see it.

Okay, I'm done with my rant because I've got issues to fix. Thanks for reading.

-V5
TOP CONTENT
Physical proxy, Server 2012 R2 or Server 2016?   [VIEWS: 193 • REPLIES: 7]
Just received a brand new HPE server that will be a Veeam physical proxy server for SAN based backups
2 CPUs, 16 cores each, 32 cores total, 2.60GHz, 64GB RAM, 10Gbe connectivity
On our current 6 year old proxy server we are running Server 2012 R2 more
Backup Optimization: Storage, block sizes, RAID, backups...   [VIEWS: 173 • REPLIES: 6]
I am on a quest to speed up our backup times without changing our backup targets (we just got the Synology targets last year).... Below is a summary thus far... I definitely need more eyes on this as I am killing myself and my wife has become a widow due to the amount of time I am spending on this. more
Access to VCSP Forum   [VIEWS: 168 • REPLIES: 10]
Hello, guys.
My request for joining in CSP group for access to Cloud & Service Providers Forum was rejected.
How can I confirm my affiliation with the providers of cloud services?
Disk space drops on Veeam server   [VIEWS: 164 • REPLIES: 5]
I've noticed over the 2 years we've been running Veeam, that twice there have been big drops in disk space on C: the drive where Veeam is installed, and the disk space has never recovered.
I've discovered two files matching the size and dates of these drops:
veeamflr-000043e300001b02-0000. more
Veeam Agent for Windows paid needs VBR?   [VIEWS: 157 • REPLIES: 6]
Our company is using kvm for virtualization. We want to use veeam agent for windows paid on our Exchange and AD servers. Do we need to have VBR license to be able to use shared storage for backups or VBR free is enough?
We will install VBR in a windows vm. more
YOUR CONTENT
None of topics you have contributed to have been updated this week.



Komentáře