We've initialized the Veeam Agent for Windows update servers with version 2.1 last week, so checking for an update will now show the new version being available. It is always one of the top questions for the agent – why my installation does not see the latest version? We simply like to take time to ensure stability of new installations before pushing the update to our entire install base. And this time, we waited longer than usual because 2.1 introduces the new optional changed block tracking driver for servers. The official announcement of Windows Server 2019 is out - so now we know how the new LTSC release of Windows Server will be called. And looks like it is scheduled to come out at the same time as RS5 SAC release towards the end of this year. But before that, we will see RS4 – which I understand should become available any day now. Our planned release vehicle for RS4 support will be what I previously referred to as Update 3a (name subject to change). But we're talking a fairly "light" update centered around new platform support, including VMware vSphere 6.7 and vCloud Director 9.1. Which also means that its timelines will very much depend on when the last one of these platforms will become generally available. If you saw my Data Corruption 101 or Backup Repository Best Practices breakout sessions at past VeeamONs, you know we don't recommend SMB backup targets due to them being the top source of backup corruption cases that we see in support. But, what's exactly the issue here? I had to explain this to someone last week, so I thought this information would be valuable for many, especially since this issue may affect any application. So, here's what is causing the issue. When an application writes data to an SMB share using WinAPI, it receives Success for this I/O operation prematurely - right after the corresponding data gets put into the Microsoft SMB client's sending queue. If subsequently the connection to a share gets lost – then the queue will remain in the memory, and SMB client will wait for the share to become available to try and finish writing that cached data. However, if a connection to the share does not restore in a timely manner, the content of the queue will be lost forever. Apparently, there is even the corresponding event logged into the system event log when this happens, but I was unable to track one down quickly. As a result, application thinks that the data was written to a disk, when in reality it was not. In their stress tests, our QC folks saw up to 3GB of data lost due to this issue by comparing the data that our data mover thought was successfully written into the backup file to what actually landed on the storage backing this share. And perhaps even scarier was seeing WinAPI keep returning Success on writes even AFTER the share was already made unavailable. However, this does not automatically make every SMB share a bad candidate for hosting backups. The issue above was actually one of the reasons why Microsoft introduced SMB Transparent Failover in Windows Server 2012. Basically, when using file shares with the Continuous Availability (CA) flag set, the Microsoft SMB client will not return Success until the I/O has landed on a stable media on the server side. And as an added benefit, active SMB handles will survive a failover to another node and the application may only experience a temporary stall in IO – but no timeouts or broken handles. One of our customers who is using such backup repository did a "chaos monkeys" exercise, resetting cluster nodes while Veeam backup jobs were running – and could not achieve any data loss or even backup job interruption no matter what. He was very impressed. So if you need a highly available backup repository, this could be a way to go - although this setup does require Windows Failover Clustering, of course. Last but not least... before some of you walk away thinking "meh, that's why I use NFS instead of SMB" – I have something for you too! Apparently, things can go very wrong in that kingdom as well. We've recently had to research one data corruption issue that was driving us insane: full VM restore over Direct NFS transport from good backups would restore corrupted virtual disk content. We were convinced our data movers simply cannot have such bad bugs after so many years, but it seemed like the only possible explanation! Finally, after some hardcore troubleshooting, the root cause was found to be with the NAS appliance caching WRITE operations on its end, doing so based on the combination of the client IP, client port and xid (command sequence number). So, with parallel disk restore carried out by multiple data movers, only the first one would "win" the write, with the following matching write requests coming from other data movers running on the same proxy simply ignored. NFS clients, however, would still receive Success for the corresponding write operations from an NFS server! Now, we did manage to work around this particular peculiarity by making xid unique within the backup proxy – but who knows what other "optimizations" are lurking out there in all those different NFS server implementations? All these issues is basically why I'll never get tired of recommending general purpose servers instead of NAS appliances for backup repositories... having a data mover run locally lets you completely remove those extra protocol stacks from the equation, and this can't be bad. |
Komentáře
Okomentovat