Skip to content

Data integrity and validation

John Rusk [MSFT] edited this page Feb 18, 2020 · 17 revisions

HTTPS

For security, we recommend the use of HTTPS, rather than HTTP, for all use of AzCopy. HTTPS is a tamper-resistant protocol, to protect against entities on the network deliberately changing data. Because of this tamper-resistance, it also provides protection against accidental network-level errors.

MD5 Hashes

AzCopyV10 supports MD5 hashes to validate the integrity of file contents. To opt in to this mechanism, include --put-md5 on the command line when uploading to Azure. NOTE that the actual check does not happen until the uploaded blob is used (i.e. downloaded) by AzCopy or another MD5-aware tool.

The overall process looks like this:

  1. At upload time, the hash of the original disk file is computed, and recorded against the blob. I.e. hash of source file is stored against blob.
  2. At download time, when the file is written to disk, a new hash is computed. This new "download time" hash is compared to the original hash from the time of upload. If they match, that proves that the downloaded file, as written to disk, exactly matches the original file as read at the time of upload. By default, AzCopy will signal a failure if they don't match. This behavior can be configured by the --check-md5 flag. The default is to check hashes for all blobs that have them, but to do no check on blobs that have no hash.

Checking lots of blobs

If you have uploaded a large amount of data, and want to check the MD5 hashes without the time and cost of downloading it, AzCopyV10 offers a shortcut that may help. Instead of downloading to your own premises, you can download to an Azure VM, and configure AzCopy to check the hashes but not actually save any data to the VM. Because its not saving any data, it can run very fast, and you can check as much data as needed without needing to provision any disks.

To check MD5s in this way, use a command line like this: azcopy copy <url to the data you want to check> NUL --check-md5 FailIfDifferentOrMissing

Use NUL on Windows and /dev/null on Linux and MacOS. The key points to note in this command line are that the destination is NUL (or /dev/null) so the data never gets saved, and the check-md5 flag is set to its strictest setting, which says to report failures on any blob that does not have an MD5. (That's stricter than the default, which normally does not report a missing MD5 as an error).

If you have Terabytes of data to check, you should use a relatively large VM (e.g. 16 cores) to maximize throughput. If the VM is in the same region as your Storage account, you won't be charged any data egress fees when running this check.

Clone this wiki locally