Skip to content

Disk and service constraint indicators in AzCopy

John Rusk [MSFT] edited this page May 29, 2020 · 21 revisions

Here’s when AzCopy v10 will display messages saying ‘Disk may be limiting speed’ or ‘Service may be limiting speed’ or ‘PageBlobService may be limiting speed’. This wiki covers the on-screen messages. See final heading below for how to get the same information out of the logs.

Disk may be limiting speed

Never displays in the first 30 seconds of a job. Only displays after first 30 seconds, since we want things to stabilize before we consider displaying this message.

After first 30 seconds of job it displays in the following cases:

  1. If data is moving over the network faster than it can be written or read to/from disk. When uploading, we detect this by measuring the queue of file chunks that have been read from disk, but not yet sent out over the network. If that queue is full, it means disk is faster than network (so no message is displayed); but if the queue is empty, that means network is faster than disk, so we display the message. (In practice, the queue is virtually never half full. It tends to be either full or empty). Similar logic applies in reverse for downloading. Suggested steps in this case:

    1. If uploading a small number of large files on Linux, first try this
    2. Explain to the end user that their speed appears to be limited by disk performance
    3. There is currently only one option in AzCopy v10 to tune disk read behavior. It is the environment variable AZCOPY_CONCURRENT_FILES. It defaults to 64 and represents the number of files that AzCopy should read from disk concurrently. (It does not control the number of TCP connections that are used. That number is controlled separately by AZCOPY_CONCURRENCY_VALUE). In most cases, adjusting AZCOPY_CONCURRENT_FILES is not likely to make a significant difference to performance, but if desired you can experiment with increasing or decreasing it (e.g. to 16 or 256).
    4. For uploads, use the azcopy benchmark command to test how fast AzCopy can upload a payload of random data, which is generated without touching disk. If that's much faster than what you see when reading from disk, it provides added confirmation that disk is the issue.
    5. For downloads, first check to see if the benchmark command supports downloads in your version of AzCopy. (As at May 2020, this feature is in development). If you don't have benchmark support for downloads, then you can test test how fast the tool can go without the disk constraint, by downloading to /dev/null (on Linux) or to NUL (on Windows). Note there is only one L in NUL on Windows. Downloading to this destination just throws the data away after it is received, so it's a true test of download speed without any disk constraint.
  2. If scanning is so slow that we can’t feed new files quickly enough to the rest of AzCopy. This can happen if there are millions of small files. It might also happen if reading from an older SAN or NAS where enumerating directory contents is very slow. We detect it using the same queue-based method as #1. The difference here is that AzCopy’s on-screen status line will say “scanning” and the number of "Pending Files" reported on screen will dip down close to zero. You can also tell from the logs that scanning is still happening while scanning is happening we log lines that look like this: scheduling JobID=, Part#=x, Transfer#=y, priority=0. Suggested support steps in this case:

    1. Update to version 10.5 or later, since it will support parallel enumeration of local file systems. This should help in cases where there the data is in a hierarchy of folders, but it won't help if the data structure is flat (e.g. everything in the root directory)
    2. If running on Linux with version 10.5 or later:
      1. Experiment with setting the environment variable AZCOPY_PARALLEL_STAT_FILES to true and/or experiment with setting relatively high values for AZCOPY_CONCURRENT_SCAN (e.g. 128 or 256).
      2. If reading from a network location (e.g. NAS) experiment with using the SMB protocol instead of Linux's default NFS protocol. Most Linux distros support operating as an SMB client, using the smbfs package (or similar). We have some reports of SMB being faster with AzCopy for very high numbers of small files, but not enough info yet to know whether it's useful general advice, or just unique to the environment(s) where it was reported.
    3. Explain to the end user that the cause appears to be due to the time taken to scan the file system.
    4. Contact the product team, because if these cases are happening we need to know about that, so that we can make decisions about whether additional performance tuning is necessary.
    5. If possible, supply the AzCopy log file to the product group.
  3. (This last case only affects customers with unusually fast networks) If the number of files in the job is less than 10, AND the user has included --put-md5 on the command line, and the available network bandwidth is greater than about 3 Gbps, there’s currently an error where it can say “Disk may be limiting speed” when really it’s our MD5 computations that are limiting the speed.

    1. Suggested steps in this case
      • Does the user have lots of other files that also need uploading? If so, they should use larger upload jobs (i.e. more than 10 files per job). If they do that, this problem will almost always go away.
      • Also, if the user would like to run a test without –put-MD5, that will show them the true disk performance without the overhead of MD5 calculations. For some customers, this will be just a test because for production usage they will need the MD5 hash. But for others, maybe they don’t need the hash and they can run in this faster mode as a solution to the problem. See more details on this page.

PageBlobService may be limiting speed

Never displays in the first 30 seconds of a job. Only displays after first 30 seconds, since we want things to stabilize before we consider displaying this message.

After first 30 seconds of job it displays in the following case:

  1. If the job includes at least one page blob and the Storage Service is limiting the transfer rate for that page blob. Page blobs can have per-blob throughput limits, and in many cases those limits are stricter than the overall throughput limit on the storage account. If those limits are affecting the transfer of any file in the job, this message will be displayed.

    1. Suggested steps in this case
      • Check the max throughput for a page blob of similar size to yours. (Check the units of speed there, they may be Mega Bytes per second (capital B) whereas AzCopy displays Mega bits per second (lower-case b)). Note that there are specific limits published on that page only for Premium Page Blobs. If you are using Standard Page Blobs, you will see similar behavior from AzCopy if and when the Storage Service returns Server Busy responses.
      • Explain the situation to the end user
      • If the user needs to upload multiple page blobs to the same container, try to upload them all simultaneously in a single AzCopy job (rather than run separate jobs, one per file). Processing them all together should give you speed equal to the total throughput limit for all the blobs (assuming no other constraints, such as disk or network).
      • If you want to see the decisions that AzCopy is making, about speed, search the log file for “Page blob”. This will show you two types of log lines: lines where Service returned a 503 status (telling us to slow down), and lines that report the speed that AzCopy has chosen as a result of the 503s. AzCopy chooses a separate speed for each page blob, based on the 503s for that particular blob. The speed will drop after a burst of 503 messages, climb back towards its last-known-best speed, linger there for a while, then probe upwards to see if it can go any faster.

If you want to extract all the relevant log files to a separate file, you can use a command like this PowerShell one (substituting in the correct name for your log file):

select-string "Page blob" .\858ebb30-4796-914f-632e-dc355cda0e1c.log | Out-File pageBlobLines.txt

Service may be limiting speed

(Not to be confused with PageBlobService may be limiting speed, above)

Never displays in the first 30 seconds of a job. Only displays after first 30 seconds, since we want things to stabilize before we consider displaying this message.

This message is displayed if AzCopy is receiving "Server Busy" responses from Azure. (I.e. HTTP status 503). Seeing a few Server Busy messages is fine. AzCopy will automatically retry the affected operations, with an exponential delay.

Suggested steps in this case:

  1. If the message is displayed only infrequently and intermittently, it's probably normal and can be ignored.
  2. Otherwise, check your throughput against the publicly documented throughput and IOPS limits for the particular type of Storage Service that you are using. ( If using AzCopy at the default logging level, the number of IOPS achieved is shown near the end of the log file). If you are approaching the throughput or IOPS limits, then seeing this message is expected.
  3. If you are not approaching the throughput or IOPS limits, and you are seeing this message persistently, then check the documentation on High Throughput Block Blobs and Premium Block Blobs. If, after checking that documentation, it appears that you should not be seeing this message, please contact the product team.

Note on use of log files, and brief tips on debugging other performance issues not covered above

IMPORTANT: Before attempting to diagnose performance issues from the log files, try using AzCopy's benchmark mode. It will conduct diagnostic tests for you, and report on most common performance issues automatically.

When searching AzCopy 10.0.9 log files, it is advisable to use grep (if on Linux) or select-string (if in PowerShell on Windows). For large log files, that’s much more practical that trying to open the whole file in a text editor. For example, in PowerShell here’s how to extract all the performance information from a log file into a separate file named “perfLines.txt”

select-string PERF .\858ebb30-4796-914f-632e-dc355cda0e1c.log | Out-File perfLines.txt

The Linux equivalent is

grep PERF ./858ebb30-4796-914f-632e-dc355cda0e1c.log > perfLines.txt

The performance log likes look something like this:

858ebb30-4796-914f-632e-dc355cda0e1c.log:94549:2019/04/03 06:42:16 PERF: primary performance constraint is Unknown. States: R: 0, D: 289, W: 1046, F: 0, B: 12, T: 1347

The section that says “primary performance constraint is” corresponds to the messages described above. It may say that the primary constraint is:

  • Disk (this is what gets logged when the screen says “Disk may be limiting speed”)
  • PageBlobService (this get logged when the screen says “PageBlobService may be limiting speed”)
  • Service (this get logged when the screen says “Service may be limiting speed”)
  • Unknown. This is logged in all other cases. i.e. when there is no limit message on screen.

When it says “Unknown” that doesn’t mean that there is nothing limiting throughput (there’s always something). It just means that AzCopy hasn’t figured out exactly what the constraint is. Possible constraints include:

  • CPU (to diagnose, check CPU usage on the AzCopy machine, using Task Manager or similar. Solution is to run on a machine with more CPUs, if possible).
  • Memory (to diagnose, check memory usage on the AzCopy machine, using Task Manager or similar. Solution is to ensure no other apps are using a lot of RAM on the same machine).
  • Specs of network card/interface
    • check specs of network adapter (including both host and VM if virtualized). E.g. you won’t get more than 1 Gbps if you only have a 1 Gbps network card.
    • If an Azure VM, check max documented network throughput for the size of VM that is being used. E.g. see public docs such as: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general
    • Configuration of network interface (downloads only). For downloads, sometimes it can help to configure the network adapter/interface to use larger buffers.
    • Provisioned network bandwidth (if used on-premise). To diagnose, discuss with networking staff at customer. Consider both the pipe out to the internet and the internal network. E.g. can’t fill a 10 Gbps internet pipe if machine is connected to a portion of the internal network that only supports 1 Gbps.
    • Available network bandwidth (this is provisioned bandwidth, minus bandwidth used by other traffic). This can be a tricky one to diagnose. Easiest way is probably to ask networking staff a the customer if they have records or telemetry of typical throughput when AzCopy is not running.
  • AzCopy concurrency value. In in the PERF lines from the log, extracted as above, look at the number after “B”(which stands for HTTP Body). If uploading, use the “B” value on its own; but if downloading, use “H” PLUS “B”. If that number is typically about the same in all the PERF lines, then see if it is approximately equal to the AZCOPY_CONCURRENCY_VALUE. That value defaults to 32 when the machine as 4 or fewer CPUs, 300 when the machine has more than 18 CPUs, and 16 * number-of-CPUs in all other cases. If the number you find in the logs seems to be consistently equal to the AZCOPY_CONCURRENCY_VALUE, you can try setting a higher value by setting an environment variable called AZCOPY_CONCURRENCY_VALUE to a higher number (e.g. 500).

Other

Supply the extracted PERF lines to the product team for analysis. And, in the case of page blobs, also supply the extracted page blob perf info, like this: (Windows) select-string "Page blob" .\858ebb30-4796-914f-632e-dc355cda0e1c.log | Out-File pageBlobLines.txt (Linux) grep "Page blob" ./858ebb30-4796-914f-632e-dc355cda0e1c.log > pageBlobLines.txt

Clone this wiki locally