-
Notifications
You must be signed in to change notification settings - Fork 76
RFC-0025 Improving incremental builds #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
# Improving incremental builds | ||
**Authors:** | ||
@peterbell10 | ||
|
||
## **Summary** | ||
|
||
This RFC proposes changes to ATen that will allow more granular header | ||
dependencies and tools to enforce their usage across the codebase. | ||
This should greatly improve incremental and cached build performance. | ||
|
||
## **Motivation** | ||
|
||
In ATen, certain generated headers files include the definitions of | ||
every operator in PyTorch. Consequently any new operator, or update | ||
to any of ATen's over 1000 existing operators in | ||
`native_functions.yaml`, requires a full rebuild of ATen and torch. | ||
These problematic headers include `Tensor.h`, `Functions.h`, | ||
`NativeFunctions.h`, which are all widely used and currently | ||
unavoidable. | ||
|
||
For example, adding a new operator that isn't referenced anywhere | ||
results in 1000+ files being rebuilt over many hours of CPU time | ||
(single threaded). After the proposed changes you can expect at most a | ||
small handful of files to be rebuilt, most of which should contain | ||
code generated specifically for that new operator. | ||
|
||
## **Proposed Implementation** | ||
|
||
This RFC proposes two changes to allow breaking up these header | ||
dependencies, and avoid unnecessary header dependencies in ATen. | ||
|
||
1. Add a base class to `Tensor` called `TensorBase` which has no | ||
method operators but is otherwise fully interoperable. This is | ||
particularly useful for kernel code which often doesn't call any | ||
operators but still needs to pass `Tensor`s around. | ||
2. Split code generated header files up on a per-operator basis, so | ||
only necessary operators need to be included. These headers still | ||
depend on `Tensor.h` so will depend on method operators. | ||
|
||
Additionally, two macros are introduced which can be defined to | ||
enforce strict adoption of these more granular headers: | ||
|
||
- `TORCH_ASSERT_NO_OPERATORS` when defined, will make inclusion of any | ||
code generated from the operator list into a compiler error. In | ||
practice this limits you to only use `TensorBase` and `TensorIterator`. | ||
- `TORCH_ASSERT_ONLY_METHOD_OPERATORS` which allows `Tensor.h` to be | ||
included, but not the other umbrella headers. Instead specific | ||
operators must be included directly as they are used. | ||
|
||
These will be implemented by inserting preprocessor guards into the | ||
relevant header files, with `ATen/core/TensorBody.h` as an example | ||
```cpp | ||
#ifdef TORCH_ASSERT_NO_OPERATORS | ||
#error This change adds a dependency on native_functions.yaml, \ | ||
meaning the file will need to be re-compiled every time an operator \ | ||
is changed or added. Consider if your change would be better placed in \ | ||
another file, or if a more specific header might achieve the same goal. \ | ||
See NOTE: [Tensor vs. TensorBase] | ||
#endif | ||
``` | ||
|
||
This assists in first detecting problematic includes, even when | ||
buried in layers of transitive includes. Then further with | ||
maintaining the header hygiene into the future, as developers get | ||
immediate feedback when undesirable headers are added in the future. | ||
|
||
### TensorBase | ||
`Tensor` serves two related but separate functions. It is both a | ||
user-friendly handle to the underlying `TensorImpl` data structure, | ||
and an interface to call operators with the `variant: method` | ||
property. `TensorBase` acts only as a handle without any of the | ||
code-generated methods and lives in it's own header | ||
(`ATen/core/TensorBase.h`). | ||
|
||
`Tensor` inherits publicly from `TensorBase` to make `TensorBase&` | ||
arguments implicitly callable with `Tensor` without touching the | ||
reference count. Similarly, an implicit move constructor | ||
`Tensor(TensorBase &&)` allows `Tensor` to be implicitly created from | ||
`TensorBase` so long as there is no reference count change. This way | ||
the two types cleanly inter-operate without silently degrading | ||
performance with reference count increments. | ||
|
||
The primary use case for `TensorBase` is kernel code which is often | ||
the most expensive to compile yet often don't reference operators. | ||
Many kernels simply hand tensors off to `TensorIterator` or call | ||
`data_ptr<>` and access the data directly. To enable this use case, | ||
`TensorIterator.h` and the `cpu/Loops.h` and `cuda/Loops.cuh` headers | ||
must be kept clean of any `Tensor` includes (forward declarations are | ||
fine though). | ||
|
||
|
||
### Per-Operator Headers | ||
|
||
The contents of code-generated "umbrella" headers will be split up | ||
into separate headers so all they do is include from one header per | ||
operator. The actual generated code will not be changed, only moved to | ||
a new header. These new headers will live in the `ATen/ops` folder. | ||
`at::add` for example will generate 8 new headers, | ||
|
||
```cpp | ||
#include <ATen/ops/add.h> // Split from Functions.h | ||
#include <ATen/ops/add_meta.h> // Split from NativeMetaFunctions.h | ||
#include <ATen/ops/add_native.h> // Split from NativeFunctions.h | ||
#include <ATen/ops/add_ops.h> // Split from Operators.h | ||
|
||
// Plus static dispatch functions split from {DispatchKey}Functions.h | ||
#include <ATen/ops/add_compositeexplicitautograd_dispatch.h> | ||
#include <ATen/ops/add_cpu_dispatch.h> | ||
#include <ATen/ops/add_cuda_dispatch.h> | ||
#include <ATen/ops/add_meta_dispatch.h> | ||
``` | ||
|
||
Note that `add` has 4 overloads which are included in the same header | ||
files. This way the caller doesn't need to know which overload is | ||
relevant to their use case, only the basic operator name. | ||
|
||
Unfortunately, per-operator headers aren't compatible with Meta's | ||
internal `buck` builds because the codegen output list depends | ||
dynamically on the contents of an input file (namely | ||
`native_functions.yaml`). So these headers can only be included when | ||
`AT_PER_OPERATOR_HEADERS` is defined which is the default for CMake | ||
and bazel builds. So, they must be included behind an `#ifdef` guard, | ||
e.g. | ||
|
||
```cpp | ||
#ifndef AT_PER_OPERATOR_HEADERS | ||
#include <ATen/Functions.h> | ||
#else | ||
#include <ATen/ops/add.h> | ||
#include <ATen/ops/sum.h> | ||
#endif | ||
``` | ||
|
||
### Enforcing adoption | ||
These tools can only improve incremental builds if they are used | ||
widely throughout the codebase and continue to be enforced as the code | ||
develops. So, adoption will be enforced through the | ||
`TORCH_ASSERT_NO_OPERATORS` and `TORCH_ASSERT_ONLY_METHOD_OPERATORS` | ||
macros. | ||
|
||
First, `ATen` and `torch` will be migrated to using the per-operator | ||
headers, defining `TORCH_ASSERT_ONLY_METHOD_OPERATORS` in each file as | ||
they are updated. Then, once most of the codebase is compliant, the | ||
build system will be changed to define | ||
`TORCH_ASSERT_ONLY_METHOD_OPERATORS` by default in all new files. In | ||
the rare occasion where umbrella headers are strictly necessary, these | ||
files can opt-out by explicitly `#undef`ing the macro making it | ||
`grep`-able and clearly visible in code-review. One such example is | ||
`autograd/generated/Functions.cpp` which includes verbatim code from | ||
`derivatives.yaml` so doesn't know which operators are being called. | ||
This should be extremely rare though. | ||
|
||
The initial migration can be assisted by | ||
[`include-what-you-use`](https://include-what-you-use.org/) which is a | ||
clang-based tool to find the necessary headers based on what symbols | ||
are actually used in a file. For most cpp files, it is as simple as | ||
copying the `<ATen/ops/*>` headers from the tool output. | ||
|
||
`TORCH_ASSERT_NO_OPERATORS` cannot be enforced as strictly because | ||
compliance often requires significantly restructuring the code. | ||
However, where the cost-benefit is favorable, this macro should be | ||
defined manually in the source file. This will include files that | ||
already use `TensorIterator` exclusively and just need their includes | ||
pruned; and more costly to compile files where taking time to | ||
restructure them is worthwhile. | ||
|
||
## **Metrics** | ||
|
||
Some key metrics that should be improved by these changes: | ||
- Time to rebuild after adding editing a functional operator | ||
- Time to rebuild after adding editing a method operator | ||
- `sccache` miss rate in open source CI | ||
|
||
For example, we can compare against a branch with the | ||
`ATen/native/cpu` and `ATen/native/cuda` folders migrated to | ||
per-operator headers, and some key files edited to use `TensorBase`. | ||
Despite only partial migration, the table below shows ~2x speedup in | ||
incremental build for functional operators and ~1.4x for method | ||
operators. | ||
|
||
| Variant | Jobs (before) | Jobs (after) | Time (before) | Time (after) | | ||
|----------|:-------------:|--------------|---------------|--------------| | ||
| Function | 1403 | 775 | 10 m 00 s | 4 m 50 s | | ||
| Method | 1403 | 1184 | 10 m 00 s | 7 m 15 s | | ||
|
||
We can expect even further improvements once all of ATen and torch | ||
have adopted per-operator headers. | ||
|
||
## **Drawbacks** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some of our users include internal file and rely on the fact that most of ATen is available at that point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose we could do something like
However, I think that user code should really just be including There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that the fix is really on the other side there. |
||
There is no denying that it's more convenient to just include | ||
`<ATen/ATen.h>` at the top of the file and not have to think about it. | ||
This will require additional effort on the part of developers to | ||
maintain the list of operator includes. | ||
|
||
C++ users who include internal header files without including | ||
`ATen/ATen.h` may also find their compilation failing as unnecessary | ||
includes are trimmed from internal headers. | ||
|
||
## **Alternatives** | ||
[pytorch/pytorch#61915](https://github.com/pytorch/pytorch/pull/61915) | ||
proposed generating new operators into separate headers called | ||
`*_Delta.h` (e.g. `Functions_Delta.h`) during local development. This | ||
improves incremental builds locally with comparatively fewer changes | ||
to the ATen codebase. However, | ||
|
||
1. It only helps when adding operators, not changing existing ones | ||
2. It does nothing for clean builds using a cache (like in CI) | ||
3. Incremental builds no long reproduce a from-scratch build | ||
|
||
[pytorch/pytorch#62258](https://github.com/pytorch/pytorch/issues/62258) | ||
proposed removing operator header dependencies from CUDA kernels | ||
specifically. This is aligned with the aims of `TensorBase`, but | ||
leaves out several hundred non-kernel source files. Individually these | ||
files take around 10 seconds each to compile, but together account for | ||
2-3 hours of (serial) CPU time. | ||
|
||
|
||
## **Prior Art** | ||
Not aware of any relevant prior art. | ||
|
||
## **How we teach this** | ||
This change doesn't effect end-users. | ||
|
||
For PyTorch developers, a developer wiki entry should outline the | ||
motivation from this RFC and describe how to write code in compliance | ||
with the two enforcement macros: | ||
- How to include the new per-operator headers | ||
- Using the include-what-you-use tool | ||
- Structuring headers to minimize unnecessary dependencies | ||
- Modifying kernels to be `TORCH_ASSERT_NO_OPERATORS` compliant | ||
|
||
This wiki entry should also be linked from the enforcement macro error | ||
messages to increase visibility. | ||
|
||
## **Unresolved questions** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that most of the above is done today. Do we want to extend this rfc to try and answer such questions? |
||
- Is it possible/desirable to enforce `TORCH_ASSERT_NO_OPERATORS` automatically? | ||
e.g. for all `.cu` files, or all files over a certain compile time. | ||
- Can `include-what-you-use` completely automate operator includes? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could imagine a daily/weekly automatic CI that automatically runs/fix this similar to how we do for not important python lint with pytorch/pytorch@1edf6f5 Do you think that could work? Any danger of BC-breaking change for c++ users? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the |
||
Tools exist for strictly managing _all_ includes, but that would be | ||
a significant change from existing include style. | ||
- Is it worth adopting `TensorBase` more widely than just kernels? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could answer this by seeing how much benefit we will get from doing that? Is there an easy way to see how this will change the metrics discussed above? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know of an easy way but some numbers to put it into perspective: 1184 files include However, I suspect that methods change less often and that more development effort will be going into new operators which are mostly functional-only. So in practice it may be much more the 60% that are fast. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds great! |
||
e.g. if the operators themselves used `TensorBase` then you no | ||
longer have to include hundreds of method operators. |
Uh oh!
There was an error while loading. Please reload this page.