forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FusionLoops based on MZ's LoopNest #233
Open
lly-zero-one
wants to merge
290
commits into
bertmaher:pytorch_fusion
Choose a base branch
from
lly-zero-one:loop_nest
base: pytorch_fusion
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Enable Werror
…pper accessors to make the code more explicit. (pytorch#186) * Remove wrapper function accessors from TensorNode: instead access function_'s members directly through function(). * Remove TensorNode class. * Remove TensorOperationNode class.
* formatted guard elimination * initial impl of symbolic shapes
…ytorch#191) * Remove BaseStmtNode class. * Use `const BaseExprNode*` instead of Expr in classes from ir.h * Rename Expr->ExprHandler, Var->VarHandler, BaseExprNode->Expr, Variable->Var. * Fixup CUDA build. * Rename {Expr,Var}Handler to {Expr,Var}Handle. * Fixup after rebase.
…tBinaryOp. (pytorch#192) * Backport a clang-tidy fix: replace BINARY_ACCEPT with IRPrinter::visitBinaryOp. * Make visitBinaryOp a local function rather than a method of IRPrinter.
…ts. (pytorch#194) All `test_*` functions are now moved into a test-class (with no changes to them).
* Add rand benchmark. * Add an option to disable texpr fuser.
…remainder (pytorch#198) * Add the cast_float, backward ops and also fix the remainder * fix the conflict * change expr to exprhandle * formatting * fix the linter
* Fix some ir printer bugs * also true_stmt
…rch#142) * Enable axis splitting and GPU grid binding with variable shapes * farwell ExprStmt, we hardly knew ye
* Add workflow.md. * Remove the suggestions from the doc. * Add language reference. * Add language reference. * Address some of the comments.
* Adding bitwise integer ops: &,^,<<, >>
* Add the cast_float, backward ops and also fix the remainder fix the conflict change expr to exprhandle formatting fix the linter add the type_as support * fix the threshold failure
* Aten op: where This require a helper function which does promote types for the condition expression.
* LLVM codgen for fmod, remainder
* fix testATengeInt
This reverts commit 5bf52fa.
The moved code wasnt changed.
…t way. LoopNest is my attempt to simplify our core abstraction. The main idea behind this change is to merge two classes: `TensorExprNode` and `For` (derived from `Stmt`). Currently they represent basically the same thing, but in a slightly different way. `TensorExprNode` attaches some metadata and provides a different way for traversing through siblings/parents/children. `For` represents the same structure, but without any metadata. Once a kernel is lowered to `For` statements, they are immediately consumed by a codegen, which lowers them to LLVMIR or prints as a CUDA string. This PR adds some functionality to `For` statements (and to other types of statements as well) and implements `SplitWithTail` and `ComputeInline` using only those. The implementation is just a proof of concept: it doesn't cover all corner cases, but they are trivial to add. As a demo, I added a test where we create a simple tensor-expression, then split one of the axis and then lower it to a Stmt. The demo shows that we're producing exactly the same result. For the reference, below is the output of the test (Root stmt - produced by the new implementation, Ref stmt - the product of the existing one): ``` [ RUN ] TensorExprTest.LoopNest_LLVM Root stmt: for (int n = 0; n < N; n++) { for (int i = 0; i < 1024; i++) { for (int j_outer = 0; j_outer < ((256 - 0) / 17); j_outer++) { for (int j_inner = 0; j_inner < 17; j_inner++) { g[(((n * (1024 * 256)) + (i * 256)) + (((j_outer * 17) + j_inner) * 1))] = (((A[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))] + B[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))]) + C[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))]) + D[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))]); } } for (int j_tail = 0; j_tail < ((256 - 0) % 17); j_tail++) { g[(((n * (1024 * 256)) + (i * 256)) + ((j_tail + (((256 - 0) / 17) * 17)) * 1))] = (((A[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))] + B[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))]) + C[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))]) + D[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))]); } } } Ref stmt: for (int n = 0; n < N; n++) { for (int i = 0; i < 1024; i++) { for (int j_outer = 0; j_outer < ((256 - 0) / 17); j_outer++) { for (int j_inner = 0; j_inner < 17; j_inner++) { g[(((n * (1024 * 256)) + (i * 256)) + (((j_outer * 17) + j_inner) * 1))] = (((A[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))] + B[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))]) + C[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))]) + D[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + ((j_outer * 17) + j_inner))]); } } for (int j_tail = 0; j_tail < ((256 - 0) % 17); j_tail++) { g[(((n * (1024 * 256)) + (i * 256)) + ((j_tail + (((256 - 0) / 17) * 17)) * 1))] = (((A[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))] + B[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))]) + C[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))]) + D[(((n * ((1 * 256) * 1024)) + (i * (1 * 256))) + (j_tail + (((256 - 0) / 17) * 17)))]); } } } [ OK ] TensorExprTest.LoopNest_LLVM (3 ms) ```
ZolotukhinM
force-pushed
the
pytorch_fusion
branch
from
March 4, 2020 00:16
6628d0f
to
36e8a6f
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.