-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955
base: develop
Are you sure you want to change the base?
Conversation
8871aa9
to
12b9a65
Compare
12b9a65
to
1ea2cc4
Compare
Isn't that the {
"language": "Yul",
"sources":
{
"C":
{
"content": "{ let x := mload(0) sstore(add(x, 0), 0) }"
}
},
"settings":
{
"outputSelection":
{
"*": { "*": ["evm.assembly"], "": [ "*" ] }
}
}
} |
The equivalent that is showing the |
1ea2cc4
to
c20a453
Compare
libevmasm/Assembly.h
Outdated
Assembly const& sub(SubAssemblyID const _sub) const | ||
{ | ||
solAssert(_sub.value <= std::numeric_limits<size_t>::max()); | ||
return *m_subs.at(static_cast<size_t>(_sub.value)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that I'd implicitly assume to be true for such a type. I think that asserting it once in the definition of SubAssemblyID
would be enough:
static_assert(std::numeric_limits<value_type>::max() <= std::numeric_limits<size_t>::max());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, isn't uint64_t
implicitly convertible to size_t
? You have static casts to size_t
all over the place and I think they're not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail for emscripten, as there size_t
is 32 bits. :)
libevmasm/Assembly.cpp
Outdated
assertThrow(item.data() <= std::numeric_limits<size_t>::max(), AssemblyException, ""); | ||
auto s = subAssemblyById(static_cast<size_t>(item.data()))->assemble().bytecode.size(); | ||
assertThrow(item.data() <= std::numeric_limits<SubAssemblyID::value_type>::max(), AssemblyException, ""); | ||
auto s = subAssemblyById({static_cast<SubAssemblyID::value_type>(item.data())})->assemble().bytecode.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assembling code would be more concise if we defined a conversion method in SubAssemblyID()
that does the cast and asserts that the value is in the range of the value_type
.
We could then use that in several other places, e.g. CompilerContext.h
or EthAssemblyAdapter.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, almost all of these assertThrow()
s should really be solAssert()
s. The few that do represent proper validations should be replaced with solRequire()
.
I'd recommend converting any of them that you touch in your PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we defined a conversion method
I have added a conversion constructor, that is indeed a good idea, makes the code much more readable and the assertions more concentrated in the place where they're needed.
almost all of these
assertThrow()
s should really besolAssert()
s.
Right, solAssert
and/or solRequire
is also much easier to digest than assertThrow
libevmasm/Assembly.cpp
Outdated
Assembly const* currentAssembly = this; | ||
for (size_t currentSubId: subIds) | ||
for (auto [subIDIndex]: subIDs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, isn't this an ID rather than an index? You're iterating a vector
of SubAssemblyID
s.
Though I also see that it gets cast size_t
without extracting the .value
. Why does that even work?
This would be much clearer without auto
obscuring the type.
for (auto [subIDIndex]: subIDs) | |
for (SubAssemblyID [subID]: subIDs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also have some other loops using size_t
subIDIndex
where just using SubAssemblyID
for the counter would make things simpler. E.g. in optimiseInternal()
I think you could just do this:
for (SubAssemblyID subID = 0; subID.value < m_subs.size(); ++subID.value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works because you can unpack structs much like tuples, auto [subIDIndex] = subIDs.front()
already refers to the contained value
member. Although subIDIndex
was not a good name, should have been subIDValue
.
In any case, I have changed the loop now to just iterate over the SubAssemblyID
instances instead of contained members and then extract the index with a new asIndex
method that does static casting and size checking if needed. As an added benefit it limits the amount of static casts being littered about.
@@ -50,7 +50,7 @@ std::string toStringInHex(u256 _value) | |||
|
|||
} | |||
|
|||
AssemblyItem AssemblyItem::toSubAssemblyTag(size_t _subId) const | |||
AssemblyItem AssemblyItem::toSubAssemblyTag(SubAssemblyID _subId) const | |||
{ | |||
assertThrow(data() < (u256(1) << 64), util::Exception, "Tag already has subassembly set."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assert correct? I think it will fail when _subID
is zero, but zero is a valid ID and an empty one is represented with max()
instead. The message indicates that the function considers max()
to be set and 0
to be unset, which does not seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, now that the sub ID is well-defined as 64-bits, I think we should have asserts that the upper bits of data are actually zeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assert correct? I think it will fail when
_subID
is zero, but zero is a valid ID and an empty one is represented withmax()
instead. The message indicates that the function considersmax()
to be set and0
to be unset, which does not seem right.
You are not the only whose head hurts by this stuff :) the condition data() < (u256(1) << 64)
checks just if data()
fits into 64 bits, as we have
u256(std::numeric_limits<uint64_t>::max()) + 1 == (u256(1) << 64)
So it would pass if data()
is zero and it indeed already asserts that the upper bits are zero. Or am I misunderstanding your concern?
evmasm::AssemblyItem addSubroutine(evmasm::AssemblyPointer const& _assembly) { return m_asm->appendSubroutine(_assembly); } | ||
/// Pushes the size of the subroutine. | ||
void pushSubroutineSize(size_t _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); } | ||
void pushSubroutineSize(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); } | ||
/// Pushes the offset of the subroutine. | ||
void pushSubroutineOffset(size_t _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); } | ||
void pushSubroutineOffset(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's the first time I see assemblies referred to as subroutines. It really stretches the definition. From Wikipedia:
In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit[1] of software logic that has a well-defined interface and behavior and can be invoked multiple times.
It's going to become very confusing with EOF, where we have both functions (code sections) and assemblies (containers). I'd by default interpret "subroutine" to mean to former (there was even a competing "simple subroutines" EIP). We should really rename this at some point.
@@ -88,6 +88,6 @@ class EthAssemblyAdapter: public AbstractAssembly | |||
|
|||
evmasm::Assembly& m_assembly; | |||
std::map<SubID, u256> m_dataHashBySubId; | |||
size_t m_nextDataCounter = std::numeric_limits<size_t>::max() / 2; | |||
SubID::value_type m_nextDataCounter = std::numeric_limits<SubID::value_type>::max() / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I thought I understood what's happening and then I ran into this. WTF are we doing? It just makes my head hurt. If I understand correctly:
Assembly
keeps track of chunks of data and subassemblies separately.- Assemblies are stored in
m_subs
vector. - Data is stored by hash in
m_data
.- Except for metadata, which is singled out and stored in
m_auxiliaryData
.
- Except for metadata, which is singled out and stored in
- Assemblies are stored in
EthAssemblyAdapter
holds the assembly and independently keeps track of all the appended data chunks inm_dataHashBySubId
.- It gives them fake IDs in the upper half of the subassembly ID space
- When a Yul object is being compiled,
EVMObjectCompiler
adds subassemblies and data chunks via the adapter, receiving both real and fake IDs.- It stores the IDs in
BuiltinContext
, each one associated with a Yul node by name. - This is done only for children of the object. Anything nested deeper does not get an ID.
- It stores the IDs in
BuiltinContext
is passed to Yul code transform and later to builtins defined inEVMDialect
.- Builtins like
datasize
then useBuiltinContext
to map Yul node names to sub IDs.- If the name is present in
BuiltinContext
, they take the ID stored there. - If not, they assume the name is a dotted path and convert it to a sequence of IDs.
- In either case the sequence is passed into
EthAssemblyAdapter::appendDataSize()
.- It assumes that if the first ID in the sequence is present in
m_dataHashBySubId
, it must be a fake ID assigned to a data chunk. - Otherwise it converts the path to a sub ID using
Assembly::encodeSubPath()
and assumes it must be a subassembly.- There seems to be an assumption that Yul
Object
tree structure matches theAssembly
tree, which I'm not sure is true. For example we sometimes we move things (e.g. long Yul strings) to data chunks and the transition happens in evmasm optimizer, which means that they do not have correspondingdata
nodes at Yul level.
- There seems to be an assumption that Yul
- I actually don't understand how paths to nested data are handled. I don't think they're are added to
BuiltinContext
, but then how do they not end up treated as assemblies byappendDataSize()
?
- It assumes that if the first ID in the sequence is present in
- If the name is present in
- Builtins like
Overall, looks like we have several kinds of sub IDs sharing the same ID space, without any checks for conflicts (just assuming the space is large enough that they won't happen):
- Normal IDs assigned to subassemblies, starting at
0
.- EOF
ContainerID
s are a subset of this and are limited touint8_t
.
- EOF
- Empty ID at
max()
. - Fake data IDs starting at
max() / 2
. - "Negative" IDs representing deeply nested paths, starting at
max()
and going down.
I think we should assign a non-overlapping region to each of them and have asserts enforcing that.
The way we assign IDs should also be more prominently documented. For example m_subPaths
does not even say that the "negative" IDs mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up?
Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up? Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.
Yes, it is here: https://github.com/ethereum/solidity/wiki/Assembly-Indices
@@ -47,9 +48,9 @@ using AssemblyPointer = std::shared_ptr<Assembly>; | |||
|
|||
class Assembly | |||
{ | |||
using TagRefs = std::map<size_t, std::pair<size_t, size_t>>; | |||
using TagRefs = std::map<size_t, std::pair<SubAssemblyID, size_t>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use size_t
for tag ID but that should really be uint64_t
as well. Would be nice to change that too at some point.
The tag+sub pair would also be better off as a proper struct with methods for converting to/from u256
. We wouldn't then have to hard-code all those masks and 64s all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it would also make it easier to reason about the whole program and data flow.
ee90452
to
6d3ec74
Compare
6d3ec74
to
7881236
Compare
When computing object ids for referencing subassemblies, these ids are currently determined as negative DFS enumeration based on
size_t
:solidity/libevmasm/Assembly.cpp
Line 1825 in a652292
This PR changes this to
uint64_t
, so that assembly text - in particularPUSH #[$]
instructions - is consistent over different type sizes ofsize_t
.Since these indices are stored in a map and later referenced for exporting and importing, as well as (currently)
numeric_limits<size_t>::max()
is used to indicate the root object or an empty state which is used in, eg, various asserts, I found it helpful to wrap theuint64_t
into a struct so that these places become apparent and dealing with the sizes is explicit. This caused a whole avalanche of changes, as such object ids (or more generally:SubAssemblyID
s) are used in many places. In particular, alsoyul::Object::subId
s are subject to the type.We were lacking a cmdline test that has non-zero
PUSH #[$]
es, so I added one. Since emscripten builds do not execute them, we have no direct regression test by this but I have added one to thesolc-js
tests. While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.Fixes #15953