Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

clonker
Copy link
Member

@clonker clonker commented Mar 19, 2025

When computing object ids for referencing subassemblies, these ids are currently determined as negative DFS enumeration based on size_t:

size_t objectId = std::numeric_limits<size_t>::max() - m_subPaths.size();

This PR changes this to uint64_t, so that assembly text - in particular PUSH #[$] instructions - is consistent over different type sizes of size_t.

Since these indices are stored in a map and later referenced for exporting and importing, as well as (currently) numeric_limits<size_t>::max() is used to indicate the root object or an empty state which is used in, eg, various asserts, I found it helpful to wrap the uint64_t into a struct so that these places become apparent and dealing with the sizes is explicit. This caused a whole avalanche of changes, as such object ids (or more generally: SubAssemblyIDs) are used in many places. In particular, also yul::Object::subIds are subject to the type.

We were lacking a cmdline test that has non-zero PUSH #[$]es, so I added one. Since emscripten builds do not execute them, we have no direct regression test by this but I have added one to the solc-js tests. While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Fixes #15953

@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 2 times, most recently from 8871aa9 to 12b9a65 Compare March 19, 2025 11:29
@ethereum ethereum deleted a comment from stackenbotten Mar 19, 2025
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 12b9a65 to 1ea2cc4 Compare March 19, 2025 11:32
@r0qs
Copy link
Member

r0qs commented Mar 19, 2025

While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Isn't that the evm.assembly option?

{
	"language": "Yul",
	"sources":
	{
		"C":
		{
			"content": "{ let x := mload(0) sstore(add(x, 0), 0) }"
		}
	},
	"settings":
	{
		"outputSelection":
		{
			"*": { "*": ["evm.assembly"], "": [ "*" ] }
		}
	}
}

@clonker clonker requested a review from aarlt March 19, 2025 11:45
@clonker
Copy link
Member Author

clonker commented Mar 19, 2025

Isn't that the evm.assembly option?

The equivalent that is showing the PUSH #[$]es would be evm.legacyAssembly

@clonker clonker requested a review from r0qs March 19, 2025 11:47
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 1ea2cc4 to c20a453 Compare March 21, 2025 08:09
Comment on lines 85 to 89
Assembly const& sub(SubAssemblyID const _sub) const
{
solAssert(_sub.value <= std::numeric_limits<size_t>::max());
return *m_subs.at(static_cast<size_t>(_sub.value));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that I'd implicitly assume to be true for such a type. I think that asserting it once in the definition of SubAssemblyID would be enough:

static_assert(std::numeric_limits<value_type>::max() <= std::numeric_limits<size_t>::max());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, isn't uint64_t implicitly convertible to size_t? You have static casts to size_t all over the place and I think they're not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail for emscripten, as there size_t is 32 bits. :)

assertThrow(item.data() <= std::numeric_limits<size_t>::max(), AssemblyException, "");
auto s = subAssemblyById(static_cast<size_t>(item.data()))->assemble().bytecode.size();
assertThrow(item.data() <= std::numeric_limits<SubAssemblyID::value_type>::max(), AssemblyException, "");
auto s = subAssemblyById({static_cast<SubAssemblyID::value_type>(item.data())})->assemble().bytecode.size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assembling code would be more concise if we defined a conversion method in SubAssemblyID() that does the cast and asserts that the value is in the range of the value_type.

We could then use that in several other places, e.g. CompilerContext.h or EthAssemblyAdapter.cpp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, almost all of these assertThrow()s should really be solAssert()s. The few that do represent proper validations should be replaced with solRequire().

I'd recommend converting any of them that you touch in your PRs.

Copy link
Member Author

@clonker clonker Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we defined a conversion method

I have added a conversion constructor, that is indeed a good idea, makes the code much more readable and the assertions more concentrated in the place where they're needed.

almost all of these assertThrow()s should really be solAssert()s.

Right, solAssert and/or solRequire is also much easier to digest than assertThrow

Assembly const* currentAssembly = this;
for (size_t currentSubId: subIds)
for (auto [subIDIndex]: subIDs)
Copy link
Member

@cameel cameel Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, isn't this an ID rather than an index? You're iterating a vector of SubAssemblyIDs.
Though I also see that it gets cast size_t without extracting the .value. Why does that even work?

This would be much clearer without auto obscuring the type.

Suggested change
for (auto [subIDIndex]: subIDs)
for (SubAssemblyID [subID]: subIDs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also have some other loops using size_t subIDIndex where just using SubAssemblyID for the counter would make things simpler. E.g. in optimiseInternal() I think you could just do this:

for (SubAssemblyID subID = 0; subID.value < m_subs.size(); ++subID.value)

Copy link
Member Author

@clonker clonker Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works because you can unpack structs much like tuples, auto [subIDIndex] = subIDs.front() already refers to the contained value member. Although subIDIndex was not a good name, should have been subIDValue.

In any case, I have changed the loop now to just iterate over the SubAssemblyID instances instead of contained members and then extract the index with a new asIndex method that does static casting and size checking if needed. As an added benefit it limits the amount of static casts being littered about.

@@ -50,7 +50,7 @@ std::string toStringInHex(u256 _value)

}

AssemblyItem AssemblyItem::toSubAssemblyTag(size_t _subId) const
AssemblyItem AssemblyItem::toSubAssemblyTag(SubAssemblyID _subId) const
{
assertThrow(data() < (u256(1) << 64), util::Exception, "Tag already has subassembly set.");
Copy link
Member

@cameel cameel Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assert correct? I think it will fail when _subID is zero, but zero is a valid ID and an empty one is represented with max() instead. The message indicates that the function considers max() to be set and 0 to be unset, which does not seem right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, now that the sub ID is well-defined as 64-bits, I think we should have asserts that the upper bits of data are actually zeros.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assert correct? I think it will fail when _subID is zero, but zero is a valid ID and an empty one is represented with max() instead. The message indicates that the function considers max() to be set and 0 to be unset, which does not seem right.

You are not the only whose head hurts by this stuff :) the condition data() < (u256(1) << 64) checks just if data() fits into 64 bits, as we have

u256(std::numeric_limits<uint64_t>::max()) + 1 == (u256(1) << 64)

So it would pass if data() is zero and it indeed already asserts that the upper bits are zero. Or am I misunderstanding your concern?

Comment on lines 230 to +232
evmasm::AssemblyItem addSubroutine(evmasm::AssemblyPointer const& _assembly) { return m_asm->appendSubroutine(_assembly); }
/// Pushes the size of the subroutine.
void pushSubroutineSize(size_t _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
void pushSubroutineSize(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
/// Pushes the offset of the subroutine.
void pushSubroutineOffset(size_t _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }
void pushSubroutineOffset(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }
Copy link
Member

@cameel cameel Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's the first time I see assemblies referred to as subroutines. It really stretches the definition. From Wikipedia:

In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit[1] of software logic that has a well-defined interface and behavior and can be invoked multiple times.

It's going to become very confusing with EOF, where we have both functions (code sections) and assemblies (containers). I'd by default interpret "subroutine" to mean to former (there was even a competing "simple subroutines" EIP). We should really rename this at some point.

@@ -88,6 +88,6 @@ class EthAssemblyAdapter: public AbstractAssembly

evmasm::Assembly& m_assembly;
std::map<SubID, u256> m_dataHashBySubId;
size_t m_nextDataCounter = std::numeric_limits<size_t>::max() / 2;
SubID::value_type m_nextDataCounter = std::numeric_limits<SubID::value_type>::max() / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I thought I understood what's happening and then I ran into this. WTF are we doing? It just makes my head hurt. If I understand correctly:

  • Assembly keeps track of chunks of data and subassemblies separately.
    • Assemblies are stored in m_subs vector.
    • Data is stored by hash in m_data.
      • Except for metadata, which is singled out and stored in m_auxiliaryData.
  • EthAssemblyAdapter holds the assembly and independently keeps track of all the appended data chunks in m_dataHashBySubId.
    • It gives them fake IDs in the upper half of the subassembly ID space
  • When a Yul object is being compiled, EVMObjectCompiler adds subassemblies and data chunks via the adapter, receiving both real and fake IDs.
    • It stores the IDs in BuiltinContext, each one associated with a Yul node by name.
    • This is done only for children of the object. Anything nested deeper does not get an ID.
  • BuiltinContext is passed to Yul code transform and later to builtins defined in EVMDialect.
    • Builtins like datasize then use BuiltinContext to map Yul node names to sub IDs.
      • If the name is present in BuiltinContext, they take the ID stored there.
      • If not, they assume the name is a dotted path and convert it to a sequence of IDs.
        • BTW, this is a pretty questionable assumption: #13794, #15540.
      • In either case the sequence is passed into EthAssemblyAdapter::appendDataSize().
        • It assumes that if the first ID in the sequence is present in m_dataHashBySubId, it must be a fake ID assigned to a data chunk.
        • Otherwise it converts the path to a sub ID using Assembly::encodeSubPath() and assumes it must be a subassembly.
          • There seems to be an assumption that Yul Object tree structure matches the Assembly tree, which I'm not sure is true. For example we sometimes we move things (e.g. long Yul strings) to data chunks and the transition happens in evmasm optimizer, which means that they do not have corresponding data nodes at Yul level.
        • I actually don't understand how paths to nested data are handled. I don't think they're are added to BuiltinContext, but then how do they not end up treated as assemblies by appendDataSize()?

Overall, looks like we have several kinds of sub IDs sharing the same ID space, without any checks for conflicts (just assuming the space is large enough that they won't happen):

  • Normal IDs assigned to subassemblies, starting at 0.
    • EOF ContainerIDs are a subset of this and are limited to uint8_t.
  • Empty ID at max().
  • Fake data IDs starting at max() / 2.
  • "Negative" IDs representing deeply nested paths, starting at max() and going down.

I think we should assign a non-overlapping region to each of them and have asserts enforcing that.

The way we assign IDs should also be more prominently documented. For example m_subPaths does not even say that the "negative" IDs mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up?
Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up? Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Yes, it is here: https://github.com/ethereum/solidity/wiki/Assembly-Indices

@@ -47,9 +48,9 @@ using AssemblyPointer = std::shared_ptr<Assembly>;

class Assembly
{
using TagRefs = std::map<size_t, std::pair<size_t, size_t>>;
using TagRefs = std::map<size_t, std::pair<SubAssemblyID, size_t>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use size_t for tag ID but that should really be uint64_t as well. Would be nice to change that too at some point.

The tag+sub pair would also be better off as a proper struct with methods for converting to/from u256. We wouldn't then have to hard-code all those masks and 64s all over the place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it would also make it easier to reason about the whole program and data flow.

@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 6 times, most recently from ee90452 to 6d3ec74 Compare March 24, 2025 11:32
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 6d3ec74 to 7881236 Compare March 24, 2025 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent subassembly object ID for different sizes of size_t
3 participants