Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

clonker · 2025-03-19T09:58:05Z

When computing object ids for referencing subassemblies, these ids are currently determined as negative DFS enumeration based on size_t:

solidity/libevmasm/Assembly.cpp

Line 1825 in a652292

size_t objectId = std::numeric_limits<size_t>::max() - m_subPaths.size();

This PR changes this to uint64_t, so that assembly text - in particular PUSH #[$] instructions - is consistent over different type sizes of size_t.

Since these indices are stored in a map and later referenced for exporting and importing, as well as (currently) numeric_limits<size_t>::max() is used to indicate the root object or an empty state which is used in, eg, various asserts, I found it helpful to wrap the uint64_t into a struct so that these places become apparent and dealing with the sizes is explicit. This caused a whole avalanche of changes, as such object ids (or more generally: SubAssemblyIDs) are used in many places. In particular, also yul::Object::subIds are subject to the type.

We were lacking a cmdline test that has non-zero PUSH #[$]es, so I added one. Since emscripten builds do not execute them, we have no direct regression test by this but I have added one to the solc-js tests. While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Fixes #15953

r0qs · 2025-03-19T11:44:36Z

While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Isn't that the evm.assembly option?

{
	"language": "Yul",
	"sources":
	{
		"C":
		{
			"content": "{ let x := mload(0) sstore(add(x, 0), 0) }"
		}
	},
	"settings":
	{
		"outputSelection":
		{
			"*": { "*": ["evm.assembly"], "": [ "*" ] }
		}
	}
}

clonker · 2025-03-19T11:46:49Z

Isn't that the evm.assembly option?

The equivalent that is showing the PUSH #[$]es would be evm.legacyAssembly

cameel · 2025-03-21T21:04:36Z

libevmasm/Assembly.h

+	Assembly const& sub(SubAssemblyID const _sub) const
+	{
+		solAssert(_sub.value <= std::numeric_limits<size_t>::max());
+		return *m_subs.at(static_cast<size_t>(_sub.value));
+	}


This is something that I'd implicitly assume to be true for such a type. I think that asserting it once in the definition of SubAssemblyID would be enough:

static_assert(std::numeric_limits<value_type>::max() <= std::numeric_limits<size_t>::max());

Also, isn't uint64_t implicitly convertible to size_t? You have static casts to size_t all over the place and I think they're not necessary.

This will fail for emscripten, as there size_t is 32 bits. :)

libevmasm/AssemblyItem.cpp

cameel · 2025-03-21T22:10:49Z

libevmasm/Assembly.cpp

-				assertThrow(item.data() <= std::numeric_limits<size_t>::max(), AssemblyException, "");
-				auto s = subAssemblyById(static_cast<size_t>(item.data()))->assemble().bytecode.size();
+				assertThrow(item.data() <= std::numeric_limits<SubAssemblyID::value_type>::max(), AssemblyException, "");
+				auto s = subAssemblyById({static_cast<SubAssemblyID::value_type>(item.data())})->assemble().bytecode.size();


This assembling code would be more concise if we defined a conversion method in SubAssemblyID() that does the cast and asserts that the value is in the range of the value_type.

We could then use that in several other places, e.g. CompilerContext.h or EthAssemblyAdapter.cpp.

BTW, almost all of these assertThrow()s should really be solAssert()s. The few that do represent proper validations should be replaced with solRequire().

I'd recommend converting any of them that you touch in your PRs.

if we defined a conversion method

I have added a conversion constructor, that is indeed a good idea, makes the code much more readable and the assertions more concentrated in the place where they're needed.

almost all of these assertThrow()s should really be solAssert()s.

Right, solAssert and/or solRequire is also much easier to digest than assertThrow

cameel · 2025-03-21T22:35:30Z

libevmasm/Assembly.cpp

 	Assembly const* currentAssembly = this;
-	for (size_t currentSubId: subIds)
+	for (auto [subIDIndex]: subIDs)


Wait, isn't this an ID rather than an index? You're iterating a vector of SubAssemblyIDs.
Though I also see that it gets cast size_t without extracting the .value. Why does that even work?

This would be much clearer without auto obscuring the type.

Suggested change

for (auto [subIDIndex]: subIDs)

for (SubAssemblyID [subID]: subIDs)

You also have some other loops using size_t subIDIndex where just using SubAssemblyID for the counter would make things simpler. E.g. in optimiseInternal() I think you could just do this:

for (SubAssemblyID subID = 0; subID.value < m_subs.size(); ++subID.value)

It works because you can unpack structs much like tuples, auto [subIDIndex] = subIDs.front() already refers to the contained value member. Although subIDIndex was not a good name, should have been subIDValue.

In any case, I have changed the loop now to just iterate over the SubAssemblyID instances instead of contained members and then extract the index with a new asIndex method that does static casting and size checking if needed. As an added benefit it limits the amount of static casts being littered about.

cameel · 2025-03-21T22:43:54Z

libevmasm/AssemblyItem.cpp

@@ -50,7 +50,7 @@ std::string toStringInHex(u256 _value)

 }

-AssemblyItem AssemblyItem::toSubAssemblyTag(size_t _subId) const
+AssemblyItem AssemblyItem::toSubAssemblyTag(SubAssemblyID _subId) const
 {
 	assertThrow(data() < (u256(1) << 64), util::Exception, "Tag already has subassembly set.");


Is this assert correct? I think it will fail when _subID is zero, but zero is a valid ID and an empty one is represented with max() instead. The message indicates that the function considers max() to be set and 0 to be unset, which does not seem right.

Also, now that the sub ID is well-defined as 64-bits, I think we should have asserts that the upper bits of data are actually zeros.

Is this assert correct? I think it will fail when _subID is zero, but zero is a valid ID and an empty one is represented with max() instead. The message indicates that the function considers max() to be set and 0 to be unset, which does not seem right.

You are not the only whose head hurts by this stuff :) the condition data() < (u256(1) << 64) checks just if data() fits into 64 bits, as we have

u256(std::numeric_limits<uint64_t>::max()) + 1 == (u256(1) << 64)

So it would pass if data() is zero and it indeed already asserts that the upper bits are zero. Or am I misunderstanding your concern?

cameel · 2025-03-21T22:57:54Z

libsolidity/codegen/CompilerContext.h

 	evmasm::AssemblyItem addSubroutine(evmasm::AssemblyPointer const& _assembly) { return m_asm->appendSubroutine(_assembly); }
 	/// Pushes the size of the subroutine.
-	void pushSubroutineSize(size_t _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
+	void pushSubroutineSize(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
 	/// Pushes the offset of the subroutine.
-	void pushSubroutineOffset(size_t _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }
+	void pushSubroutineOffset(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }


Oh, it's the first time I see assemblies referred to as subroutines. It really stretches the definition. From Wikipedia:

In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit[1] of software logic that has a well-defined interface and behavior and can be invoked multiple times.

It's going to become very confusing with EOF, where we have both functions (code sections) and assemblies (containers). I'd by default interpret "subroutine" to mean to former (there was even a competing "simple subroutines" EIP). We should really rename this at some point.

cameel · 2025-03-22T00:20:34Z

libyul/backends/evm/EthAssemblyAdapter.h

@@ -88,6 +88,6 @@ class EthAssemblyAdapter: public AbstractAssembly

 	evmasm::Assembly& m_assembly;
 	std::map<SubID, u256> m_dataHashBySubId;
-	size_t m_nextDataCounter = std::numeric_limits<size_t>::max() / 2;
+	SubID::value_type m_nextDataCounter = std::numeric_limits<SubID::value_type>::max() / 2;


Ugh, I thought I understood what's happening and then I ran into this. WTF are we doing? It just makes my head hurt. If I understand correctly:

Assembly keeps track of chunks of data and subassemblies separately.

Assemblies are stored in m_subs vector.

Data is stored by hash in m_data.

Except for metadata, which is singled out and stored in m_auxiliaryData.

EthAssemblyAdapter holds the assembly and independently keeps track of all the appended data chunks in m_dataHashBySubId.

It gives them fake IDs in the upper half of the subassembly ID space

When a Yul object is being compiled, EVMObjectCompiler adds subassemblies and data chunks via the adapter, receiving both real and fake IDs.

It stores the IDs in BuiltinContext, each one associated with a Yul node by name.

This is done only for children of the object. Anything nested deeper does not get an ID.

BuiltinContext is passed to Yul code transform and later to builtins defined in EVMDialect.

Builtins like datasize then use BuiltinContext to map Yul node names to sub IDs.

If the name is present in BuiltinContext, they take the ID stored there.

If not, they assume the name is a dotted path and convert it to a sequence of IDs.

BTW, this is a pretty questionable assumption: #13794, #15540.

In either case the sequence is passed into EthAssemblyAdapter::appendDataSize().

It assumes that if the first ID in the sequence is present in m_dataHashBySubId, it must be a fake ID assigned to a data chunk.

Otherwise it converts the path to a sub ID using Assembly::encodeSubPath() and assumes it must be a subassembly.

There seems to be an assumption that Yul Object tree structure matches the Assembly tree, which I'm not sure is true. For example we sometimes we move things (e.g. long Yul strings) to data chunks and the transition happens in evmasm optimizer, which means that they do not have corresponding data nodes at Yul level.

I actually don't understand how paths to nested data are handled. I don't think they're are added to BuiltinContext, but then how do they not end up treated as assemblies by appendDataSize()?

Overall, looks like we have several kinds of sub IDs sharing the same ID space, without any checks for conflicts (just assuming the space is large enough that they won't happen):

Normal IDs assigned to subassemblies, starting at 0.

EOF ContainerIDs are a subset of this and are limited to uint8_t.

Empty ID at max().

Fake data IDs starting at max() / 2.

"Negative" IDs representing deeply nested paths, starting at max() and going down.

I think we should assign a non-overlapping region to each of them and have asserts enforcing that.

The way we assign IDs should also be more prominently documented. For example m_subPaths does not even say that the "negative" IDs mean.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up?
Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up? Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Yes, it is here: https://github.com/ethereum/solidity/wiki/Assembly-Indices

cameel · 2025-03-22T01:34:48Z

libevmasm/Assembly.h

@@ -47,9 +48,9 @@ using AssemblyPointer = std::shared_ptr<Assembly>;

 class Assembly
 {
-	using TagRefs = std::map<size_t, std::pair<size_t, size_t>>;
+	using TagRefs = std::map<size_t, std::pair<SubAssemblyID, size_t>>;


We use size_t for tag ID but that should really be uint64_t as well. Would be nice to change that too at some point.

The tag+sub pair would also be better off as a proper struct with methods for converting to/from u256. We wouldn't then have to hard-code all those masks and 64s all over the place.

Agree, it would also make it easier to reason about the whole program and data flow.

clonker force-pushed the fix_asm_subpath_obj_id branch 2 times, most recently from 8871aa9 to 12b9a65 Compare March 19, 2025 11:29

ethereum deleted a comment from stackenbotten Mar 19, 2025

clonker force-pushed the fix_asm_subpath_obj_id branch from 12b9a65 to 1ea2cc4 Compare March 19, 2025 11:32

clonker requested a review from aarlt March 19, 2025 11:45

clonker requested a review from r0qs March 19, 2025 11:47

Add test with non-zero PUSH #[$] asm instructions

e0de766

clonker force-pushed the fix_asm_subpath_obj_id branch from 1ea2cc4 to c20a453 Compare March 21, 2025 08:09

cameel reviewed Mar 22, 2025

View reviewed changes

clonker force-pushed the fix_asm_subpath_obj_id branch 6 times, most recently from ee90452 to 6d3ec74 Compare March 24, 2025 11:32

clonker added 2 commits March 24, 2025 12:43

Make subassembly IDs based on fixed-size 64 bit uint

36e6199

Add test with non-zero PUSH #[$] to solc-js tests

7881236

clonker force-pushed the fix_asm_subpath_obj_id branch from 6d3ec74 to 7881236 Compare March 24, 2025 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

clonker commented Mar 19, 2025 •

edited

Loading

r0qs commented Mar 19, 2025

clonker commented Mar 19, 2025

cameel Mar 21, 2025

cameel Mar 22, 2025

clonker Mar 22, 2025

cameel Mar 21, 2025

cameel Mar 22, 2025

clonker Mar 24, 2025 •

edited

Loading

cameel Mar 21, 2025 •

edited

Loading

cameel Mar 22, 2025

clonker Mar 24, 2025 •

edited

Loading

cameel Mar 21, 2025 •

edited

Loading

cameel Mar 22, 2025

clonker Mar 24, 2025

cameel Mar 21, 2025 •

edited

Loading

cameel Mar 22, 2025

clonker Mar 24, 2025

r0qs Mar 28, 2025

cameel Mar 22, 2025

clonker Mar 24, 2025

	for (auto [subIDIndex]: subIDs)
	for (SubAssemblyID [subID]: subIDs)

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Are you sure you want to change the base?

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Conversation

clonker commented Mar 19, 2025 • edited Loading

r0qs commented Mar 19, 2025

clonker commented Mar 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clonker Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

cameel Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clonker Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

cameel Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameel Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clonker commented Mar 19, 2025 •

edited

Loading

clonker Mar 24, 2025 •

edited

Loading

cameel Mar 21, 2025 •

edited

Loading

clonker Mar 24, 2025 •

edited

Loading

cameel Mar 21, 2025 •

edited

Loading

cameel Mar 21, 2025 •

edited

Loading