Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stash objcore references until the end of the task to avoid copies #4269

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nigoroll
Copy link
Member

@nigoroll nigoroll commented Feb 11, 2025

This proposal was motivated by #3768, which is about avoiding to make copies of constant strings by special casing. This PR does not yet include one additional detail from #3768 1, but it solves the underlying root cause.

Context

Any reference handled in VCL needs to have at least PRIV_TASK lifetime. We notoriously shied away from formalizing this definition, but it is a factual consequence from how workspaces work. Rollbacks reset the respective task's workspace and thus finalize all PRIV_TASKs.

One way of ensuring PRIV_TASK lifetime is by copying the referenced value (usually a string) to the task's workspace, and we do this today.

Yet this is wasteful, because static strings from VCL and pointers to memory on the heap already outlive the task lifetime.

The only objects which did not already have PRIV_TASK lifetime were attributes from objects, because object references got returned before restarts. b92.vtc illustrates this case.

For this reason and this reason only do we currently copy all strings to the respective workspace.

Avoid copies by giving object references task lifetime

This proposal gives object references the same lifetime as PRIV_TASK and removes the then unnecessary workspace copies. As a side effect, it also solves the case for #3768 1, because this also avoids most copies of static strings to workspace.

In the past, an argument had been made (IIRC by @mbgrydeland) that keeping object references until the end of the task would increase their lifetime by too much, but restarts in VCL really should be done within milliseconds in most cases - and if keeping references is an actual problem in specific situations, those can be avoided by either not restarting or rolling back also. In general, until now we have charged all Varnish-Cache users with the cost for specific use cases, but we should rather only charge the specific use cases instead.


Footnotes

  1. When we set an existing header to a new header with the same name as in set req.http.name = resp.http.name, we currently create a new HEADER on the workspace. This could be avoided by allowing a HEADER argument to some new SetHdr() variant similarly to vcc: Teach HEADER symbols to accept constant strings #3768. 2

Copy link
Member

@dridi dridi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, an argument had been made (IIRC by @mbgrydeland) that keeping object references until the end of the task would increase their lifetime by too much, but restarts in VCL really should be done within milliseconds in most cases - and if keeping references is an actual problem in specific situations, those can be avoided by either not restarting or rolling back also.

A restart occurring after vcl_miss can take orders of magnitude longer than milliseconds.

I'm not convinced this is a good idea. I think objcores are resources we should always try to hold onto sparingly, and workspace allocations are an acceptable trade off.

bin/varnishd/cache/cache_req_fsm.c Outdated Show resolved Hide resolved
bin/varnishd/cache/cache.h Outdated Show resolved Hide resolved
@nigoroll nigoroll force-pushed the ocstash_no_string_copy branch 2 times, most recently from a055c50 to 15c1510 Compare February 11, 2025 13:26
@nigoroll
Copy link
Member Author

A restart occurring after vcl_miss can take orders of magnitude longer than milliseconds.

For this scenario to be relevant, it would need to be a return(restart) from sub vcl_deliver {}, which then ends up being a miss. But yes, I am aware and I was specifically referring to this argument in the last paragraph of the initial comment.

I think objcores are resources we should always try to hold onto sparingly

What would be the convincing argument for that?

I brought my case, so if you disagree, what is yours?

@nigoroll nigoroll force-pushed the ocstash_no_string_copy branch 2 times, most recently from a3881c1 to ed965b6 Compare February 11, 2025 13:53
Copy link
Member

@dridi dridi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, I am aware and I was specifically referring to this argument in the last paragraph of the initial comment.

And I quoted the last paragraph, except the last sentence I don't understand.

This proposal gives object references the same lifetime as PRIV_TASK and removes the then unnecessary workspace copies. As a side effect, it also solves the case for #3768, because this also avoids most copies of static strings to workspace.

It was a quick glance, but I don't see how ensuring an at-least-task lifetime to objcores solves anything for static strings. In #3768 I added a new function to the runtime that sets a header with a string guaranteed to outlive the task (user's responsibility) and taught libvcc to better keep track of constant expressions to favor the new function when possible.

I would argue that with objcores guaranteed to outlive VCL execution, we could teach libvcc to recognize obj.stuff as a constant expression:

set req.http.foo = "foo"; # 3768 would avoid a workspace copy
set req.http.foo = obj.http.foo; # 3768 could avoid a workspace copy
set req.http.foo = "bar, " + obj.http.foo; # cannot avoid copy

Your change is only affecting core code and it would easily compose with #3768 to extend obj.* "constness" to VCL code (though not very common). But it certainly doesn't solve workspace overflows on constant string assignments, I have seen cases where a significant bunch of headers are set in vcl_deliver and vcl_synth (CORS and other policies).

I brought my case, so if you disagree, what is yours?

My main concern is that object storage is a more critical resource than workspaces. If we need a lot of workspace, we can lower task concurrency. When storage is saturated and in constant churn (when you eventually reach full capacity) it becomes crucial that dying objects actually go away. Increasing latency here means that your churn throughput will increase, reducing your cache/storage efficiency.

Forget what I said about restarts. Unlike retries, I think they should generally be avoided and be the exception instead of the norm, so I don't have that much of a problem adding the objcore retention caveat to this feature.

I had a closer look and in the normal case we stash nothing and drop the objcore reference in cnt_finish(), right? In that case I'm more open to it, and it would be a good occasion to revisit #3768 with more "task-constant" expressions. But I really don't like how the stash is put together. Please also note that dropping the ref in cnt_finish() makes the normal case "not quite task-scoped" but is it good enough? If the answer is yes, then the stash should be cleared alongside.

Comment on lines 130 to 132
stash = WS_Alloc(ws, (unsigned)stash_sz(l));
if (stash == NULL)
stash = malloc(stash_sz(l));
Copy link
Member

@dridi dridi Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be allocated from the workspace once and for all similarly to the req->top allocation. Especially since we fix max_restarts when req is initialized. We should stop performing heap allocations for task-scoped allocations (we could submit a pull request to always allocate dynamic privs from workspace). What is the point of giving a fixed budget if we double dip in the heap?

In fact, shouldn't the stash belong to struct reqtop?

edit: never mind my last question, I crossed two streams.

bin/varnishd/cache/cache_req_fsm.c Outdated Show resolved Hide resolved
@@ -684,8 +751,7 @@ cnt_lookup(struct worker *wrk, struct req *req)
WRONG("Illegal return from vcl_hit{}");
}

/* Drop our object, we won't need it */
(void)HSH_DerefObjCore(wrk, &req->objcore, HSH_RUSH_POLICY);
stash_oc(&req->ocstash, &req->objcore, req->ws, req->max_restarts + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean req->restarts + 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The stash is only allocated once for the maximum capacity the task might need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't resolve review comments on my behalf.

I crossed two streams again. By the time I left my comment I had convinced myself that the stash should be allocated upfront, and misinterpreted the meaning of "restarts" here.

Anyway, what I'm seeing here is duplication and spread of logic. Both stash_oc() calls are identical, both are forcing the call site to know about internal logic, they should instead look like this:

Req_StashObjcore(req);
Req_StashObjcore(req, &req->objcore); // if we ever grow the need to stash another oc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can change the signature. The current generic arguments come from the idea that we might want to reuse the facility somewhere else, but we do not have that use case at the moment.

Copy link
Member Author

@nigoroll nigoroll Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you like a5e8ebc better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this all belongs in cache_req.c, and I'm thinking that clearing the stash could fit well in Req_Cleanup().

Copy link
Member Author

@nigoroll nigoroll Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the code is declared static in cache_req_fsm.c because that's where it is used.
I think otherwise it would live in cache_hash.c, because HSH_DerefObjCore() lives there.

ocstash_fini() / ReqFiniObjcoreStash() is called where VCL_TaskLeave() is called, because the the two are very much related. Req_Cleanup() is concerned with the lifetime of struct req, which is longer than PRIV_TASK.

side note: #3994 will probably also need this facility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Req_Cleanup() barely outlives the task's lifetime, and at least calling it from there guarantees that privs were actually freed *before* clearing the stash. And a cleanup doesn't prevent req from being reused for keep-alive, you are confusing it with Req_Release().

side note: #3994 will probably also need this facility

You are just agreeing with me that cache_req_fsm.c is not a good place for this. At least with what I'm suggesting we can move the stash logic from cache_req.c to cache_obj.c and keep the call site in Req_Cleanup().

We can cross that bridge when obj_stale makes its appearance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can cross that bridge when obj_stale makes its appearance.

That's were we agree.

With the slight interface modifications in #4271 and in particular calling the cleanup from Req_*.c, I do fully agree with the move of the code. The two important questions, however, are if the new location to call fini/cleanup is better and if we really should have the upfront allocation.

@nigoroll
Copy link
Member Author

But yes, I am aware and I was specifically referring to this argument in the last paragraph of the initial comment.

And I quoted the last paragraph, except the last sentence I don't understand.

The sentence is:

In general, until now we have charged all Varnish-Cache users with the cost for specific use cases, but we should rather only charge the specific use cases instead.

My point is: This patch makes the common case cheaper.

This proposal gives object references the same lifetime as PRIV_TASK and removes the then unnecessary workspace copies. As a side effect, it also solves the case for #3768, because this also avoids most copies of static strings to workspace.

It was a quick glance, but I don't see how ensuring an at-least-task lifetime to objcores solves anything for static strings.

Without this patch, static strings need to be copied always. Without this patch, the set HEADER = HEADER optimization is not possible in the general case.

In #3768 I added a new function to the runtime that sets a header with a string guaranteed to outlive the task (user's responsibility) and taught libvcc to better keep track of constant expressions to favor the new function when possible.

I think I understand what you are doing, and I think we should first have a more general improvement. We should first make sure that every reference in VCL at least has the same lifetime as PRIV_TASK, then we can avoid copying no matter the source. Your set HEADER = HEADER optimization makes sense (see footnote of the initial comment), but we do not need to mark vcl statics constant, because they are just a special case of "at least PRIV_TASK lifetime".

I would argue that with objcores guaranteed to outlive VCL execution, we could teach libvcc to recognize obj.stuff as a constant expression:

Now you seem confused to me. Also with #3768, set req.http.foo = obj.http.foo can not save a workspace copy unless we also have this PR. And yes, again, a set HEADER = HEADER optimization can save copies for the special case that the header names match, and for statics from VCL we can make it so that they do.

Your change is only affecting core code and it would easily compose with #3768 to extend obj.* "constness" to VCL code

It is not just affecting core code, because vmods suffer from the same copying1. And I have said all the time that it would compose with the set HEADER = HEADER optimization from #3768. The main point is that we do not avoid copying by marking an additional special case, the main point is that we avoid it by making sure that "all things referenced" have PRIV_TASK equivalent lifetime.

(Oh man, I am repeating myself for the fifth time now or so.)

My main concern is that object storage is a more critical resource than workspaces. If we need a lot of workspace, we can lower task concurrency. When storage is saturated and in constant churn (when you eventually reach full capacity) it becomes crucial that dying objects actually go away. Increasing latency here means that your churn throughput will increase, reducing your cache/storage efficiency.

I understand what an object reference implies. Also we should note that for most objects on a typical Varnish-Cache installation, the bulk of object references will be held because of body delivery.

here the main point is that all the scenarios where the reference would be kept longer than before the patch are special cases, which can be avoided by not restarting, or rolling back in addition to the restart. None of these cases is typical, and the argument that for some special case some object would be held for somewhat longer than before the patch really seems not significant.

I had a closer look [...]

So was the first half of your comment from "before you looked closer" and the last paragraph from after?

in the normal case we stash nothing and drop the objcore reference in cnt_finish(), right?

Yes

But I really don't like how the stash is put together. Please also note that dropping the ref in cnt_finish() makes the normal case "not quite task-scoped" but is it good enough? If the answer is yes, then the stash should be cleared alongside.

The stash is only allocated if needed. The call to ocstash_fini() is a noop if stash_oc() was not called. It only gets called for the "exception path", of which restart is most relevant.

Footnotes

  1. Actually vmod_objvar probably benefits the most, because taskvar.string replacing use cases of HEADER will then not use any copies for STRING inputs and could be extended to also support STRANDS to avoid even more copying.

@dridi
Copy link
Member

dridi commented Feb 12, 2025

The sentence is:

In general, until now we have charged all Varnish-Cache users with the cost for specific use cases, but we should rather only charge the specific use cases instead.

My point is: This patch makes the common case cheaper.

I can parse it now, got it. I generally agree with tradeoffs in favor of the common cases.

Without this patch, static strings need to be copied always. Without this patch, the set HEADER = HEADER optimization is not possible in the general case.

Agreed, with the understanding that "this patch" refers to #3768. But I think you are reading too much into my patch series, see below.

I think I understand what you are doing, [...] Your set HEADER = HEADER optimization makes sense

I don't think I can take credit for the set HEADER = HEADER optimization, or I should look at #3768 to make sure. My optimization was set HEADER = "literal string". Then I suggested marking obj header as constants to treat them like literal strings in libvcc, and then you came up with a generalization (assuming guaranteed task lifetime of objcores).

I would argue that with objcores guaranteed to outlive VCL execution, we could teach libvcc to recognize obj.stuff as a constant expression:

Now you seem confused to me. Also with #3768, set req.http.foo = obj.http.foo can not save a workspace copy unless we also have this PR.

Not confused, right? The idea was assuming "objcores guaranteed to outlive VCL execution", but your set HEADER = HEADER idea would encompass obj headers.

Your change is only affecting core code and it would easily compose with #3768 to extend obj.* "constness" to VCL code

It is not just affecting core code, because vmods suffer from the same copying. And I have said all the time that it would compose with the set HEADER = HEADER optimization from #3768.

We actually agree, this is poor wording on my end. I think both changes would complement each other well and open the door to a new optimization.

So was the first half of your comment from "before you looked closer" and the last paragraph from after?

Hours before the beginning and the end of my review...

The stash is only allocated if needed. The call to ocstash_fini() is a noop if stash_oc() was not called. It only gets called for the "exception path", of which restart is most relevant.

Then considering the workspace gains we can expect, we should avoid the complications of a just-in-time allocation, and certainly not fall back to a heap allocation. I think we should give it the reqtop treatment and simply always make room for the stash.

@nigoroll
Copy link
Member Author

As the rest of the discussion looks like it might be resolved, I will only respond to the last paragraph:

we should avoid the complications of a just-in-time allocation

Again, I wanted to keep impact on the common case minimal.

not fall back to a heap allocation.

This is unlikely and simplifies the calling code.

@nigoroll nigoroll force-pushed the ocstash_no_string_copy branch from 81391a5 to a5e8ebc Compare February 12, 2025 14:39
@dridi
Copy link
Member

dridi commented Feb 12, 2025

Again, I wanted to keep impact on the common case minimal.

The impact is already a negative workspace footprint.

This is unlikely and simplifies the calling code.

So does a small systematic allocation (56B by default) during the struct req setup.

@nigoroll
Copy link
Member Author

nigoroll commented Feb 12, 2025

So does a small systematic allocation (56B by default) during the struct req setup.

It is not like I had not considered this option. My worry is that it will impact users with high max_restarts values substantially, and likely needlessly.

On the malloc fallback, I think it is generally a good idea to not fail in our supporting facilities when we could be in an exception code path.

@dridi
Copy link
Member

dridi commented Feb 12, 2025

On the malloc fallback, I think it is generally a good idea to not fail in out supporting facilities.

And I disputed this in a previous comment (#4269 (comment)). If we have dedicated allocators in the form of workspaces for tasks, then task allocations should be performed there and respect the configured limits.

It's doable, I already offered to submit such a change.

And then there are the cases where a workspace allocation may not be appropriate (for example gzip_buffer, h2 stream window etc). A VMOD author is still free to allocate PRIV_TASK data from the heap.

@nigoroll
Copy link
Member Author

nigoroll commented Feb 12, 2025

Sure is the workspace allocation failure handling doable.

The point here is that if we run into the ws overflow at this point, a subsequent request will already fail, and we will induce massive overhead basically everywhere, from handling the error, possibly running into a restart etc. etc.
The adminstrator will hopefully notice the ws overflow and do something about it.

But at any rate, doing a heap allocation for this exception path is, I think, offset by the gain in simplicity: Because we might already be on an exception path when we stash an oc, we save ourselves from complicating it further.

Do not get me wrong: Yes, we should use the workspace whenever feasible.

FTR on the other topic: The allocation of the struct vrt_priv is not to be confused with the user controlled (struct vmod_priv).priv member.

@dridi
Copy link
Member

dridi commented Feb 12, 2025

But at any rate, doing a heap allocation for this exception path is, I think, offset by the gain in simplicity: Because we might already be on an exception path when we stash an oc, we save ourselves from complicating it further.

The simplest approach is a preemptive allocation of the stash that completely removes the need for just-in-time allocations and gets rid of all allocation-related branches.

Ignoring a failed workspace allocation here is just delaying the actual workspace failure somewhere in vcl_synth for the synth/fail cases (reminder, the built-in vcl_synth makes allocations, but we have a candidate fix for that). In the restart case, there is no point processing an entire task again with an overflowed workspace.

This is actually increasing the distance between the root cause and the symptoms.

Again, reserving a stash upfront:

  • gets rid of branches
  • avoids the worst case scenario a heap allocation
    • and some malloc overhead to perform, track and free the allocation
    • and a potential delayed failure caused by the workspace overflow
  • does not introduce error handling at the call site

I do agree in the dynamic priv case that it brings error handling to the caller. In fact VRT_priv_task() used to be fallible so it would only bring error handling back. It would also bring dynamic privs back to respecting the task's memory budget.

FTR on the other topic: The allocation of the struct vrt_priv is not to be confused with the user controlled (struct vmod_priv).priv member.

That's exactly the distinction I was making. The struct should be allocated from the relevant workspace, the member is up to the VMOD author.

Quoting myself:

A VMOD author is still free to allocate PRIV_TASK data from the heap.

Replace my "data" with your more accurate "(struct vmod_priv).priv member".

👎 for not noticing the pun :]

I think the simplest actually is:

  • manage stash in cache_req.c
  • allocate stash upfront from mempool (like reqtop)
  • export Req_StashObjcore()
  • clear stash in Req_Cleanup()

Pros and cons:

  • less moving parts, better encapsulation
  • no new error paths
  • fixed tiny workspace overhead
    • paying dividends as soon as you set up a resp for vcl_deliver

And if the stash turns out to have a prohibitive cost1 then we can complicate the picture and refactor the allocation to make it just-in-time. The extra care you added is premature optimization at this point.

Footnotes

  1. unlike a reqtop systematically allocated upfront even in the absence of sub-requests

@nigoroll
Copy link
Member Author

I disagree that it is the better option, because it charges all users with a workspace allocation which will not be needed in most cases. The struct reqtop case is different, because that struct is of fixed size and we actually do need it always to not add special casing all over the place for vcl0, VCL req.top. and PRIV_TOP. Yet you are right that we allocate it also for ESI subrequests, which would not be needed.
I have commented on the Req_Cleanup and cache_req.c suggestions further up.

@nigoroll
Copy link
Member Author

fixed tiny workspace overhead

It is not fixed. It depends on max_restarts.

paying dividends as soon as you set up a resp for vcl_deliver

No. The straight path does not need it. The common cases where it is needed are return(restart) and return(synth) from vcl_deliver{} and vcl_hit{}.

nigoroll added a commit to nigoroll/varnish-cache that referenced this pull request Feb 12, 2025
nigoroll added a commit to nigoroll/varnish-cache that referenced this pull request Feb 12, 2025
@dridi
Copy link
Member

dridi commented Feb 13, 2025

It is not fixed. It depends on max_restarts.

Sorry, I meant fixed for the duration of the task.

paying dividends as soon as you set up a resp for vcl_deliver

No. The straight path does not need it. The common cases where it is needed are return(restart) and return(synth) from vcl_deliver{} and vcl_hit{}.

I meant that your change will pay dividends with the workspace savings outweighing the upfront allocation even for the common cases.

I have commented on the Req_Cleanup and cache_req.c suggestions further up.

I submitted #4271 to illustrate what I meant about better encapsulation. What leaks out of cache_req.c is minimal.

Comment on lines 1284 to 1217
if (nxt == REQ_FSM_DONE) {
ReqFiniObjcoreStash(req);
INIT_OBJ(ctx, VRT_CTX_MAGIC);
VCL_Req2Ctx(ctx, req);
if (IS_TOPREQ(req)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to clear the stash before the privs free callbacks run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. The free callbacks should not do anything with references which the priv might still hold, but still I agree that changing the order is the safer option. Thank you for your valid point and good suggestion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have in mind is collecting something during the task, coming from a header for example, and logging it when the task ends. We replaced the free() callback with a fini() one that takes a VRT_CTX to allow privs to do things.

Comment on lines 356 to 290
ReqFiniObjcoreStash(req);

INIT_OBJ(ctx, VRT_CTX_MAGIC);
VCL_Req2Ctx(ctx, req);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to clear the stash before the privs free callbacks run.

@nigoroll
Copy link
Member Author

nigoroll commented Feb 13, 2025

@dridi On the ReqFiniObjcoreStash() / ocstash_fini() / ocstash_clear() call site (called "the cleanup" in what follows to avoid the bikeshed):

I agree that calling it in the middle of Req_Rollback() after the VCL_TaskLeave() is better.
I now also agree that calling it in CNT_Request() for the REQ_FSM_DONE case is wrong, because I had overlooked that filters (VDPs) can also have a look at headers from their init callback, and because headers could reference object storage, the cleanup needs to run after the vdp init.
But calling it in Req_Cleanup() is also wrong: If we moved the cleanup there, the concern that we held onto oc references for "way too long" became valid, because Req_Cleanup() is called after all of the body processing is done (edit: this aspect is actually even broader, because VMODs could allocate returned HEADER values which become invalid after their free callback...)

@dridi
Copy link
Member

dridi commented Feb 13, 2025

I'm not following you here, isn't REQ_FSM_DONE reached from cnt_finish() for the normal object delivery case? In that case we already delivered the body in cnt_transmit().

That's why I said that Req_Cleanup() is called soon after leaving the state machine in a previous comment (#4269 (comment)).

@dridi
Copy link
Member

dridi commented Feb 13, 2025

FWIW I'm fine assuming longer objcore retention for very specific use cases:

  • return restart/synth from hit/deliver steps
  • return pass from hit step

Failing from deliver stashes the objcore for a very short time in #4271 because of the rollback happening before entering vcl_synth, and pushing to the stack is a cheap constant-time fail-safe operation with the upfront allocation. All that to justify why there isn't a "failure in hit/deliver steps" bullet point.

I don't remember ever seeing more than one restart. By far the most common case I have seen is the purge+restart combo, and it's not affected by this change. The second most common case I have seen is a request changed to perform a (usually cacheable) request to check authorizations before delivering privileged cache hits, and since req needs to be restored to perform the original request, a rollback would naturally prevent objcore retention.

I have seen cases where the first authorization request grabs a token so a rollback becomes impossible.

I have never seen a real-world case for a pass-from-hit (yes, intended).

So I'm fine with these changes:

  • objcore stash (this pull request or 4271)
  • skipping the workspace copies (this pull request or 4271)
  • skipping workspace copies for literal strings (3768)
  • skipping workspace copies for set HEADER = HEADER;

@nigoroll nigoroll marked this pull request as draft February 13, 2025 12:22
@nigoroll
Copy link
Member Author

@dridi thank you for your excellent summary.
I am now going to continue here and would like to ask you to please apologize my confusion when I wrote "But calling it in Req_Cleanup() is also wrong" here. Yes, we do need to keep all stashed ocs until delivery is complete, because references to them might be used by filters basically anywhere (a filter could also add any task scoped data to the body).

"almost 64k restarts ought to be enough for everyone"
@nigoroll nigoroll force-pushed the ocstash_no_string_copy branch from a5e8ebc to 8c8e19c Compare February 14, 2025 14:37
nigoroll and others added 3 commits February 14, 2025 15:37
Context:

Only while we hold a reference to an object are we allowed to access any
attributes from it. With restarts, we might want to keep references to
attributes of objects from previous restart iterations. Before this patch, this
could lead to use-after-dereference, which was "fixed" by always copying object
variables into the request's workspace (though this was problematic if VMODs did
not copy always, as will be shown by a follow-up vtc addition).

But that copyp is wasteful, so we should rather make sure that we keep
references until we are done with the task and not copy (which is the second
next commit).

An argument had been made in the past that keeping object references until the
end of the task would increase their lifetime by too much. It is true that, to
support all possible use cases, we need to keep any oc references until the end of
the task, which might include a final body delivery. But keeping the additional
references can be avoided by either not restarting, or rolling back also.

In general, until now we have charged all Varnish-Cache users with the cost for
specific use cases, but we should rather only charge the specific use cases
instead.

Implementation:

struct ocstash basically is a Variable Length Array (VLA) on the workspace,
allocated when the request is created.

For each invocation, ocstash_push() copies the passed objcore to one of
max_restarts
+ 1 slots and clears the original location as HSH_DerefObjCore() would.

Alternatives considered:

* Dynamically allocating space for each objcore pointer was dismissed due to the
  substantial overhead.

* Using a TASK_PRIV was tried and dismissed because this would require setting
  up a VRT_CTX or pulling the VRT_CTX out one level to the core of the FSM,
  which was intrusive and increased complexity substantially.

* The original iteration of this patch would dynamically allocate the workspace
  for struct ocstash only when neeed, but after some involved discussion between
  Dridi and Nils this idea was given up in favor of the simpler upfront
  allocation.

Co-authored-by: Dridi Boukelmoune <[email protected]>
@nigoroll nigoroll force-pushed the ocstash_no_string_copy branch from 8c8e19c to 191be48 Compare February 14, 2025 14:37
@nigoroll
Copy link
Member Author

@dridi I have now taken your version of the patch in slightly modified form and added you as Co-Auther by the de-facto standard form (I should have looked this up for some other commits added recently and done the same...).

Regarding the jit vs. upfront allocation, one argument came to my mind which, I think, had not been mentioned, but which made me change my opinion: For a scenario with a substantial amount of restarts which only happen under some circumstances, users might be taken by surprise if their code "suddenly" needs more workspace. Because, to make their use case work, they need to increase the workspace size anyway, they rather notice sooner than by surprise. So, simply put, I now agree that this is the better option by the argument of "predictability before maximum efficiency".

A diff to your version can be found in 75e88ca. I have done the following:

  • made max_restarts fit into uint16_t to make room for a magic value without spending another sizeof(void *).
  • Added CHECK_OBJ_NOTNULL accordingly
  • Modified Req_New() slightly to make Flexelint happy and avoid zeroing memory twice (while that is still done for the other struct allocated...)

@nigoroll nigoroll marked this pull request as ready for review February 14, 2025 14:48
Copy link
Member

@dridi dridi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, and apologies if the review got on your nerves. You can otherwise tell me I'm wrong as much as you want. As long as you are saying it in good faith I can't take offense.

In the end, I'm happy with the result of our discussion.

Comment on lines 767 to 772
/* name */ max_restarts,
/* type */ uint,
/* min */ "0",
/* max */ NULL,
/* max */ "65534", // (1<<16)-2 #4269
/* def */ "4",
/* units */ "restarts",
Copy link
Member

@dridi dridi Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was up to me, the limit would be 20. I already have a very hard time justifying more than twoone restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but if there is no technical reason, I guess we should not restrict it. Who knows what people might be doing with it...
I at least remember having had a previous iteration of r02618.vtc using a ridiculously high max_restarts before getting the idea to use the predictable vtc xids.

Comment on lines 778 to 783
/* name */ max_retries,
/* type */ uint,
/* min */ "0",
/* max */ NULL,
/* max */ "65534", // (1<<16)-2 #4269
/* def */ "4",
/* units */ "retries",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was up to me, the limit would be 20.

Comment on lines +49 to +51
/*--------------------------------------------------------------------
* Facility to keep obcore references until the end of the task across restarts
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion:

Facility to keep references of discarded objcores exposed to VCL code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"discarded objcores" to me sounds like they would be removed from cache.

Copy link
Member

@dridi dridi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to reduce my workspace saving expectations here. I somehow convinced myself that this change was preventing copies from objcore to workspace once the objcore is guaranteed to outlive VCL processing.

It turns out we already skip copies, when most notably during resp setup. HTTP_Decode() already dips directly in the objcore without a detour from workspace.

So I still think this is a good idea, but we need to deal with set HEADER = HEADER to actually reap benefits.

Comment on lines -96 to -102
if (reason && !WS_Allocated(ctx->ws, reason, -1)) {
reason = WS_Copy(ctx->ws, reason, -1);
if (!reason) {
VRT_fail(ctx, "Workspace overflow");
return;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be safe actually:

return (synth(200, some_vmod.some_string()));

In that case we can't assume a lifetime beyond the VRT_synth() call for the reason argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #4269 (comment) regarding this and the next four review comments.

Comment on lines -448 to +444
if (q == NULL) {
if (h == NULL)
return ("");
if (WS_Allocated(ws, h, -1))
return (h);
} else if (h == NULL && WS_Allocated(ws, q, -1)) {
if (q == NULL && h == NULL)
return ("");
if (q == NULL)
return (h);
if (h == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, we can't assume a task lifetime of the arguments.

Comment on lines 647 to -649
r = VRT_StrandsWS(ctx->ws, NULL, s);
if (r != NULL && *r != '\0')
AN(WS_Allocated(ctx->ws, r, strlen(r) + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our concern is coverage for VRT_StrandsWS() then we need to keep a check that the result effectively belongs to the workspace.

Comment on lines -660 to -661
if (r != NULL && *r != '\0')
AN(WS_Allocated(ctx->ws, r, strlen(r) + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous.

Comment on lines -693 to -694
if (*b != '\0')
AN(WS_Allocated(hp->ws, b, strlen(b) + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous.

@nigoroll
Copy link
Member Author

So I still think this is a good idea, but we need to deal with set HEADER = HEADER to actually reap benefits.

b00092.vtc shows how copying from an object is skipped when restarting:

**** v1    vsl|       1001 Debug           c resp.http.method[0]: (oc) 0x7ffb8a60709c O1_M...
**** v1    vsl|       1003 Debug           c resp.http.url[0]: (oc) 0x7ffb8a607199 /o2_...
**** v1    vsl|       1005 Debug           c req.method[0]: (?) 0x7ffb8a60709c O1_M...
**** v1    vsl|       1005 Debug           c req.url[0]: (?) 0x7ffb8a607199 /o2_...

Yes, this is no copy to HEADER, just from HEADER. I think we discussed this topic sufficiently.

But the main benefit is that we do not need to copy for all the other cases, which you questioned again. After this PR, the contract would officially become that all pointers passed around for VCL have to have at least PRIV_TASK lifetime. This was already the de-facto contract, with the exception of the case addressed here.

Regarding your code comments: I do not want to add 4 more comments to this, but what kind of a pointer do you think for return (synth(200, some_vmod.some_string())); the vmod would return?

nigoroll added a commit to nigoroll/varnish-cache that referenced this pull request Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants