-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update upstream serially between all processes in dyups module #1780
base: master
Are you sure you want to change the base?
Conversation
问题: 在nginx中保存ip数量在10000左右,且开启健康检查情况下,当通过dyups更新一个ip数量在1300以上的upstream情况下,根据测试,发现每个进程大约会阻塞200ms,若机器有50+个worker进程,进程间阻塞加锁更新的方式将会导致正常请求得不到处理。 解决方式: 通过添加串行更新指令,让nginx的每个进程串行更新upstream,当每个进程更新完成后,对原子变量进行+1操作,这样下一个进程才会去更新。 进程间的串行更新,避免了多进程同时阻塞去更新upstream。
|
这个行为和nginx的mutex try lock行为一致。内部实现也是将pid存在lock上,如果被其他进程枷锁后则本进程无法强到锁。dyups有个指令可以开启这个功能。你可以试一下,看看是否满足你的需求:https://github.com/alibaba/tengine/blob/master/docs/modules/ngx_http_upstream_dyups_module.md#dyups_trylock This behavior is consistent with the mutex locking behavior used by nginx. The internal implementation stores the process ID (PID) of the process that currently holds the lock. If the lock is held by another process, the current process cannot acquire the lock. The dyups module directive |
@chobits 只是有个问题,极端场景下,某个或者某几个进程可能一直ngx_shmtx_trylock失败,这样不能保证在规定时间内100%的进程得到更新,甚至最坏情况下,某些进程ngx_shmtx_trylock从来没有成功过。 而串行更新能够保证在规定的时间内所有进程完成更新,所有进程更新完毕时间为worker_processes*dyups_read_msg_timeout。 There is just one problem. In extreme cases, one or more processes may fail The serial update ensures that all processes are updated within a specified period of time, which is worker_processes*dyups_read_msg_timeout. |
@@ -2129,13 +2168,15 @@ ngx_http_dyups_read_msg_locked(ngx_event_t *ev) | |||
ngx_dyups_msg_t *msg; | |||
ngx_dyups_shctx_t *sh; | |||
ngx_dyups_status_t *status; | |||
ngx_http_dyups_main_conf_t *dmcf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless logic, cannot be compiled.
From github workflow: https://github.com/alibaba/tengine/actions/runs/5052359059/jobs/9175451294?pr=1780
In file included from src/core/ngx_core.h:63,
from src/http/ngx_http.h:13,
from modules/ngx_http_upstream_dyups_module/ngx_http_dyups_module.c:6:
modules/ngx_http_upstream_dyups_module/ngx_http_dyups_module.c: In function ‘ngx_http_dyups_read_msg_locked’:
src/core/ngx_log.h:92:5: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘if’
92 | if ((log)->log_level & level) \
| ^~
src/core/ngx_log.h:138:9: note: in expansion of macro ‘ngx_log_debug’
138 | ngx_log_debug(level, log, err, fmt, arg1)
| ^~~~~~~~~~~~~
modules/ngx_http_upstream_dyups_module/ngx_http_dyups_module.c:2173:5: note: in expansion of macro ‘ngx_log_debug1’
2173 | ngx_log_debug1(NGX_LOG_DEBUG_HTTP, ev->log, 0,
| ^~~~~~~~~~~~~~
modules/ngx_http_upstream_dyups_module/ngx_http_dyups_module.c:2179:5: error: ‘dmcf’ undeclared (first use in this function)
2179 | dmcf = ev->data;
| ^~~~
modules/ngx_http_upstream_dyups_module/ngx_http_dyups_module.c:2179:5: note: each undeclared identifier is reported only once for each function it appears in
make[1]: *** [objs/Makefile:2711: objs/addon/ngx_http_upstream_dyups_module/ngx_http_dyups_module.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/home/runner/work/tengine/tengine'
make: *** [Makefile:10: build] Error 2
Error: Process completed with exit code 2.
trylock也能保证几乎每个进程都能公平,抢到锁的进程自己更新完之后会设置一个 所以我觉得trylock即可解决 但是跟极端的情况可能是:第一个进程(假设a)自己更新完之后等待超时之后,继续参与抢锁,此时如果还有进程前一轮更新没有完成,进程a又拿到锁有失公平,相当于第二轮更新任务来的时候 第一轮的进程a把别的进程第一轮更新任务的权限抢走开始更新自己第二轮了。 ==> 但是这种情况出现其实我觉得已经不是 串行加锁还是随机加锁的问题了,本身更新已经超出负荷了,如果这个负荷一直持续 其实 就算公平串行抢锁后续更新堆积越来越多了。更好的方法是加到readmsg timeout 或者看看是不是 被动健康检查溢出了,一般我们业务如果后端peers数量巨大都不推荐开被动健康检查了。会有一个统一的中心server让他来做主动健康检查 所以trylock 还是 (PR里)串行lock都只是解决进程阻塞(这个是lock本身的问题)。(我看串行lock逻辑其实也是一个 串行+trylock的逻辑),解决不了任务堆积过多的问题。 如果这里一定要加这个逻辑,其实所有nginx 的ngx_shm_trylock都有类似问题,更好的方式可能是实现一个 ngx_shm_serial_trylock的api,这样直接使用即可。所以我觉得最好这个pr保留在这即可。
|
fix a compiler bug caused by missing symbols
问题:
当nginx中保存ip数量在10000左右,且开启健康检查情况下,通过dyups模块更新一个ip数量在1300以上的upstream。根据测试,发现每个进程大约会阻塞200ms,若机器有50+个worker进程,进程间阻塞加锁更新方式将会导致大量正常请求得不到处理。
解决方式:
添加串行更新指令,让nginx的每个进程串行更新upstream。具体为设置flag原子变量,当且仅当进程ngx_worker==flag时,该进程才有权加锁更新。当该进程更新完成,对原子变量进行+1操作,下一个进程将会去更新。 进程间的串行更新,避免了多进程同时阻塞去更新upstream。每个进程检查原子变量flag的时间间隔依赖于dyups_read_msg_timeout参数。