Skip to content

Fix jq healthcheck to account for nulls #370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

peixian
Copy link
Member

@peixian peixian commented Apr 13, 2025

This healthcheck has been broken for a while now, although I think we usually just don't notice it.

How this started

I noticed that we were seeing pretty frequent healthcheck errors when this got deployed

Screenshot 2025-04-13 at 6 48 54 PM

However, the pods themselves were clearly healthy, so this was very odd, and manual inspection of pod metrics made it look okay, and WA was clearly not down.

Debugging

On a pod:

wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$ cat /tmp/stats.txt
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,wrew,connect,reuse,cache_lookups,cache_hits,srv_icur,src_ilim,qtime_max,ctime_max,rtime_max,ttime_max,eint,idle_conn_cur,safe_conn_cur,used_conn_cur,need_conn_est,uweight,agg_server_status,agg_server_check_status,agg_check_status,srid,sess_other,h1sess,h2sess,h3sess,req_other,h1req,h2req,h3req,proto,-,ssl_sess,ssl_reused_sess,ssl_failed_handshake,h2_headers_rcvd,h2_data_rcvd,h2_settings_rcvd,h2_rst_stream_rcvd,h2_goaway_rcvd,h2_detected_conn_protocol_errors,h2_detected_strm_protocol_errors,h2_rst_stream_resp,h2_goaway_resp,h2_open_connections,h2_backend_open_streams,h2_total_connections,h2_backend_total_streams,h1_open_connections,h1_open_streams,h1_total_connections,h1_total_streams,h1_bytes_in,h1_bytes_out,h1_spliced_bytes_in,h1_spliced_bytes_out,
stats,FRONTEND,,,2,3,27500,141258,1233762,76321709,0,0,127118,,,,,OPEN,,,,,,,,,1,1,0,,,,0,1,0,3,,,,0,14250,0,127118,0,0,,1,3,141369,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,1,3,141258,14251,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,141258,0,0,0,141369,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,141258,14252,1192261,102995133,0,0,
stats,BACKEND,0,0,0,0,2750,0,1233762,76321709,0,0,,0,0,0,0,UP,0,0,0,,0,424098,,,1,1,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,0,,,0,0,0,6497,,,,,,,,,,,,,,http,,,,,,,,0,0,0,0,0,,,0,0,0,30008,0,,,,,0,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_http,FRONTEND,,,2890,3550,27495,341922,141496598,585822497,0,0,6743,,,,,OPEN,,,,,,,,,1,2,0,,,,0,36,0,131,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,36,131,341922,,0,0,0,,,,,,,,,,,0,,,,,,,,,,341922,0,0,0,0,0,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_https,FRONTEND,,,4201,4967,27495,45545,266546615,843378279,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,11,0,44,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,34,103,292364,,0,0,0,,,,,,,,,,,0,,,,,,,,,,45545,0,0,0,0,0,0,0,,-,45487,66,199589,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_xmpp,FRONTEND,,,4139,4669,27495,291565,221990895,714391600,0,0,6888,,,,,OPEN,,,,,,,,,1,4,0,,,,0,38,0,106,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,38,106,291565,,0,0,0,,,,,,,,,,,0,,,,,,,,,,291565,0,0,0,0,0,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_whatsapp_net,FRONTEND,,,0,5,27495,127177,20912,757601,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,1,0,3,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,1,3,127177,,0,0,0,,,,,,,,,,,0,,,,,,,,,,127177,0,0,0,0,0,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wa_whatsapp_net,whatsapp_net_443,0,0,0,5,,127758,20912,757601,,0,,34,0,581,0,UP,1,1,0,1,0,424098,0,,1,6,1,,127177,,2,1,,5,L4OK,,1,,,,,,,0,,,,18,7,,,,,1,,,0,2002,0,2005,,,,Layer4 check passed,,2,3,4,,,,,,tcp,,,,,,,,0,127758,0,,,0,,0,18141,0,71328,0,0,0,0,2,1,,,,0,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
wa_whatsapp_net,BACKEND,0,0,0,5,2750,127177,20912,757601,0,0,,34,0,581,0,UP,1,1,0,,0,424098,0,,1,6,0,,127177,,1,1,,3,,,,,,,,,,,,,,18,7,0,0,0,0,1,,,0,2002,0,2005,,,,,,,,,,,,,,tcp,,,,,,,,0,127758,0,,,,,0,18141,0,71328,0,,,,,1,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wa,g_whatsapp_net_5222,0,0,6082,6318,,436152,488537510,1557769879,,0,,8143,4877,105930,0,UP,1,1,0,1,0,424098,0,,1,7,1,,330222,,2,56,,196,L4OK,,2,,,,,,,0,,,,144164,15673,,,,,0,,,1787,1064,0,129783,,,,Layer4 check passed,,2,3,4,,,,,,tcp,,,,,,,,0,436152,0,,,0,,199919,18228,0,3822038,0,0,0,0,2,1,,,,0,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
wa,BACKEND,0,0,6082,6318,5499,337110,488537510,1557769879,0,0,,8143,4877,105930,0,UP,1,1,0,,0,424098,0,,1,7,0,,330222,,1,50,,117,,,,,,,,,,,,,,151052,15673,0,0,0,0,0,,,1787,1064,0,129783,,,,,,,,,,,,,,tcp,,,,,,,,0,436152,0,,,,,199919,18228,0,3822038,0,,,,,1,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wa_http,g_whatsapp_net_80,0,0,2890,3550,,388030,141496598,585822497,,0,,4012,2986,52851,0,UP,1,1,0,4,0,424098,0,,1,8,1,,335179,,2,37,,135,L4OK,,1,,,,,,,0,,,,181169,7115,,,,,0,,,0,667,0,75233,,,,Layer4 check passed,,2,3,4,,,,,,tcp,,,,,,,,0,388030,0,,,0,,0,18171,0,3720402,0,0,0,0,2,1,,,,0,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
wa_http,BACKEND,0,0,2890,3550,2750,341922,141496598,585822497,0,0,,4012,2986,52851,0,UP,1,1,0,,0,424098,0,,1,8,0,,335179,,1,36,,131,,,,,,,,,,,,,,187912,7115,0,0,0,0,0,,,0,667,0,75233,,,,,,,,,,,,,,tcp,,,,,,,,0,388030,0,,,,,0,18171,0,3720402,0,,,,,1,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$ tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c '. | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)'
jq: error (at <stdin>:1836): null (null) and string ("whatsapp_net") cannot have their containment checked

null here implies an odd value, and indeed we can see that by selecting the raw fields

wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$ tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c '. | {field1: .[1], isNull: (.[1] == null)}'
{"field1":"svname","isNull":false}
{"field1":"FRONTEND","isNull":false}
{"field1":"BACKEND","isNull":false}
{"field1":"FRONTEND","isNull":false}
{"field1":"FRONTEND","isNull":false}
{"field1":"FRONTEND","isNull":false}
{"field1":"FRONTEND","isNull":false}
{"field1":"whatsapp_net_443","isNull":false}
{"field1":"BACKEND","isNull":false}
{"field1":"g_whatsapp_net_5222","isNull":false}
{"field1":"BACKEND","isNull":false}
{"field1":"g_whatsapp_net_80","isNull":false}
{"field1":"BACKEND","isNull":false}
{"field1":null,"isNull":true}

{"field1":null,"isNull":true} is our problem.

By skipping over nulls:

wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$ tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c 'select(.[1] != null) | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]'
Layer4 check passed
Layer4 check passed
Layer4 check passed

We can see that this health check now succeeds:

wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$ tail -n +1 /tmp/stats.txt | jq -R 'split(",")' | jq -c 'select(.[1] != null) | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)'
wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$  echo $?
0

Hasn't this been around forever?

Yep. Which meant that pods sometimes would get restarted in kubernetes for random reasons. I suspect the root cause is that when HAproxy gets a lot of load, the stats file becomes slightly different, which adds this null value. We should switch to prometheus based metrics for health checks when they're available.

@peixian peixian requested a review from eozturk1 April 13, 2025 22:58
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 13, 2025
@peixian peixian requested a review from BenWunderlich April 13, 2025 22:58
Copy link
Contributor

@eozturk1 eozturk1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test a failing case as well?

@peixian
Copy link
Member Author

peixian commented Apr 15, 2025

@eozturk1

Failing case

pxw@pxw-mbp ~/c/scratch> cat stats.txt
wapox-chat-whatsapp-proxy-chart-7cc466897c-jfcbs:/$ cat /tmp/stats.txt
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,agent_status,agent_code,agent_duration,check_desc,agent_desc,check_rise,check_fall,check_health,agent_rise,agent_fall,agent_health,addr,cookie,mode,algo,conn_rate,conn_rate_max,conn_tot,intercepted,dcon,dses,wrew,connect,reuse,cache_lookups,cache_hits,srv_icur,src_ilim,qtime_max,ctime_max,rtime_max,ttime_max,eint,idle_conn_cur,safe_conn_cur,used_conn_cur,need_conn_est,uweight,agg_server_status,agg_server_check_status,agg_check_status,srid,sess_other,h1sess,h2sess,h3sess,req_other,h1req,h2req,h3req,proto,-,ssl_sess,ssl_reused_sess,ssl_failed_handshake,h2_headers_rcvd,h2_data_rcvd,h2_settings_rcvd,h2_rst_stream_rcvd,h2_goaway_rcvd,h2_detected_conn_protocol_errors,h2_detected_strm_protocol_errors,h2_rst_stream_resp,h2_goaway_resp,h2_open_connections,h2_backend_open_streams,h2_total_connections,h2_backend_total_streams,h1_open_connections,h1_open_streams,h1_total_connections,h1_total_streams,h1_bytes_in,h1_bytes_out,h1_spliced_bytes_in,h1_spliced_bytes_out,
stats,FRONTEND,,,2,3,27500,141258,1233762,76321709,0,0,127118,,,,,OPEN,,,,,,,,,1,1,0,,,,0,1,0,3,,,,0,14250,0,127118,0,0,,1,3,141369,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,http,,1,3,141258,14251,0,0,0,,,0,0,,,,,,,0,,,,,,,,,,0,141258,0,0,0,141369,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,141258,14252,1192261,102995133,0,0,
stats,BACKEND,0,0,0,0,2750,0,1233762,76321709,0,0,,0,0,0,0,UP,0,0,0,,0,424098,,,1,1,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,0,0,0,0,0,0,0,0,,,0,0,0,6497,,,,,,,,,,,,,,http,,,,,,,,0,0,0,0,0,,,0,0,0,30008,0,,,,,0,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_http,FRONTEND,,,2890,3550,27495,341922,141496598,585822497,0,0,6743,,,,,OPEN,,,,,,,,,1,2,0,,,,0,36,0,131,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,36,131,341922,,0,0,0,,,,,,,,,,,0,,,,,,,,,,341922,0,0,0,0,0,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_https,FRONTEND,,,4201,4967,27495,45545,266546615,843378279,0,0,0,,,,,OPEN,,,,,,,,,1,3,0,,,,0,11,0,44,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,34,103,292364,,0,0,0,,,,,,,,,,,0,,,,,,,,,,45545,0,0,0,0,0,0,0,,-,45487,66,199589,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_xmpp,FRONTEND,,,4139,4669,27495,291565,221990895,714391600,0,0,6888,,,,,OPEN,,,,,,,,,1,4,0,,,,0,38,0,106,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,38,106,291565,,0,0,0,,,,,,,,,,,0,,,,,,,,,,291565,0,0,0,0,0,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
haproxy_v4_whatsapp_net,FRONTEND,,,0,5,27495,127177,20912,757601,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,1,0,3,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,tcp,,1,3,127177,,0,0,0,,,,,,,,,,,0,,,,,,,,,,127177,0,0,0,0,0,0,0,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wa_whatsapp_net,whatsapp_net_443,0,0,0,5,,127758,20912,757601,,0,,34,0,581,0,UP,1,1,0,1,0,424098,0,,1,6,1,,127177,,2,1,,5,L4OK,,1,,,,,,,0,,,,18,7,,,,,1,,,0,2002,0,2005,,,,Layer4 check failed,,2,3,4,,,,,,tcp,,,,,,,,0,127758,0,,,0,,0,18141,0,71328,0,0,0,0,2,1,,,,0,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
wa_whatsapp_net,BACKEND,0,0,0,5,2750,127177,20912,757601,0,0,,34,0,581,0,UP,1,1,0,,0,424098,0,,1,6,0,,127177,,1,1,,3,,,,,,,,,,,,,,18,7,0,0,0,0,1,,,0,2002,0,2005,,,,,,,,,,,,,,tcp,,,,,,,,0,127758,0,,,,,0,18141,0,71328,0,,,,,1,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wa,g_whatsapp_net_5222,0,0,6082,6318,,436152,488537510,1557769879,,0,,8143,4877,105930,0,UP,1,1,0,1,0,424098,0,,1,7,1,,330222,,2,56,,196,L4OK,,2,,,,,,,0,,,,144164,15673,,,,,0,,,1787,1064,0,129783,,,,Layer4 check passed,,2,3,4,,,,,,tcp,,,,,,,,0,436152,0,,,0,,199919,18228,0,3822038,0,0,0,0,2,1,,,,0,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
wa,BACKEND,0,0,6082,6318,5499,337110,488537510,1557769879,0,0,,8143,4877,105930,0,UP,1,1,0,,0,424098,0,,1,7,0,,330222,,1,50,,117,,,,,,,,,,,,,,151052,15673,0,0,0,0,0,,,1787,1064,0,129783,,,,,,,,,,,,,,tcp,,,,,,,,0,436152,0,,,,,199919,18228,0,3822038,0,,,,,1,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
wa_http,g_whatsapp_net_80,0,0,2890,3550,,388030,141496598,585822497,,0,,4012,2986,52851,0,UP,1,1,0,4,0,424098,0,,1,8,1,,335179,,2,37,,135,L4OK,,1,,,,,,,0,,,,181169,7115,,,,,0,,,0,667,0,75233,,,,Layer4 check failed,,2,3,4,,,,,,tcp,,,,,,,,0,388030,0,,,0,,0,18171,0,3720402,0,0,0,0,2,1,,,,0,,,,,,,,,,-,0,0,0,,,,,,,,,,,,,,,,,,,,,,
wa_http,BACKEND,0,0,2890,3550,2750,341922,141496598,585822497,0,0,,4012,2986,52851,0,UP,1,1,0,,0,424098,0,,1,8,0,,335179,,1,36,,131,,,,,,,,,,,,,,187912,7115,0,0,0,0,0,,,0,667,0,75233,,,,,,,,,,,,,,tcp,,,,,,,,0,388030,0,,,,,0,18171,0,3720402,0,,,,,1,0,0,0,,,,,,,,,,,-,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
pxw@pxw-mbp ~/c/scratch> tail -n +1 stats.txt | jq -R 'split(",")' | jq -c 'select(.[1] != null) | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]'
Layer4 check failed
Layer4 check passed
Layer4 check failed
pxw@pxw-mbp ~/c/scratch> tail -n +1 stats.txt | jq -R 'split(",")' | jq -c 'select(.[1] != null) | select(.[1] | contains("whatsapp_net"))' | jq --raw-output '.[65]| select(. | test("Layer4 check passed") | not)'

Layer4 check failed
Layer4 check failed

@peixian peixian merged commit ccb8989 into WhatsApp:main Apr 15, 2025
3 checks passed
@Bank-bri
Copy link

Bank-bri commented Apr 16, 2025 via email

@malek3988
Copy link

q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants