Hi! We seem to have a bit of an issue with Netmake...
# client
m
Hi! We seem to have a bit of an issue with Netmaker. Every 5-ish minutes connections halt for around 10 seconds. The issue seems to be consistent between different peers. Is there some kind of operation that Netmaker does every 5 minutes, or do you have other ideas what could be the cause?
Sometimes the gap between halts can be as long as 10 minutes, and the pause be as long as 20 seconds
b
the server sends a peer update to all nodes every 5 minutes --- a peer update does not bring the wireguard interface up/down so connections should not be halted
m
yeah. there have not been any changes to configs, yet for example pinging has losses for the 10-20 second period every 5 minutes or so
and the issue is consistent with different peers in different places, with the netmaker server being in the AWS
b
the peer update is sent unconditionally
m
the wireguard config file update times seem to be in line with the gaps, so it seems highly likely, that netclient is halting the connection
so it halts the connection even though there is not interface up/down procedure?
b
it should not be halting the interface.. will need to dig into the code to see why it is happening
m
Yeah. If you need any extra info just ask!
Platform for netclient is with our usage scenario either rasbian (rpi os, 32bit) or rocky linux on amd64
we discovered, that the netclient is calling "wg set interface peer" every five minutes, even if there are no changes. is this maybe the cause? could there be way to diff old and new config before calling wg?
b
actually we currently do that when the message is received from server, we check to see if the message is identical to a message received within the last TIME_PERIOD, and if so, discard the message. We may have to adjust the TIME_PERIOD.
m
yeah, disconnects every 5 minutes are quite annoying usage wise, so any kind of solution, like adjusting the interval, would be nice
b
we are thinking of updating it for the next_release
m
okay. now, i know this is an annoying question, but do you have estimated date for that? 😄
b
hopefully next week, but it may be the week after that, have some HA bugs to fix
m
Do we hany any update? My boss is asking questions 😄
b
today or tomorrow: TIMEPERIOD will be 24 hours
f
Hello! The issues seems to be even worse now after updates. Here is a picture of the data loss
b
Ok, i will have to dig into it
f
--- 192.168.192.1 ping statistics --- 413 packets transmitted, 239 received, 42.1307% packet loss, time 416659ms rtt min/avg/max/mdev = 67.377/113.252/172.114/28.433 ms
m
Okay, the pauses now seem to happen every 4-5 minutes AND whenever we do any change to any node in UI. @flat-alarm-21130 will post a screenshot..
Disabling dynamic endpoints & ports helped a bit, but the issue persists. the update interval seems to be now 10 minutes
f
So same problem still persists after the update
b
i can't seem to reproduce your issue regarding updates to nodes causing drop out. I have node1 pinging node2 and I make an change to node3.... obviously node1 and node2 receive peer updates and I experience 0 packet loss on the ping
can I ask how you are gathering your connectivity data?
if I make a change to node2 ... the ping hangs (as expected) and needs to be restarted
m
we use prometheus and thanos
j
@flat-alarm-21130 @miniature-judge-91363 if you have the time we'd like to hop on a voice call / screen share to take a look at what's going on
m
but now we discovered something new! the issue might with keepalive: we had two nodes, one with constant ping to the server and one without. as you can see from the picture there are pauses on the node without constant ping activity
and we ofc tried switching the pinging node and then the pauses switched from one node to another
we are atm using the default keepalive of 20
and yeah, that might be option tomorrow! today we likely won't make it
and yeah, after update whenever we do a change to ingress node, the external clients to that node wont work until manual netclient pull
j
can you provide recreation steps? @echoing-controller-96073 will test
m
okay, so we have the NM server running on docker, 3 main ingress nodes running on ubuntu netclient and rest of the nodes running either rasbian or rocky linux. all of the netclients are using your repositories (dnf on rocky, apt on ubuntu and rasbian) changes we did were for example changing keepalives for all of the nodes (20 -> 25) and after that handshakes (external client -> ingress node or between nodes) did not work. handshakes between nodes and netmaker server however do work. manual netclient pull on either end of node connection pair fixes handshaking. connection pauses seem to be now consistent in that sense, that if we have any kind of nonstop connection (ping or ssh for example) in the tunnel, the pauses do not happen.
j
ok, we're going to look into the ingress issue now
for the connection issues, what would be a good time tomorrow? (we're all in US ET time zone)
m
in the morning US ET time would be fine, as we operate on east Europe summer time (EEST). maybe 07.00 ET (14 EEST), but by 9.00 am ET (16 EEST) latest if possible, as we have previously agreed programme on 17.00 EEST
f
Ok so we have narrowed the problem now so that in our case NODE 1 (192.168.192.1) has NM node and Prometheus + Thanos in Docker. When we keep pinging with our "outer node" (let's call it NODE 2 with IP 192.168.192.14, this has raspbian and NM installed from APT) the data loss is gone. So it seems that not all the connections between nodes are actually keepalived but those are "paused" after 5min of not constant pinging.
m
Also for the sake of curiosity: How do you define IP for interface on linux, as Address doesn't seem to exist under [Interface] on WG conf file? I need this info for the sake of science 😄 EDIT: i skimmed through the source, and seems the answer is "ip address add dev"
(Especially because it seems, that IPv6 address would be on config, but IPv4 wont)
f
Also we noticed that pinging with node to node fixes this issue. Issue still persists if we ping from ext client to node.
b
Do all of the nodes have a static public ip or are some behind NAT?
m
The 3 central nodes which all the of the rest connect to have public static IP:s, rest have NATs
b
Any corelation between connectivity interruptions and dhcp renewals?
m
Nope; and the pauses happen consistently at same time with different nodes on different locations and IP's
f
And different ISP
b
All nodes pause at same time?
m
Seems so that way. We have currently compared only 2 at the time, and there is really small difference based on different scrape intervals (difference at most being 5 sec)
b
Ok that is interesting. Not sure what to make of it
j
It would be interesting to see with a larger set if they all have the pause within the same time span, or if it's asynchronous
unfortunately I dont think we'll be able to meet today, I will PM to plan a meeting
m
Yeah and the most interesting part, that we used to have the same setup running on plain Wireguard before and worked like a charm. Sooooooo... 😄
f
Ok we will try the get you a graph with all same scrape configuration nodes from prometheus/thanos. We currently have there two configs for lab testing nodes and field testing nodes. These are IoT devices based on RPi.
In lab we have also the Rocky Linux machine with intel based hardware
But yeah at least now we can leave ping running there in systemd or something 😄
b
if you are not pinging, how do you detect a connectivity interruption?
f
From data, Prometheus cannot access the device and will get timeout
b
how often does Prometheus collect data?
f
Every 5s
b
thanks, I will do some experiments
j
FYI what version of server/client are you running?
f
We have updated all to 0.14.5
b
and clients are running the hotfix for 0.14.5
m
latest provided by your repo, so supposedly yes
b
thanks
how long does it take promethus to gather the data? Is it possible that prometheus timesout because a previous instance is still collecting data?
m
One scrape takes 50-250ms and scrape interval is 5s so nope
b
hmmmm.... I duplicated your set up (except running on vps rather than pi) and ran a task that ssh'd into node every 5 secs... Ran for hour+ with no loss of connectivity except when changing parameters on node or gateway which would result in the interface going up/down (as expected)
m
Okay, a long time later, it seems that we finally pinpointed the issue, which was actually a combination of jitter, prometheus scrape interval and other variables. Long story short, it seems that all issues vanish when we use default keepalive of 20 sec (or any keepalive) and keep scraping at 10 sec intervals. This was likely due to prometheus timeouting connections where there was extra jitter, from 4G connections or otherwise less reliable connection. Thanks for the help during the debug process, and sorry for the trouble we caused 😄
@flat-alarm-21130 can post more detailed analysis later
b
Glad you figured it out (and that it wasn't due to netmaker) 😄
11 Views