Hi We seem to have a bit of an issue with Netmaker Every 5 i GRAVITL #client

Hi! We seem to have a bit of an issue with Netmake...

miniature-judge-91363

06/30/2022, 1:37 PM

Hi! We seem to have a bit of an issue with Netmaker. Every 5-ish minutes connections halt for around 10 seconds. The issue seems to be consistent between different peers. Is there some kind of operation that Netmaker does every 5 minutes, or do you have other ideas what could be the cause?

miniature-judge-91363

06/30/2022, 1:38 PM

Sometimes the gap between halts can be as long as 10 minutes, and the pause be as long as 20 seconds

bored-island-21407

06/30/2022, 1:40 PM

the server sends a peer update to all nodes every 5 minutes --- a peer update does not bring the wireguard interface up/down so connections should not be halted

miniature-judge-91363

06/30/2022, 1:42 PM

yeah. there have not been any changes to configs, yet for example pinging has losses for the 10-20 second period every 5 minutes or so

miniature-judge-91363

06/30/2022, 1:43 PM

and the issue is consistent with different peers in different places, with the netmaker server being in the AWS

bored-island-21407

06/30/2022, 1:44 PM

the peer update is sent unconditionally

miniature-judge-91363

06/30/2022, 1:45 PM

the wireguard config file update times seem to be in line with the gaps, so it seems highly likely, that netclient is halting the connection

miniature-judge-91363

06/30/2022, 1:46 PM

so it halts the connection even though there is not interface up/down procedure?

bored-island-21407

06/30/2022, 2:45 PM

it should not be halting the interface.. will need to dig into the code to see why it is happening

miniature-judge-91363

06/30/2022, 3:17 PM

Yeah. If you need any extra info just ask!

miniature-judge-91363

06/30/2022, 3:18 PM

Platform for netclient is with our usage scenario either rasbian (rpi os, 32bit) or rocky linux on amd64

miniature-judge-91363

07/01/2022, 12:50 PM

we discovered, that the netclient is calling "wg set interface peer" every five minutes, even if there are no changes. is this maybe the cause? could there be way to diff old and new config before calling wg?

bored-island-21407

07/01/2022, 12:54 PM

actually we currently do that when the message is received from server, we check to see if the message is identical to a message received within the last TIME_PERIOD, and if so, discard the message. We may have to adjust the TIME_PERIOD.

miniature-judge-91363

07/01/2022, 12:56 PM

yeah, disconnects every 5 minutes are quite annoying usage wise, so any kind of solution, like adjusting the interval, would be nice

bored-island-21407

07/01/2022, 12:57 PM

we are thinking of updating it for the next_release

miniature-judge-91363

07/01/2022, 12:58 PM

okay. now, i know this is an annoying question, but do you have estimated date for that? 😄

bored-island-21407

07/01/2022, 12:59 PM

hopefully next week, but it may be the week after that, have some HA bugs to fix

miniature-judge-91363

07/07/2022, 9:54 AM

Do we hany any update? My boss is asking questions 😄

bored-island-21407

07/07/2022, 9:55 AM

today or tomorrow: TIMEPERIOD will be 24 hours

flat-alarm-21130

07/11/2022, 11:42 AM

Hello! The issues seems to be even worse now after updates. Here is a picture of the data loss

bored-island-21407

07/11/2022, 11:45 AM

Ok, i will have to dig into it

flat-alarm-21130

07/11/2022, 11:45 AM

--- 192.168.192.1 ping statistics --- 413 packets transmitted, 239 received, 42.1307% packet loss, time 416659ms rtt min/avg/max/mdev = 67.377/113.252/172.114/28.433 ms

miniature-judge-91363

07/11/2022, 12:16 PM

Okay, the pauses now seem to happen every 4-5 minutes AND whenever we do any change to any node in UI. @flat-alarm-21130 will post a screenshot..

miniature-judge-91363

07/11/2022, 12:25 PM

Disabling dynamic endpoints & ports helped a bit, but the issue persists. the update interval seems to be now 10 minutes

flat-alarm-21130

07/11/2022, 12:26 PM

So same problem still persists after the update

bored-island-21407

07/11/2022, 12:29 PM

i can't seem to reproduce your issue regarding updates to nodes causing drop out. I have node1 pinging node2 and I make an change to node3.... obviously node1 and node2 receive peer updates and I experience 0 packet loss on the ping

bored-island-21407

07/11/2022, 12:53 PM

can I ask how you are gathering your connectivity data?

bored-island-21407

07/11/2022, 12:59 PM

if I make a change to node2 ... the ping hangs (as expected) and needs to be restarted

miniature-judge-91363

07/11/2022, 1:08 PM

we use prometheus and thanos

jolly-london-20127

07/11/2022, 1:08 PM

@flat-alarm-21130 @miniature-judge-91363 if you have the time we'd like to hop on a voice call / screen share to take a look at what's going on

miniature-judge-91363

07/11/2022, 1:09 PM

but now we discovered something new! the issue might with keepalive: we had two nodes, one with constant ping to the server and one without. as you can see from the picture there are pauses on the node without constant ping activity

miniature-judge-91363

07/11/2022, 1:10 PM

and we ofc tried switching the pinging node and then the pauses switched from one node to another

miniature-judge-91363

07/11/2022, 1:10 PM

we are atm using the default keepalive of 20

miniature-judge-91363

07/11/2022, 1:10 PM

and yeah, that might be option tomorrow! today we likely won't make it

miniature-judge-91363

07/11/2022, 1:47 PM

and yeah, after update whenever we do a change to ingress node, the external clients to that node wont work until manual netclient pull

jolly-london-20127

07/11/2022, 1:49 PM

can you provide recreation steps? @echoing-controller-96073 will test

miniature-judge-91363

07/11/2022, 2:12 PM

okay, so we have the NM server running on docker, 3 main ingress nodes running on ubuntu netclient and rest of the nodes running either rasbian or rocky linux. all of the netclients are using your repositories (dnf on rocky, apt on ubuntu and rasbian) changes we did were for example changing keepalives for all of the nodes (20 -> 25) and after that handshakes (external client -> ingress node or between nodes) did not work. handshakes between nodes and netmaker server however do work. manual netclient pull on either end of node connection pair fixes handshaking. connection pauses seem to be now consistent in that sense, that if we have any kind of nonstop connection (ping or ssh for example) in the tunnel, the pauses do not happen.

jolly-london-20127

07/11/2022, 2:13 PM

ok, we're going to look into the ingress issue now

jolly-london-20127

07/11/2022, 2:13 PM

for the connection issues, what would be a good time tomorrow? (we're all in US ET time zone)

miniature-judge-91363

07/11/2022, 9:56 PM

in the morning US ET time would be fine, as we operate on east Europe summer time (EEST). maybe 07.00 ET (14 EEST), but by 9.00 am ET (16 EEST) latest if possible, as we have previously agreed programme on 17.00 EEST

flat-alarm-21130

07/12/2022, 12:09 PM

Ok so we have narrowed the problem now so that in our case NODE 1 (192.168.192.1) has NM node and Prometheus + Thanos in Docker. When we keep pinging with our "outer node" (let's call it NODE 2 with IP 192.168.192.14, this has raspbian and NM installed from APT) the data loss is gone. So it seems that not all the connections between nodes are actually keepalived but those are "paused" after 5min of not constant pinging.

miniature-judge-91363

07/12/2022, 12:12 PM

Also for the sake of curiosity: How do you define IP for interface on linux, as Address doesn't seem to exist under [Interface] on WG conf file? I need this info for the sake of science 😄 EDIT: i skimmed through the source, and seems the answer is "ip address add dev"

miniature-judge-91363

07/12/2022, 12:13 PM

(Especially because it seems, that IPv6 address would be on config, but IPv4 wont)

flat-alarm-21130

07/12/2022, 12:13 PM

Also we noticed that pinging with node to node fixes this issue. Issue still persists if we ping from ext client to node.

bored-island-21407

07/12/2022, 12:46 PM

Do all of the nodes have a static public ip or are some behind NAT?

miniature-judge-91363

07/12/2022, 12:47 PM

The 3 central nodes which all the of the rest connect to have public static IP:s, rest have NATs

bored-island-21407

07/12/2022, 12:48 PM

Any corelation between connectivity interruptions and dhcp renewals?

miniature-judge-91363

07/12/2022, 12:49 PM

Nope; and the pauses happen consistently at same time with different nodes on different locations and IP's

flat-alarm-21130

07/12/2022, 12:49 PM

And different ISP

bored-island-21407

07/12/2022, 12:49 PM

All nodes pause at same time?

miniature-judge-91363

07/12/2022, 12:50 PM

Seems so that way. We have currently compared only 2 at the time, and there is really small difference based on different scrape intervals (difference at most being 5 sec)

bored-island-21407

07/12/2022, 12:52 PM

Ok that is interesting. Not sure what to make of it

jolly-london-20127

07/12/2022, 12:52 PM

It would be interesting to see with a larger set if they all have the pause within the same time span, or if it's asynchronous

jolly-london-20127

07/12/2022, 12:52 PM

unfortunately I dont think we'll be able to meet today, I will PM to plan a meeting

miniature-judge-91363

07/12/2022, 12:53 PM

Yeah and the most interesting part, that we used to have the same setup running on plain Wireguard before and worked like a charm. Sooooooo... 😄

flat-alarm-21130

07/12/2022, 12:55 PM

Ok we will try the get you a graph with all same scrape configuration nodes from prometheus/thanos. We currently have there two configs for lab testing nodes and field testing nodes. These are IoT devices based on RPi.

flat-alarm-21130

07/12/2022, 12:56 PM

In lab we have also the Rocky Linux machine with intel based hardware

flat-alarm-21130

07/12/2022, 12:56 PM

But yeah at least now we can leave ping running there in systemd or something 😄

bored-island-21407

07/12/2022, 12:57 PM

if you are not pinging, how do you detect a connectivity interruption?

flat-alarm-21130

07/12/2022, 12:58 PM

From data, Prometheus cannot access the device and will get timeout

bored-island-21407

07/12/2022, 12:58 PM

how often does Prometheus collect data?

flat-alarm-21130

07/12/2022, 12:58 PM

Every 5s

bored-island-21407

07/12/2022, 12:59 PM

thanks, I will do some experiments

jolly-london-20127

07/12/2022, 1:10 PM

FYI what version of server/client are you running?

flat-alarm-21130

07/12/2022, 2:05 PM

We have updated all to 0.14.5

bored-island-21407

07/12/2022, 2:55 PM

and clients are running the hotfix for 0.14.5

miniature-judge-91363

07/12/2022, 3:10 PM

latest provided by your repo, so supposedly yes

bored-island-21407

07/12/2022, 3:10 PM

thanks

bored-island-21407

07/20/2022, 6:54 PM

how long does it take promethus to gather the data? Is it possible that prometheus timesout because a previous instance is still collecting data?

miniature-judge-91363

07/22/2022, 11:47 AM

One scrape takes 50-250ms and scrape interval is 5s so nope

bored-island-21407

07/22/2022, 11:50 AM

hmmmm.... I duplicated your set up (except running on vps rather than pi) and ran a task that ssh'd into node every 5 secs... Ran for hour+ with no loss of connectivity except when changing parameters on node or gateway which would result in the interface going up/down (as expected)

miniature-judge-91363

08/12/2022, 12:43 PM

Okay, a long time later, it seems that we finally pinpointed the issue, which was actually a combination of jitter, prometheus scrape interval and other variables. Long story short, it seems that all issues vanish when we use default keepalive of 20 sec (or any keepalive) and keep scraping at 10 sec intervals. This was likely due to prometheus timeouting connections where there was extra jitter, from 4G connections or otherwise less reliable connection. Thanks for the help during the debug process, and sorry for the trouble we caused 😄

miniature-judge-91363

08/12/2022, 12:44 PM

@flat-alarm-21130 can post more detailed analysis later

bored-island-21407

08/12/2022, 12:53 PM

Glad you figured it out (and that it wasn't due to netmaker) 😄

11 Views

Previous Next