also can you please post the output of
# upgrades
j
also, can you please post the output of the following?
Copy code
cat /etc/netclient/netclient.yml | grep firewallinuse
iptables --version
nft --version
iptables-legacy --version
iptables-legacy-save --version
iptables-nft-save --version
s
Copy code
firewallinuse: nftables
iptables v1.8.7 (legacy)
nftables v1.0.2 (Lester Gooch)
iptables v1.8.7 (legacy)
iptables-save v1.8.7 (legacy)
iptables-nft-save v1.8.7 (nf_tables)
Part of how this server is configured is to update-alertnatives for iptables to use
/usr/sbin/iptables-legacy
I believe that was due to issues with netclient and docker
j
^ you ran this on the ingress machine correcT?
s
Yeah
Any system that has netclient does that
I'm creating a new instance without that change for troubleshooting
On a fresh instance with only netclient installed:
Copy code
firewallinuse: iptables
iptables v1.8.7 (nf_tables)
nftables v1.0.2 (Lester Gooch)
iptables v1.8.7 (legacy)
iptables-save v1.8.7 (legacy)
iptables-nft-save v1.8.7 (nf_tables)
j
hmmm this output is different from the above
s
Yeah
j
can you try making this one the ingress?
s
Im creating a new Ingress on this node to test as well
j
if it works, will confirm my suspicions
s
hmm, weird, I can't scan the QR code with the new UI
j
it's a bit of a bug and is finnicky, but you should be able to get it
s
I just downloaded it and used
qrencode
haha
OK using the Ingress on this test node appears to be working with no further changes
so that iptables-legacy must be causing it
j
perfect, this explains it
All of our test boxes match the configuration of your current machine, so we missed this. I think our nftables commands are not functioning correctly.
s
Yeah, i forgot about that change completely - sorry about that. Thanks for the help troubleshooting this. I'll look into why that change was even needed to begin with - I think it had some conflict with docker or something
j
it also should probably be using iptables, even if it's iptables-legacy. Our firewall-chooser, is missing that, and picking nftables instead
no sorry on your part, we should be thanking you. Pinpointing this issue on our own would take eternity.
we may need some additional details to help with this. How you set up that environment would help, so we can replicate
s
Within the
ansible
folder is how the systems are configured
netclient
is preloaded into systems so they're pre-connected into the network (join on first-boot)
j
also, to confirm, on the non-functioning system, does
iptables --version
output anything?
s
On the nonfunctioning, yes:
iptables v1.8.7 (legacy)
OK so hold on - I just ran the standard setup on it (update the system, install docker, etc.) and now the issue is back again.
j
does this update change the output of the above commands?
Copy code
cat /etc/netclient/netclient.yml | grep firewallinuse
iptables --version
nft --version
iptables-legacy --version
iptables-legacy-save --version
iptables-nft-save --version
s
Copy code
firewallinuse: iptables
iptables v1.8.7 (nf_tables)
nftables v1.0.2 (Lester Gooch)
iptables v1.8.7 (legacy)
iptables-save v1.8.7 (legacy)
iptables-nft-save v1.8.7 (nf_tables)
j
what commands exactly do you run for the "standard setup"?
s
It's all the stuff in that
ansible
folder in the repo i linked. It's a whole bunch of commands... Mainly setting mounts, users, hosts, packges, docker, etc.
I'll try to narrow down which of the steps is causing it
j
yeah that would be very helpful...my initial suspicion is docker, perhaps try uninstalling
s
That's what I'm trying right now; I think you're right. It does mess with iptables after installation
I think it's docker. Uninstalled, rebooted to flush iptables, clear difference and it's working again.
ok so also interesting. Installing it again doesn't immediately cause the issue, because the
netmakerfilter
is at the bottom of the list I think? https://gist.github.com/IAreKyleW00t/01cdd2bbc7ae42f6a8b128fd44abb4e5
j
in your ansible script, do you install netclient first, or docker?
s
netclient is installed on the OS before anything so I can connect to it to configure via ansible
b
so if you compare before and after docker, the default policy on FORWARD is set to drop...
s
docker will redo iptables each time the service starts though
which is probably why after a reboot the issue comes up again
b
i think this is the issue
s
oh didnt even notice that!
yeah the default being DROP may be it
b
yup
s
Yeah, if I manually make it ACCEPT then it works
I'm checking if the issue comes back after a reboot again. It didn't when I manually restarted the docker service, which would have messed with iptables
Yup, it went back to DROP after a reboot. Just to double make sure I'm gonna remove docker again
j
yeah, it's a part of how docker does things, apparently: https://docs.docker.com/network/iptables/#docker-on-a-router
s
Thanks, I'm looking into their suggested fix right now
iptables -I DOCKER-USER -j ACCEPT
got things working!
Here's a more defined solution:
sudo iptables -I DOCKER-USER -i netmaker -j ACCEPT
Alternative is to add a rule in the FORWARD chain:
iptables -I FORWARD -i netmaker -j ACCEPT
So perhaps if a node is marked as an Ingress, always ensure this forward rule exists? I'm not sure if there's other things to consider with that
b
@stale-judge-54185 when you add ext clients to the ingress node do you see any rules under
Chain netmakerfilter
s
I have 1 client on the test Ingress and there are no rules under that chain
Copy code
Chain netmakerfilter (1 references)
 pkts bytes target     prot opt in     out     source               destination
b
hmm this can be the actual issue, in the FORWARD chain we jump to netmaker filter chain if the pkt is from netmaker interface and there should be an accept rule for the ext client under
netmakerfilter chain
it will be helpful if you can paste the client logs
s
I tried creating a new Ext Client and there are still no rules
Sure, how can I get the logs? Is there a log file?
or would that be from the server?
b
journalctl -fu netclient
run this on ingress client
s
Copy code
ubuntu@test-1:~$ journalctl -fu netclient
May 19 03:20:21 test-1 netclient[930]: [GIN-debug] POST   /leave/:net               --> github.com/gravitl/netclient/functions.leave (3 handlers)
May 19 03:20:21 test-1 netclient[930]: [GIN-debug] GET    /servers                  --> github.com/gravitl/netclient/functions.servers (3 handlers)
May 19 03:20:21 test-1 netclient[930]: [GIN-debug] POST   /uninstall                --> github.com/gravitl/netclient/functions.uninstall (3 handlers)
May 19 03:20:21 test-1 netclient[930]: [GIN-debug] GET    /pull/:net                --> github.com/gravitl/netclient/functions.pull (3 handlers)
May 19 03:20:21 test-1 netclient[930]: [GIN-debug] POST   /nodepeers                --> github.com/gravitl/netclient/functions.nodePeers (3 handlers)
May 19 03:20:21 test-1 netclient[930]: [netclient] 2023-05-19 03:20:21 mqtt connect handler
May 19 03:20:21 test-1 netclient[930]: [netclient] 2023-05-19 03:20:21 processing node update for network k2net
May 19 03:20:21 test-1 netclient[930]: [netclient] 2023-05-19 03:20:21 network: k2net received message to update node 3a1964a8-023e-4f33-bd59-a9446d196827
May 19 03:20:21 test-1 netclient[930]: [netclient] 2023-05-19 03:20:21 published host turn register signal to server: net.kyle.systems
May 19 03:20:21 test-1 netclient[930]: [netclient] 2023-05-19 03:20:21 adding addresses to netmaker interface
b
can you increase the verbosity for this host on the Ui to
4
?
then add a new extclient and share the logs
b
is this from the ingress node? i don't see any firewall related logs at all
s
Yes it is
This is what happens each time an Ext Client is added on that node: https://gist.github.com/IAreKyleW00t/8e23245807895d4ca9e6daf78de15f8a
That's all I see
b
could you run a pull on this client
netclient pull
and try this again
s
b
on running pull, your daemon should restart
s
It did yeah, sorry I can include those logs too - thought you just wanted what showed up when adding the client
b
journalctl -u netclient
i think it will give all the logs
s
Yeah on sec, i'll give you a better collection
This is starting from a
netclient pull
and then immediately adding an Ext Client, https://gist.github.com/IAreKyleW00t/17cb9c82f479992d1e370ebb4edc1da0
b
failed to create proxy, check if stun list is configured correctly on your server
can you check if
STUN_LIST
is set on your server env?
s
Copy code
ubuntu@netmaker-1:/mnt/docker/netmaker$ cat docker-compose.yml | grep STUN_LIST
      - STUN_LIST=stun.${NM_DOMAIN}:${STUN_PORT},stun1.netmaker.io:3478,stun2.netmaker.io:3478,stun1.l.google.com:19302,stun2.l.google.com:19302
Copy code
ubuntu@netmaker-1:/mnt/docker/netmaker$ sudo docker exec -it netmaker sh
~ # echo $STUN_LIST
stun.net.kyle.systems:3478,stun1.netmaker.io:3478,stun2.netmaker.io:3478,stun1.l.google.com:19302,stun2.l.google.com:19302
b
can you send output of
cat /etc/netclient/servers.yml
from your ingress node
b
can you remove this entry from your stun list on the server
stun.net.kyle.systems:3478,
then run on server
1. docker-compose down && docker-compose up -d
On the client once the server is up and ready
run netclient pull
s
so just:
stun1.netmaker.io:3478,stun2.netmaker.io:3478,stun1.l.google.com:19302,stun2.l.google.com:19302
in the list?
b
yes
s
Does the STUN port need to be publicly accessible?
b
yes
s
OK, that was not the case before - I can fix that going forward
I wasn't aware of that
I have restarted it without that entry
should be fine for now since we removed your stun domain from the list
s
Yeah it's exposed to the network, but not to the world
as in I don't have a FW rule put in (security group)
b
oh alright
did you run pull on the client, once the server is up?
s
doing that now
I've made sure that STUN port is exposed now btw
b
okay there seems to be an issue with this domain currently
stun2.netmaker.io
can you just keep these two in the stun list
stun1.l.google.com:19302,stun2.l.google.com:19302
...
and repeat the above the steps again, sorry for the trouble
s
sure thing, no worries!
b
did you run pull on client?
cat /etc/netclient/servers.yml
s
yeah i did
are there any other ports* that need to be publicly accessible?
b
for some reason, your udp hole punch is failing that's the reason you are seeing all these issues
s
is it related to connectivity between nodes, or with the server?
b
this is related to the node, udp hole punch happens on the node
s
Copy code
ubuntu@test-1:~$ nc -v -u -z -w 3 stun.net.kyle.systems 3478
Connection to stun.net.kyle.systems (3.135.131.84) 3478 port [udp/*] succeeded!
still seeing that failure in the logs though
I'll need to be hopping off for the night, but please let me now if there's anything else you'd like me to test. and thanks for the help!
t
@stale-judge-54185 whats the output of 1.
docker-compose config | grep STUN_LIST
2.
docker-compose -v
thx
s
@tall-room-55783 - here you go; I put the STUN_LIST back to normal after the troubleshooting efforts last night. the
docker compose -v
is just normal output, but i ran
docker compose version
too for you
Copy code
ubuntu@netmaker-1:/mnt/docker/netmaker$ sudo docker compose config | grep STUN_LIST
      STUN_LIST: stun.net.kyle.systems:3478,stun1.netmaker.io:3478,stun2.netmaker.io:3478,stun1.l.google.com:19302,stun2.l.google.com:19302
ubuntu@netmaker-1:/mnt/docker/netmaker$ sudo docker compose -v

Usage:  docker compose [OPTIONS] COMMAND

Docker Compose

...
ubuntu@netmaker-1:/mnt/docker/netmaker$ docker compose version
Docker Compose version v2.17.3
And docker version info too, just in case
Copy code
ubuntu@netmaker-1:/mnt/docker/netmaker$ sudo docker version
Client: Docker Engine - Community
 Version:           24.0.0
 API version:       1.43
 Go version:        go1.20.4
 Git commit:        98fdcd7
 Built:             Mon May 15 18:49:22 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.0
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       1331b8c
  Built:            Mon May 15 18:49:22 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.21
  GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
 runc:
  Version:          1.1.7
  GitCommit:        v1.1.7-0-g860f061
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
t
thanks, glad it works for you
s
No prob, let me know if you need any more information collected
OK, so I wanted to try something again and spun up a new test server (with no modifications other than installing and configuring netclient), and it look like STUN is working now? https://gist.github.com/IAreKyleW00t/8f186a5be87ae909e0c779f1474b89c7
it did fail when trying to setup iptables for IPv6 addresses though
this is also set with my own STUN server too
im checking if docker is causing some other issue
I don't think it's docker, but something with my standard system setup is causing something weird to happen. The only thing I can see is after ansible runs, STUN fails and the "Static Endpoint" for the client in the UI keeps being set to blank. Before, it kept the value I put in it (which wasn't any different from what it auto-detected anyways, but it's acting different)
j
@bored-solstice-58967 found some serious issues with the netclient that can cause this issue under certain circumstances. We have a ticket created and it is high priority to fix. Would love if you can be a guinea pig one branch is up.
s
Absolutely, just let me know when you're ready for me to test things out! I think I've narrowed it down to something with
/etc/resolv.conf
- after those changes, and a reboot, the STUN proxy fails. Trying one more thing to confirm that is actually what is causing it
(That file is being set to use my DNS servers, systemd-resolved is disabled, then it's being write-locked using
chattr
)
Both those DNS servers would be IPs over the nm network, one of which is a "normal" node, another is a Pi behind the Egress on my home network that does not have the netclient installed. Hitting both those on the Ingress node doesn't appear to have any issues, but that's after netclient has started. I'm not sure if there's some race contition with it attempting to resolve DNS using those IPs before it's actually connected
Yeah, so if I add a public DNS server into resolv.conf, then reboot the system (just changing it repull with netclient didn't seem to work, maybe I need to restart the service), then STUN connection worked again
Adding an Ext Client works, once STUN is connected properly. However the ip6tables don't seem to be working
Copy code
ubuntu@test-1:~$ sudo ip6tables -L -v
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain netmakerfilter (0 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DROP       all      any    any     anywhere             anywhere
    0     0 RETURN     all      any    any     anywhere             anywhere
Copy code
May 19 16:17:33 test-1 netclient[682]: [netclient] 2023-05-19 16:17:33 [iptables_linux.go-355] InsertIngressRoutingRules(): failed to add rule: [-s 100.100.100.252/32 -d fde7:76ae:f7c1:10::/64 -j ACCEPT], Err: running [/usr/sbin/iptables -t filter -I netmakerfilter 1 -s 100.100.100.252/32 -d fde7:76ae:f7c1:10::/64>
May 19 16:17:33 test-1 netclient[682]: Try `iptables -h' or 'iptables --help' for more information.
I'm testing this now with docker installed, to see if that other FORWARD fix is needed anymore if the STUN connection works and it adds those iptable entries.
Seems to be working fine with docker installed and no other changes
@User / @User - So it looks like the root cause is: Ingress node cannot resolve DNS for the STUN server during netclient startup. This results in: STUN proxy setup failing, which causes the UDP holepunching not functioning (and its related iptables, which allow the traffic through).
j
yes, this was our analysis as well.Thank you for confirming
There is no reason for STUN or firewall management to be tied to proxy, it is legacy code. So we need to remove that, and it should be fine.
s
Awesome, glad we were on the same track! Quite the interesting rabbit hole to go down for this haha
j
yup...kinda crazy that this ended up being the issue. Seems so irrelevant. I was certain it was our nftables rules...and then I was certain it was docker
but this was some fantastically done root cause analysis
should print and laminate this thread
3 Views