error broker port is blank - upgrading from 0.14.1...
# client
a
error broker port is blank - upgrading from 0.14.1 to 0.14.2
wondering if i missed a step on the upgrade? I updated my netmaker & ui docker images to v0.14.2 all my 0.14.1 clients were still happy
so then i upgraded one netclient to 0.14.2 that one could not check in... giving an error
Copy code
netclient[59819]: [netclient] 2022-06-02 14:25:12 error publishing ping, mq setup error error: broker port is blank
j
it should fallback and retrieve the port if you give it a few minutes
b
you may have to wait for a couple of minutes for the client to do a pull
a
i noticed that the
/etc/netmaker/config/netconfig-main
had this block
Copy code
server:
    corednsaddr: ""
    apihost: ""
    apiport: ""
    clientmode: ""
    dnsmode: ""
    version: v0.14.2
    mqport: ""
    server: broker.netmaker.MYDOMAIN
b
or you could do a manual pull
a
hmm... ok, i'll try another node to verify that ... i worked around on this node my manually setting the port
hmmm
Copy code
# netclient pull -v

[netclient] 2022-06-02 19:41:46 No network selected. Running Pull for all networks. 

[netclient] 2022-06-02 19:41:46 Error pulling network config for network:  family 

 Post "https:///api/nodes/adm/family/authenticate": http: no Host in request URL 

[netclient] 2022-06-02 19:41:46 Error pulling network config for network:  main 

 Post "https:///api/nodes/adm/main/authenticate": http: no Host in request URL 

[netclient] 2022-06-02 19:41:46 register at https:///api/server/register 

[netclient] 2022-06-02 19:41:47 restarting netclient.service 

[netclient] 2022-06-02 19:41:48 reset network and peer configs
and my systemd logs for the netclient unit:
Copy code
Jun 02 19:43:53 tunnel netclient[143179]: [netclient] 2022-06-02 19:43:53 initializing network main

Jun 02 19:43:53 tunnel netclient[143179]: [netclient] 2022-06-02 19:43:53 netclient daemon started for server:  broker.netmaker.MYDOMAIN

Jun 02 19:43:53 tunnel netclient[143179]: 2022/06/02 19:43:53 could not read client cert/key tls: private key does not match public key
b
interesing that your certs/key got out of sync... bute force way to recover ... on sever delete files in /root/certs and restart docker containers
a
btw, i can totally work around this... just trying to highlight the issue and see if there are "best practice" steps for the upgrade and/or the move from port 8883 to 443
j
@bored-island-21407 I wonder if the change I made to retrieving the broker address resets the whole server section of the config. That would explain it
but I dont think I did that...
b
no ...
a
@jolly-london-20127 that sounds promising,... because here's a node that's still on 14.1
Copy code
server:
    corednsaddr: ""
    accesskey: SOMESTUFF
    server: broker.netmaker.MYDOMAIN
    api: api.netmaker.MYDOMAIN:443
b
when i was testing the upgrade scenario prior to release this am ... i did the same steps .. it took awhile but the node eventually recovered
a
i'm going to downgrade aclient and try to reproduce this
j
it may be an issue of doing too much to it before it has a chance to recover automatically
when you attempt to reproduce, please try leaving the client for ~5min to see if it's able to reset its configs automatically
a
will do.... I'm keeping one node in known good state from 14.1 another in my broken state on 14.2 and reverting a broken one to 14.1
the revert was successful, though i had to manually put back the
server: api:
field as it was before... then a
netclient pull
was good and that node is communicating fine with broker again on 8883
b
what os are your nodes running?
a
so to recap the steps here... My docker-compose was on 0.14.1 and thus did not have an MQ_PORT set... 1) upgrade docker-compose netmaker/ui to 0.14.2 (do NOT setup mqtt over traefik, just using bare 8883 port, still, no MQ_PORT) 2) (linux node) systemctl stop netclient 3) (linux node) wget https://github.com/gravitl/netmaker/releases/download/v0.14.2/netclient to /sbin/netclient , chmod 755 /sbin/netclient 4) (linux node) systemctl retart netclient; journalctl -f -u netclient
now i'm watching
most are linux (ubuntu 22.04 servers and one fedora desktop) ... plus one windows machine, but i've only been troubleshooting on linux because its easier for me
b
🥂 you did see my avatar, right
a
yes 🙂 i was ashamed to tell you about the windows box
ok, i don't think this will recover
pasting logs and config to show why
b
did you add an MQ_PORT in the env for netmaker
a
no
b
but it should default to 8883 if it isn't set
a
that's what i figured... i'm effectively trying to do what was in the announcment > If you'd like to keep your existing Caddy proxy, you can just update the images to 0.14.2 and run as-is (with port 8883).
Copy code
Jun 02 15:10:50 MYNODE systemd[1]: Started netclient.service - Netclient Daemon.
Jun 02 15:10:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:10:50 initializing network family
Jun 02 15:10:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:10:50 started daemon for server  broker.netmaker.MYDOMAIN
Jun 02 15:10:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:10:50 netclient daemon started for server:  broker.netmaker.MYDOMAIN
Jun 02 15:11:20 MYNODE netclient[71487]: [netclient] 2022-06-02 15:11:20 unable to connect to broker, retrying ...
Jun 02 15:11:20 MYNODE netclient[71487]: [netclient] 2022-06-02 15:11:20 unable to connect to broker error: broker port is blank
Jun 02 15:11:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:11:50 local port has changed from  42624  to  41916
Jun 02 15:12:20 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:20 unable to connect to broker, retrying ...
Jun 02 15:12:20 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:20 could not publish local port change
Jun 02 15:12:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:50 unable to connect to broker, retrying ...
Jun 02 15:12:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:50 error publishing ping, mq setup error error: broker port is blank
Jun 02 15:12:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:50 running pull on family to reconnect
Jun 02 15:12:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:50 could not run pull on family, error: Post "https:///api/nodes/adm/family/authenticate": http: no Host in request URL
Jun 02 15:12:50 MYNODE netclient[71487]: [netclient] 2022-06-02 15:12:50 checkin for family complete
i'm not sure where, but sometime in that period... my
netconfig-family
changed from having
Copy code
server:
    corednsaddr: ""
    accesskey: ""
    server: broker.netmaker.MYDOMAIN
    api: api.netmaker.MYDOMAIN:443
to...
Copy code
server:
    corednsaddr: ""
    apihost: ""
    apiport: ""
    clientmode: ""
    dnsmode: ""
    version: ""
    mqport: ""
    server: broker.netmaker.MYDOMAIN
which is why there's no api hostname to pull from
so pulls fail
i'm going to try setting
MQ_PORT: "8883"
in my docker-compose and then re-attempt the upgrade
same results
b
hmmm , well that is a scenario I didn't test ... i did all my testing with traefik rather than caddy but i am a bit baffled as to the root cause of the issue
a
yeah, i don't think it's a traefik/caddy related issue
it seems to be related to the model change of the config file for netclient 14.1 vs 14.2
and in my case ... it's traefik -> traefik, 14.1 -> 14.2 , but not even changing the MQ port
and i can confirm, if i wipe out the /etc/netclient/config/* for my node, and manually delete it from netmaker-ui i can cleaning join a 14.2 netclient
that's why i think it really seems to be a problem in the client config upgrade
b
your compose files has SERVER_API_CONN_STRING?
a
yep
Copy code
environment:
      SERVER_NAME: "broker.${NM_BASE_DOMAIN}"
      SERVER_HOST: "${NM_PUBLIC_IP}"
      SERVER_API_CONN_STRING: "api.${NM_BASE_DOMAIN}:443"
      COREDNS_ADDR: "${NM_PUBLIC_IP}"
      DNS_MODE: "on"
      SERVER_HTTP_HOST: "api.${NM_BASE_DOMAIN}"
      API_PORT: "8081"
      CLIENT_MODE: "on"
      MASTER_KEY: "${NM_MASTER_KEY}"
      CORS_ALLOWED_ORIGIN: "*"
      DISPLAY_KEYS: "on"
      DATABASE: "sqlite"
      NODE_ID: "netmaker-server-1"
      MQ_HOST: "mq"
      #MQ_PORT: "443"
      HOST_NETWORK: "off"
      VERBOSITY: "1"
      MANAGE_IPTABLES: "on"
      PORT_FORWARD_SERVICES: "dns"
b
I am trying to determine where the api is getting set to blank
a
this is my netclient config on the clean 14.2 install
Copy code
server:
    corednsaddr: MYIP
    apihost: api.netmaker.MYDOMAIN:443
    apiport: "8081"
    clientmode: ""
    dnsmode: "on"
    version: v0.14.2
    mqport: "8883"
    server: broker.netmaker.MYDOMAIN
ok, i have a workaround at least
b
ok i have a 14.1 client connected to a server running the test build (aka 14.2)
I am going to update the client
a
manually add
Copy code
apihost: api.netmaker.MYDOMAIN:443
    apiport: "8081"
to the
server
section of the netclient-netname config file in /etc/netclient/config/ then
netclient pull -v
and the node started working for me again
(i'm not sure if apiport was needed)
i'll confirm
b
it should not be
a
yeah, i didn't think so
yeah, 0.14.1 expected
server: api: HOSTNAME_OF_NETMAKER:443
and 0.14.2 expects
server: apihost: HOSTNAME_OF_NETMAKER:443
so that's why that bit is broken...
maybe
i'm guessing
b
yes I remember that being changed but i thought we had a recover in place
a
ok, well, good luck... i gotta get back to my actual job 😉 if i can help test something specific, let me know
b
will do ... thanks a bunch for helping with this ( and for the traefik stuff)
a
my pleasure
f
I tried adding apihost: api.netmaker.MYDOMAIN:443 to the netconfig- file, and the pull on the client worked, but the status still shows error...saw in the docker logs mq "sslv3 alert bad certificate" so I tried wiping the certificates in /root/certs/ on the server and restarted...but now I get no new certs at all in there? 🤔
b
Restart the netmaker container
f
I did
b
Netmaker will gen certs on startup if they are missing
f
Ok, but they're not in the /root/certs folder :/
Hmm, I saved my old docker-compose.yml before changing to the docker-compose.traefik.yml. In the old the mq volumes look like
Copy code
volumes:
      - /root/mosquitto.conf:/mosquitto/config/mosquitto.conf
      - /root/certs/:/mosquitto/certs/
      - mosquitto_data:/mosquitto/data
      - mosquitto_logs:/mosquitto/log
in the new traefik based one:
Copy code
volumes:
      - /root/mosquitto.conf:/mosquitto/config/mosquitto.conf
      - mosquitto_data:/mosquitto/data
      - mosquitto_logs:/mosquitto/log
      - shared_certs:/mosquitto/certs
is that relevant?
b
In that case you need to delete them from the shared certs docker volume
f
Thanks, looks like that solved it.. But why was the volume for certs changed? 😄
b
It was a community submitted PR
f
Ah. the certs are maybe less prone to be accidently deleted that way.
Time to sleep, will continue to fiddle with this tomorrow 😄
a
the docker-compose yaml files provided are really intended to be a guide, not a production solution... specifically related to the volume definitions...
Copy code
volumes:
  traefik_certs: {}
  shared_certs: {}
  sqldata: {}
  dnsconfig: {}
  mosquitto_data: {}
  mosquitto_logs: {}
this isn't a recommended way to actually do volumes in docker... it works, but its really more like a place holder.
at least, that's my opinion 😉
with that default config, the volumes are assigned to some location as specified by the docker daemon configuration... which, if your systems is linux with stock configs, usually means it's buried somewhere under
/var/lib/docker
but it's not exactly obvious where your data was stored
i really like this method for standalone servers like the small virtual machine where i run netmaker... https://docs.docker.com/storage/bind-mounts/#use-a-bind-mount-with-compose
you can see how i've used that in my personal repo (which has not yet been updated to 14.2) https://github.com/bsherman/netmaker-traefik/blob/main/docker-compose.yml#L140
anyway, i also should apologize... I was responsible for the change of
/root/certs
to
shared_certs:
in my contribution of the
docker-compose.traefik.yml
i kept the simple default volumes to avoid complications for folks upgrading from caddy to traefik, but didn't think about the complication with respect to the changed one.
j
@few-airline-95046 @average-helicopter-96869 we think we've narrowed down the issue. Did you upgrade the clients before upgrading the server?
f
One node might have been updated before, but not the second one that i tried later yesterday night
j
can you share your docker-compose (before and after)? would help with recreating the issue
f
I can do it in a couple of hours probably
@jolly-london-20127, I see that you've released new binaries, so you don't need the docker-compose files any more? 😄
Hmm, updated to the latest netclient binary, but still seeing
Copy code
root@Cradle:/boot/config/netclient# netclient pull --vvv --daemon off
[netclient] 2022-06-03 23:08:54 No network selected. Running Pull for all networks.
[netclient] 2022-06-03 23:08:54 Error pulling network config for network:  xxx
 Post "https:///api/nodes/adm/xxx/authenticate": http: no Host in request URL
[netclient] 2022-06-03 23:08:54 register at https:///api/server/register
[netclient] 2022-06-03 23:08:55 restarting netclient.service
[netclient] 2022-06-03 23:08:56 error running command: systemctl restart netclient.service
[netclient] 2022-06-03 23:08:56
[netclient] 2022-06-03 23:08:56 reset network and peer configs
And then when trying to run netclient in deamon mode (no systemd on that machine) I get
Copy code
root@Cradle:/boot/config/netclient# netclient daemon
[netclient] 2022-06-03 23:11:55 initializing network xxx
[netclient] 2022-06-03 23:11:55 started daemon for server  broker.netmaker.xxx.se
[netclient] 2022-06-03 23:11:55 netclient daemon started for server:  broker.netmaker.xxx.se
2022/06/03 23:11:55 could not read client cert/key tls: private key does not match public key
j
If you already had that issue, updating will not solve it. Once the api address is missing you need to add manually
f
Alright, but I have already modified /etc/netclient/config/netconfig-mydomain to have apihost set to my proper api url
a
in my case, I'd updated the server to 0.14.2 first, let things settle... and all my clients were working on 0.14.1.... then experienced the problem when updating a client to 0.14.2
my docker compose was literally: https://github.com/bsherman/netmaker-traefik/blob/main/docker-compose.yml and then upgraded by changing 0.14.1 to .2
also, apologies for my delay, i was in airports all day yesterday... travelling so not very accessible for a few days.
j
no worries, we put a hotfix in the release which should solve this issue