FS#66451 - [openvswitch] with netctl results in systemd timing gamble if it works or not
Attached to Project:
Community Packages
Opened by Oliver Dzombic (layer7gmbh) - Tuesday, 28 April 2020, 16:08 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:01 GMT
Opened by Oliver Dzombic (layer7gmbh) - Tuesday, 28 April 2020, 16:08 GMT
Last edited by Buggy McBugFace (bugbot) - Saturday, 25 November 2023, 20:01 GMT
|
Details
Description:
using netctl in combination with openvswitch -- following the archwiki howtos. Lucky show ( even the stop of the services are not clean, at least the start is ): -- Reboot -- Apr 28 17:24:01 systemd[1]: Starting ovsbr... Apr 28 17:24:01 network[756]: Starting network profile 'ovsbr'... Apr 28 17:24:02 ovs-vsctl[807]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port ovsbr enp3s0 Apr 28 17:24:04 systemd[1]: Started ovsbr. Apr 28 17:24:04 network[756]: Started network profile 'ovsbr' Apr 28 17:53:35 systemd[1]: Stopping ovsbr... Apr 28 17:53:35 network[39544]: Stopping network profile 'ovsbr'... Apr 28 17:53:35 ovs-vsctl[39591]: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-br ovsbr Apr 28 17:53:35 ovs-vsctl[39591]: ovs|00002|db_ctl_base|ERR|unix:/run/openvswitch/db.sock: database connection failed (No such file or directory) Apr 28 17:53:35 network[39591]: ovs-vsctl: unix:/run/openvswitch/db.sock: database connection failed (No such file or directory) Apr 28 17:53:35 network[39544]: Failed to bring the network down for profile 'ovsbr' Apr 28 17:53:35 systemd[1]: netctl@ovsbr.service: Control process exited, code=exited, status=1/FAILURE Apr 28 17:53:35 systemd[1]: netctl@ovsbr.service: Failed with result 'exit-code'. Apr 28 17:53:35 systemd[1]: Stopped ovsbr. Now machine comes up after the reboot, with a not so lucky shot: -- Reboot -- Apr 28 17:56:17 systemd[1]: Starting ovsbr... Apr 28 17:56:17 network[720]: Starting network profile 'ovsbr'... Apr 28 17:56:18 ovs-vsctl[734]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port ovsbr enp3s0 Apr 28 17:56:18 network[720]: /usr/lib/netctl/interface: line 46: /sys/class/net/ovsbr/flags: No such file or directory Apr 28 17:56:23 network[841]: Error: Nexthop has invalid gateway. Apr 28 17:56:23 network[720]: Could not set gateway '192.168.178.1' on interface 'ovsbr' Apr 28 17:56:23 network[720]: Failed to bring the network up for profile 'ovsbr' Apr 28 17:56:23 systemd[1]: netctl@ovsbr.service: Main process exited, code=exited, status=1/FAILURE Apr 28 17:56:23 systemd[1]: netctl@ovsbr.service: Failed with result 'exit-code'. Apr 28 17:56:23 systemd[1]: Failed to start ovsbr. The netctl config: Description="ovsbr" Interface=ovsbr Connection=openvswitch BindsToInterfaces=(enp3s0) IP=static Address=('192.168.178.10/24') Gateway='192.168.178.1' DNS=('1.1.1.1 1.0.0.1') WaitOnline=yes netctl version 1.21 ovs-vsctl (Open vSwitch) 2.13.0 DB Schema 8.2.0 To me, as it seems sometimes openvswitch manages to start up fast enough before netctl take action. And sometime not. Netctl manages to configure the IP Address, but it does not manage to configure the gateway. |
This task depends upon
Closed by Buggy McBugFace (bugbot)
Saturday, 25 November 2023, 20:01 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/openvswitch/issues/1
Saturday, 25 November 2023, 20:01 GMT
Reason for closing: Moved
Additional comments about closing: https://gitlab.archlinux.org/archlinux/p ackaging/packages/openvswitch/issues/1
https://git.archlinux.org/netctl.git/tree/src/lib/connections/openvswitch
I contacted the author, Jonathan Hudson (stronnag), but he replied to have moved away from using openvswitch and is unable to help.
Maybe something in the proper use of openvswitch has changed. If so, yourself and Sergej Pupykin (sergej) are the most likely people to spot what should change. Here is my take on your log, but note that I know nothing about openvswitch.
-- First start --
Seems fine. I note that the connection script does not attempt to start any database server. Should it?
-- Stop --
The problem seems to be simply that openvswitch fails to do its thing because of some missing database. The weird thing is that this error did not show up on start.
-- Second start --
This one is weird. It looks like at the point the connection script tries to bring the openvswitch interface up (line 25), the interface does not (fully) exist yet. It is a mystery to me why ovs-vsctl apparently had no trouble adding a port to the openvswitch interface right before this failure.
thank you for your time !
In my humble opinion, there is simply a systemd timing problem.
At the first start all runs cool.
At the stop:
1. systemd try's to stop ovsbr ( the netctl interface )
2. network ( netctl ) got the command to actually do it
3. netctl realize its openvswitch and starts using ovs-vsctl ( the cli interface for ovs )
4. Actually this failes because ovs-vsctl cant build up the database connection
And the reason is: A systemreboot took place here. ( restarting the network does not work at all, so i have to reboot always if i want to test, if something works there)
So ovs was actually put down by systemd before netctl/network/systemd could do its job. So everything in the row failed.
At the reboot ( second start ):
1. systemd again gives the order to start ovsbr
2. network ( netctl ) got the order to do it
3. the ovs database was this time already reachable by the ovs-vsctl cli command ( ovs was booted up already ), and the commanded to add to the virtual interface ovsbr the physical port enp3s0 was placed successfully... unfortunatelly...
4. network continued its procedure before ovs was actually able to add the physical port enp3s0 to the virtual interface ovsbr resulting in
5. an error in finding the ovsbr and again all the rest in the row just failed because of this
No matter what i do in systemd, to define what services to run before and after, and what service actually need/want which service to be able to work. There is no way preventing this situation, that ovs wants to add the physical port to the virtual interface and netctl too early proceed
with its starting routine.
So out of 10 reboots, maybe you have 1-2 lucky shots, where ovs is fast enough to get up the virtual port before netctl actually continues.
At start:
- Is ovs-vsctl supposed to work without a database-server running? If not, is it supposed to start one if none is found? In other words: is another process required before netctl openvswitch connections can be used?
- Is ovs-vsctl br-exists supposed to succeed if the interface sysfs tree is not present?
At stop:
I don't understand what you mean by
> A systemreboot took place here. ( restarting the network does not work at all, so i have to reboot always if i want to test, if something works there)
> So ovs was actually put down by systemd before netctl/network/systemd could do its job.
Are you saying that another ovs service is running and that it is stopped before the netctl profile? The modus operandi of netctl is that no other services should be running. If a database server is needed for netctl openvswitch connections, then netctl should be in charge of starting and stopping this server.
there is a service vswitchd which has to start before netctl can do something ( via ovs-vsctl ). ovs-vsctl is nothing else but a cli interface for the ovs system ( which has its own internal database ).
Normally it should go:
vswitchd service starts
netctl starts
which is the case.
The problem is:
netctl commands ovs-vsctl to create the ovsbr interface and assign the physical port to it, but netctl actually does not wait until ovs-vsctl has done its job. Thats why we see a not found for the /sys/class/net/ovsbr/flags.
---
netctl is stupid and very easy. Its using ovs-vsctl to create the virtual interface and connect it to the physical port according to its configuration.
It does not care if the open vswitch system is actually ready, nor does it care to check if ovs-vsctl actually did what netctl commanded it to do.
Systemd has to make sure, that openvswitch is ready -- and thats actually the case.
Netctl just does not wait until /sys/class/net/ovsbr/flags was actually created by ovs-vsctl. And thats the root problem, in my opinion.
So must probably if someone would like to fix this, we should take a look at /usr/lib/netctl/interface what this code actually does.
=====
And as to the shutdown thing ( the stop one ): Here it seems to me that the proper order when to shutdown ovs and when to shutdown the interface by netctl is not properly times.
Anyway, my (dirty) fix was to switch in netctl from openvswitch handling to regular ethernet handling ( Connection=ethernet ) and adding
IPCustom=('link set enp3s0 up') which will send an ip link set $physical_interface up.
Openvswitch has actually the ovsbr interface already in its database, together with the physical port ( independent of netctl ) as this kind of stuff is persistent in the ovs db.
So this way, this stuff works reliable as it seems this way. But its of course just a dirty workaround for the not_working netctl ovs implementation.
1: netctl should take control of starting/stopping vswitchd (which should in that case not be started via its own service file).
2a: a netctl-generated service file for an openvswitch connection should contain "After=ovs-vswitchd".
2b: The openvswitch service should contain something like "Before=netctl@" or "Before=network-pre.target".
Only option 2b is easy to implement and only the second variant of it stands a chance of being upstreamed (although I see that the service files are provided by Arch, and not by upstream). The problem with option 1 is that it could get difficult when you want to run multiple openvswitch profiles (although maybe there can be multiple database-server instances?).
Additionally, if ovs-vsctl returns before it finishes creating an interface/port, I think that is a bug in ovs-vsctl.