My Home Assistant was misbehaving lately and I could not figure out why. It did take me time, effort, a lot of cursing, troubleshooting and head scratching, and eventually it was solved by the typical "this can NOT be the problem, but there is nothing else to try!" approach!
My setup
I set up Home Assistant on an Intel NUC Kit NUC5PPYH. The system is quite powerful, it has 4Gb of RAM and a fast 256Gb SSD. On the system I have installed openSUSE with VirtualBox, and a virtual machine on it runs a hassos OVA.
I bought a zig-a-zig-ah! stick to set up a Zigbee network with several smart devices. This stick, except for the fact that is super-easy to program, is one of the recommended Zigbee coordinators according to zigbee2mqtt - my choice for managing the Zigbee network. The USB stick is automatically attached to the VM through a USBfilter, and the VM autostarts when the system boots. Overall with this setup everything should work flawlessly.
Ensuring the availability
The NUC is set up to automatically power on after a power failure, because it will be installed in a house with potential power drops and interruptions. A Wireguard client addon provides connecticity to my own Wireguard network so that I can be sure my home assistant will be operational and reachable, as long as there is power (and network connectivity) where it will be installed.
What I did not know...
Being on the safe side, I bought a couple of supported Zigbee devices such as the IKEA TRADFRI motion sensor which should pair without an issue. The firmware on the zig-a-zag-ah is the one written by Koenk himself. What could ever go wrong?
Well, pairing did not work. Although zigbee2mqtt identifies the USB stick as coordinator, and the coordinator can scan the network, there was no way to pair any devices to it. Strange, as this particular coordinator was chosen because it has an external antenna and supports tens of devices - something done on purpose so that I don't run out of capacity.
Despite all my attempts, pairing was as unstable as ever. Sometimes pairing would work. Un-pairing a device would work without an issue, always (which is one of the most annoying parts!). But pairing would either completely fail (nothing in the logs), or the interview would fail (partial pairing). Or pairing would happen but the device would be unstable with messages such as "Failed to execute LQI for 'IKEA Motion Sensor'" while other sensors would get proper LQI.
I tried more Zigbee devices, from different manufacturers. To no avail. I tried to confirm that the coordinator is recognized by zigbee2mqtt, by removing it. Not plugging in the coordinator makes zigbee2mqtt unable to start; removing the device while in operation is properly registered in the logs. All as expected.
Furthermore, I learned about Zigbee and WiFi interference; which led me to a reconfiguration of the Zigbee network and some WiFi scanning around the house. Again, no difference.
A lot of headscratching and frustration, until I reached a magical link where the poster said that all they had to do was to plug the coordinator to a different USB port. Could this really be the case?
According to Intel's IO technical specifications the ports are the same - they are all USB 3.0. Nevertheless, one of them is yellow while the others are blue. I did not know the difference between these, but some googling informed me that the yellow ports are "always on". The system BIOS mentions "USB ports that are colored yellow support a Portable Device Charging Mode with higher maximum current", a setting that I have as OFF. This shouldn't make any difference. We are talking about a system that should be always on. In addition, the blue USB ports obviously have power since the coordinator communicates with the software. Yet, moving the coordinator to the yellow port made all the difference in the world.
More problems
I still don't know the root cause, but here are some interesting facts that I had not paid attention to before.
Keyboard
When I installed the system, I noticed the USB wireless keyboard did not work as expected. There were delays in sending keystrokes; some of them would be sent twice or three times, some of them would never appear. I tried to troubleshoot the linux keyboard driver for a while, but I quickly gave up as I couldn't find anything that would make sense. At the moment I thought it would be a problem with the keyboard's batteries and as soon as I had the O/S installed, I enabled SSH and didn't have to worry about using a keyboard on the system any more. I do not remember which ports I used for the USB keyboard, but I do remember trying different ports.
Bluetooth
The NUC has a bluetooth controller. At some point I "lost" it. I could not see it in the O/S, interestingly I could not see it in the BIOS either. I even asked Intel Support but without any answer. Interestingly, the Bluetooth controller re-appeared as soon as I moved the zig-ah-zag-ah! stick to a different USB port (for totally unrelated reasons - probably space considerations). I do not remember which port I had the zig-a-zag-ah connected at the time, but I can confirm that using the yellow USB port, the internal on-board Bluetooth controller is operational. Good, because I will need it to connect more things to my Home Assistant, some of them being Bluetooth.
Conclusion
Since I moved the USB stick on the yellow USB port, I had no problems. Pairing of new devices works flawlessly with the first attempt. Communicating with the devices -and not only the coordinator- is as expected. I can see packets registered in the logs, every single movement is registered correctly. The internal Bluetooth device is recognized and operational.
It probably comes down to the conclusion that there is something wrong with the USB ports on the NUC. Or, something with the power of the ports - although this would not explain the keyboard and bluetooth problems. It is really a pity. I bought this system because I wanted something more powerful and extensible than a Raspberry Pi. "Losing" 3 USB ports and being left with only one limits me but I need to accept that the specific model or unit is probably problematic. And judging by the fact there are other posts pointing to USB-port change making a difference, I would assume it is not the unit that is problematic.