fix: service control buttons and AP-mode SSH lockout post-install (#326)

* fix: service control buttons and AP-mode SSH lockout post-install

Two user-reported issues after fresh install:

1. All service buttons (Start/Stop/Restart Display, Restart Web Service)
   failed silently — only Reboot worked.

   Root cause: sudoers rules use `ledmatrix.service` (with suffix) but
   api_v3.py called `sudo systemctl start ledmatrix` (no suffix). sudo
   does exact string matching, so every service action was rejected with
   returncode=1. Also missing from sudoers: ledmatrix-web, journalctl,
   and is-active entries.

   Fix:
   - Add `.service` suffix to all 8 sudo systemctl call sites in
     api_v3.py (_ensure_display_service_running, _stop_display_service,
     and all execute_system_action branches).
   - Add timeout=15 to all subprocess.run calls in execute_system_action
     (previously could hang indefinitely).
   - Add missing sudoers rules to first_time_install.sh and
     configure_web_sudo.sh: ledmatrix-web.service start/stop/restart,
     is-active for both name forms, and journalctl -u/-t ledmatrix rules.

2. SSH and web UI became inaccessible after ~1 hour even though the
   display kept running.

   Root cause: wifi_monitor_daemon restarts NetworkManager after 5
   consecutive internet failures (~2.5 min). Each NM restart drops WiFi
   briefly. During that window check_and_manage_ap_mode() increments
   _disconnected_checks but the daemon never reset it after the restart.
   After 3 such NM-restart cycles, _disconnected_checks reached 3 and
   AP mode activated — changing the Pi from WiFi client to hotspot
   (192.168.4.1) and killing SSH on the old IP.

   Fix:
   - Reset wifi_manager._disconnected_checks = 0 in the daemon
     immediately after a successful NM restart so the brief drop it
     causes doesn't count toward AP-mode activation.
   - Increase _disconnected_checks_required from 3 to 6 (90s → 3min)
     as an additional buffer against transient network flaps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* revert: restore AP-mode grace period to 90s (3 checks)

The counter reset after NM restart already fully prevents the SSH-lockout
cascade: _disconnected_checks can never accumulate across NM restarts
because it is reset to 0 before the next daemon iteration runs.

The 3→6 increase provided no additional fix for the described problem and
caused a UX regression: fresh Pi devices with no WiFi configured would
wait 3 minutes instead of 90 seconds for the LEDMatrix-Setup hotspot to
appear.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address five valid review findings; skip two

Fixed:
- march-madness/requirements.txt: Pillow>=10.3.0 (patches CVE-2024-28219;
  10.3.0 is the actual fix version — reviewer cited 12.2.0 but that risks
  breaking API changes without test coverage)
- wifi_monitor_daemon.py: add missing `import subprocess`; subprocess.run
  and CalledProcessError would NameError at runtime on the NM restart path
- wifi_manager.py: validate ap_idle_timeout_minutes before arithmetic —
  coerce to int, clamp 1–1440, fall back to 15 on bad config values
- wifi_manager.py: call _remove_nm_dnsmasq_captive_conf() on all three
  rollback paths in _enable_ap_mode_nmcli_hotspot() and in the top-level
  except block so stale dnsmasq drop-ins are never left behind
- api_v3.py: fix wrong_password prefix strip — removeprefix("wrong_password:")
  then lstrip() handles both "wrong_password: msg" and "wrong_password:msg"
- plugins_manager.js: add .catch() to loadInstalledPlugins().then() to
  surface failures instead of silently dropping unhandled rejections

Skipped:
- WiFiManager AP state persistence: architectural overhaul; _is_ap_mode_active()
  already derives from live system state, not in-memory variables
- Absolute subprocess paths in api_v3.py: paths vary by distro (/usr/bin vs
  /bin); web service has a normal PATH; sudoers already use resolved paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address five review findings (NM retry loop, start_display message, code quality)

- wifi_monitor_daemon: reset _consecutive_internet_failures = 0 in both
  NM-restart exception handlers; previously both left the counter at threshold,
  causing an immediate retry on the next iteration instead of waiting another
  full backoff period

- api_v3: fix start_display failure message — when mode is set and systemctl
  returns non-zero, message now includes the failure reason and a hint rather
  than always reporting success phrasing

- wifi_manager: move _redirect_backend from class variable to instance variable
  in __init__ alongside _ap_enabled_at; class-level default shadowed correctly
  in practice (single instance) but was misleading

- wifi_manager: narrow broad except Exception in _check_internet_connectivity
  to (subprocess.SubprocessError, OSError) for ping and OSError for HTTP
  (urllib.error.URLError is an OSError subclass in Python 3)

- wifi_manager: remove redundant local 'import re as _re' in _validate_ap_config;
  re is already imported at module level (line 37)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address five review findings (Pillow CVEs, daemon exception narrowing, timeout handling, plugin store)

- march-madness/requirements.txt: Pillow>=12.2.0 (patches CVE-2026-42308
  and CVE-2026-42310; previous floor of 10.3.0 was insufficient)

- wifi_monitor_daemon: narrow final except Exception to
  (subprocess.SubprocessError, OSError) so programming errors in the NM
  restart block are no longer silently swallowed

- api_v3/execute_system_action: add explicit subprocess.TimeoutExpired
  handler before the generic Exception catch; returns action-specific
  message with 'status','message','returncode','stdout','stderr' fields
  so the UI receives a precise, actionable payload instead of the generic
  'Failed to execute system action' string

- plugins_manager.js: move searchPluginStore into .finally() so the
  plugin store renders regardless of whether loadInstalledPlugins succeeds
  or fails; .catch() still logs the error

- first_time_install.sh: add safe_plugin_rm.sh NOPASSWD rule to the
  /tmp/ledmatrix_web_sudoers block; configure_web_sudo.sh had this rule
  but the standalone installer never granted it, leaving plugin removal
  broken after first-time install

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(api): resolve sudo/systemctl/reboot/poweroff paths at startup

Use shutil.which() with safe fallbacks for the four privileged binaries
instead of relying on bare names being resolved by the subprocess shell
search. Resolves paths once at module load rather than per-call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Chuck <chuck@example.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Chuck
2026-05-12 17:58:51 -04:00
committed by GitHub
parent b361866679
commit 1a0f1c8015
7 changed files with 80 additions and 40 deletions

View File

@@ -1086,6 +1086,7 @@ SYSTEMCTL_PATH=$(which systemctl)
REBOOT_PATH=$(which reboot)
POWEROFF_PATH=$(which poweroff)
BASH_PATH=$(which bash)
JOURNALCTL_PATH=$(which journalctl 2>/dev/null || true)
# Create sudoers content
cat > /tmp/ledmatrix_web_sudoers << EOF
@@ -1101,10 +1102,23 @@ $ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH restart ledmatrix.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH enable ledmatrix.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH disable ledmatrix.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH status ledmatrix.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH is-active ledmatrix
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH is-active ledmatrix.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH start ledmatrix-web.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH stop ledmatrix-web.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH restart ledmatrix-web.service
$ACTUAL_USER ALL=(ALL) NOPASSWD: $PYTHON_PATH $PROJECT_ROOT_DIR/display_controller.py
$ACTUAL_USER ALL=(ALL) NOPASSWD: $BASH_PATH $PROJECT_ROOT_DIR/start_display.sh
$ACTUAL_USER ALL=(ALL) NOPASSWD: $BASH_PATH $PROJECT_ROOT_DIR/stop_display.sh
$ACTUAL_USER ALL=(ALL) NOPASSWD: $BASH_PATH $PROJECT_ROOT_DIR/scripts/fix_perms/safe_plugin_rm.sh *
EOF
if [ -n "$JOURNALCTL_PATH" ]; then
cat >> /tmp/ledmatrix_web_sudoers << EOF
$ACTUAL_USER ALL=(ALL) NOPASSWD: $JOURNALCTL_PATH -u ledmatrix.service *
$ACTUAL_USER ALL=(ALL) NOPASSWD: $JOURNALCTL_PATH -u ledmatrix *
$ACTUAL_USER ALL=(ALL) NOPASSWD: $JOURNALCTL_PATH -t ledmatrix *
EOF
fi
if [ -f "$SUDOERS_FILE" ] && cmp -s /tmp/ledmatrix_web_sudoers "$SUDOERS_FILE"; then
echo "Sudoers configuration already up to date"

View File

@@ -1,4 +1,4 @@
requests>=2.28.0
Pillow>=9.1.0
Pillow>=12.2.0
pytz>=2022.1
numpy>=1.24.0

View File

@@ -89,9 +89,9 @@ TEMP_SUDOERS="/tmp/ledmatrix_web_sudoers_$$"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH status ledmatrix.service"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH is-active ledmatrix"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH is-active ledmatrix.service"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH start ledmatrix-web"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH stop ledmatrix-web"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH restart ledmatrix-web"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH start ledmatrix-web.service"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH stop ledmatrix-web.service"
echo "$WEB_USER ALL=(ALL) NOPASSWD: $SYSTEMCTL_PATH restart ledmatrix-web.service"
# Optional: journalctl (non-critical — skip if not found)
if [ -n "$JOURNALCTL_PATH" ]; then

View File

@@ -10,6 +10,7 @@ import sys
import time
import logging
import signal
import subprocess
from pathlib import Path
# Add project root to path (parent of scripts/utils/)
@@ -146,12 +147,18 @@ class WiFiMonitorDaemon:
capture_output=True, timeout=20, check=True
)
self._consecutive_internet_failures = 0
# NM restart causes a brief WiFi drop; reset the AP-mode grace
# counter so that transient disconnect doesn't count toward
# triggering AP mode.
self.wifi_manager._disconnected_checks = 0
except subprocess.CalledProcessError as e:
logger.error(f"NetworkManager restart failed (rc={e.returncode}); "
"keeping failure counter unchanged")
except Exception as e:
"resetting failure counter to avoid tight retry loop")
self._consecutive_internet_failures = 0
except (subprocess.SubprocessError, OSError) as e:
logger.error(f"NetworkManager restart error: {e}; "
"keeping failure counter unchanged")
"resetting failure counter to avoid tight retry loop")
self._consecutive_internet_failures = 0
else:
self._consecutive_internet_failures = 0
else:

View File

@@ -144,6 +144,8 @@ class WiFiManager:
# Timestamp set when AP mode is enabled; used for the idle-timeout check
self._ap_enabled_at: Optional[float] = None
# Which redirect backend was used (iptables/nftables/None); set per-instance
self._redirect_backend: Optional[str] = None
logger.info(f"WiFi Manager initialized - nmcli: {self.has_nmcli}, iwlist: {self.has_iwlist}, "
f"hostapd: {self.has_hostapd}, dnsmasq: {self.has_dnsmasq}, "
@@ -691,9 +693,8 @@ class WiFiManager:
def _validate_ap_config(self) -> Tuple[str, int]:
"""Return a sanitized (ssid, channel) pair from config, falling back to defaults."""
import re as _re
ssid = str(self.config.get("ap_ssid", DEFAULT_AP_SSID))
if not ssid or len(ssid) > 32 or not _re.match(r'^[\x20-\x7E]+$', ssid):
if not ssid or len(ssid) > 32 or not re.match(r'^[\x20-\x7E]+$', ssid):
logger.warning(f"AP SSID '{ssid}' is invalid, falling back to default")
ssid = DEFAULT_AP_SSID
try:
@@ -705,10 +706,6 @@ class WiFiManager:
channel = DEFAULT_AP_CHANNEL
return ssid, channel
# Tracks which redirect backend was used so teardown uses the same one.
# Value is "iptables", "nftables", or None (not set up).
_redirect_backend: Optional[str] = None
def _setup_iptables_redirect(self) -> bool:
"""
Add port 80 → 5000 redirect rules for the captive portal.
@@ -936,14 +933,14 @@ class WiFiManager:
if r.returncode == 0:
logger.debug("Internet connectivity confirmed via ping 8.8.8.8")
return True
except Exception:
except (subprocess.SubprocessError, OSError):
pass
try:
import urllib.request as _ureq
_ureq.urlopen("http://connectivity-check.ubuntu.com/", timeout=timeout)
logger.debug("Internet connectivity confirmed via HTTP check")
return True
except Exception:
except OSError:
pass
logger.debug("Internet connectivity check failed (both ping and HTTP)")
return False
@@ -2074,6 +2071,7 @@ class WiFiManager:
if up_result.returncode != 0:
error_msg = up_result.stderr.strip() or up_result.stdout.strip()
logger.error(f"Failed to bring up AP connection: {error_msg}")
self._remove_nm_dnsmasq_captive_conf()
subprocess.run(["nmcli", "connection", "delete", "LEDMatrix-Setup-AP"],
capture_output=True, timeout=10)
self._show_led_message("AP mode failed", duration=5)
@@ -2085,6 +2083,7 @@ class WiFiManager:
# need to add the iptables port-redirect rules for the captive portal.
if not self._setup_iptables_redirect():
logger.error("Captive-portal redirect setup failed; rolling back AP profile")
self._remove_nm_dnsmasq_captive_conf()
subprocess.run(["nmcli", "connection", "down", "LEDMatrix-Setup-AP"],
capture_output=True, timeout=10)
subprocess.run(["nmcli", "connection", "delete", "LEDMatrix-Setup-AP"],
@@ -2102,6 +2101,7 @@ class WiFiManager:
else:
logger.error("AP mode started but not verified by status check — rolling back")
self._teardown_iptables_redirect()
self._remove_nm_dnsmasq_captive_conf()
subprocess.run(["nmcli", "connection", "down", "LEDMatrix-Setup-AP"],
capture_output=True, timeout=10)
subprocess.run(["nmcli", "connection", "delete", "LEDMatrix-Setup-AP"],
@@ -2111,6 +2111,7 @@ class WiFiManager:
except Exception as e:
logger.error(f"Error starting AP mode with nmcli: {e}")
self._remove_nm_dnsmasq_captive_conf()
self._show_led_message("Setup mode error", duration=5)
return False, str(e)
@@ -2498,7 +2499,10 @@ address=/detectportal.firefox.com/192.168.4.1
# Idle-timeout check: disable AP if no client has connected within the window.
# Only applies when AP is active and we haven't just decided to enable/disable it.
if ap_active and self._ap_enabled_at is not None:
idle_timeout_min = self.config.get("ap_idle_timeout_minutes", 15)
try:
idle_timeout_min = max(1, min(1440, int(self.config.get("ap_idle_timeout_minutes", 15))))
except (TypeError, ValueError):
idle_timeout_min = 15
elapsed = time.time() - self._ap_enabled_at
if elapsed > idle_timeout_min * 60 and not self._has_ap_clients():
logger.info(

View File

@@ -2,6 +2,7 @@ from flask import Blueprint, request, jsonify, Response, send_from_directory
import json
import os
import re
import shutil
import socket
import sys
import subprocess
@@ -16,6 +17,11 @@ from typing import Optional, Tuple, Dict, Any, Type
logger = logging.getLogger(__name__)
SUDO_BIN = shutil.which("sudo") or "/usr/bin/sudo"
SYSTEMCTL_BIN = shutil.which("systemctl") or "/usr/bin/systemctl"
REBOOT_BIN = shutil.which("reboot") or "/usr/sbin/reboot"
POWEROFF_BIN = shutil.which("poweroff") or "/usr/sbin/poweroff"
# Import new infrastructure
from src.web_interface.api_helpers import success_response, error_response, validate_request_json
from src.web_interface.errors import ErrorCode
@@ -218,7 +224,7 @@ def _ensure_display_service_running():
if status.get('active'):
status['started'] = False
return status
result = _run_systemctl_command(['sudo', 'systemctl', 'start', 'ledmatrix'])
result = _run_systemctl_command([SUDO_BIN, SYSTEMCTL_BIN, 'start', 'ledmatrix.service'])
service_status = _get_display_service_status()
result['started'] = result.get('returncode') == 0
result['active'] = service_status.get('active')
@@ -227,7 +233,7 @@ def _ensure_display_service_running():
def _stop_display_service():
"""Stop the ledmatrix display service."""
result = _run_systemctl_command(['sudo', 'systemctl', 'stop', 'ledmatrix'])
result = _run_systemctl_command([SUDO_BIN, SYSTEMCTL_BIN, 'stop', 'ledmatrix.service'])
status = _get_display_service_status()
result['active'] = status.get('active')
result['status'] = status
@@ -1716,33 +1722,34 @@ def execute_system_action():
if mode:
# For on-demand modes, we would need to integrate with the display controller
# For now, just start the display service
result = subprocess.run(['sudo', 'systemctl', 'start', 'ledmatrix'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'start', 'ledmatrix.service'],
capture_output=True, text=True, timeout=15)
return jsonify({
'status': 'success' if result.returncode == 0 else 'error',
'message': f'Started display in {mode} mode',
'message': f'Started display in {mode} mode' if result.returncode == 0
else f'Failed to start display in {mode} mode: {result.stderr.strip() or "check sudo systemctl status ledmatrix.service"}',
'returncode': result.returncode,
'stdout': result.stdout,
'stderr': result.stderr
})
else:
result = subprocess.run(['sudo', 'systemctl', 'start', 'ledmatrix'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'start', 'ledmatrix.service'],
capture_output=True, text=True, timeout=15)
elif action == 'stop_display':
result = subprocess.run(['sudo', 'systemctl', 'stop', 'ledmatrix'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'stop', 'ledmatrix.service'],
capture_output=True, text=True, timeout=15)
elif action == 'enable_autostart':
result = subprocess.run(['sudo', 'systemctl', 'enable', 'ledmatrix'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'enable', 'ledmatrix.service'],
capture_output=True, text=True, timeout=15)
elif action == 'disable_autostart':
result = subprocess.run(['sudo', 'systemctl', 'disable', 'ledmatrix'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'disable', 'ledmatrix.service'],
capture_output=True, text=True, timeout=15)
elif action == 'reboot_system':
result = subprocess.run(['sudo', 'reboot'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, REBOOT_BIN],
capture_output=True, text=True, timeout=10)
elif action == 'shutdown_system':
result = subprocess.run(['sudo', 'poweroff'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, POWEROFF_BIN],
capture_output=True, text=True, timeout=10)
elif action == 'git_pull':
# Use PROJECT_ROOT instead of hardcoded path
project_dir = str(PROJECT_ROOT)
@@ -1823,12 +1830,11 @@ def execute_system_action():
'stderr': result.stderr
})
elif action == 'restart_display_service':
result = subprocess.run(['sudo', 'systemctl', 'restart', 'ledmatrix'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'restart', 'ledmatrix.service'],
capture_output=True, text=True, timeout=15)
elif action == 'restart_web_service':
# Try to restart the web service (assuming it's ledmatrix-web.service)
result = subprocess.run(['sudo', 'systemctl', 'restart', 'ledmatrix-web'],
capture_output=True, text=True)
result = subprocess.run([SUDO_BIN, SYSTEMCTL_BIN, 'restart', 'ledmatrix-web.service'],
capture_output=True, text=True, timeout=15)
else:
return jsonify({'status': 'error', 'message': f'Unknown action: {action}'}), 400
@@ -1840,6 +1846,13 @@ def execute_system_action():
'stderr': result.stderr
})
except subprocess.TimeoutExpired:
if action == 'start_display' and mode:
msg = f'Failed to start display in {mode} mode: timed out'
else:
msg = f'Action {action} timed out'
logger.warning("[System] execute_system_action timed out: action=%s", action)
return jsonify({'status': 'error', 'message': msg, 'returncode': -1, 'stdout': '', 'stderr': 'timeout'}), 500
except Exception as e:
logger.exception("[System] execute_system_action failed")
return jsonify({'status': 'error', 'message': 'Failed to execute system action'}), 500
@@ -7136,7 +7149,7 @@ def connect_wifi():
# Propagate structured error type so the captive portal UI can show
# "Wrong password — try again" instead of a generic failure message.
error_type = "wrong_password" if (message or "").startswith("wrong_password:") else "connection_failed"
clean_message = (message or "").removeprefix("wrong_password: ") or "Failed to connect to network"
clean_message = (message or "").removeprefix("wrong_password:").lstrip() or "Failed to connect to network"
return jsonify({
'status': 'error',
'message': clean_message,

View File

@@ -1225,7 +1225,9 @@ function initializePlugins() {
window.pluginManager._reswap = false;
// Await the installed-plugins fetch so window.installedPlugins is populated before
// searchPluginStore renders Installed/Reinstall badges against it.
loadInstalledPlugins().then(() => {
loadInstalledPlugins().catch(err => {
console.error('[PluginStore] loadInstalledPlugins failed:', err);
}).finally(() => {
searchPluginStore(!isReswapWarm);
});