Files
LEDMatrix/web_interface/app.py
Chuck 39ccdcf00d fix(plugins): stop reconciliation install loop, slow plugin list, uninstall resurrection (#309)
* fix(plugins): stop reconciliation install loop, slow plugin list, and uninstall resurrection

Three interacting bugs reported by a user (Discord/ericepe) on a fresh install:

1. The state reconciler retried failed auto-repairs on every HTTP request,
   pegging CPU and flooding logs with "Plugin not found in registry: github
   / youtube". Root cause: ``_run_startup_reconciliation`` reset
   ``_reconciliation_started`` to False on any unresolved inconsistency, so
   ``@app.before_request`` re-fired the entire pass on the next request.
   Fix: run reconciliation exactly once per process; cache per-plugin
   unrecoverable failures inside the reconciler so even an explicit
   re-trigger stays cheap; add a registry pre-check to skip the expensive
   GitHub fetch when we already know the plugin is missing; expose
   ``force=True`` on ``/plugins/state/reconcile`` so users can retry after
   fixing the underlying issue.

2. Uninstalling a plugin via the UI succeeded but the plugin reappeared.
   Root cause: a race between ``store_manager.uninstall_plugin`` (removes
   files) and ``cleanup_plugin_config`` (removes config entry) — if
   reconciliation fired in the gap it saw "config entry with no files" and
   reinstalled. Fix: reorder uninstall to clean config FIRST, drop a
   short-lived "recently uninstalled" tombstone on the store manager that
   the reconciler honors, and pass ``store_manager`` to the manual
   ``/plugins/state/reconcile`` endpoint (it was previously omitted, which
   silently disabled auto-repair entirely).

3. ``GET /plugins/installed`` was very slow on a Pi4 (UI hung on
   "connecting to display" for minutes, ~98% CPU). Root causes: per-request
   ``discover_plugins()`` + manifest re-read + four ``git`` subprocesses per
   plugin (``rev-parse``, ``--abbrev-ref``, ``config``, ``log``). Fix:
   mtime-gate ``discover_plugins()`` and drop the per-plugin manifest
   re-read in the endpoint; cache ``_get_local_git_info`` keyed on
   ``.git/HEAD`` mtime so subprocesses only run when the working copy
   actually moved; bump registry cache TTL from 5 to 15 minutes and fall
   back to stale cache on transient network failure.

Tests: 16 reconciliation cases (including 5 new ones covering the
unrecoverable cache, force-reconcile path, transient-failure handling, and
recently-uninstalled tombstone) and 8 new store_manager cache tests
covering tombstone TTL, git-info mtime cache hit/miss, and the registry
stale-cache fallback. All 24 pass; the broader 288-test suite continues to
pass with no new failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(plugins): parallelize Plugin Store browse and extend metadata cache TTLs

Follow-up to the previous commit addressing the Plugin Store browse path
specifically. Most users install plugins via the store (ZIP extraction,
no .git directory) so the git-info mtime cache from the previous commit
was a no-op for them; their pain was coming from /plugins/store/list.

Root cause. search_plugins() enriched each returned plugin with three
serial GitHub fetches: _get_github_repo_info (repo API), _get_latest_commit_info
(commits API), _fetch_manifest_from_github (raw.githubusercontent.com).
Fifteen plugins × three requests × serial HTTP = 30–45 sequential round
trips on every cold browse. On a Pi4 over WiFi that translated directly
into the "connecting to display" hang users reported. The commit and
manifest caches had a 5-minute TTL, so even a brief absence re-paid the
full cost.

Changes.

- ``search_plugins``: fan out per-plugin enrichment through a
  ``ThreadPoolExecutor`` (max 10 workers, stays well under unauthenticated
  GitHub rate limits). Apply category/tag/query filters before enrichment
  so we never waste requests on plugins that will be filtered out.
  ``executor.map`` preserves input order, which the UI depends on.
- ``commit_cache_timeout`` and ``manifest_cache_timeout``: 5 min → 30 min.
  Keeps the cache warm across a realistic session while still picking up
  upstream updates in a reasonable window.
- ``_get_github_repo_info`` and ``_get_latest_commit_info``: stale-on-error
  fallback. On a network failure or a 403 we now prefer a previously-
  cached value over the zero-default, matching the pattern already in
  ``fetch_registry``. Flaky Pi WiFi no longer causes star counts to flip
  to 0 and commit info to disappear.

Tests (5 new in test_store_manager_caches.py).

- ``test_results_preserve_registry_order`` — the parallel map must still
  return plugins in input order.
- ``test_filters_applied_before_enrichment`` — category/tag/query filters
  run first so we don't waste HTTP calls.
- ``test_enrichment_runs_concurrently`` — peak-concurrency check plus a
  wall-time bound that would fail if the code regressed to serial.
- ``test_repo_info_stale_on_network_error`` — repo info falls back to
  stale cache on RequestException.
- ``test_commit_info_stale_on_network_error`` — commit info falls back to
  stale cache on RequestException.

All 29 tests (16 reconciliation, 13 store_manager caches) pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(plugins): drop redundant per-plugin manifest.json fetch in search_plugins

Benchmarking the previous parallelization commit on a real Pi4 revealed
that the 10x speedup I expected was only ~1.1x. Profiling showed two
plugins (football-scoreboard, ledmatrix-flights) each spent 5 seconds
inside _fetch_manifest_from_github — not on the initial HTTP call, but
on the three retries in _http_get_with_retries with exponential backoff
after transient DNS failures. Even with the thread pool, those 5-second
tail latencies stayed in the wave and dominated wall time.

The per-plugin manifest fetch in search_plugins is redundant anyway.
The registry's plugins.json already carries ``description`` (it is
generated from each plugin's manifest by update_registry.py at release
time), and ``last_updated`` is filled in from the commit info that we
already fetch in the same loop. Dropping the manifest fetch eliminates
one of the three per-plugin HTTPS round trips entirely, which also
eliminates the DNS-retry tail.

The _fetch_manifest_from_github helper itself is preserved — it is
still used by the install path.

Tests unchanged (the search_plugins tests mock all three helpers and
still pass); this drop only affects the hot-path call sequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: lock down install/update/uninstall invariants

Regression guard for the caching and tombstone changes in this PR:

- ``install_plugin`` must not be gated by the uninstall tombstone. The
  tombstone only exists to keep the state reconciler from resurrecting a
  freshly-uninstalled plugin; explicit user-initiated installs via the
  store UI go straight to ``install_plugin()`` and must never be blocked.
  Test: mark a plugin recently uninstalled, stub out the download, call
  ``install_plugin``, and assert the download step was reached.

- ``get_plugin_info(force_refresh=True)`` must forward force_refresh
  through to both ``_get_latest_commit_info`` and ``_fetch_manifest_from_github``,
  so that install_plugin and update_plugin (both of which call
  get_plugin_info with force_refresh=True) continue to bypass the 30-min
  cache TTLs introduced in c03eb8db. Without this, bumping the commit
  cache TTL could cause users to install or update to a commit older than
  what GitHub actually has.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): address review findings — transactional uninstall, registry error propagation, payload hardening

Three real bugs surfaced by review, plus one nitpick. Each was verified
against the current code before fixing.

1. fetch_registry silently swallowed network errors, breaking the
   reconciler (CONFIRMED BUG).

   The stale-cache fallback I added in c03eb8db made fetch_registry
   return {"plugins": []} on network failure when no cache existed —
   which is exactly the state on a fresh boot with flaky WiFi. The
   reconciler's _auto_repair_missing_plugin code assumed an exception
   meant "transient, don't mark unrecoverable" and expected to never
   see a silent empty-dict result. With the silent fallback in place
   on a fresh boot, it would see "no candidates in registry" and
   mark every config-referenced plugin permanently unrecoverable.

   Fix: add ``raise_on_failure: bool = False`` to fetch_registry. UI
   callers keep the stale-cache-fallback default. The reconciler's
   _auto_repair_missing_plugin now calls it with raise_on_failure=True
   so it can distinguish a genuine registry miss from a network error.

2. Uninstall was not transactional (CONFIRMED BUG).

   Two distinct failure modes silently left the system in an
   inconsistent state:

   (a) If ``cleanup_plugin_config`` raised, the code logged a warning
       and proceeded to delete files anyway, leaving an orphan install
       with no config entry.
   (b) If ``uninstall_plugin`` returned False or raised AFTER cleanup
       had already succeeded, the config was gone but the files were
       still on disk — another orphan state.

   Fix: introduce ``_do_transactional_uninstall`` shared by both the
   queue and direct paths. Flow:
     - snapshot plugin's entries in main config + secrets
     - cleanup_plugin_config; on failure, ABORT before touching files
     - uninstall_plugin; on failure, RESTORE the snapshot, then raise
   Both queue and direct endpoints now delegate to this helper and
   surface clean errors to the user instead of proceeding past failure.

3. /plugins/state/reconcile crashed on non-object JSON bodies
   (CONFIRMED BUG).

   The previous code did ``payload.get('force', False)`` after
   ``request.get_json(silent=True) or {}``. If a client sent a bare
   string or array as the JSON body, payload would be that string or
   list and .get() would raise AttributeError. Separately,
   ``bool("false")`` is True, so string-encoded booleans were
   mis-handled.

   Fix: guard ``isinstance(payload, dict)`` and route the value
   through the existing ``_coerce_to_bool`` helper.

4. Nitpick: use ``assert_called_once_with`` in
   test_force_reconcile_clears_unrecoverable_cache. The existing test
   worked in practice (we call reset_mock right before) but the stricter
   assertion catches any future regression where force=True might
   double-fire the install.

Tests added (19 new, 48 total passing):

- TestFetchRegistryRaiseOnFailure (4): flag propagates both
  RequestException and JSONDecodeError, wins over stale cache, and
  the default behavior is unchanged for existing callers.
- test_real_store_manager_empty_registry_on_network_failure (1): the
  key regression test — uses the REAL PluginStoreManager (not a Mock)
  with ConnectionError at the HTTP helper layer, and verifies the
  reconciler does NOT poison _unrecoverable_missing_on_disk.
- TestTransactionalUninstall (4): cleanup failure aborts before file
  removal; file removal failure (both False return and raise) restores
  the config snapshot; happy path still succeeds.
- TestReconcileEndpointPayload (8): bare string / array / null JSON
  bodies, missing force key, boolean true/false, and string-encoded
  "true"/"false" all handled correctly.

All 342 tests in the broader sweep still pass (2 pre-existing
TestDottedKeyNormalization failures reproduce on main and are unrelated).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: address review nitpicks in store_manager + test

Four small cleanups, each verified against current code:

1. ``_git_info_cache`` type annotation was ``Dict[str, tuple]`` — too
   loose. Tightened to ``Dict[str, Tuple[float, Dict[str, str]]]`` to
   match what ``_get_local_git_info`` actually stores (mtime + the
   sha/short_sha/branch/... dict it returns). Added ``Tuple`` to the
   typing imports.

2. The ``search_plugins`` early-return condition
   ``if len(filtered) == 1 or not fetch_commit_info and len(filtered) < 4``
   parses correctly under Python's precedence (``and`` > ``or``) but is
   visually ambiguous. Added explicit parentheses to make the intent —
   "single plugin, OR small batch that doesn't need commit info" —
   obvious at a glance. Semantics unchanged.

3. Replaced a Unicode multiplication sign (×) with ASCII 'x' in the
   commit_cache_timeout comment.

4. Removed a dead ``concurrent_workers = []`` declaration from
   ``test_enrichment_runs_concurrently``. It was left over from an
   earlier sketch of the concurrency check — the final test uses only
   ``peak_lock`` and ``peak``.

All 48 tests still pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): address second review pass — cache correctness and rollback

Verified each finding against the current code. All four inline issues
were real bugs; nitpicks 5-7 were valid improvements.

1. _get_latest_commit_info overwrote a good cached value with None on
   all-branches-404 (CONFIRMED BUG).

   The final line of the branch loop unconditionally wrote
   ``self.commit_info_cache[cache_key] = (time.time(), None)``, which
   clobbered any previously-good entry on a single transient failure
   (e.g. an odd 5xx, a temporary DNS hiccup during the branches_to_try
   loop). Fix: if there's already a good prior value, bump its
   timestamp into the backoff window and return it instead. Only
   cache None when we never had a good value.

2. _get_local_git_info cache did not invalidate on fast-forward
   (CONFIRMED BUG).

   Caching on ``.git/HEAD`` mtime alone is wrong: a ``git pull`` that
   fast-forwards the current branch updates ``.git/refs/heads/<branch>``
   (or packed-refs) but leaves HEAD's contents and mtime untouched.
   The cache would then serve a stale SHA indefinitely.

   Fix: introduce ``_git_cache_signature`` which reads HEAD contents,
   resolves ``ref: refs/heads/<name>`` to the corresponding loose ref
   file, and builds a signature tuple of (head_contents, head_mtime,
   resolved_ref_mtime, packed_refs_mtime). A fast-forward bumps the
   ref file's mtime, which invalidates the signature and re-runs git.

3. test_install_plugin_is_not_blocked_by_tombstone swallowed all
   exceptions (CONFIRMED BUG in test).

   ``try: self.sm.install_plugin("bar") except Exception: pass`` could
   hide a real regression in install_plugin that happens to raise.
   Fix: the test now writes a COMPLETE valid manifest stub (id, name,
   class_name, display_modes, entry_point) and stubs _install_dependencies,
   so install_plugin runs all the way through and returns True. The
   assertion is now ``assertTrue(result)`` with no exception handling.

4. Uninstall rollback missed unload/reload (CONFIRMED BUG).

   Previous flow: cleanup → unload (outside try/except) → uninstall →
   rollback config on failure. Problem: if ``unload_plugin`` raised,
   the exception propagated without restoring config. And if
   ``uninstall_plugin`` failed after a successful unload, the rollback
   restored config but left the plugin unloaded at runtime —
   inconsistent.

   Fix: record ``was_loaded`` before touching runtime state, wrap
   ``unload_plugin`` in the same try/except that covers
   ``uninstall_plugin``, and on any failure call a ``_rollback`` local
   that (a) restores the config snapshot and (b) calls
   ``load_plugin`` to reload the plugin if it was loaded before we
   touched it.

5. Nitpick: ``_unrecoverable_missing_on_disk: set`` → ``Set[str]``.
   Matches the existing ``Dict``/``List`` style in state_reconciliation.py.

6. Nitpick: stale-cache fallbacks in _get_github_repo_info and
   _get_latest_commit_info now bump the cached entry's timestamp by a
   60s failure backoff. Without this, a cache entry whose TTL just
   expired would cause every subsequent request to re-hit the network
   until it came back, amplifying the failure. Introduced
   ``_record_cache_backoff`` helper and applied it consistently.

7. Nitpick: replaced the flaky wall-time assertion in
   test_enrichment_runs_concurrently with just the deterministic
   ``peak["count"] >= 2`` signal. ``peak["count"]`` can only exceed 1
   if two workers were inside the critical section simultaneously,
   which is definitive proof of parallelism. The wall-time check was
   tight enough (<200ms) to occasionally fail on CI / low-power boxes.

Tests (6 new, 54 total passing):

- test_cache_invalidates_on_fast_forward_of_current_branch: builds a
  loose-ref layout under a temp .git/, verifies a first call populates
  the cache, a second call with unchanged state hits the cache, and a
  simulated fast-forward (overwriting ``.git/refs/heads/main`` with a
  new SHA and mtime) correctly re-runs git.
- test_commit_info_preserves_good_cache_on_all_branches_404: seeds a
  good cached entry, mocks requests.get to always return 404, and
  verifies the cache still contains the good value afterwards.
- test_repo_info_stale_bumps_timestamp_into_backoff: seeds an expired
  cache, triggers a ConnectionError, then verifies a second lookup
  does NOT re-hit the network (proves the timestamp bump happened).
- test_repo_info_stale_on_403_also_backs_off: same for the 403 path.
- test_file_removal_failure_reloads_previously_loaded_plugin:
  plugin starts loaded, uninstall_plugin returns False, asserts
  load_plugin was called during rollback.
- test_unload_failure_restores_config_and_does_not_call_uninstall:
  unload_plugin raises, asserts uninstall_plugin was never called AND
  config was restored AND load_plugin was NOT called (runtime state
  never changed, so no reload needed).

Broader test sweep: 348/348 pass (2 pre-existing
TestDottedKeyNormalization failures reproduce on main, unrelated).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): address third review pass — cache signatures, backoff, isolation

All four findings verified as real issues against the current code.

1. _git_cache_signature was missing .git/config (CONFIRMED GAP).

   The cached ``result`` dict from _get_local_git_info includes
   ``remote_url``, which is read from ``.git/config``. But the cache
   signature only tracked HEAD + refs — so a config-only change (e.g.
   ``git remote set-url origin https://...``) would leave the stale
   URL cached indefinitely. This matters for the monorepo-migration
   detection in update_plugin.

   Fix: add ``config_contents`` and ``config_mtime`` to the signature
   tuple. Config reads use the same OSError-guarded pattern as the
   HEAD read.

2. fetch_registry stale fallback didn't bump registry_cache_time
   (CONFIRMED BUG).

   The other caches already had the failure-backoff pattern added in
   the previous review pass (via ``_record_cache_backoff``), but the
   registry cache's stale-fallback branches silently returned the
   cached payload without updating ``registry_cache_time``. Next
   request saw the same expired TTL, re-hit the network, failed
   again — amplifying the original transient failure.

   Fix: bump ``self.registry_cache_time`` forward by the existing
   ``self._failure_backoff_seconds`` (reused — no new constant
   needed) in both the RequestException and JSONDecodeError stale
   branches. Kept the ``raise_on_failure=True`` path untouched so the
   reconciler still gets the exception.

3. _make_client() in the uninstall/reconcile test helper leaked
   MagicMocks into the api_v3 singleton (CONFIRMED RISK).

   Every test call replaced api_v3.config_manager, .plugin_manager,
   .plugin_store_manager, etc. with MagicMocks and never restored them.
   If any later test in the same pytest run imported api_v3 expecting
   original state (or None), it would see the leftover mocks.

   Fix: _make_client now snapshots the original attributes (with a
   sentinel to distinguish "didn't exist" from "was None") and returns
   a cleanup callable. Both setUp methods call self.addCleanup(cleanup)
   so state is restored even if the test raises. On cleanup, sentinel
   entries trigger delattr rather than setattr to preserve the
   "attribute was never set" case.

4. Snapshot helpers used broad ``except Exception`` (CONFIRMED).

   _snapshot_plugin_config caught any exception from
   get_raw_file_content, which could hide programmer errors (TypeError,
   AttributeError) behind the "best-effort snapshot" fallback. The
   legitimate failure modes are filesystem errors (covered by OSError;
   FileNotFoundError is a subclass, IOError is an alias in Python 3)
   and ConfigError (what config_manager wraps all load failures in).

   Fix: narrow to ``(OSError, ConfigError)`` in both snapshot blocks.
   ConfigError was already imported at line 20 of api_v3.py.

Tests added (4 new, 58 total passing):

- test_cache_invalidates_on_git_config_change: builds a realistic
  loose-ref layout, writes .git/config with an "old" remote URL,
  exercises _get_local_git_info, then rewrites .git/config with a
  "new" remote URL + new mtime, calls again, and asserts the cache
  invalidated and returned the new URL.
- test_stale_fallback_bumps_timestamp_into_backoff: seeds an expired
  registry cache, triggers ConnectionError, verifies first call
  serves stale, then asserts a second call makes ZERO new HTTP
  requests (proves registry_cache_time was bumped forward).
- test_snapshot_survives_config_read_error: raises ConfigError from
  get_raw_file_content and asserts the uninstall still completes
  successfully — the narrow exception list still catches this case.
- test_snapshot_does_not_swallow_programmer_errors: raises a
  TypeError from get_raw_file_content (not in the narrow list) and
  asserts it propagates up to a 500, AND that uninstall_plugin was
  never called (proves the exception was caught at the right level).

Broader test sweep: 352/352 pass (2 pre-existing
TestDottedKeyNormalization failures reproduce on main, unrelated).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Chuck <chuck@example.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 12:33:54 -04:00

730 lines
28 KiB
Python

from flask import Flask, Blueprint, render_template, request, redirect, url_for, flash, jsonify, Response, send_from_directory
import json
import logging
import os
import sys
import subprocess
import time
from pathlib import Path
from datetime import datetime, timedelta
# Add parent directory to path for imports
sys.path.insert(0, str(Path(__file__).parent.parent))
from src.config_manager import ConfigManager
from src.exceptions import ConfigError
from src.plugin_system.plugin_manager import PluginManager
from src.plugin_system.store_manager import PluginStoreManager
from src.plugin_system.saved_repositories import SavedRepositoriesManager
from src.plugin_system.schema_manager import SchemaManager
from src.plugin_system.operation_queue import PluginOperationQueue
from src.plugin_system.state_manager import PluginStateManager
from src.plugin_system.operation_history import OperationHistory
from src.plugin_system.health_monitor import PluginHealthMonitor
from src.wifi_manager import WiFiManager
# Create Flask app
app = Flask(__name__)
app.secret_key = os.urandom(24)
config_manager = ConfigManager()
# CSRF protection disabled for local-only application
# CSRF is designed for internet-facing web apps to prevent cross-site request forgery.
# For a local-only Raspberry Pi application, the threat model is different:
# - If an attacker has network access to perform CSRF, they have other attack vectors
# - All API endpoints are programmatic (HTMX/fetch) and don't include CSRF tokens
# - Forms use HTMX which doesn't automatically include CSRF tokens
# If you need CSRF protection (e.g., exposing to internet), properly implement CSRF tokens in HTMX forms
csrf = None
# Initialize rate limiting (prevent accidental abuse, not security)
try:
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
limiter = Limiter(
app=app,
key_func=get_remote_address,
default_limits=["1000 per minute"], # Generous limit for local use
storage_uri="memory://" # In-memory storage for simplicity
)
except ImportError:
# flask-limiter not installed, rate limiting disabled
limiter = None
pass
# Import cache functions from separate module to avoid circular imports
from web_interface.cache import get_cached, set_cached, invalidate_cache
# Initialize plugin managers - read plugins directory from config
config = config_manager.load_config()
plugin_system_config = config.get('plugin_system', {})
plugins_dir_name = plugin_system_config.get('plugins_directory', 'plugin-repos')
# Resolve plugin directory - handle both absolute and relative paths
if os.path.isabs(plugins_dir_name):
plugins_dir = Path(plugins_dir_name)
else:
# If relative, resolve relative to the project root (LEDMatrix directory)
project_root = Path(__file__).parent.parent
plugins_dir = project_root / plugins_dir_name
plugin_manager = PluginManager(
plugins_dir=str(plugins_dir),
config_manager=config_manager,
display_manager=None, # Not needed for web interface
cache_manager=None # Not needed for web interface
)
plugin_store_manager = PluginStoreManager(plugins_dir=str(plugins_dir))
saved_repositories_manager = SavedRepositoriesManager()
# Initialize schema manager
schema_manager = SchemaManager(
plugins_dir=plugins_dir,
project_root=project_root,
logger=None
)
# Initialize operation queue for plugin operations
# Use lazy_load=True to defer file loading until first use (improves startup time)
operation_queue = PluginOperationQueue(
history_file=str(project_root / "data" / "plugin_operations.json"),
max_history=500,
lazy_load=True
)
# Initialize plugin state manager
# Use lazy_load=True to defer file loading until first use (improves startup time)
plugin_state_manager = PluginStateManager(
state_file=str(project_root / "data" / "plugin_state.json"),
auto_save=True,
lazy_load=True
)
# Initialize operation history
# Use lazy_load=True to defer file loading until first use (improves startup time)
operation_history = OperationHistory(
history_file=str(project_root / "data" / "operation_history.json"),
max_records=1000,
lazy_load=True
)
# Initialize health monitoring (if health tracker is available)
# Deferred until first request to improve startup time
health_monitor = None
_health_monitor_initialized = False
# Plugin discovery is deferred until first API request that needs it
# This improves startup time - endpoints will call discover_plugins() when needed
# Register blueprints
from web_interface.blueprints.pages_v3 import pages_v3
from web_interface.blueprints.api_v3 import api_v3
# Initialize managers in blueprints
pages_v3.config_manager = config_manager
pages_v3.plugin_manager = plugin_manager
pages_v3.plugin_store_manager = plugin_store_manager
pages_v3.saved_repositories_manager = saved_repositories_manager
api_v3.config_manager = config_manager
api_v3.plugin_manager = plugin_manager
api_v3.plugin_store_manager = plugin_store_manager
api_v3.saved_repositories_manager = saved_repositories_manager
api_v3.schema_manager = schema_manager
api_v3.operation_queue = operation_queue
api_v3.plugin_state_manager = plugin_state_manager
api_v3.operation_history = operation_history
api_v3.health_monitor = health_monitor
# Initialize cache manager for API endpoints
from src.cache_manager import CacheManager
api_v3.cache_manager = CacheManager()
app.register_blueprint(pages_v3, url_prefix='/v3')
app.register_blueprint(api_v3, url_prefix='/api/v3')
# Route to serve plugin asset files (registered on main app, not blueprint, for /assets/... path)
@app.route('/assets/plugins/<plugin_id>/uploads/<path:filename>', methods=['GET'])
def serve_plugin_asset(plugin_id, filename):
"""Serve uploaded asset files from assets/plugins/{plugin_id}/uploads/"""
try:
# Build the asset directory path
assets_dir = project_root / 'assets' / 'plugins' / plugin_id / 'uploads'
assets_dir = assets_dir.resolve()
# Security check: ensure the assets directory exists and is within project_root
if not assets_dir.exists() or not assets_dir.is_dir():
return jsonify({'status': 'error', 'message': 'Asset directory not found'}), 404
# Ensure we're serving from within the assets directory (prevent directory traversal)
# Use proper path resolution instead of string prefix matching to prevent bypasses
assets_dir_resolved = assets_dir.resolve()
project_root_resolved = project_root.resolve()
# Check that assets_dir is actually within project_root using commonpath
try:
common_path = os.path.commonpath([str(assets_dir_resolved), str(project_root_resolved)])
if common_path != str(project_root_resolved):
return jsonify({'status': 'error', 'message': 'Invalid asset path'}), 403
except ValueError:
# commonpath raises ValueError if paths are on different drives (Windows)
return jsonify({'status': 'error', 'message': 'Invalid asset path'}), 403
# Resolve the requested file path
requested_file = (assets_dir / filename).resolve()
# Security check: ensure file is within the assets directory using proper path comparison
# Use commonpath to ensure assets_dir is a true parent of requested_file
try:
common_path = os.path.commonpath([str(requested_file), str(assets_dir_resolved)])
if common_path != str(assets_dir_resolved):
return jsonify({'status': 'error', 'message': 'Invalid file path'}), 403
except ValueError:
# commonpath raises ValueError if paths are on different drives (Windows)
return jsonify({'status': 'error', 'message': 'Invalid file path'}), 403
# Check if file exists
if not requested_file.exists() or not requested_file.is_file():
return jsonify({'status': 'error', 'message': 'File not found'}), 404
# Determine content type based on file extension
content_type = 'application/octet-stream'
if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
content_type = 'image/jpeg' if filename.lower().endswith(('.jpg', '.jpeg')) else 'image/png'
elif filename.lower().endswith('.gif'):
content_type = 'image/gif'
elif filename.lower().endswith('.bmp'):
content_type = 'image/bmp'
elif filename.lower().endswith('.webp'):
content_type = 'image/webp'
elif filename.lower().endswith('.svg'):
content_type = 'image/svg+xml'
elif filename.lower().endswith('.json'):
content_type = 'application/json'
elif filename.lower().endswith('.txt'):
content_type = 'text/plain'
# Use send_from_directory to serve the file
return send_from_directory(str(assets_dir), filename, mimetype=content_type)
except Exception as e:
# Log the exception with full traceback server-side
import traceback
app.logger.exception('Error serving plugin asset file')
# Return generic error message to client (avoid leaking internal details)
# Only include detailed error information when in debug mode
if app.debug:
return jsonify({
'status': 'error',
'message': str(e),
'traceback': traceback.format_exc()
}), 500
else:
return jsonify({
'status': 'error',
'message': 'Internal server error'
}), 500
# Cached AP mode check — avoids creating a WiFiManager per request
_ap_mode_cache = {'value': False, 'timestamp': 0}
_AP_MODE_CACHE_TTL = 5 # seconds
def is_ap_mode_active():
"""
Check if access point mode is currently active (cached, 5s TTL).
Uses a direct systemctl check instead of instantiating WiFiManager.
"""
now = time.time()
if (now - _ap_mode_cache['timestamp']) < _AP_MODE_CACHE_TTL:
return _ap_mode_cache['value']
try:
result = subprocess.run(
['systemctl', 'is-active', 'hostapd'],
capture_output=True, text=True, timeout=2
)
active = result.stdout.strip() == 'active'
_ap_mode_cache['value'] = active
_ap_mode_cache['timestamp'] = now
return active
except (subprocess.SubprocessError, OSError) as e:
logging.getLogger('web_interface').error(f"AP mode check failed: {e}")
return _ap_mode_cache['value']
# Captive portal detection endpoints
# When AP mode is active, return responses that TRIGGER the captive portal popup.
# When not in AP mode, return normal "success" responses so connectivity checks pass.
@app.route('/hotspot-detect.html')
def hotspot_detect():
"""iOS/macOS captive portal detection endpoint"""
if is_ap_mode_active():
# Non-"Success" title triggers iOS captive portal popup
return redirect(url_for('pages_v3.captive_setup'), code=302)
return '<HTML><HEAD><TITLE>Success</TITLE></HEAD><BODY>Success</BODY></HTML>', 200
@app.route('/generate_204')
def generate_204():
"""Android captive portal detection endpoint"""
if is_ap_mode_active():
# Android expects 204 = "internet works". Non-204 triggers portal popup.
return redirect(url_for('pages_v3.captive_setup'), code=302)
return '', 204
@app.route('/connecttest.txt')
def connecttest_txt():
"""Windows captive portal detection endpoint"""
if is_ap_mode_active():
return redirect(url_for('pages_v3.captive_setup'), code=302)
return 'Microsoft Connect Test', 200
@app.route('/success.txt')
def success_txt():
"""Firefox captive portal detection endpoint"""
if is_ap_mode_active():
return redirect(url_for('pages_v3.captive_setup'), code=302)
return 'success', 200
# Initialize logging
try:
from web_interface.logging_config import setup_web_interface_logging, log_api_request
# Use JSON logging in production, readable logs in development
use_json_logging = os.environ.get('LEDMATRIX_JSON_LOGGING', 'false').lower() == 'true'
setup_web_interface_logging(level='INFO', use_json=use_json_logging)
except ImportError:
# Logging config not available, use default
log_api_request = None
pass
# Request timing and logging middleware
@app.before_request
def before_request():
"""Track request start time for logging."""
from flask import request
request.start_time = time.time()
@app.after_request
def after_request_logging(response):
"""Log API requests after response."""
if log_api_request:
try:
from flask import request
duration_ms = (time.time() - getattr(request, 'start_time', time.time())) * 1000
ip_address = request.remote_addr if hasattr(request, 'remote_addr') else None
log_api_request(
method=request.method,
path=request.path,
status_code=response.status_code,
duration_ms=duration_ms,
ip_address=ip_address
)
except Exception:
pass # Don't break response if logging fails
return response
# Global error handlers
@app.errorhandler(404)
def not_found_error(error):
"""Handle 404 errors."""
return jsonify({
'status': 'error',
'error_code': 'NOT_FOUND',
'message': 'Resource not found',
'path': request.path
}), 404
@app.errorhandler(500)
def internal_error(error):
"""Handle 500 errors."""
import traceback
error_details = traceback.format_exc()
# Log the error
import logging
logger = logging.getLogger('web_interface')
logger.error(f"Internal server error: {error}", exc_info=True)
# Return user-friendly error (hide internal details in production)
return jsonify({
'status': 'error',
'error_code': 'INTERNAL_ERROR',
'message': 'An internal error occurred',
'details': error_details if app.debug else None
}), 500
@app.errorhandler(Exception)
def handle_exception(error):
"""Handle all unhandled exceptions."""
import traceback
import logging
logger = logging.getLogger('web_interface')
logger.error(f"Unhandled exception: {error}", exc_info=True)
return jsonify({
'status': 'error',
'error_code': 'UNKNOWN_ERROR',
'message': str(error) if app.debug else 'An error occurred',
'details': traceback.format_exc() if app.debug else None
}), 500
# Captive portal redirect middleware
@app.before_request
def captive_portal_redirect():
"""
Redirect all HTTP requests to WiFi setup page when AP mode is active.
This creates a captive portal experience where users are automatically
directed to the WiFi configuration page.
"""
# Check if AP mode is active
if not is_ap_mode_active():
return None # Continue normal request processing
# Get the request path
path = request.path
# List of paths that should NOT be redirected (allow normal operation)
allowed_paths = [
'/v3', # Main interface and all sub-paths (includes /v3/setup)
'/api/v3/', # All API endpoints
'/static/', # Static files (CSS, JS, images)
'/hotspot-detect.html', # iOS/macOS detection
'/generate_204', # Android detection
'/connecttest.txt', # Windows detection
'/success.txt', # Firefox detection
'/favicon.ico', # Favicon
]
for allowed_path in allowed_paths:
if path.startswith(allowed_path):
return None
# Redirect to lightweight captive portal setup page (not the full UI)
return redirect(url_for('pages_v3.captive_setup'), code=302)
# Add security headers and caching to all responses
@app.after_request
def add_security_headers(response):
"""Add security headers and caching to all responses"""
# Only set standard security headers - avoid Permissions-Policy to prevent browser warnings
# about unrecognized features
response.headers['X-Content-Type-Options'] = 'nosniff'
response.headers['X-Frame-Options'] = 'SAMEORIGIN'
response.headers['X-XSS-Protection'] = '1; mode=block'
# Add caching headers for static assets
if request.path.startswith('/static/'):
# Cache static assets for 1 year (with versioning via query params)
response.headers['Cache-Control'] = 'public, max-age=31536000, immutable'
response.headers['Expires'] = (datetime.now() + timedelta(days=365)).strftime('%a, %d %b %Y %H:%M:%S GMT')
elif request.path.startswith('/api/v3/'):
# Short cache for API responses (5 seconds) to allow for quick updates
# but reduce server load for repeated requests
if request.method == 'GET' and 'stream' not in request.path:
response.headers['Cache-Control'] = 'private, max-age=5, must-revalidate'
else:
# No cache for HTML pages to ensure fresh content
response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
response.headers['Pragma'] = 'no-cache'
response.headers['Expires'] = '0'
return response
# SSE helper function
def sse_response(generator_func):
"""Helper to create SSE responses"""
def generate():
for data in generator_func():
yield f"data: {json.dumps(data)}\n\n"
return Response(generate(), mimetype='text/event-stream')
# System status generator for SSE
def system_status_generator():
"""Generate system status updates"""
while True:
try:
# Try to import psutil for system stats
try:
import psutil
cpu_percent = round(psutil.cpu_percent(interval=1), 1)
memory = psutil.virtual_memory()
memory_used_percent = round(memory.percent, 1)
# Try to get CPU temperature (Raspberry Pi specific)
cpu_temp = 0
try:
with open('/sys/class/thermal/thermal_zone0/temp', 'r') as f:
cpu_temp = round(float(f.read()) / 1000.0, 1)
except (OSError, ValueError):
pass
except ImportError:
cpu_percent = 0
memory_used_percent = 0
cpu_temp = 0
# Check if display service is running
service_active = False
try:
result = subprocess.run(['systemctl', 'is-active', 'ledmatrix'],
capture_output=True, text=True, timeout=2)
service_active = result.stdout.strip() == 'active'
except (subprocess.SubprocessError, OSError):
pass
status = {
'timestamp': time.time(),
'uptime': 'Running',
'service_active': service_active,
'cpu_percent': cpu_percent,
'memory_used_percent': memory_used_percent,
'cpu_temp': cpu_temp,
'disk_used_percent': 0
}
yield status
except Exception as e:
yield {'error': str(e)}
time.sleep(10) # Update every 10 seconds (reduced frequency for better performance)
# Display preview generator for SSE
def display_preview_generator():
"""Generate display preview updates from snapshot file"""
import base64
from PIL import Image
import io
snapshot_path = "/tmp/led_matrix_preview.png"
last_modified = None
# Get display dimensions from config
try:
main_config = config_manager.load_config()
cols = main_config.get('display', {}).get('hardware', {}).get('cols', 64)
chain_length = main_config.get('display', {}).get('hardware', {}).get('chain_length', 2)
rows = main_config.get('display', {}).get('hardware', {}).get('rows', 32)
parallel = main_config.get('display', {}).get('hardware', {}).get('parallel', 1)
width = cols * chain_length
height = rows * parallel
except (KeyError, TypeError, ValueError, ConfigError):
width = 128
height = 64
while True:
try:
# Check if snapshot file exists and has been modified
if os.path.exists(snapshot_path):
current_modified = os.path.getmtime(snapshot_path)
# Only read if file is new or has been updated
if last_modified is None or current_modified > last_modified:
try:
# Read and encode the image
with Image.open(snapshot_path) as img:
# Convert to PNG and encode as base64
buffer = io.BytesIO()
img.save(buffer, format='PNG')
img_str = base64.b64encode(buffer.getvalue()).decode('utf-8')
preview_data = {
'timestamp': time.time(),
'width': width,
'height': height,
'image': img_str
}
last_modified = current_modified
yield preview_data
except Exception as read_err:
# File might be being written, skip this update
pass
else:
# No snapshot available
yield {
'timestamp': time.time(),
'width': width,
'height': height,
'image': None
}
except Exception as e:
yield {'error': str(e)}
time.sleep(0.5) # Check 2 times per second (reduced frequency for better performance)
# Logs generator for SSE
def logs_generator():
"""Generate log updates from journalctl"""
while True:
try:
# Get recent logs from journalctl (simplified version)
# Note: User should be in systemd-journal group to read logs without sudo
try:
result = subprocess.run(
['journalctl', '-u', 'ledmatrix.service', '-n', '50', '--no-pager'],
capture_output=True, text=True, timeout=5
)
if result.returncode == 0:
logs_text = result.stdout.strip()
if logs_text:
logs_data = {
'timestamp': time.time(),
'logs': logs_text
}
yield logs_data
else:
# No logs available
logs_data = {
'timestamp': time.time(),
'logs': 'No logs available from ledmatrix service'
}
yield logs_data
else:
# journalctl failed
error_data = {
'timestamp': time.time(),
'logs': f'journalctl failed with return code {result.returncode}: {result.stderr.strip()}'
}
yield error_data
except subprocess.TimeoutExpired:
# Timeout - just skip this update
pass
except Exception as e:
error_data = {
'timestamp': time.time(),
'logs': f'Error running journalctl: {str(e)}'
}
yield error_data
except Exception as e:
error_data = {
'timestamp': time.time(),
'logs': f'Unexpected error in logs generator: {str(e)}'
}
yield error_data
time.sleep(5) # Update every 5 seconds (reduced frequency for better performance)
# SSE endpoints
@app.route('/api/v3/stream/stats')
def stream_stats():
return sse_response(system_status_generator)
@app.route('/api/v3/stream/display')
def stream_display():
return sse_response(display_preview_generator)
@app.route('/api/v3/stream/logs')
def stream_logs():
return sse_response(logs_generator)
# Exempt SSE streams from CSRF and add rate limiting
if csrf:
csrf.exempt(stream_stats)
csrf.exempt(stream_display)
csrf.exempt(stream_logs)
# Note: api_v3 blueprint is exempted above after registration
if limiter:
limiter.limit("20 per minute")(stream_stats)
limiter.limit("20 per minute")(stream_display)
limiter.limit("20 per minute")(stream_logs)
# Main route - redirect to v3 interface as default
@app.route('/')
def index():
"""Redirect to v3 interface"""
return redirect(url_for('pages_v3.index'))
@app.route('/favicon.ico')
def favicon():
"""Return 204 No Content for favicon to avoid 404 errors"""
return '', 204
def _initialize_health_monitor():
"""Initialize health monitoring after server is ready to accept requests."""
global health_monitor, _health_monitor_initialized
if _health_monitor_initialized:
return
if health_monitor is None and hasattr(plugin_manager, 'health_tracker') and plugin_manager.health_tracker:
try:
health_monitor = PluginHealthMonitor(
health_tracker=plugin_manager.health_tracker,
check_interval=60.0, # Check every minute
degraded_threshold=0.5,
unhealthy_threshold=0.8,
max_response_time=5.0
)
health_monitor.start_monitoring()
print("✓ Plugin health monitoring started")
except Exception as e:
print(f"⚠ Could not start health monitoring: {e}")
_health_monitor_initialized = True
_reconciliation_done = False
_reconciliation_started = False
import threading as _threading
_reconciliation_lock = _threading.Lock()
def _run_startup_reconciliation() -> None:
"""Run state reconciliation in background to auto-repair missing plugins.
Reconciliation runs exactly once per process lifetime, regardless of
whether every inconsistency could be auto-fixed. Previously, a failed
auto-repair (e.g. a config entry referencing a plugin that no longer
exists in the registry) would reset ``_reconciliation_started`` to False,
causing the ``@app.before_request`` hook to re-trigger reconciliation on
every single HTTP request — an infinite install-retry loop that pegged
the CPU and flooded the log. Unresolved issues are now left in place for
the user to address via the UI; the reconciler itself also caches
per-plugin unrecoverable failures internally so repeated reconcile calls
stay cheap.
"""
global _reconciliation_done
from src.logging_config import get_logger
_logger = get_logger('reconciliation')
try:
from src.plugin_system.state_reconciliation import StateReconciliation
reconciler = StateReconciliation(
state_manager=plugin_state_manager,
config_manager=config_manager,
plugin_manager=plugin_manager,
plugins_dir=plugins_dir,
store_manager=plugin_store_manager
)
result = reconciler.reconcile_state()
if result.inconsistencies_found:
_logger.info("[Reconciliation] %s", result.message)
if result.inconsistencies_fixed:
plugin_manager.discover_plugins()
if not result.reconciliation_successful:
_logger.warning(
"[Reconciliation] Finished with %d unresolved issue(s); "
"will not retry automatically. Use the Plugin Store or the "
"manual 'Reconcile' action to resolve.",
len(result.inconsistencies_manual),
)
except Exception as e:
_logger.error("[Reconciliation] Error: %s", e, exc_info=True)
finally:
# Always mark done — we do not want an unhandled exception (or an
# unresolved inconsistency) to cause the @before_request hook to
# retrigger reconciliation on every subsequent request.
_reconciliation_done = True
# Initialize health monitor and run reconciliation on first request
@app.before_request
def check_health_monitor():
"""Ensure health monitor is initialized; launch reconciliation in background."""
global _reconciliation_started
if not _health_monitor_initialized:
_initialize_health_monitor()
with _reconciliation_lock:
if not _reconciliation_started:
_reconciliation_started = True
_threading.Thread(target=_run_startup_reconciliation, daemon=True).start()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)