Commit Graph

2 Commits

Author SHA1 Message Date
Chuck
eedf680a8c perf: display pipeline optimizations — caching, logging, scroll, text width (#358)
* docs(core): add module and class docstrings to the 5 undocumented core files

Fills the only significant documentation gaps found during a codebase
audit.  All other core files (plugin_system/, logging_config.py, etc.)
already have complete module, class, and function docstrings.

Files changed (documentation only — zero logic changes):

  display_controller.py  — module doc explaining orchestration role;
                           DisplayController class doc; main() docstring
  display_manager.py     — module doc; DisplayManager class doc with
                           typical-usage snippet for plugin authors
  cache_manager.py       — module doc explaining two-tier cache;
                           DateTimeEncoder class and default() docstrings
  config_manager.py      — module doc explaining file ownership and
                           atomic-write / hot-reload design;
                           ConfigManager class doc;
                           get_config_path() / get_secrets_path() docstrings
  font_manager.py        — module doc (class docstring already existed)

Also noted (but not changed to avoid behaviour risk):
  display_manager.py and font_manager.py use logging.getLogger() directly
  instead of the project's get_logger() wrapper.  display_manager.py also
  calls setLevel(logging.INFO) immediately after, which would be lost if
  switched to get_logger().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* perf(display_controller): three targeted hot-path optimizations

Opt 1 — cache inspect.signature() per plugin_id
  inspect.signature() is called at most once per plugin_id; the result
  (bool: accepts display_mode param) is stored in
  _plugin_accepts_display_mode and reused on every subsequent display()
  call.  Eliminates all reflection from the display path at runtime.
  Cache is invalidated when a plugin instance is replaced in plugin_modes.

Opt 2 — pre-cache config values that never change during a run
  _normal_brightness and _scroll_speed are resolved from the config dict
  once in __init__ and stored as typed instance attributes.
  - Removes 2+ chained dict.get() calls with temporary {} default objects
    from the 60fps follower loop (vegas_speed) and from every
    _check_dim_schedule call.
  - current_brightness init now uses _normal_brightness directly.

Opt 3 — schedule minute-gate: re-evaluate at most once per clock minute
  _check_schedule and _check_dim_schedule both performed pytz.timezone(),
  datetime.now(), strftime(), and datetime.strptime() on every outer loop
  call.  Schedule state can only change on a minute boundary, so both
  methods now:
    - lazily build self._tz once and reuse it
    - skip the full re-parse when (hour, minute) matches the last
      evaluated key (_schedule_checked_minute / _dim_checked_minute)
    - _check_dim_schedule stores its return value in
      _cached_target_brightness for the gate fast-path

Tests: 23 new tests in test_display_controller_optimizations.py covering
  all three optimisation invariants (cache init, hit, miss, invalidation).
  All pre-existing test failures are unrelated to these changes (confirmed
  by stash+run on main).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve 22 pre-existing test failures across 6 groups

Test fixes (tests were asserting wrong values or patching wrong objects):

  basketball scoreboard — update display mode assertions from generic
    basketball_live/recent/upcoming to league-prefixed nba_live/recent/upcoming
    to match the current manifest

  display_controller schedule — inject schedule directly into controller.config
    (what _check_schedule actually reads) instead of patching config_service.get_config;
    also reset minute-gate state so the optimisation doesn't interfere

  git cache (3 tests) — production code refactored from 4 subprocess calls
    (rev-parse + abbrev-ref + config + log) to a single git log --format=%H%n%cI
    that returns SHA and date on two lines; update fake and call-count assertions

  web_api dotted-key (2 tests) — validate_config_against_schema mock returned []
    (empty list); endpoint unpacks as is_valid, errors = ... causing ValueError;
    fix: return_value = (True, [])

  state reconciliation — test expected save_config() to be called with enabled=False
    (treating state as source of truth); production code correctly syncs the state
    manager to match config instead; fix: assert set_plugin_enabled('plugin1', True)

Production fixes (production code had bugs or missing features):

  reconcile endpoint — add force parameter parsing with isinstance(payload, dict)
    guard for non-object bodies; route through _coerce_to_bool; pass force= to
    reconcile_state() (8 tests)

  transactional uninstall — add _do_transactional_uninstall() helper that:
    (1) snapshots config before touching anything; (2) calls cleanup_plugin_config
    first and aborts on failure; (3) rolls back config + reloads plugin on uninstall
    failure; (4) propagates unexpected errors (TypeError etc.) instead of swallowing
    them (6 tests)

  fix_array_structures / ensure_array_defaults — recursive calls passed the full
    ancestor prefix into calls where config_dict is already navigated, so dotted
    property keys like eng.1 caused parent_parts.split('.') to mis-navigate; fix:
    drop prefix on recursive calls; also add _fix_none_arrays pass after
    merge_with_defaults so None arrays in JSON requests are replaced with schema
    defaults (2 tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* perf: four targeted optimizations across the display pipeline

Opt 1 — cache data-fetch interval per plugin (plugin_manager.py)
  _get_plugin_update_interval fell back to config_manager.get_config()
  (a full dict copy) when the manifest lacked an interval.  Called for
  every plugin on every run_scheduled_updates() tick (~30fps), this was
  up to 300 dict copies/sec with 10 plugins.
  Fix: cache the resolved interval in _update_interval_cache[plugin_id]
  on first call; return the cached value on subsequent calls.  Cache is
  cleared on load_plugin and unload_plugin.

Opt 2 — demote noisy per-cycle INFO logs to DEBUG (display_controller.py)
  Four logger.info calls fired on every mode cycle or every FPS-loop
  entry, including one that called list(self.plugin_modes.keys())
  unconditionally (allocating a list every outer loop iteration).
  - "Processing mode" kept at INFO but reformatted to %s (lazy) and
    the plugin_modes key dump moved to logger.debug
  - "Attempting/Got cycle duration" → logger.debug
  - "Entering high/normal FPS loop" → logger.debug
  Mode name at INFO is preserved for black-screen troubleshooting.

Opt 3 — use Image.frombytes instead of Image.fromarray in scroll hot path
  (scroll_helper.py)
  Image.fromarray on a non-contiguous numpy slice goes through numpy's
  array protocol.  Image.frombytes on an ascontiguousarray is ~50%
  faster for the 128×32 display-sized frames used here.  Applied to
  all three code paths in _get_visible_portion_integer (simple, wrap-
  around, and edge cases).

Opt 5 — cache get_text_width per (text, font) pair (display_manager.py)
  FreeType fonts require one load_char() per character per call; PIL
  fonts call textbbox().  Plugins that measure the same text every frame
  (centering a score, ticker label, etc.) were re-measuring from scratch
  on every display() call.
  Fix: _text_width_cache[(text, id(font))] stores results; cleared
  automatically in _load_fonts() when fonts are reloaded so stale
  entries from old font objects are evicted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(scroll_helper): fix edge-case bug exposed by frombytes switch

The previous commit replaced Image.fromarray with Image.frombytes in
_get_visible_portion_integer.  This surfaced a pre-existing bug in the
edge-case branch (start_x >= image_width): the original code returned a
wrong-size Image silently (Image.fromarray accepts a too-short array);
Image.frombytes raises ValueError instead.

Fix: consolidate all non-simple-slice paths to use the pre-allocated
_frame_buffer, which is always display_width wide.  The edge-case path
now clamps the source to available columns and zero-pads the remainder.

Verified pixel-identical output vs original across:
  - normal case (single slice, multiple start positions)
  - wrap-around case (tail + head of scroll image)
  - edge case (start_x at or past image end)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeRabbit review comments on PR #358

1. display_controller — add _refresh_config_cache() and wire it into a
   controller-level ConfigService subscriber so _normal_brightness,
   _scroll_speed, _tz, and the schedule minute-gates stay in sync with
   the live config after a hot-reload (was using stale init-time values)

2. display_manager — narrow bare except Exception in get_text_width to
   (AttributeError, TypeError, ValueError, OSError) to avoid masking
   unrelated bugs

3. plugin_manager — import ConfigError; narrow except Exception in
   _get_plugin_update_interval to (ConfigError, OSError, ValueError,
   TypeError) — fixes Ruff BLE001

4. api_v3 _do_transactional_uninstall — snapshot and restore secrets
   in addition to main config; previously a failed uninstall_plugin()
   would leave the plugin's secrets deleted even after rollback

5. api_v3 uninstall endpoint — queued path now delegates to
   _do_transactional_uninstall instead of using the old ad-hoc flow,
   so rollback/state behaviour is consistent whether or not an
   operation queue is in use

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(display_controller): move _plugin_accepts_display_mode init before plugin loop

Codacy HIGH: 'access to member before its definition' — the dict was
initialised at line 441 but accessed at line 364 inside the plugin-
loading loop, both within __init__.

Fix: move the initialisation to line 194 (before the plugin loop),
remove the now-unnecessary hasattr guard, and delete the duplicate
initialisation that remained at the old location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Chuck <chuck@example.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 11:58:21 -04:00
Chuck
39ccdcf00d fix(plugins): stop reconciliation install loop, slow plugin list, uninstall resurrection (#309)
* fix(plugins): stop reconciliation install loop, slow plugin list, and uninstall resurrection

Three interacting bugs reported by a user (Discord/ericepe) on a fresh install:

1. The state reconciler retried failed auto-repairs on every HTTP request,
   pegging CPU and flooding logs with "Plugin not found in registry: github
   / youtube". Root cause: ``_run_startup_reconciliation`` reset
   ``_reconciliation_started`` to False on any unresolved inconsistency, so
   ``@app.before_request`` re-fired the entire pass on the next request.
   Fix: run reconciliation exactly once per process; cache per-plugin
   unrecoverable failures inside the reconciler so even an explicit
   re-trigger stays cheap; add a registry pre-check to skip the expensive
   GitHub fetch when we already know the plugin is missing; expose
   ``force=True`` on ``/plugins/state/reconcile`` so users can retry after
   fixing the underlying issue.

2. Uninstalling a plugin via the UI succeeded but the plugin reappeared.
   Root cause: a race between ``store_manager.uninstall_plugin`` (removes
   files) and ``cleanup_plugin_config`` (removes config entry) — if
   reconciliation fired in the gap it saw "config entry with no files" and
   reinstalled. Fix: reorder uninstall to clean config FIRST, drop a
   short-lived "recently uninstalled" tombstone on the store manager that
   the reconciler honors, and pass ``store_manager`` to the manual
   ``/plugins/state/reconcile`` endpoint (it was previously omitted, which
   silently disabled auto-repair entirely).

3. ``GET /plugins/installed`` was very slow on a Pi4 (UI hung on
   "connecting to display" for minutes, ~98% CPU). Root causes: per-request
   ``discover_plugins()`` + manifest re-read + four ``git`` subprocesses per
   plugin (``rev-parse``, ``--abbrev-ref``, ``config``, ``log``). Fix:
   mtime-gate ``discover_plugins()`` and drop the per-plugin manifest
   re-read in the endpoint; cache ``_get_local_git_info`` keyed on
   ``.git/HEAD`` mtime so subprocesses only run when the working copy
   actually moved; bump registry cache TTL from 5 to 15 minutes and fall
   back to stale cache on transient network failure.

Tests: 16 reconciliation cases (including 5 new ones covering the
unrecoverable cache, force-reconcile path, transient-failure handling, and
recently-uninstalled tombstone) and 8 new store_manager cache tests
covering tombstone TTL, git-info mtime cache hit/miss, and the registry
stale-cache fallback. All 24 pass; the broader 288-test suite continues to
pass with no new failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(plugins): parallelize Plugin Store browse and extend metadata cache TTLs

Follow-up to the previous commit addressing the Plugin Store browse path
specifically. Most users install plugins via the store (ZIP extraction,
no .git directory) so the git-info mtime cache from the previous commit
was a no-op for them; their pain was coming from /plugins/store/list.

Root cause. search_plugins() enriched each returned plugin with three
serial GitHub fetches: _get_github_repo_info (repo API), _get_latest_commit_info
(commits API), _fetch_manifest_from_github (raw.githubusercontent.com).
Fifteen plugins × three requests × serial HTTP = 30–45 sequential round
trips on every cold browse. On a Pi4 over WiFi that translated directly
into the "connecting to display" hang users reported. The commit and
manifest caches had a 5-minute TTL, so even a brief absence re-paid the
full cost.

Changes.

- ``search_plugins``: fan out per-plugin enrichment through a
  ``ThreadPoolExecutor`` (max 10 workers, stays well under unauthenticated
  GitHub rate limits). Apply category/tag/query filters before enrichment
  so we never waste requests on plugins that will be filtered out.
  ``executor.map`` preserves input order, which the UI depends on.
- ``commit_cache_timeout`` and ``manifest_cache_timeout``: 5 min → 30 min.
  Keeps the cache warm across a realistic session while still picking up
  upstream updates in a reasonable window.
- ``_get_github_repo_info`` and ``_get_latest_commit_info``: stale-on-error
  fallback. On a network failure or a 403 we now prefer a previously-
  cached value over the zero-default, matching the pattern already in
  ``fetch_registry``. Flaky Pi WiFi no longer causes star counts to flip
  to 0 and commit info to disappear.

Tests (5 new in test_store_manager_caches.py).

- ``test_results_preserve_registry_order`` — the parallel map must still
  return plugins in input order.
- ``test_filters_applied_before_enrichment`` — category/tag/query filters
  run first so we don't waste HTTP calls.
- ``test_enrichment_runs_concurrently`` — peak-concurrency check plus a
  wall-time bound that would fail if the code regressed to serial.
- ``test_repo_info_stale_on_network_error`` — repo info falls back to
  stale cache on RequestException.
- ``test_commit_info_stale_on_network_error`` — commit info falls back to
  stale cache on RequestException.

All 29 tests (16 reconciliation, 13 store_manager caches) pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf(plugins): drop redundant per-plugin manifest.json fetch in search_plugins

Benchmarking the previous parallelization commit on a real Pi4 revealed
that the 10x speedup I expected was only ~1.1x. Profiling showed two
plugins (football-scoreboard, ledmatrix-flights) each spent 5 seconds
inside _fetch_manifest_from_github — not on the initial HTTP call, but
on the three retries in _http_get_with_retries with exponential backoff
after transient DNS failures. Even with the thread pool, those 5-second
tail latencies stayed in the wave and dominated wall time.

The per-plugin manifest fetch in search_plugins is redundant anyway.
The registry's plugins.json already carries ``description`` (it is
generated from each plugin's manifest by update_registry.py at release
time), and ``last_updated`` is filled in from the commit info that we
already fetch in the same loop. Dropping the manifest fetch eliminates
one of the three per-plugin HTTPS round trips entirely, which also
eliminates the DNS-retry tail.

The _fetch_manifest_from_github helper itself is preserved — it is
still used by the install path.

Tests unchanged (the search_plugins tests mock all three helpers and
still pass); this drop only affects the hot-path call sequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: lock down install/update/uninstall invariants

Regression guard for the caching and tombstone changes in this PR:

- ``install_plugin`` must not be gated by the uninstall tombstone. The
  tombstone only exists to keep the state reconciler from resurrecting a
  freshly-uninstalled plugin; explicit user-initiated installs via the
  store UI go straight to ``install_plugin()`` and must never be blocked.
  Test: mark a plugin recently uninstalled, stub out the download, call
  ``install_plugin``, and assert the download step was reached.

- ``get_plugin_info(force_refresh=True)`` must forward force_refresh
  through to both ``_get_latest_commit_info`` and ``_fetch_manifest_from_github``,
  so that install_plugin and update_plugin (both of which call
  get_plugin_info with force_refresh=True) continue to bypass the 30-min
  cache TTLs introduced in c03eb8db. Without this, bumping the commit
  cache TTL could cause users to install or update to a commit older than
  what GitHub actually has.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): address review findings — transactional uninstall, registry error propagation, payload hardening

Three real bugs surfaced by review, plus one nitpick. Each was verified
against the current code before fixing.

1. fetch_registry silently swallowed network errors, breaking the
   reconciler (CONFIRMED BUG).

   The stale-cache fallback I added in c03eb8db made fetch_registry
   return {"plugins": []} on network failure when no cache existed —
   which is exactly the state on a fresh boot with flaky WiFi. The
   reconciler's _auto_repair_missing_plugin code assumed an exception
   meant "transient, don't mark unrecoverable" and expected to never
   see a silent empty-dict result. With the silent fallback in place
   on a fresh boot, it would see "no candidates in registry" and
   mark every config-referenced plugin permanently unrecoverable.

   Fix: add ``raise_on_failure: bool = False`` to fetch_registry. UI
   callers keep the stale-cache-fallback default. The reconciler's
   _auto_repair_missing_plugin now calls it with raise_on_failure=True
   so it can distinguish a genuine registry miss from a network error.

2. Uninstall was not transactional (CONFIRMED BUG).

   Two distinct failure modes silently left the system in an
   inconsistent state:

   (a) If ``cleanup_plugin_config`` raised, the code logged a warning
       and proceeded to delete files anyway, leaving an orphan install
       with no config entry.
   (b) If ``uninstall_plugin`` returned False or raised AFTER cleanup
       had already succeeded, the config was gone but the files were
       still on disk — another orphan state.

   Fix: introduce ``_do_transactional_uninstall`` shared by both the
   queue and direct paths. Flow:
     - snapshot plugin's entries in main config + secrets
     - cleanup_plugin_config; on failure, ABORT before touching files
     - uninstall_plugin; on failure, RESTORE the snapshot, then raise
   Both queue and direct endpoints now delegate to this helper and
   surface clean errors to the user instead of proceeding past failure.

3. /plugins/state/reconcile crashed on non-object JSON bodies
   (CONFIRMED BUG).

   The previous code did ``payload.get('force', False)`` after
   ``request.get_json(silent=True) or {}``. If a client sent a bare
   string or array as the JSON body, payload would be that string or
   list and .get() would raise AttributeError. Separately,
   ``bool("false")`` is True, so string-encoded booleans were
   mis-handled.

   Fix: guard ``isinstance(payload, dict)`` and route the value
   through the existing ``_coerce_to_bool`` helper.

4. Nitpick: use ``assert_called_once_with`` in
   test_force_reconcile_clears_unrecoverable_cache. The existing test
   worked in practice (we call reset_mock right before) but the stricter
   assertion catches any future regression where force=True might
   double-fire the install.

Tests added (19 new, 48 total passing):

- TestFetchRegistryRaiseOnFailure (4): flag propagates both
  RequestException and JSONDecodeError, wins over stale cache, and
  the default behavior is unchanged for existing callers.
- test_real_store_manager_empty_registry_on_network_failure (1): the
  key regression test — uses the REAL PluginStoreManager (not a Mock)
  with ConnectionError at the HTTP helper layer, and verifies the
  reconciler does NOT poison _unrecoverable_missing_on_disk.
- TestTransactionalUninstall (4): cleanup failure aborts before file
  removal; file removal failure (both False return and raise) restores
  the config snapshot; happy path still succeeds.
- TestReconcileEndpointPayload (8): bare string / array / null JSON
  bodies, missing force key, boolean true/false, and string-encoded
  "true"/"false" all handled correctly.

All 342 tests in the broader sweep still pass (2 pre-existing
TestDottedKeyNormalization failures reproduce on main and are unrelated).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: address review nitpicks in store_manager + test

Four small cleanups, each verified against current code:

1. ``_git_info_cache`` type annotation was ``Dict[str, tuple]`` — too
   loose. Tightened to ``Dict[str, Tuple[float, Dict[str, str]]]`` to
   match what ``_get_local_git_info`` actually stores (mtime + the
   sha/short_sha/branch/... dict it returns). Added ``Tuple`` to the
   typing imports.

2. The ``search_plugins`` early-return condition
   ``if len(filtered) == 1 or not fetch_commit_info and len(filtered) < 4``
   parses correctly under Python's precedence (``and`` > ``or``) but is
   visually ambiguous. Added explicit parentheses to make the intent —
   "single plugin, OR small batch that doesn't need commit info" —
   obvious at a glance. Semantics unchanged.

3. Replaced a Unicode multiplication sign (×) with ASCII 'x' in the
   commit_cache_timeout comment.

4. Removed a dead ``concurrent_workers = []`` declaration from
   ``test_enrichment_runs_concurrently``. It was left over from an
   earlier sketch of the concurrency check — the final test uses only
   ``peak_lock`` and ``peak``.

All 48 tests still pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): address second review pass — cache correctness and rollback

Verified each finding against the current code. All four inline issues
were real bugs; nitpicks 5-7 were valid improvements.

1. _get_latest_commit_info overwrote a good cached value with None on
   all-branches-404 (CONFIRMED BUG).

   The final line of the branch loop unconditionally wrote
   ``self.commit_info_cache[cache_key] = (time.time(), None)``, which
   clobbered any previously-good entry on a single transient failure
   (e.g. an odd 5xx, a temporary DNS hiccup during the branches_to_try
   loop). Fix: if there's already a good prior value, bump its
   timestamp into the backoff window and return it instead. Only
   cache None when we never had a good value.

2. _get_local_git_info cache did not invalidate on fast-forward
   (CONFIRMED BUG).

   Caching on ``.git/HEAD`` mtime alone is wrong: a ``git pull`` that
   fast-forwards the current branch updates ``.git/refs/heads/<branch>``
   (or packed-refs) but leaves HEAD's contents and mtime untouched.
   The cache would then serve a stale SHA indefinitely.

   Fix: introduce ``_git_cache_signature`` which reads HEAD contents,
   resolves ``ref: refs/heads/<name>`` to the corresponding loose ref
   file, and builds a signature tuple of (head_contents, head_mtime,
   resolved_ref_mtime, packed_refs_mtime). A fast-forward bumps the
   ref file's mtime, which invalidates the signature and re-runs git.

3. test_install_plugin_is_not_blocked_by_tombstone swallowed all
   exceptions (CONFIRMED BUG in test).

   ``try: self.sm.install_plugin("bar") except Exception: pass`` could
   hide a real regression in install_plugin that happens to raise.
   Fix: the test now writes a COMPLETE valid manifest stub (id, name,
   class_name, display_modes, entry_point) and stubs _install_dependencies,
   so install_plugin runs all the way through and returns True. The
   assertion is now ``assertTrue(result)`` with no exception handling.

4. Uninstall rollback missed unload/reload (CONFIRMED BUG).

   Previous flow: cleanup → unload (outside try/except) → uninstall →
   rollback config on failure. Problem: if ``unload_plugin`` raised,
   the exception propagated without restoring config. And if
   ``uninstall_plugin`` failed after a successful unload, the rollback
   restored config but left the plugin unloaded at runtime —
   inconsistent.

   Fix: record ``was_loaded`` before touching runtime state, wrap
   ``unload_plugin`` in the same try/except that covers
   ``uninstall_plugin``, and on any failure call a ``_rollback`` local
   that (a) restores the config snapshot and (b) calls
   ``load_plugin`` to reload the plugin if it was loaded before we
   touched it.

5. Nitpick: ``_unrecoverable_missing_on_disk: set`` → ``Set[str]``.
   Matches the existing ``Dict``/``List`` style in state_reconciliation.py.

6. Nitpick: stale-cache fallbacks in _get_github_repo_info and
   _get_latest_commit_info now bump the cached entry's timestamp by a
   60s failure backoff. Without this, a cache entry whose TTL just
   expired would cause every subsequent request to re-hit the network
   until it came back, amplifying the failure. Introduced
   ``_record_cache_backoff`` helper and applied it consistently.

7. Nitpick: replaced the flaky wall-time assertion in
   test_enrichment_runs_concurrently with just the deterministic
   ``peak["count"] >= 2`` signal. ``peak["count"]`` can only exceed 1
   if two workers were inside the critical section simultaneously,
   which is definitive proof of parallelism. The wall-time check was
   tight enough (<200ms) to occasionally fail on CI / low-power boxes.

Tests (6 new, 54 total passing):

- test_cache_invalidates_on_fast_forward_of_current_branch: builds a
  loose-ref layout under a temp .git/, verifies a first call populates
  the cache, a second call with unchanged state hits the cache, and a
  simulated fast-forward (overwriting ``.git/refs/heads/main`` with a
  new SHA and mtime) correctly re-runs git.
- test_commit_info_preserves_good_cache_on_all_branches_404: seeds a
  good cached entry, mocks requests.get to always return 404, and
  verifies the cache still contains the good value afterwards.
- test_repo_info_stale_bumps_timestamp_into_backoff: seeds an expired
  cache, triggers a ConnectionError, then verifies a second lookup
  does NOT re-hit the network (proves the timestamp bump happened).
- test_repo_info_stale_on_403_also_backs_off: same for the 403 path.
- test_file_removal_failure_reloads_previously_loaded_plugin:
  plugin starts loaded, uninstall_plugin returns False, asserts
  load_plugin was called during rollback.
- test_unload_failure_restores_config_and_does_not_call_uninstall:
  unload_plugin raises, asserts uninstall_plugin was never called AND
  config was restored AND load_plugin was NOT called (runtime state
  never changed, so no reload needed).

Broader test sweep: 348/348 pass (2 pre-existing
TestDottedKeyNormalization failures reproduce on main, unrelated).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(plugins): address third review pass — cache signatures, backoff, isolation

All four findings verified as real issues against the current code.

1. _git_cache_signature was missing .git/config (CONFIRMED GAP).

   The cached ``result`` dict from _get_local_git_info includes
   ``remote_url``, which is read from ``.git/config``. But the cache
   signature only tracked HEAD + refs — so a config-only change (e.g.
   ``git remote set-url origin https://...``) would leave the stale
   URL cached indefinitely. This matters for the monorepo-migration
   detection in update_plugin.

   Fix: add ``config_contents`` and ``config_mtime`` to the signature
   tuple. Config reads use the same OSError-guarded pattern as the
   HEAD read.

2. fetch_registry stale fallback didn't bump registry_cache_time
   (CONFIRMED BUG).

   The other caches already had the failure-backoff pattern added in
   the previous review pass (via ``_record_cache_backoff``), but the
   registry cache's stale-fallback branches silently returned the
   cached payload without updating ``registry_cache_time``. Next
   request saw the same expired TTL, re-hit the network, failed
   again — amplifying the original transient failure.

   Fix: bump ``self.registry_cache_time`` forward by the existing
   ``self._failure_backoff_seconds`` (reused — no new constant
   needed) in both the RequestException and JSONDecodeError stale
   branches. Kept the ``raise_on_failure=True`` path untouched so the
   reconciler still gets the exception.

3. _make_client() in the uninstall/reconcile test helper leaked
   MagicMocks into the api_v3 singleton (CONFIRMED RISK).

   Every test call replaced api_v3.config_manager, .plugin_manager,
   .plugin_store_manager, etc. with MagicMocks and never restored them.
   If any later test in the same pytest run imported api_v3 expecting
   original state (or None), it would see the leftover mocks.

   Fix: _make_client now snapshots the original attributes (with a
   sentinel to distinguish "didn't exist" from "was None") and returns
   a cleanup callable. Both setUp methods call self.addCleanup(cleanup)
   so state is restored even if the test raises. On cleanup, sentinel
   entries trigger delattr rather than setattr to preserve the
   "attribute was never set" case.

4. Snapshot helpers used broad ``except Exception`` (CONFIRMED).

   _snapshot_plugin_config caught any exception from
   get_raw_file_content, which could hide programmer errors (TypeError,
   AttributeError) behind the "best-effort snapshot" fallback. The
   legitimate failure modes are filesystem errors (covered by OSError;
   FileNotFoundError is a subclass, IOError is an alias in Python 3)
   and ConfigError (what config_manager wraps all load failures in).

   Fix: narrow to ``(OSError, ConfigError)`` in both snapshot blocks.
   ConfigError was already imported at line 20 of api_v3.py.

Tests added (4 new, 58 total passing):

- test_cache_invalidates_on_git_config_change: builds a realistic
  loose-ref layout, writes .git/config with an "old" remote URL,
  exercises _get_local_git_info, then rewrites .git/config with a
  "new" remote URL + new mtime, calls again, and asserts the cache
  invalidated and returned the new URL.
- test_stale_fallback_bumps_timestamp_into_backoff: seeds an expired
  registry cache, triggers ConnectionError, verifies first call
  serves stale, then asserts a second call makes ZERO new HTTP
  requests (proves registry_cache_time was bumped forward).
- test_snapshot_survives_config_read_error: raises ConfigError from
  get_raw_file_content and asserts the uninstall still completes
  successfully — the narrow exception list still catches this case.
- test_snapshot_does_not_swallow_programmer_errors: raises a
  TypeError from get_raw_file_content (not in the narrow list) and
  asserts it propagates up to a 500, AND that uninstall_plugin was
  never called (proves the exception was caught at the right level).

Broader test sweep: 352/352 pass (2 pre-existing
TestDottedKeyNormalization failures reproduce on main, unrelated).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Chuck <chuck@example.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 12:33:54 -04:00