Files
LEDMatrix/src/plugin_system/health_monitor.py
Chuck 05b3fa56cb fix: Codacy security fixes, CVE dependency bumps, and code quality cleanup (#331)
* fix(deps): bump minimum versions to address CVEs

Pillow 10.4.0 → 12.2.0: CVE-2026-40192 (DoS via FITS decompression bomb),
CVE-2026-25990 (OOB write via PSD image), CVE-2026-42311/42308/42310

requests 2.32.0 → 2.33.0: CVE-2026-25645 (temp file security bypass),
CVE-2024-47081 (.netrc credentials leak)

werkzeug 3.0.0 → 3.1.6: CVE-2023-46136, CVE-2024-49766/49767,
CVE-2025-66221, CVE-2026-21860/27199 (DoS, path traversal, safe_join bypass)

Flask 3.0.0 → 3.1.3: CVE-2026-27205 (session data caching info disclosure)

spotipy 2.24.0 → 2.25.2: CVE-2025-27154, CVE-2025-66040

python-socketio 5.11.0 → 5.14.0: CVE-2025-61765

pytest 7.4.0 → 9.0.3: CVE-2025-71176 (insecure temp dir handling)

Updated in requirements.txt, web_interface/requirements.txt,
plugin-repos/starlark-apps/requirements.txt, and
plugin-repos/march-madness/requirements.txt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve Pylint errors in executor, data service, and odds call

Rename TimeoutError to PluginTimeoutError in plugin_executor.py to
avoid shadowing the built-in; no external callers affected.

Remove dead try/except in BackgroundDataService.shutdown: executor.shutdown()
never accepted a timeout kwarg so the try branch always raised TypeError.
Simplify to a direct shutdown(wait=wait) call.

Remove is_live kwarg from odds_manager.get_odds() call in sports.py;
BaseOddsManager.get_odds() has no such parameter. The live update interval
is already encoded in the update_interval_seconds argument passed alongside.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: MD5→SHA-256, shellcheck warnings, and broken doc links

config_service.py: replace MD5 with SHA-256 for config change detection;
same semantics (equality comparison), no stored hashes affected.

Shell scripts — shellcheck warnings:
- diagnose_web_interface.sh: remove useless cat (SC2002)
- dev_plugin_setup.sh: restructure A&&B||C into if/then (SC2015)
- fix_assets_permissions.sh: remove unused REAL_HOME block (SC2034)
- install_web_service.sh: remove unused USER_HOME assignment (SC2034)
- diagnose_web_ui.sh: remove unused SUDO assignments (SC2034)
- diagnose_plugin_permissions.sh: remove unused BLUE color var (SC2034)
- first_time_install.sh: remove unused CLEAR var, PACKAGE_NAME
  assignment, and replace loop variable with _ (SC2034)

docs/PLUGIN_ARCHITECTURE_SPEC.md: fix 10 broken TOC anchor links to
include section numbers matching the actual headings (MD051).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove unused imports and bare exception aliases (pyflakes F401/F841)

Remove unused imports across 86 files in src/, web_interface/, test/,
and scripts/ using autoflake. No logic changes — only dead import
statements and unused names in from-imports are removed.

Also remove bare exception aliases where the variable is never
referenced in the handler body:
- src/cache/disk_cache.py: except (IOError, OSError, PermissionError) as e
- src/cache_manager.py: except (OSError, IOError, PermissionError) as perm_error
- src/plugin_system/resource_monitor.py: except Exception as e
- web_interface/app.py: except Exception as read_err

86 files changed, 205 lines removed, 18 pre-existing test failures unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove unused local variable assignments (pyflakes F841)

Dead assignments removed across src/ and web_interface/:

- background_data_service: drop future= on fire-and-forget executor.submit
- base_classes/baseball: drop font= (all rendering uses self.fonts['time'])
- base_classes/hockey: drop status_short= (never referenced after assignment)
- common/cli: drop game_helper=/config_helper= bindings in import-test block;
  constructors called for instantiation-only validation
- common/display_helper: drop text_width= (x_position uses display_width
  directly); drop draw= in create_error_image (uses _draw_centered_text)
- config_manager: remove dead secrets_content loading block in migration path
  (comment already noted save_config_atomic handles secrets internally)
- display_manager: drop setup_start= (timing was never completed or read)
- font_manager: drop target_path= (catalog uses font_file_path directly);
  drop face=/font= bindings in validate_font (validation by construction —
  TypeError on failure is the signal, not the return value)
- font_test_manager: drop width=/height= (draw_text uses display_manager directly)
- plugin_system/state_reconciliation: drop manager= (only config/disk/state_mgr used)
- plugin_system/store_manager: drop result= on pip install subprocess.run
  (check=True raises on failure; stdout unused)
- web_interface/blueprints/pages_v3: drop main_config_path=""/secrets_config_path=""
  (render_template uses config_manager.get_*_path() inline)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(js): resolve ESLint no-undef warnings across 6 JS files

Three distinct patterns:

1. Vendor library globals — htmx is injected by <script> before these
   extension files load; ESLint lints files in isolation and doesn't know.
   Fix: add /* global htmx */ to htmx-sse.js and htmx-json-enc.js.

2. Cross-file globals — showNotification is defined as window.showNotification
   in app.js/notification.js but called bare in app.js and error_handler.js.
   ESLint doesn't connect window.X = Y with a bare call to X.
   Fix: add /* global showNotification */ to app.js and error_handler.js.

3. Forward-reference window.* functions — in array-table.js, checkbox-group.js,
   and custom-feeds.js, functions like removeArrayTableRow are called early
   inside event-handler closures but assigned to window.* later in the file.
   At runtime this works (the handler fires after the assignment), but ESLint
   sees the bare name at the call site.
   Fix: change bare calls to window.removeArrayTableRow(this) etc. so the
   reference is explicit and ESLint-safe.

Also guard the updateSystemStats call in app.js reconnectSSE: the function
is called but defined nowhere in the codebase. Guard with typeof check so
it won't throw ReferenceError if the reconnect path is hit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(js): resolve Biome lint warnings across 9 JS files

noUnusedVariables (catch bindings → optional catch syntax):
- app.js, file-upload.js, timezone-selector.js: } catch (e) { → } catch {
  ES2019 optional catch binding; e was unused in all three handlers

noUnusedVariables (dead assignments):
- app.js: remove const data= in display SSE stub (handler does nothing yet)
- api_client.js: remove const timeoutId= (setTimeout ID never used to cancel)
- custom-feeds.js: remove const oldIndex= (getAttribute result never read)
- schedule-picker.js: remove const compactMode= (never used in HTML build)
- select-dropdown.js: remove const icons= (icons not yet rendered in options)

noPrototypeBuiltins:
- day-selector.js: DAY_LABELS.hasOwnProperty(x) →
  Object.prototype.hasOwnProperty.call(DAY_LABELS, x)
  Safe form that works even on null-prototype objects

useIterableCallbackReturn:
- file-upload.js, notification.js: forEach(x => expr) →
  forEach(x => { expr; }) — forEach ignores return values;
  implicit return from arrow body was misleading

htmx-sse.js is a vendor extension file with old-style var/== patterns
that are correct for it; 18 Biome issues suppressed via Codacy API
rather than modifying the vendor source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(security): escape user input in raw HTML responses in pages_v3.py

plugin_id comes directly from the URL path
(/partials/plugin-config/<plugin_id>) and was interpolated into an HTML
fragment without escaping. A crafted URL like
/partials/plugin-config/<script>alert(1)</script> would inject that
tag into the DOM via the HTMX partial response.

Fix: wrap all user-controlled values in markupsafe.escape() before
embedding in raw HTML strings. Affects the plugin-not-found 404
response and both error 500 responses in the plugin config partial.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address Bandit B108/B110 across production code

B110 (try/except/pass):
- display_controller.py: narrow 'except Exception' to 'except AttributeError'
  for get_offset_frame() — plugins not having this optional method is the
  expected case, not all exceptions
- config_manager.py: B110 already resolved by the earlier removal of the
  dead secrets-loading block (the except/pass was inside it)
- All other except/pass blocks in src/ and web_interface/ are intentional
  (last-resort recovery, best-effort fallbacks, non-critical startup probes).
  Annotated each with # nosec B110 and a brief inline reason so the decision
  is explicit for future reviewers.
- Test files and plugin-repos B110 suppressed via Codacy API (not prod code).

B108 (/tmp usage):
- permission_utils.py: /tmp listed to PREVENT permission changes on it — not
  used as a temp path. Annotated # nosec B108.
- display_manager.py: fixed snapshot path is intentional (web UI reads same
  path); path-check guard also annotated.
- wifi_manager.py: named /tmp files match the sudoers allowlist installed with
  the system (the paths are hard-coded in both places by design). Annotated
  all six open/cp references # nosec B108.
- scripts/render_plugin.py: dev script default overridable by user. Annotated.
- web_interface/app.py: reads the same fixed path written by display_manager.
  Annotated # nosec B108.
- Test files suppressed via Codacy API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address remaining Codacy security findings

Flask debug=True (real fix):
- web_interface/app.py: debug=True in __main__ block exposes the Werkzeug
  interactive debugger (arbitrary code execution). Changed to
  os.environ.get('FLASK_DEBUG', '0') == '1' — off by default, opt-in
  via environment variable for local development.

nosec annotations (accepted risk with documented rationale):
- disk_cache.py: os.chmod(0o660) is intentional — web UI and LED matrix
  service share a group, 660 gives group write while denying world access
  (B103 + Semgrep insecure-file-permissions suppressed in Codacy)
- wifi_manager.py: urlopen to hardcoded connectivity-check.ubuntu.com URL
  (B310 — no user input involved)
- font_manager.py: urlretrieve URL comes from user's own config file on
  their local device (B310)
- start_web_conditionally.py: os.execvp with both sys.executable and a
  fixed PROJECT_DIR-relative constant (B606)

Confirmed false positives suppressed via Codacy API (15 issues):
- SSRF (3x): client-side JS fetch — SSRF is server-side; browser fetch
  is CORS-restricted to same origin
- B105 (3x): test fixtures use dummy secrets by design; store_manager
  checks for the placeholder string, it is not itself a secret
- PMD numeric literal (2x): 10000000 is within Number.MAX_SAFE_INTEGER
- Prototype pollution (1x): read-only schema traversal, no writes
- no-unsanitized_method (1x): dynamic import() is CORS-restricted
- detect-unsafe-regex (1x): operates on server-controlled config values
- plugin-repos B103 (1x): vendor code chmod on executable
- Semgrep insecure-file-permissions (3x): same disk_cache 0o660 as above

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove unnecessary f prefix from f-strings without placeholders (F541)

Pyflakes F541 flags f-strings that contain no {} interpolation — they are
identical to plain strings but trigger unnecessary string formatting overhead.

Fixed in production code:
- src/base_classes/data_sources.py (2 debug log calls)
- src/logo_downloader.py (1 error log)
- src/plugin_system/store_manager.py (5 strings across 3 log calls)
- src/web_interface/validators.py (1 return value)
- src/wifi_manager.py (4 log/message strings)
- web_interface/start.py (1 print)

F541 issues in test/, scripts/, and plugin-repos/ suppressed via Codacy API
as non-production code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(dev): add Pillow compatibility smoke test script

Covers all Pillow APIs used in LEDMatrix — image creation, drawing,
font metrics, LANCZOS resampling, paste/alpha_composite, and PNG I/O.
Run after any Pillow version bump to catch regressions before deploy.

    python3 scripts/dev/test_pillow_compat.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve 8 new Codacy issues introduced by PR changes

shellcheck SC2034:
- first_time_install.sh: 'type' loop variable also unused in the wifi
  status loop (we previously fixed 'device' → '_' but left 'type').
  Changed to '_ _ state' since neither device nor type is referenced.

ESLint no-undef:
- app.js: typeof guards don't satisfy no-undef; added updateSystemStats
  to the /* global */ declaration alongside showNotification.

nosec annotation:
- web_interface/app.py: app.run(host='0.0.0.0') line changed when we
  fixed debug=True, giving it a new issue ID. Re-added # nosec B104.

pyflakes F401:
- scripts/dev/test_pillow_compat.py: ImageFilter was imported but never
  used in the smoke test. Removed from the import.

Codacy API suppressions (false positives on changed lines):
- disk_cache.py 0o660 chmod (2x): lines changed when # nosec B103 was
  added, producing new Semgrep issue IDs. Re-suppressed.
- pages_v3.py raw-html-concat: Semgrep does not recognise escape() as
  a sanitizer; the escape() call IS the correct fix.
- app.py flask 0.0.0.0: same line as B104 above; Semgrep rule also
  re-suppressed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address PR review findings

Fix (10 of 15 findings):

plugin-repos/march-madness/requirements.txt:
  Add urllib3>=1.26.0 — manager.py directly imports from urllib3; it was
  an undeclared transitive dependency via requests.

scripts/dev/dev_plugin_setup.sh:
  Restore subshell form (cd "$target_dir" && git pull --rebase) || true
  so the shell's working directory is not permanently changed after the
  if-cd block. Previous fix for SC2015 leaked cwd into the remainder of
  the script.

src/base_classes/sports.py:
  Narrow 'except Exception' to 'except RuntimeError as e' and log via
  self.logger.debug — Path.home() raises only RuntimeError for service
  users; other exceptions should not be silently swallowed.

src/config_service.py:
  Fix stale "MD5 checksum" in ConfigVersion.__init__ docstring (line 40);
  the implementation uses SHA-256 since the Codacy fix.

src/wifi_manager.py:
  Log the last-resort AP enable failure with exc_info=True instead of
  silently passing — failure here means the device may be unreachable.

web_interface/blueprints/pages_v3.py:
  Log the outer metadata pre-load exception at debug level instead of
  swallowing it silently; schema still loads fully below.

src/background_data_service.py:
  Remove unused 'timeout' parameter from shutdown() — executor.shutdown()
  does not accept timeout; update __del__ caller accordingly.

src/font_manager.py:
  Validate URL scheme before urlretrieve — reject non-http/https schemes
  (e.g. file://) to prevent reading local files from config-supplied URLs.

src/plugin_system/plugin_executor.py:
  Simplify redundant except tuple: (PluginTimeoutError, PluginError,
  Exception) → Exception, which already covers the others.

test/test_display_controller.py:
  Mark empty test_plugin_discovery_and_loading as @pytest.mark.skip with
  reason. Move duplicate 'from datetime import datetime' to module header
  and remove the stray mid-module copy.

Skip (5 of 15 findings, with reasons):
  - pytest 9.0.3 concerns: full suite already verified (467 pass, 18 pre-existing)
  - Pillow 12.2.0 API concerns: no deprecated APIs in codebase; tests + Pi smoke test pass
  - diagnose_web_ui.sh sudo validation: set -e already ensures fail-fast on any sudo failure
  - app.py request-logging except: must stay silent (recursive logging risk); annotated
  - app.py SSE file-read except: genuinely transient I/O; annotated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Chuck <chuck@example.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 10:19:55 -04:00

320 lines
11 KiB
Python

"""
Enhanced plugin health monitoring with background checks and auto-recovery.
Builds on existing PluginHealthTracker to provide:
- Background health checks
- Health status determination (healthy/degraded/unhealthy)
- Auto-recovery suggestions
- Health metrics aggregation
"""
import threading
import time
from typing import Dict, Any, Optional, List, Callable
from datetime import datetime
from enum import Enum
from dataclasses import dataclass
from src.logging_config import get_logger
class HealthStatus(Enum):
"""Overall health status of a plugin."""
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
UNKNOWN = "unknown"
@dataclass
class HealthMetrics:
"""Health metrics for a plugin."""
plugin_id: str
status: HealthStatus
last_successful_update: Optional[datetime]
error_rate: float # 0.0 to 1.0
average_response_time: Optional[float] # seconds
consecutive_failures: int
total_failures: int
total_successes: int
success_rate: float # 0.0 to 1.0
last_error: Optional[str]
circuit_breaker_state: str
recovery_suggestions: List[str]
class PluginHealthMonitor:
"""
Enhanced health monitoring for plugins.
Provides:
- Background health checks
- Health status determination
- Auto-recovery suggestions
- Health metrics aggregation
"""
def __init__(
self,
health_tracker,
check_interval: float = 60.0,
degraded_threshold: float = 0.5, # 50% error rate
unhealthy_threshold: float = 0.8, # 80% error rate
max_response_time: float = 5.0 # seconds
):
"""
Initialize health monitor.
Args:
health_tracker: PluginHealthTracker instance
check_interval: Interval between background health checks (seconds)
degraded_threshold: Error rate threshold for degraded status
unhealthy_threshold: Error rate threshold for unhealthy status
max_response_time: Maximum acceptable response time (seconds)
"""
self.health_tracker = health_tracker
self.check_interval = check_interval
self.degraded_threshold = degraded_threshold
self.unhealthy_threshold = unhealthy_threshold
self.max_response_time = max_response_time
self.logger = get_logger(__name__)
# Background check thread
self._monitor_thread: Optional[threading.Thread] = None
self._stop_event = threading.Event()
# Health check callbacks
self._health_check_callbacks: List[Callable[[str], Dict[str, Any]]] = []
def start_monitoring(self) -> None:
"""Start background health monitoring."""
if self._monitor_thread and self._monitor_thread.is_alive():
return
self._stop_event.clear()
self._monitor_thread = threading.Thread(
target=self._monitor_loop,
daemon=True,
name="PluginHealthMonitor"
)
self._monitor_thread.start()
self.logger.info("Started plugin health monitoring")
def stop_monitoring(self) -> None:
"""Stop background health monitoring."""
self._stop_event.set()
if self._monitor_thread and self._monitor_thread.is_alive():
self._monitor_thread.join(timeout=5.0)
self.logger.info("Stopped plugin health monitoring")
def register_health_check(self, callback: Callable[[str], Dict[str, Any]]) -> None:
"""
Register a callback for health checks.
Callback should accept plugin_id and return dict with health info.
"""
self._health_check_callbacks.append(callback)
def get_plugin_health_status(self, plugin_id: str) -> HealthStatus:
"""
Determine overall health status for a plugin.
Args:
plugin_id: Plugin identifier
Returns:
HealthStatus enum value
"""
if not self.health_tracker:
return HealthStatus.UNKNOWN
summary = self.health_tracker.get_health_summary(plugin_id)
if not summary:
return HealthStatus.UNKNOWN
# Check circuit breaker state
circuit_state = summary.get('circuit_state', 'closed')
if circuit_state == 'open':
return HealthStatus.UNHEALTHY
# Check error rate
success_rate = summary.get('success_rate', 100.0)
error_rate = 1.0 - (success_rate / 100.0)
if error_rate >= self.unhealthy_threshold:
return HealthStatus.UNHEALTHY
elif error_rate >= self.degraded_threshold:
return HealthStatus.DEGRADED
else:
return HealthStatus.HEALTHY
def get_plugin_health_metrics(self, plugin_id: str) -> HealthMetrics:
"""
Get comprehensive health metrics for a plugin.
Args:
plugin_id: Plugin identifier
Returns:
HealthMetrics object
"""
if not self.health_tracker:
return HealthMetrics(
plugin_id=plugin_id,
status=HealthStatus.UNKNOWN,
last_successful_update=None,
error_rate=0.0,
average_response_time=None,
consecutive_failures=0,
total_failures=0,
total_successes=0,
success_rate=0.0,
last_error=None,
circuit_breaker_state="unknown",
recovery_suggestions=[]
)
summary = self.health_tracker.get_health_summary(plugin_id)
if not summary:
return HealthMetrics(
plugin_id=plugin_id,
status=HealthStatus.UNKNOWN,
last_successful_update=None,
error_rate=0.0,
average_response_time=None,
consecutive_failures=0,
total_failures=0,
total_successes=0,
success_rate=0.0,
last_error=None,
circuit_breaker_state="unknown",
recovery_suggestions=[]
)
# Calculate metrics
success_rate = summary.get('success_rate', 100.0) / 100.0
error_rate = 1.0 - success_rate
# Parse last success time
last_success_time = None
if summary.get('last_success_time'):
try:
last_success_time = datetime.fromisoformat(summary['last_success_time'])
except (ValueError, TypeError):
pass
# Determine status
status = self.get_plugin_health_status(plugin_id)
# Get recovery suggestions
recovery_suggestions = self._get_recovery_suggestions(plugin_id, summary, status)
return HealthMetrics(
plugin_id=plugin_id,
status=status,
last_successful_update=last_success_time,
error_rate=error_rate,
average_response_time=None, # Would need resource monitor for this
consecutive_failures=summary.get('consecutive_failures', 0),
total_failures=summary.get('total_failures', 0),
total_successes=summary.get('total_successes', 0),
success_rate=success_rate,
last_error=summary.get('last_error'),
circuit_breaker_state=summary.get('circuit_state', 'closed'),
recovery_suggestions=recovery_suggestions
)
def get_all_plugin_health(self) -> Dict[str, HealthMetrics]:
"""
Get health metrics for all tracked plugins.
Returns:
Dictionary mapping plugin_id to HealthMetrics
"""
if not self.health_tracker:
return {}
summaries = self.health_tracker.get_all_health_summaries()
health_metrics = {}
for plugin_id in summaries.keys():
health_metrics[plugin_id] = self.get_plugin_health_metrics(plugin_id)
return health_metrics
def _get_recovery_suggestions(
self,
plugin_id: str,
summary: Dict[str, Any],
status: HealthStatus
) -> List[str]:
"""
Generate recovery suggestions based on health status.
Args:
plugin_id: Plugin identifier
summary: Health summary from tracker
status: Current health status
Returns:
List of suggested recovery actions
"""
suggestions = []
if status == HealthStatus.UNHEALTHY:
suggestions.append("Plugin is unhealthy - check plugin logs for errors")
suggestions.append("Verify plugin configuration is correct")
suggestions.append("Check if plugin dependencies are installed")
if summary.get('circuit_state') == 'open':
suggestions.append("Circuit breaker is open - plugin is being skipped")
suggestions.append("Wait for cooldown period or manually reset health")
if summary.get('consecutive_failures', 0) > 0:
suggestions.append(f"Plugin has {summary['consecutive_failures']} consecutive failures")
suggestions.append("Consider disabling plugin temporarily")
elif status == HealthStatus.DEGRADED:
suggestions.append("Plugin is degraded - experiencing intermittent failures")
suggestions.append("Monitor plugin performance")
suggestions.append("Check for resource constraints (CPU, memory)")
error_rate = (1.0 - (summary.get('success_rate', 100.0) / 100.0)) * 100
suggestions.append(f"Current error rate: {error_rate:.1f}%")
elif status == HealthStatus.HEALTHY:
suggestions.append("Plugin is healthy - no action needed")
# Add specific suggestions based on last error
last_error = summary.get('last_error')
if last_error:
if "timeout" in last_error.lower():
suggestions.append("Last error was a timeout - plugin may be slow or unresponsive")
elif "import" in last_error.lower() or "module" in last_error.lower():
suggestions.append("Last error suggests missing dependencies")
elif "permission" in last_error.lower() or "access" in last_error.lower():
suggestions.append("Last error suggests permission issues")
return suggestions
def _monitor_loop(self) -> None:
"""Background monitoring loop."""
while not self._stop_event.is_set():
try:
# Run health checks for all plugins
if self._health_check_callbacks:
# Get list of plugin IDs (would need plugin manager reference)
# For now, just wait
pass
# Sleep until next check
self._stop_event.wait(self.check_interval)
except Exception as e:
self.logger.error(f"Error in health monitor loop: {e}", exc_info=True)
# Continue monitoring even if there's an error
time.sleep(self.check_interval)