day view on mobile bug fix

This commit is contained in:
2026-03-25 08:49:14 -04:00
parent 163d71d505
commit 941d216f38
2 changed files with 519 additions and 8 deletions

525
CLAUDE.md
View File

@@ -4,7 +4,7 @@
**RosterChirp** is a self-hosted, closed-source, full-stack Progressive Web App for team messaging. It supports both single-tenant (selfhost) and multi-tenant (host) deployments.
**Current version:** 0.11.26
**Current version:** 0.12.27
---
@@ -41,7 +41,7 @@ rosterchirp/
│ │ └── auth.js ← JWT auth, teamManagerMiddleware
│ ├── models/
│ │ ├── db.js ← Postgres pool, query helpers, migrations, seeding
│ │ └── migrations/ ← 001006 SQL files, auto-applied on startup
│ │ └── migrations/ ← 001008 SQL files, auto-applied on startup
│ ├── routes/
│ │ ├── auth.js
│ │ ├── groups.js ← receives io
@@ -106,7 +106,7 @@ rosterchirp/
## Version Bump — Files to Update
When bumping the version (e.g. 0.11.26 → 0.11.27), update **all three**:
When bumping the version (e.g. 0.12.27 → 0.12.28), update **all three**:
```
backend/package.json "version": "X.Y.Z"
@@ -116,7 +116,7 @@ build.sh VERSION="${1:-X.Y.Z}"
One-liner:
```bash
OLD=0.11.26; NEW=0.11.27
OLD=0.12.27; NEW=0.12.28
sed -i "s/\"version\": \"$OLD\"/\"version\": \"$NEW\"/" backend/package.json frontend/package.json
sed -i "s/VERSION=\"\${1:-$OLD}\"/VERSION=\"\${1:-$NEW}\"/" build.sh
```
@@ -184,6 +184,8 @@ const onlineUsers = new Map(); // `${schema}:${userId}` → Set<socketId>
**Critical:** The map key is `${schema}:${userId}` — not bare `userId`. Integer IDs are per-schema, so two tenants can have the same user ID. Without the schema prefix, push notifications and online presence would leak across tenants.
**Scale note:** This in-process Map is a single-server construct. See Phase 2 (Redis) for the multi-instance replacement.
---
## Active Sessions
@@ -395,6 +397,493 @@ Use `/debug` to confirm tokens are registered. Use `/test` to verify end-to-end
---
## Scale Architecture
### Context
RosterChirp-Host is expected to grow to 100,000+ tenants with some tenants having 300+ users — potentially millions of concurrent users total. The current single-process, single-database architecture has well-understood ceilings. This section documents what those ceilings are, what needs to change, and exactly how to implement each phase.
### How Messages Are Currently Loaded (No Problem Here)
Messages are **not** pre-loaded into server memory. The backend uses cursor-based pagination:
- On conversation open: fetches the most recent **50 messages** via `ORDER BY created_at DESC LIMIT 50`
- "Load older messages" button: fetches the next 50 using `before={oldest_message_id}` as a cursor
- Each fetch is a fast indexed Postgres query; the Node process returns results and discards them immediately
The `messages` array grows in the **browser tab** as users scroll back (each "load more" prepends 50 items to React state). At extreme history depth this affects browser memory and scroll performance — a virtual scroll window would fix it — but this is a client-side concern, not a server concern.
### Current Architecture Ceilings
| Resource | Current Config | Approximate Ceiling |
|---|---|---|
| Node.js processes | 1 | ~10,00030,000 concurrent sockets |
| Postgres connections | Pool max 20 | Saturates under concurrent load |
| `onlineUsers` Map | In-process JavaScript Map | Lost on restart; not shared across instances |
| `tenantDomainCache` | In-process JavaScript Map | Stale on other instances after update |
| File storage | `/app/uploads` (container volume) | Not accessible across multiple instances |
### Scale Targets by Phase
| Phase | Concurrent Users | Architecture |
|---|---|---|
| Current | ~5,00010,000 | Single Node, single Postgres |
| Phase 1 (PgBouncer) | ~20,00040,000 | + connection pooler, no code changes |
| Phase 2 (Redis) | ~200,000500,000 | + Redis, multiple Node instances |
| Phase 3 (Read replicas) | ~500,0001,000,000 | + Postgres streaming replication |
| Phase 4 (Sharding) | 1,000,000+ | Multiple Postgres clusters, regional deploy |
---
## Phase 1 — PgBouncer (Implement Now)
### What It Does
PgBouncer sits between the Node app and Postgres as a connection pooler. Instead of Node holding up to 20 long-lived Postgres connections, PgBouncer maintains a pool of e.g. 100 server-side Postgres connections and multiplexes thousands of short application requests onto them. Postgres itself stays healthy; query throughput increases significantly under concurrent load.
**This requires zero code changes.** It is purely an infrastructure addition.
### Why It Matters Now
The current pool `max: 20` means at most 20 queries run simultaneously across all tenants. Under load (many tenants posting messages simultaneously) requests queue up waiting for a free connection. PgBouncer resolves this without touching a line of application code.
### Implementation
**Step 1: Add PgBouncer service to `docker-compose.host.yaml`**
```yaml
pgbouncer:
image: edoburu/pgbouncer:latest
container_name: ${PROJECT_NAME:-rosterchirp}_pgbouncer
restart: unless-stopped
environment:
- DATABASE_URL=postgres://${DB_USER:-rosterchirp}:${DB_PASSWORD}@db:5432/${DB_NAME:-rosterchirp}
- POOL_MODE=transaction
- MAX_CLIENT_CONN=1000
- DEFAULT_POOL_SIZE=100
- MIN_POOL_SIZE=10
- RESERVE_POOL_SIZE=20
- RESERVE_POOL_TIMEOUT=5
- SERVER_IDLE_TIMEOUT=600
- LOG_CONNECTIONS=0
- LOG_DISCONNECTIONS=0
depends_on:
db:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "pg_isready -h localhost -p 5432 -U ${DB_USER:-rosterchirp}"]
interval: 10s
timeout: 5s
retries: 5
```
**Step 2: Point the Node app at PgBouncer instead of Postgres directly**
In `docker-compose.host.yaml`, change the `jama` service environment:
```yaml
- DB_HOST=pgbouncer # was: db
- DB_PORT=5432
```
The `jama` service `depends_on` should add `pgbouncer`.
**Step 3: Tune Postgres `max_connections`**
Add to the `db` service in `docker-compose.host.yaml`:
```yaml
command: >
postgres
-c max_connections=200
-c shared_buffers=256MB
-c effective_cache_size=768MB
-c work_mem=4MB
-c maintenance_work_mem=64MB
-c checkpoint_completion_target=0.9
-c wal_buffers=16MB
-c random_page_cost=1.1
```
**Step 4: Increase the Node pool size**
In `backend/src/models/db.js`, increase `max` since PgBouncer multiplexes efficiently:
```js
const pool = new Pool({
host: process.env.DB_HOST || 'db',
port: parseInt(process.env.DB_PORT || '5432'),
database: process.env.DB_NAME || 'rosterchirp',
user: process.env.DB_USER || 'rosterchirp',
password: process.env.DB_PASSWORD || '',
max: 100, // was 20 — PgBouncer handles the actual Postgres pool
idleTimeoutMillis: 10000, // was 30000 — release faster, PgBouncer manages persistence
connectionTimeoutMillis: 5000,
});
```
**Important caveat — transaction mode:** PgBouncer in `POOL_MODE=transaction` releases the server connection after each transaction completes. This means `SET search_path` (which `db.js` runs before every query) is safe only because each `query()` call acquires, uses, and releases its own connection. Do **not** use session-level state or `LISTEN/NOTIFY` through PgBouncer — it won't work in transaction mode.
**Step 5: Add `PGBOUNCER_` vars to `.env.example`**
```
PGBOUNCER_MAX_CLIENT_CONN=1000
PGBOUNCER_DEFAULT_POOL_SIZE=100
```
**Step 6: Verify**
After deploying:
```bash
# Connect to PgBouncer admin console
docker compose exec pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer
SHOW POOLS; -- shows active/idle/waiting connections
SHOW STATS; -- shows requests/sec
```
### Expected Outcome
With PgBouncer in place, the database connection bottleneck is effectively eliminated for the near term. 1,000 simultaneous tenant requests will queue through PgBouncer's pool of 100 server connections rather than waiting for Node's pool of 20 application-level connections. Throughput roughly 5× at moderate load.
---
## Phase 2 — Redis (Horizontal Scaling)
### What It Does
Redis enables multiple Node.js instances to share state that currently lives in each process's memory:
1. **Socket.io Redis Adapter** — allows `io.to(room).emit()` to reach sockets on any instance
2. **Shared `onlineUsers`** — replaces the in-process Map with a Redis `SADD`/`SREM`/`SMEMBERS` structure
3. **Shared `tenantDomainCache`** — replaces the in-process Map with a Redis hash with TTL
Without Redis, running two Node instances would mean:
- A message emitted on Instance A can't reach a user connected to Instance B
- User A on Instance 1 shows as offline to User B on Instance 2
- A custom domain update on Instance 1 isn't reflected on Instance 2
### Prerequisites
Phase 1 (PgBouncer) should be deployed and stable first. Phase 2 is a significant code change — plan for a maintenance window.
### npm Packages Required
```bash
npm install @socket.io/redis-adapter ioredis
```
Add to `backend/package.json` dependencies.
### Step 1: Add Redis to docker-compose.host.yaml
```yaml
redis:
image: redis:7-alpine
container_name: ${PROJECT_NAME:-rosterchirp}_redis
restart: unless-stopped
command: >
redis-server
--maxmemory 512mb
--maxmemory-policy allkeys-lru
--save ""
--appendonly no
volumes:
- rosterchirp_redis:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
volumes:
rosterchirp_redis:
driver: local
```
Add `REDIS_URL=redis://redis:6379` to the `jama` service environment and to `.env.example`.
### Step 2: Socket.io Redis Adapter (index.js)
Replace the current `new Server(server, ...)` block:
```js
const { createAdapter } = require('@socket.io/redis-adapter');
const { createClient } = require('ioredis');
const REDIS_URL = process.env.REDIS_URL || 'redis://localhost:6379';
// Two Redis clients required by the adapter (pub + sub)
const pubClient = createClient(REDIS_URL);
const subClient = pubClient.duplicate();
await Promise.all([pubClient.connect(), subClient.connect()]);
io.adapter(createAdapter(pubClient, subClient));
console.log('[Server] Socket.io Redis adapter connected');
```
This must be done **before** `io.on('connection', ...)` registers. With this in place, `io.to(room).emit(...)` fans out via Redis pub/sub to every Node instance — no other route code changes required.
### Step 3: Replace onlineUsers Map with Redis (index.js)
Current in-process Map:
```js
const onlineUsers = new Map(); // `${schema}:${userId}` → Set<socketId>
```
Replace with Redis operations. Create a dedicated Redis client for presence (separate from the adapter clients):
```js
const presenceClient = createClient(REDIS_URL);
await presenceClient.connect();
// Key structure: presence:{schema}:{userId} → Set of socketIds
// TTL of 24h prevents stale keys if a server crashes without cleanup
const PRESENCE_TTL = 86400; // seconds
async function addPresence(schema, userId, socketId) {
const key = `presence:${schema}:${userId}`;
await presenceClient.sAdd(key, socketId);
await presenceClient.expire(key, PRESENCE_TTL);
}
async function removePresence(schema, userId, socketId) {
const key = `presence:${schema}:${userId}`;
await presenceClient.sRem(key, socketId);
// Return remaining count — 0 means user is now offline
return presenceClient.sCard(key);
}
async function isOnline(schema, userId) {
const key = `presence:${schema}:${userId}`;
return (await presenceClient.sCard(key)) > 0;
}
async function getOnlineUserIds(schema) {
// Scan keys matching presence:{schema}:* and return user IDs of non-empty sets
const pattern = `presence:${schema}:*`;
const keys = await presenceClient.keys(pattern);
const online = [];
for (const key of keys) {
if ((await presenceClient.sCard(key)) > 0) {
online.push(parseInt(key.split(':')[2]));
}
}
return online;
}
```
Then replace all `onlineUsers.has/get/set/delete` calls in the `io.on('connection')` handler with the async Redis equivalents. This requires making the connection handler and its sub-handlers `async` where they aren't already.
**Disconnect handler becomes:**
```js
socket.on('disconnect', async () => {
const remaining = await removePresence(schema, userId, socket.id);
if (remaining === 0) {
exec(schema, 'UPDATE users SET last_online=NOW() WHERE id=$1', [userId]).catch(() => {});
io.to(R('schema', 'all')).emit('user:offline', { userId });
}
});
```
**users:online handler becomes:**
```js
socket.on('users:online', async () => {
const userIds = await getOnlineUserIds(schema);
socket.emit('users:online', { userIds });
});
```
### Step 4: Replace tenantDomainCache with Redis (db.js)
Current in-process Map:
```js
const tenantDomainCache = new Map();
```
Replace with a Redis hash with TTL:
```js
let redisClient = null; // set externally after Redis connects
function setRedisClient(client) { redisClient = client; }
async function resolveSchema(req) {
// ... existing logic up to custom domain lookup ...
// Custom domain lookup — Redis first, fallback to DB
if (redisClient) {
const cached = await redisClient.hGet('tenantDomainCache', host);
if (cached) return cached;
}
// DB fallback
const tenant = await queryOne('public',
'SELECT schema_name FROM tenants WHERE custom_domain=$1 AND status=$2',
[host, 'active']
);
if (tenant) {
if (redisClient) await redisClient.hSet('tenantDomainCache', host, tenant.schema_name);
return tenant.schema_name;
}
throw new Error(`Unknown tenant for host: ${host}`);
}
async function refreshTenantCache(tenants) {
if (!redisClient) return;
// Rebuild the entire hash atomically
await redisClient.del('tenantDomainCache');
for (const t of tenants) {
if (t.custom_domain && t.schema_name) {
await redisClient.hSet('tenantDomainCache', t.custom_domain.toLowerCase(), t.schema_name);
}
}
await redisClient.expire('tenantDomainCache', 3600); // 1h TTL as safety net
}
```
Export `setRedisClient` and call it from `index.js` after Redis connects, before `initDb()`.
When a custom domain is updated via the host control panel (`host.js`), call `refreshTenantCache` to invalidate immediately.
### Step 5: File Storage — Move to Object Storage
With multiple Node instances, each container has its own `/app/uploads` volume. An avatar uploaded to Instance A isn't accessible from Instance B.
**Recommended: Cloudflare R2** (S3-compatible, free egress, affordable storage)
```bash
npm install @aws-sdk/client-s3 @aws-sdk/s3-request-presigner
```
Changes to `backend/src/routes/users.js` (avatar upload) and `backend/src/routes/settings.js` (logo/icon upload):
```js
const { S3Client, PutObjectCommand, DeleteObjectCommand } = require('@aws-sdk/client-s3');
const s3 = new S3Client({
region: 'auto',
endpoint: process.env.R2_ENDPOINT, // https://<account>.r2.cloudflarestorage.com
credentials: {
accessKeyId: process.env.R2_ACCESS_KEY_ID,
secretAccessKey: process.env.R2_SECRET_ACCESS_KEY,
},
});
async function uploadToR2(buffer, key, contentType) {
await s3.send(new PutObjectCommand({
Bucket: process.env.R2_BUCKET,
Key: key,
Body: buffer,
ContentType: contentType,
}));
return `${process.env.R2_PUBLIC_URL}/${key}`; // R2 public bucket URL
}
```
All `avatarUrl` and `logoUrl` values stored in the DB become full `https://` URLs rather than `/uploads/...` paths. The frontend already renders them via `<img src={url}>` so no frontend changes are needed.
Add to `.env.example`:
```
R2_ENDPOINT=
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET=
R2_PUBLIC_URL= # e.g. https://assets.yourdomain.com
```
### Step 6: Load Balancing Multiple Node Instances
With Redis adapter in place, run multiple Node containers behind Caddy:
In `docker-compose.host.yaml`, add additional app instances:
```yaml
rosterchirp_1:
image: rosterchirp:${ROSTERCHIRP_VERSION:-latest}
<<: *rosterchirp-base # use YAML anchors for shared config
container_name: rosterchirp_1
rosterchirp_2:
image: rosterchirp:${ROSTERCHIRP_VERSION:-latest}
<<: *rosterchirp-base
container_name: rosterchirp_2
```
**Caddyfile update:**
```
{HOST_DOMAIN} {
reverse_proxy rosterchirp_1:3000 rosterchirp_2:3000 {
lb_policy round_robin
health_uri /api/health
health_interval 15s
}
}
```
**Critical — WebSocket sticky sessions:** Socket.io with the Redis adapter handles cross-instance messaging, but the **initial HTTP upgrade handshake** must land on the same instance as the polling fallback. Caddy's `lb_policy round_robin` handles this correctly for WebSocket connections (once upgraded, the connection stays). For the polling transport, add:
```
header_up X-Real-IP {remote_host}
header_up Cookie {http.request.header.Cookie}
```
Or force WebSocket-only transport in the Socket.io client config (eliminates the polling concern entirely):
```js
// frontend/src/contexts/SocketContext.jsx
const socket = io({ transports: ['websocket'] });
```
### Step 7: Verify Redis Phase
After deploying:
```bash
# Check adapter is working — should see Redis keys
docker compose exec redis redis-cli keys '*'
# Check presence tracking
docker compose exec redis redis-cli keys 'presence:*'
# Check tenant cache
docker compose exec redis redis-cli hgetall tenantDomainCache
# Monitor real-time Redis traffic during a test message send
docker compose exec redis redis-cli monitor
```
### Phase 2 Summary — Files Changed
| File | Change |
|---|---|
| `backend/src/index.js` | Redis adapter, presence helpers replacing onlineUsers Map |
| `backend/src/models/db.js` | Redis-backed tenantDomainCache, setRedisClient export |
| `backend/src/routes/users.js` | R2 upload for avatars |
| `backend/src/routes/settings.js` | R2 upload for logos/icons |
| `backend/package.json` | Add `@socket.io/redis-adapter`, `ioredis`, `@aws-sdk/client-s3` |
| `docker-compose.host.yaml` | Add Redis service, multiple app instances, Caddy lb |
| `frontend/src/contexts/SocketContext.jsx` | Force WebSocket transport |
| `.env.example` | Add `REDIS_URL`, `R2_*` vars |
---
## Phase 3 — Read Replicas (Future)
When write load on Postgres becomes a bottleneck (typically >100,000 concurrent active users):
1. Configure Postgres streaming replication — one primary, 12 standbys
2. In `db.js`, maintain two pools: `primaryPool` (writes) and `replicaPool` (reads)
3. Route `query()` to `replicaPool`, `exec()`/`queryResult()` to `primaryPool`
4. `withTransaction()` always uses `primaryPool`
This is entirely within `db.js` — no route changes needed if the abstraction is preserved.
---
## Phase 4 — Tenant Sharding (Future)
When a single Postgres cluster can't handle the write volume (millions of active tenants):
1. Assign each tenant to a shard (DB cluster) at provisioning time — store in the `tenants` table as `shard_id`
2. `resolveSchema()` in `db.js` looks up the tenant's shard and returns both schema name and DB host
3. Maintain a pool per shard rather than one global pool
4. `host.js` provisioning logic assigns shards using a round-robin or least-loaded strategy
This is a significant architectural change. Do not implement until clearly needed.
---
## Outstanding / Deferred Work
### iOS Push Notifications
@@ -402,7 +891,17 @@ Use `/debug` to confirm tokens are registered. Use `/test` to verify end-to-end
### WebSocket Reconnect on Focus
**Status:** Deferred. Socket drops when Android PWA is backgrounded.
**Fix:** Frontend-only — listen for `visibilitychange` in `SocketContext.jsx`, reconnect socket when `document.visibilityState === 'visible'`.
**Fix:** Frontend-only — listen for `visibilitychange` in `SocketContext.jsx`, reconnect socket when `document.visibilityState === 'visible'`. Note: forcing WebSocket-only transport (Phase 2 Step 6) may affect reconnect behaviour — implement reconnect-on-focus at the same time as the transport change.
### Message History — Browser Memory
**Status:** Future. The `messages` array in `ChatWindow` grows unbounded as a user scrolls back through history. At extreme depth (thousands of messages in one session), this affects browser scroll performance.
**Fix:** Virtual scroll window — discard messages scrolled far out of view, re-fetch on demand. This is a non-trivial frontend refactor (react-virtual or similar). Not needed until users regularly have very long scrollback sessions.
### Orphaned Image Cleanup
**Status:** Future. Deleted messages null `image_url` in DB but leave the file on disk (or in R2 after Phase 2). A background job that periodically deletes image files with no corresponding DB row would prevent unbounded storage growth.
### hasMore Heuristic
**Status:** Minor. `hasMore` is set to `true` when `messages.length >= 50`. If a conversation has exactly 50 messages total, this shows a "Load older" button that returns nothing. Fix: return a `total` count from the backend GET messages route, or check `older.length < 50` to detect end of history.
---
@@ -414,7 +913,7 @@ APP_TYPE=selfhost|host
HOST_DOMAIN= # host mode only
HOST_ADMIN_KEY= # host mode only
JWT_SECRET=
DB_HOST=db
DB_HOST=db # set to 'pgbouncer' after Phase 1
DB_NAME=rosterchirp
DB_USER=rosterchirp
DB_PASSWORD= # avoid ! (shell interpolation issue with docker-compose)
@@ -434,6 +933,18 @@ FIREBASE_MESSAGING_SENDER_ID= # FCM web app config
FIREBASE_APP_ID= # FCM web app config
FIREBASE_VAPID_KEY= # FCM Web Push certificate public key
FIREBASE_SERVICE_ACCOUNT= # FCM service account JSON (stringified, backend only)
# Phase 1 (PgBouncer)
PGBOUNCER_MAX_CLIENT_CONN=1000
PGBOUNCER_DEFAULT_POOL_SIZE=100
# Phase 2 (Redis + R2)
REDIS_URL=redis://redis:6379
R2_ENDPOINT= # https://<account>.r2.cloudflarestorage.com
R2_ACCESS_KEY_ID=
R2_SECRET_ACCESS_KEY=
R2_BUCKET=
R2_PUBLIC_URL= # https://assets.yourdomain.com
```
---
@@ -454,4 +965,4 @@ Build sequence: `build.sh` → Docker build → `npm run build` (Vite) → `dock
## Session History
Development continues in Claude Code from v0.11.26 (rebranded from jama to RosterChirp).
Development continues in Claude Code from v0.11.26 (rebranded from jama to RosterChirp). Scale architecture analysis and Phase 1/2 implementation specs added based on planned growth to 100,000+ tenants.