Skip to main content

How it works

LiveScoreJob is the heartbeat of the system. It runs on a cron loop with two cadences:
  • 2.5 seconds when a match is live (actively in play)
  • 60 seconds when idle (pre-match, innings break, rain delay)
The job calls SportMonksClient, which makes HTTP requests to the SportMonks Cricket API to fetch the current match state.

Reconciliation

LivescoreProvider reconciles the raw API response into a normalized internal state. This handles quirks in the upstream API: missing fields, delayed score updates, inconsistent innings numbering.

Deduplication via state hashing

StateHashService computes a hash of the reconciled state and compares it against the last known hash. If nothing changed, the entire downstream pipeline is skipped. This is critical for avoiding unnecessary DynamoDB writes and push notifications, especially during drinks breaks or rain delays where the API returns the same data repeatedly. The hash covers: runs, wickets, overs, batting/bowling stats, match status, and innings state.

Fixture update

When a genuine state change is detected, FixtureService writes the updated fixture record to the fixtures table in DynamoDB.

Push pipeline

ActivityPushJob picks up the change and queries the cricket-activity-subscriptions table to find every user currently watching this match via a Live Activity. MatchDisplayService formats the score into a team-centric perspective. If a user is following Team A, they see Team A’s score prominently. This is a display-layer concern — the underlying data is the same. APNs Service builds the Live Activity update payload. Apple enforces a strict 4KB limit on Live Activity payloads, so the service carefully prunes the payload to fit. The push is sent over HTTP/2 to Apple’s APNs servers. iOS receives the push and ActivityKit updates both the Dynamic Island and the Lock Screen widget.

Resilience

Circuit breaker: The SportMonks client uses cockatiel for circuit breaking. If the API returns repeated 5xx errors, the circuit opens and requests are short-circuited for a coolback period. This prevents hammering a degraded upstream.
Distributed locking: LiveScoreJob acquires a distributed lock before polling to prevent duplicate processing when multiple instances are running. If the lock is held by another instance, the job skips that cycle.
Display cache: MatchDisplayService caches formatted display states to avoid recomputing the same display output for users following the same team in the same match.

Key tables

TablePurpose
fixturesCanonical match state, updated by FixtureService
cricket-activity-subscriptionsMaps users + push tokens to matches they are watching

Key jobs

JobCadencePurpose
LiveScoreJob2.5s / 60sPolls SportMonks for match state
ActivityPushJobTriggered by state changePushes updates to all subscribers

Areas of improvement

The following items were identified from a code-level audit of ScorePoller.ts, activityPushJob.ts, apnsService.ts, stateHashService.ts, lockService.ts, and SportMonksClient.ts.

1. Sequential match processing in ActivityPushJob

Scalability bottleneck: activityPushJob.ts processes matches in a serial for loop (line 817). Each iteration calls dataService.getLiveMatchStateWithBalls, queries subscriptions, builds display state per team group, and sends APNs pushes — all sequentially. With 5+ concurrent live matches and hundreds of subscribers each, a single tick can take several seconds, causing stale updates for matches processed later in the queue.Consider processing matches in parallel with a bounded concurrency pool (e.g., p-limit), or splitting into per-match workers.

2. Unbounded Promise.all in APNs batch sends

Memory and connection pressure: Every batch function in apnsService.ts (sendLiveActivityUpdateBatch, sendPushToStartBatch, etc.) fires Promise.all across all tokens simultaneously (e.g., line 975). For a popular match with thousands of subscribers, this creates thousands of concurrent HTTP/2 streams on a single connection. HTTP/2 has a max concurrent streams setting (typically 100-1000 on APNs), and exceeding it will cause stream resets.Introduce chunked concurrency (e.g., batches of 50-100 concurrent streams) to stay within HTTP/2 flow control limits.

3. No retry on transient APNs failures

Reliability gap: apnsService.ts sendPush (line 432) makes a single attempt per token. If APNs returns a transient error (e.g., HTTP 503, stream timeout, GOAWAY mid-request), the push is silently lost. The only retry mechanism is that the next poll cycle will compute a new hash — but if the state hasn’t changed, the hash matches and no update is sent, leaving the user’s lock screen stale until the next ball.Consider adding a retry with backoff for 5xx/timeout errors (distinct from invalid-token errors which should not be retried).

4. State hash does not cover key events or recent balls

Data consistency: stateHashService.ts computeStateHash (line 11) only hashes score, overs, batsmen, bowler, status, and innings. It does not include recentBalls, keyEvents, footerLine, runRate, or chase fields (target, runsNeeded, ballsRemaining). This means:
  • A wicket that doesn’t change the score/overs (e.g., timed out) won’t trigger a push.
  • Run rate changes without score changes (rounding) are silently dropped.
  • The recentBalls array can update without triggering a push, so the lock screen shows stale ball-by-ball data.
Expand the hash to include all fields that appear in the Live Activity UI.

5. DynamoDB write failure between hash update and APNs send

Data consistency: In activityPushJob.ts, the flow is: send APNs push, then updateStoredHash (line 699). If the APNs send succeeds but the DynamoDB hash write fails, the next tick will see the old hash, detect a “change,” and re-send the same update. This is mostly harmless (idempotent pushes), but the reverse order would be worse — so the current ordering is the safer choice. However, there is no logging or metric when updateStoredHash throws, since the error propagates up to the catch-all in runOnce which only logs the match ID, not which step failed.Add explicit error handling around updateStoredHash with a distinguishing log message.

6. selfHealAttempts is an in-memory Map with no size bound

Memory leak potential: activityPushJob.ts line 58 declares selfHealAttempts as an unbounded Map<string, number>. The cleanup function (cleanupSelfHealAttempts, line 64) only runs at the end of attemptSelfHealRecovery, so if self-heal is never triggered (e.g., all users have subscriptions), stale entries from previous matches accumulate. Over weeks of uptime, this map grows without bound.Consider using an LRU cache or adding a periodic sweep independent of the self-heal path.

7. Single HTTP/2 session for all APNs traffic

Single point of failure: apnsService.ts maintains exactly one http2Session (line 289). If that session enters a degraded state (e.g., receiving GOAWAY but not yet closed), all in-flight pushes fail until the close event fires and the next request triggers reconnection. During a GOAWAY drain, new streams may be rejected but the session isn’t yet closed, so getSession() returns it as healthy.Consider checking for GOAWAY state in getSession() and proactively reconnecting, or maintaining a small pool of sessions.

8. Lock renewal race in ActivityPushJob

Race condition: In activityPushJob.ts, lockRenewTimer (line 770) runs on a setInterval of 15 seconds (half the 30-second TTL). But runOnce can take longer than 15 seconds if many matches are being processed sequentially (see item 1). If runOnce takes 35+ seconds, the lock expires before the renewal fires, another instance acquires it, and both instances process the same matches concurrently — leading to duplicate APNs pushes.The ScorePoller handles this better by calling extendLock at the end of each tick (line 275). Consider aligning ActivityPushJob to the same pattern, or extending the lock within the match processing loop.

9. No metrics or structured observability

Observability gap: The codebase uses console.log and console.error throughout. ScorePoller.ts emits structured JSON logs (good), but activityPushJob.ts uses unstructured template strings like [ActivityPushJob] Processing match ${match.id}. There are no metrics emitted for:
  • APNs push latency (p50/p99)
  • Push success/failure rates per match
  • State hash hit/miss ratio
  • Lock contention frequency
  • Time-to-lock-screen (end-to-end latency from SportMonks response to APNs send)
Consider adding a metrics library (e.g., CloudWatch EMF, StatsD) to track these values, and standardizing all logs to structured JSON.

10. SportMonks client has no rate-limit backoff

Resilience gap: SportMonksClient.ts detects HTTP 429 and throws HttpRateLimitError (line 124), but does not read the Retry-After header or implement any backoff. The ScorePoller circuit breaker treats 429 the same as any other error — it counts toward the consecutive failure breaker. A sustained rate limit could open the circuit breaker unnecessarily, blocking all polling for the cooldown period (30 seconds) even if the rate limit resets in 1 second.Consider handling 429 specifically: read the Retry-After header and delay the next poll accordingly, rather than letting it trip the circuit breaker.

11. Existing TODOs in the codebase

The following open TODOs are relevant to this flow:
LocationTODO
SportMonksClient.ts:36#254: Replace hardcoded SPORTMONKS_SEASON_ID with active league seasons from LeagueService
SportMonksClient.ts:307Add man_of_match_id to fixture schema so archival captures MOTM
detectKeyEvents.ts:52-53Implement significant_over and target event detectors
adminV2.ts:1121Job status endpoint returns process-local state, not cluster-wide state in multi-instance deploys