Ionhour Docs

Check Lifecycle

The complete state machine behind check status transitions, validation timing, and multi-region consensus.

Every check in IonHour follows a deterministic state machine. Understanding the lifecycle helps you configure checks correctly and interpret status changes.

Status States

StatusMeaning
NEWCheck was just created. No signals received yet.
OKCheck is healthy. Signals are arriving on schedule.
LATEA signal is overdue. The schedule window has passed but the grace period hasn't expired yet.
DOWNThe check has failed. The grace period expired without a signal (inbound) or consecutive failures exceeded the threshold (outbound).
PAUSEDCheck is paused. No validation runs. No alerts.
RESUMEDCheck was just resumed. Waiting for the first validation cycle.

Inbound Check Transitions

Inbound (heartbeat) checks transition based on signal timing relative to their schedule and grace period.

Inbound check state machine diagram

Timing Calculations

Given a check with intervalSeconds = 300 (5 minutes) and graceSeconds = 30:

  1. Ping arrives at T=0. Status: OK. nextDueAt set to T + 300s.
  2. T + 300s passes with no ping. Validation fires. Status: LATE. A SUSPECT signal is recorded.
  3. T + 330s passes with no ping. Validation fires. Status: DOWN. A FAIL signal is recorded. An incident is created.
  4. Ping arrives at any point during LATE or DOWN. Status: OK. Incident is resolved. Cycle restarts.

Validation Buffer

IonHour adds a 2-second buffer to all validation timers to account for network jitter. A check with a 300-second interval won't be validated until 302 seconds after the last ping.

Validation Scheduling

IonHour doesn't poll checks on a fixed interval. Instead, it schedules a precise validation job for each check based on its current state:

Current StatusNext Validation At
OKlastPingAt + intervalSeconds + 2s
LATElastPingAt + intervalSeconds + graceSeconds + 2s
RESUMEDnow + intervalSeconds + 2s
DOWNNo validation scheduled
NEWNo validation scheduled (waiting for first ping)
PAUSEDNo validation scheduled

When a ping arrives, the validation job is rescheduled based on the new lastPingAt. This means validation is always relative to the most recent signal, not a fixed clock.

A background sweep runs every 5 minutes to catch any orphaned checks that might have missed their scheduled validation.

Outbound Check Transitions

Outbound checks transition based on consecutive failure/success counts evaluated across probe regions, not individual probe results.

Outbound check state machine with multi-region consensus

Per-Region State Tracking

Each region maintains independent counters:

  • consecutiveFailures — resets to 0 on success
  • consecutiveSuccesses — resets to 0 on failure

A successful probe increments consecutiveSuccesses and resets consecutiveFailures. A failed probe does the opposite.

Consensus Algorithm

IonHour doesn't rely on a single region's opinion. The consensus algorithm evaluates all regions together:

To go DOWN (from OK): A majority of regions must have consecutiveFailures >= failThreshold.

To recover (from DOWN to OK): All regions must have consecutiveSuccesses >= resolveThreshold.

No consensus: If neither condition is met, the check stays in its current state. This prevents thrashing on borderline cases.

Example (3 regions, failThreshold=3, resolveThreshold=2)

us-east-1eu-west-1ap-southeast-1Majority (2)Consensus
3 failures3 failures1 failure2 regions failedDOWN
3 failures1 failure0 failures1 region failedNo change
2 successes2 successes2 successesAll recoveredOK
2 successes2 successes0 successesNot all recoveredNo change

The asymmetry is intentional: going DOWN requires only a majority (fast failure detection), while recovering requires all regions (conservative recovery to avoid premature resolution).

Probe Execution

Each probe follows this sequence:

  1. DNS resolution with SSRF validation (private IPs blocked)
  2. TCP connection (2s connect timeout)
  3. TLS handshake for HTTPS (3s timeout, certificate captured)
  4. HTTP request (configurable timeout, max 256 KB response body)
  5. Redirect following (if enabled, up to max hops, HTTPS downgrade blocked by default)
  6. Status validation against expected range

Failed probes are retried once with a 300ms delay. Certain transient errors (502/503/504, temporary DNS failures) get an additional retry.

Concurrency Control

  • Only one probe can run at a time per check (Redis lock, 180s TTL).
  • Deduplication prevents the same scheduled probe from running twice (Redis lock, 120s TTL).
  • Up to 20 probes execute concurrently across all checks.

Signal Types

IonHour records different signal types depending on how they were generated:

TypeSourceWhen
SUCCESSUser ping or successful outbound probeHeartbeat received or probe passed
SUSPECTSystem-generatedCheck transitioned to LATE
FAILSystem-generatedCheck transitioned to DOWN
DEPLOYMENTSystem-generatedDeployment window started or ended

System-generated signals (SUSPECT, FAIL, DEPLOYMENT) appear in the signal history alongside user heartbeats, giving you a complete audit trail.

Pause and Resume

Pausing

When a check is paused (manually or by a deployment):

  1. Status set to PAUSED.
  2. Validation job is removed from the queue.
  3. Outbound probe jobs are removed (if outbound check).
  4. All open incidents are resolved.
  5. No alerts are sent while paused.

Resuming

When a check is resumed:

  1. Status set to RESUMED.
  2. A new validation job is scheduled for now + intervalSeconds.
  3. Outbound probe jobs are recreated (if outbound check).
  4. The first signal or probe after resume transitions the check to OK.

Deployment Suppression

During an active deployment window, even if a check is not paused, IonHour suppresses DOWN/LATE transitions. Missed signals during the deployment are ignored. This provides a lighter-touch alternative to full pause — checks keep running, but transient failures don't trigger incidents.

Event Flow

Every state transition emits events that drive the rest of the system.

Event flow diagram

Dependency Impact Re-evaluation

Dependency impact is re-evaluated:

  1. On every signal — when a service check receives a heartbeat, its dependencies are checked.
  2. On dependency status change — when a dependency check goes DOWN/LATE, all service checks that depend on it are re-evaluated.
  3. Every 5 minutes — a background sweep re-evaluates all service checks with dependencies to catch any missed transitions.