Check Lifecycle
The complete state machine behind check status transitions, validation timing, and multi-region consensus.
Every check in IonHour follows a deterministic state machine. Understanding the lifecycle helps you configure checks correctly and interpret status changes.
Status States
| Status | Meaning |
|---|---|
| NEW | Check was just created. No signals received yet. |
| OK | Check is healthy. Signals are arriving on schedule. |
| LATE | A signal is overdue. The schedule window has passed but the grace period hasn't expired yet. |
| DOWN | The check has failed. The grace period expired without a signal (inbound) or consecutive failures exceeded the threshold (outbound). |
| PAUSED | Check is paused. No validation runs. No alerts. |
| RESUMED | Check was just resumed. Waiting for the first validation cycle. |
Inbound Check Transitions
Inbound (heartbeat) checks transition based on signal timing relative to their schedule and grace period.
Timing Calculations
Given a check with intervalSeconds = 300 (5 minutes) and graceSeconds = 30:
- Ping arrives at
T=0. Status: OK.nextDueAtset toT + 300s. - T + 300s passes with no ping. Validation fires. Status: LATE. A
SUSPECTsignal is recorded. - T + 330s passes with no ping. Validation fires. Status: DOWN. A
FAILsignal is recorded. An incident is created. - Ping arrives at any point during LATE or DOWN. Status: OK. Incident is resolved. Cycle restarts.
Validation Buffer
IonHour adds a 2-second buffer to all validation timers to account for network jitter. A check with a 300-second interval won't be validated until 302 seconds after the last ping.
Validation Scheduling
IonHour doesn't poll checks on a fixed interval. Instead, it schedules a precise validation job for each check based on its current state:
| Current Status | Next Validation At |
|---|---|
| OK | lastPingAt + intervalSeconds + 2s |
| LATE | lastPingAt + intervalSeconds + graceSeconds + 2s |
| RESUMED | now + intervalSeconds + 2s |
| DOWN | No validation scheduled |
| NEW | No validation scheduled (waiting for first ping) |
| PAUSED | No validation scheduled |
When a ping arrives, the validation job is rescheduled based on the new lastPingAt. This means validation is always relative to the most recent signal, not a fixed clock.
A background sweep runs every 5 minutes to catch any orphaned checks that might have missed their scheduled validation.
Outbound Check Transitions
Outbound checks transition based on consecutive failure/success counts evaluated across probe regions, not individual probe results.
Per-Region State Tracking
Each region maintains independent counters:
consecutiveFailures— resets to 0 on successconsecutiveSuccesses— resets to 0 on failure
A successful probe increments consecutiveSuccesses and resets consecutiveFailures. A failed probe does the opposite.
Consensus Algorithm
IonHour doesn't rely on a single region's opinion. The consensus algorithm evaluates all regions together:
To go DOWN (from OK): A majority of regions must have consecutiveFailures >= failThreshold.
To recover (from DOWN to OK): All regions must have consecutiveSuccesses >= resolveThreshold.
No consensus: If neither condition is met, the check stays in its current state. This prevents thrashing on borderline cases.
Example (3 regions, failThreshold=3, resolveThreshold=2)
| us-east-1 | eu-west-1 | ap-southeast-1 | Majority (2) | Consensus |
|---|---|---|---|---|
| 3 failures | 3 failures | 1 failure | 2 regions failed | DOWN |
| 3 failures | 1 failure | 0 failures | 1 region failed | No change |
| 2 successes | 2 successes | 2 successes | All recovered | OK |
| 2 successes | 2 successes | 0 successes | Not all recovered | No change |
The asymmetry is intentional: going DOWN requires only a majority (fast failure detection), while recovering requires all regions (conservative recovery to avoid premature resolution).
Probe Execution
Each probe follows this sequence:
- DNS resolution with SSRF validation (private IPs blocked)
- TCP connection (2s connect timeout)
- TLS handshake for HTTPS (3s timeout, certificate captured)
- HTTP request (configurable timeout, max 256 KB response body)
- Redirect following (if enabled, up to max hops, HTTPS downgrade blocked by default)
- Status validation against expected range
Failed probes are retried once with a 300ms delay. Certain transient errors (502/503/504, temporary DNS failures) get an additional retry.
Concurrency Control
- Only one probe can run at a time per check (Redis lock, 180s TTL).
- Deduplication prevents the same scheduled probe from running twice (Redis lock, 120s TTL).
- Up to 20 probes execute concurrently across all checks.
Signal Types
IonHour records different signal types depending on how they were generated:
| Type | Source | When |
|---|---|---|
| SUCCESS | User ping or successful outbound probe | Heartbeat received or probe passed |
| SUSPECT | System-generated | Check transitioned to LATE |
| FAIL | System-generated | Check transitioned to DOWN |
| DEPLOYMENT | System-generated | Deployment window started or ended |
System-generated signals (SUSPECT, FAIL, DEPLOYMENT) appear in the signal history alongside user heartbeats, giving you a complete audit trail.
Pause and Resume
Pausing
When a check is paused (manually or by a deployment):
- Status set to PAUSED.
- Validation job is removed from the queue.
- Outbound probe jobs are removed (if outbound check).
- All open incidents are resolved.
- No alerts are sent while paused.
Resuming
When a check is resumed:
- Status set to RESUMED.
- A new validation job is scheduled for
now + intervalSeconds. - Outbound probe jobs are recreated (if outbound check).
- The first signal or probe after resume transitions the check to OK.
Deployment Suppression
During an active deployment window, even if a check is not paused, IonHour suppresses DOWN/LATE transitions. Missed signals during the deployment are ignored. This provides a lighter-touch alternative to full pause — checks keep running, but transient failures don't trigger incidents.
Event Flow
Every state transition emits events that drive the rest of the system.
Dependency Impact Re-evaluation
Dependency impact is re-evaluated:
- On every signal — when a service check receives a heartbeat, its dependencies are checked.
- On dependency status change — when a dependency check goes DOWN/LATE, all service checks that depend on it are re-evaluated.
- Every 5 minutes — a background sweep re-evaluates all service checks with dependencies to catch any missed transitions.