Incident Management

Understand how incidents are created, tracked, and resolved across the full lifecycle.

When a check detects a problem, IonHour automatically creates an incident to track it from detection through resolution. Incidents provide a timeline of events, support acknowledgment, and capture notes for post-incident review.

Incident Lifecycle

Every incident follows this lifecycle:

ACTIVE → RESOLVED

State	Meaning
ACTIVE	Something is wrong. The incident is open and being tracked.
RESOLVED	The issue has been fixed. The incident is closed.

Incidents are created automatically when a check goes down and resolved automatically when the check recovers. You can also acknowledge incidents to signal that someone is looking into it.

How Incidents Are Created

Service Down

When a check transitions to LATE or DOWN, IonHour creates an incident with:

Reason: SERVICE_DOWN
Severity: CRITICAL
Title: "Service {check name} is down"

The incident includes the time of the last successful signal, so you can immediately see how long the service has been unresponsive.

For outbound checks, the incident is only created after the configured number of consecutive failures (default: 3). This prevents incidents from single transient errors.

Dependency Impact

When a check's dependencies are reported as unavailable (either through ping payloads or check-based dependency monitoring), IonHour creates a separate incident with:

Reason: DEPENDENCY_DOWN
Severity: WARNING
Title: "Service {check name} is impacted by {dependency name}"

Dependency incidents are lower severity because the monitored service itself may still be running — it's just degraded due to an external dependency.

Incident list view showing active and resolved incidents

Incident Timeline

Each incident has an event timeline that tracks every state change. Events are immutable — they form an audit log of what happened and when.

Event Type	When it's added
OPENED	Incident created
ACKNOWLEDGED	A team member acknowledged the incident
ALERT_SENT	A notification was dispatched (email or Slack)
RESOLVED	Incident was resolved (auto or manual)

The timeline gives you a complete picture for post-incident reviews: when the issue started, how long until someone acknowledged it, when alerts were sent, and when it was resolved.

Incident detail view with event timeline

Acknowledging Incidents

Acknowledging an incident signals to your team that someone is actively investigating the issue. It doesn't resolve the incident — it's a "I'm on it" marker.

When you acknowledge an incident, IonHour records:

Who acknowledged it (your user identity)
When it was acknowledged

This is visible in the incident detail view and the timeline. Acknowledgment is useful for on-call workflows where multiple people might see an alert — the first person to acknowledge claims ownership.

Incident Notes

You can add notes to an incident to document your investigation, root cause findings, or remediation steps. Notes are attached to the incident timeline and attributed to the author.

Any workspace member can add notes to an incident.
Only the author can edit or delete their own notes.
Deleted notes are soft-deleted (retained for audit purposes).

Notes are especially valuable for post-incident reviews. Instead of piecing together what happened from Slack threads and memory, you have a structured record directly on the incident.

Automatic Resolution

Incidents are resolved automatically when the underlying check recovers:

For heartbeat checks: The next successful ping transitions the check from DOWN to OK, which resolves all active SERVICE_DOWN incidents for that check.

For outbound checks: After the configured number of consecutive successes (default: 2), the check transitions back to OK and incidents are resolved.

When an incident is resolved, IonHour records:

resolvedAt — When the incident was marked resolved
recoveredAt — When the service actually recovered (based on the signal timestamp)
Duration — Calculated from startedAt to resolvedAt

Recovery alerts are sent to all configured notification channels, including the downtime duration.

Filtering Incidents

You can filter the incident list by:

Filter	Description
State	Show only active or resolved incidents
Check	Filter to incidents for a specific check
Project	Filter to incidents within a project

The incident list view enriches each incident with additional context:

The check name and token that triggered the incident
The project the check belongs to
The last signal received (type, time, and source)
Time since last success — how long the service has been unresponsive
Acknowledged by — who is investigating (if acknowledged)

Incident Severity

Severity	When it's used
CRITICAL	Service is down — the check itself has failed
WARNING	Service is degraded — a dependency is down, but the service may still be partially functional

Severity is set automatically based on the incident reason and cannot be changed manually. This keeps severity consistent and meaningful across your workspace.

Best Practices

Acknowledge promptly. Even if you can't fix the issue immediately, acknowledging tells your team someone is looking at it and prevents duplicate investigation.
Add notes as you investigate. Document what you find in real-time. Your future self (and your teammates) will thank you during the post-incident review.
Review resolved incidents. After an incident is resolved, look at the full timeline: detection time, acknowledgment time, resolution time. These metrics help you identify gaps in your monitoring and response processes.
Use projects to organize checks. Filtering incidents by project is much easier than scanning a flat list. Group related checks into projects that map to your services or teams.
Let auto-resolution work. Don't manually resolve incidents that will auto-resolve on recovery. Manual resolution loses the accurate recovery timestamp and downtime calculation.