Ionhour Docs

Monitoring Patterns

Best practices for schedule sizing, grace period tuning, and avoiding false positives.

Getting the most out of IonHour means configuring checks that are sensitive enough to catch real problems without creating noise from transient blips. This guide covers practical patterns for both inbound and outbound monitoring.

Choosing the Right Interval

The interval should match how often your job runs or how frequently you need to verify availability.

Inbound (Heartbeat) Checks

Job FrequencyRecommended IntervalWhy
Every minute300s (5 min, minimum)Inbound minimum is 5 minutes. If your job runs more frequently, ping once per interval cycle.
Every 5 minutes300s1:1 match — ping after every run.
Every 15 minutes900s1:1 match.
Every hour3600s1:1 match.
Multiple times per hourMatch to the expected cadencePick the interval closest to your actual run frequency.

Don't set the interval shorter than your job's actual frequency. If your job runs every 15 minutes but you set a 5-minute interval, you'll get false LATE alerts between runs.

Outbound (HTTP) Checks

Service TypeRecommended IntervalWhy
Production API60–300sCatch outages quickly. Shorter intervals = faster detection.
Internal tool300–600sLess critical; longer intervals reduce probe volume.
Static site300–900sRarely goes down; less frequent probing is fine.
Third-party dependency300sMatch to your own service's check interval so incidents correlate.

Tuning Grace Periods

The grace period is the buffer between "signal is overdue" and "something is wrong." Too short and you get false alarms. Too long and you detect real outages late.

For Inbound Checks

ScenarioGrace PeriodRationale
Fast, predictable jobs (simple cron, fixed schedule)5–15sLow variance in execution time means a short grace period is safe.
Variable-duration jobs (ETL, batch processing)15–30sJob runtime fluctuates; give it headroom.
Jobs on shared infrastructure (CI runners, spot instances)30–60sStartup delays, resource contention, and scheduling jitter add up.
Jobs sending pings over unreliable networks30–60sNetwork latency and packet loss can delay pings.

The Grace Period Formula

A good starting point:

grace_period = (typical job duration variance) + (network latency) + 5s safety margin

If your job usually takes 2–8 seconds and network latency is under 1 second, a grace period of 8 - 2 + 1 + 5 = 12 seconds is reasonable.

Avoiding False Positives

False positives erode team trust in monitoring. Every false alert teaches your team to ignore alerts.

Inbound Checks

Problem: Job completes but ping fails.

  • Use curl -sf with retry: curl -sf --retry 2 https://...
  • Place the ping at the very end of your script, after all work is done.
  • If your job has cleanup steps that can fail independently, ping before cleanup.

Problem: Clock drift between server and IonHour.

  • Use NTP to keep your server's clock synchronized.
  • The 2-second validation buffer handles minor drift, but large skew (10+ seconds) can cause false alerts.

Problem: Interval mismatch with cron.

  • Cron runs at exact minutes (:00, :05, :10). If your job starts at :00 but takes 3 minutes, the ping arrives at :03. The next run starts at :05, ping at :08. The interval between pings is only 5 minutes, but the next gap might be 7 minutes if the job is slow. Set the grace period to accommodate this variance.

Outbound Checks

Problem: Single-probe false positives.

  • Use multiple regions. A single region can have network issues to your endpoint. Two or more regions plus the majority consensus algorithm prevents single-point false alarms.

Problem: Transient 5xx errors.

  • The default failAfterConsecutive = 3 handles this. A single 503 during a deployment won't trigger an incident. Only 3 consecutive failures will.
  • IonHour also retries probes once with a 300ms delay before recording a failure, and retries again for 502/503/504 responses.

Problem: Slow responses triggering timeouts.

  • Increase the timeoutMs to match your endpoint's expected response time. If your API sometimes takes 5 seconds, set timeout to 8–10 seconds, not the default 8 seconds.
  • Set a latencyWarnMs threshold to get alerted about degraded performance separately from outages.

Monitoring What Matters

Do Monitor

  • Critical paths: Payment processing, authentication, core API endpoints.
  • Scheduled jobs: Anything that runs on a cron — backups, data syncs, report generation.
  • Health check endpoints: Dedicated /health or /ready endpoints that verify database connectivity and downstream dependencies.
  • External dependencies: Use dependency tracking to monitor services you depend on but don't control.

Don't Monitor

  • Every endpoint: Focus on health check endpoints, not every route. Monitoring 50 individual routes creates noise without proportionally more signal.
  • Internal-only services from outbound checks: If a service isn't publicly accessible, outbound probes can't reach it. Use inbound heartbeats from an internal health check script instead.
  • Rapidly changing services: If a service is deployed many times per day, use deployment windows to suppress alerts during releases.

Heartbeat Patterns

Fire-and-Forget

The simplest pattern. Append a curl to your job:

./my-job.sh && curl -sf https://app.failsignal.com/api/signals/ping/TOKEN

If the job fails, the ping never fires, and IonHour detects the missing heartbeat.

Success-Only with Error Context

Ping on success, but include error information when things fail:

if ./my-job.sh 2>error.log; then
  curl -sf https://app.failsignal.com/api/signals/ping/TOKEN
else
  # Job failed — don't ping. IonHour will detect the missed heartbeat.
  # Optionally log the error for your own records.
  cat error.log >> /var/log/job-failures.log
fi

Rich Payload

Send structured metadata with each heartbeat:

START=$(date +%s)
./my-job.sh
EXIT_CODE=$?
END=$(date +%s)
DURATION=$((END - START))

if [ $EXIT_CODE -eq 0 ]; then
  curl -sf -X POST https://app.failsignal.com/api/signals/ping/TOKEN \
    -H "Content-Type: application/json" \
    -d "{\"duration\": $DURATION, \"exit_code\": $EXIT_CODE}"
fi

The duration field is extracted by IonHour and tracked as avgDurationMs in the check's rolling analytics.

Wrapper Function

For repeated use, wrap the ping in a function:

failsignal_ping() {
  local token="$1"
  curl -sf --max-time 10 --retry 2 \
    "https://app.failsignal.com/api/signals/ping/$token" > /dev/null 2>&1
}

# Usage
./backup.sh && failsignal_ping "YOUR_TOKEN"
./etl.sh && failsignal_ping "ANOTHER_TOKEN"

Multi-Region Strategy

For outbound checks, use multiple regions to distinguish between your service being down and a network issue between one probe region and your service.

Service TypeRegionsWhy
Single-region service1 region (colocated)Probe from the same region your service runs in. Adding distant regions introduces latency variance that can cause false timeouts.
Multi-region service2–3 regionsProbe from each region your service is deployed in.
Global serviceAll 3 regionsus-east-1, eu-west-1, ap-southeast-1 for worldwide coverage.

Consensus Behavior

With 3 regions and default thresholds (failAfterConsecutive=3, resolveAfterConsecutiveSuccess=2):

  • A network issue in one region won't trigger an incident (only 1 of 3 regions fails).
  • A real outage affecting 2+ regions triggers an incident (majority consensus reached).
  • Recovery requires all 3 regions to see 2 consecutive successes (conservative, prevents premature resolution).