Monitoring Patterns

Best practices for schedule sizing, grace period tuning, and avoiding false positives.

Getting the most out of IonHour means configuring checks that are sensitive enough to catch real problems without creating noise from transient blips. This guide covers practical patterns for both inbound and outbound monitoring.

Choosing the Right Interval

The interval should match how often your job runs or how frequently you need to verify availability.

Inbound (Heartbeat) Checks

Job Frequency	Recommended Interval	Why
Every minute	300s (5 min, minimum)	Inbound minimum is 5 minutes. If your job runs more frequently, ping once per interval cycle.
Every 5 minutes	300s	1:1 match — ping after every run.
Every 15 minutes	900s	1:1 match.
Every hour	3600s	1:1 match.
Multiple times per hour	Match to the expected cadence	Pick the interval closest to your actual run frequency.

Don't set the interval shorter than your job's actual frequency. If your job runs every 15 minutes but you set a 5-minute interval, you'll get false LATE alerts between runs.

Outbound (HTTP) Checks

Service Type	Recommended Interval	Why
Production API	60–300s	Catch outages quickly. Shorter intervals = faster detection.
Internal tool	300–600s	Less critical; longer intervals reduce probe volume.
Static site	300–900s	Rarely goes down; less frequent probing is fine.
Third-party dependency	300s	Match to your own service's check interval so incidents correlate.

Tuning Grace Periods

The grace period is the buffer between "signal is overdue" and "something is wrong." Too short and you get false alarms. Too long and you detect real outages late.

For Inbound Checks

Scenario	Grace Period	Rationale
Fast, predictable jobs (simple cron, fixed schedule)	5–15s	Low variance in execution time means a short grace period is safe.
Variable-duration jobs (ETL, batch processing)	15–30s	Job runtime fluctuates; give it headroom.
Jobs on shared infrastructure (CI runners, spot instances)	30–60s	Startup delays, resource contention, and scheduling jitter add up.
Jobs sending pings over unreliable networks	30–60s	Network latency and packet loss can delay pings.

The Grace Period Formula

A good starting point:

grace_period = (typical job duration variance) + (network latency) + 5s safety margin

If your job usually takes 2–8 seconds and network latency is under 1 second, a grace period of 8 - 2 + 1 + 5 = 12 seconds is reasonable.

Avoiding False Positives

False positives erode team trust in monitoring. Every false alert teaches your team to ignore alerts.

Inbound Checks

Problem: Job completes but ping fails.

Use curl -sf with retry: curl -sf --retry 2 https://...
Place the ping at the very end of your script, after all work is done.
If your job has cleanup steps that can fail independently, ping before cleanup.

Problem: Clock drift between server and IonHour.

Use NTP to keep your server's clock synchronized.
The 2-second validation buffer handles minor drift, but large skew (10+ seconds) can cause false alerts.

Problem: Interval mismatch with cron.

Cron runs at exact minutes (:00, :05, :10). If your job starts at :00 but takes 3 minutes, the ping arrives at :03. The next run starts at :05, ping at :08. The interval between pings is only 5 minutes, but the next gap might be 7 minutes if the job is slow. Set the grace period to accommodate this variance.

Outbound Checks

Problem: Single-probe false positives.

Use multiple regions. A single region can have network issues to your endpoint. Two or more regions plus the majority consensus algorithm prevents single-point false alarms.

Problem: Transient 5xx errors.

The default failAfterConsecutive = 3 handles this. A single 503 during a deployment won't trigger an incident. Only 3 consecutive failures will.
IonHour also retries probes once with a 300ms delay before recording a failure, and retries again for 502/503/504 responses.

Problem: Slow responses triggering timeouts.

Increase the timeoutMs to match your endpoint's expected response time. If your API sometimes takes 5 seconds, set timeout to 8–10 seconds, not the default 8 seconds.
Set a latencyWarnMs threshold to get alerted about degraded performance separately from outages.

Monitoring What Matters

Do Monitor

Critical paths: Payment processing, authentication, core API endpoints.
Scheduled jobs: Anything that runs on a cron — backups, data syncs, report generation.
Health check endpoints: Dedicated /health or /ready endpoints that verify database connectivity and downstream dependencies.
External dependencies: Use dependency tracking to monitor services you depend on but don't control.

Don't Monitor

Every endpoint: Focus on health check endpoints, not every route. Monitoring 50 individual routes creates noise without proportionally more signal.
Internal-only services from outbound checks: If a service isn't publicly accessible, outbound probes can't reach it. Use inbound heartbeats from an internal health check script instead.
Rapidly changing services: If a service is deployed many times per day, use deployment windows to suppress alerts during releases.

Heartbeat Patterns

Fire-and-Forget

The simplest pattern. Append a curl to your job:

./my-job.sh && curl -sf https://app.failsignal.com/api/signals/ping/TOKEN

If the job fails, the ping never fires, and IonHour detects the missing heartbeat.

Success-Only with Error Context

Ping on success, but include error information when things fail:

if ./my-job.sh 2>error.log; then
  curl -sf https://app.failsignal.com/api/signals/ping/TOKEN
else
  # Job failed — don't ping. IonHour will detect the missed heartbeat.
  # Optionally log the error for your own records.
  cat error.log >> /var/log/job-failures.log
fi

Rich Payload

Send structured metadata with each heartbeat:

START=$(date +%s)
./my-job.sh
EXIT_CODE=$?
END=$(date +%s)
DURATION=$((END - START))

if [ $EXIT_CODE -eq 0 ]; then
  curl -sf -X POST https://app.failsignal.com/api/signals/ping/TOKEN \
    -H "Content-Type: application/json" \
    -d "{\"duration\": $DURATION, \"exit_code\": $EXIT_CODE}"
fi

The duration field is extracted by IonHour and tracked as avgDurationMs in the check's rolling analytics.

Wrapper Function

For repeated use, wrap the ping in a function:

failsignal_ping() {
  local token="$1"
  curl -sf --max-time 10 --retry 2 \
    "https://app.failsignal.com/api/signals/ping/$token" > /dev/null 2>&1
}

# Usage
./backup.sh && failsignal_ping "YOUR_TOKEN"
./etl.sh && failsignal_ping "ANOTHER_TOKEN"

Multi-Region Strategy

For outbound checks, use multiple regions to distinguish between your service being down and a network issue between one probe region and your service.

Recommended Configuration

Service Type	Regions	Why
Single-region service	1 region (colocated)	Probe from the same region your service runs in. Adding distant regions introduces latency variance that can cause false timeouts.
Multi-region service	2–3 regions	Probe from each region your service is deployed in.
Global service	All 3 regions	`us-east-1`, `eu-west-1`, `ap-southeast-1` for worldwide coverage.

Consensus Behavior

With 3 regions and default thresholds (failAfterConsecutive=3, resolveAfterConsecutiveSuccess=2):

A network issue in one region won't trigger an incident (only 1 of 3 regions fails).
A real outage affecting 2+ regions triggers an incident (majority consensus reached).
Recovery requires all 3 regions to see 2 consecutive successes (conservative, prevents premature resolution).