Monitoring Patterns
Best practices for schedule sizing, grace period tuning, and avoiding false positives.
Getting the most out of IonHour means configuring checks that are sensitive enough to catch real problems without creating noise from transient blips. This guide covers practical patterns for both inbound and outbound monitoring.
Choosing the Right Interval
The interval should match how often your job runs or how frequently you need to verify availability.
Inbound (Heartbeat) Checks
| Job Frequency | Recommended Interval | Why |
|---|---|---|
| Every minute | 300s (5 min, minimum) | Inbound minimum is 5 minutes. If your job runs more frequently, ping once per interval cycle. |
| Every 5 minutes | 300s | 1:1 match — ping after every run. |
| Every 15 minutes | 900s | 1:1 match. |
| Every hour | 3600s | 1:1 match. |
| Multiple times per hour | Match to the expected cadence | Pick the interval closest to your actual run frequency. |
Don't set the interval shorter than your job's actual frequency. If your job runs every 15 minutes but you set a 5-minute interval, you'll get false LATE alerts between runs.
Outbound (HTTP) Checks
| Service Type | Recommended Interval | Why |
|---|---|---|
| Production API | 60–300s | Catch outages quickly. Shorter intervals = faster detection. |
| Internal tool | 300–600s | Less critical; longer intervals reduce probe volume. |
| Static site | 300–900s | Rarely goes down; less frequent probing is fine. |
| Third-party dependency | 300s | Match to your own service's check interval so incidents correlate. |
Tuning Grace Periods
The grace period is the buffer between "signal is overdue" and "something is wrong." Too short and you get false alarms. Too long and you detect real outages late.
For Inbound Checks
| Scenario | Grace Period | Rationale |
|---|---|---|
| Fast, predictable jobs (simple cron, fixed schedule) | 5–15s | Low variance in execution time means a short grace period is safe. |
| Variable-duration jobs (ETL, batch processing) | 15–30s | Job runtime fluctuates; give it headroom. |
| Jobs on shared infrastructure (CI runners, spot instances) | 30–60s | Startup delays, resource contention, and scheduling jitter add up. |
| Jobs sending pings over unreliable networks | 30–60s | Network latency and packet loss can delay pings. |
The Grace Period Formula
A good starting point:
grace_period = (typical job duration variance) + (network latency) + 5s safety margin
If your job usually takes 2–8 seconds and network latency is under 1 second, a grace period of 8 - 2 + 1 + 5 = 12 seconds is reasonable.
Avoiding False Positives
False positives erode team trust in monitoring. Every false alert teaches your team to ignore alerts.
Inbound Checks
Problem: Job completes but ping fails.
- Use
curl -sfwith retry:curl -sf --retry 2 https://... - Place the ping at the very end of your script, after all work is done.
- If your job has cleanup steps that can fail independently, ping before cleanup.
Problem: Clock drift between server and IonHour.
- Use NTP to keep your server's clock synchronized.
- The 2-second validation buffer handles minor drift, but large skew (10+ seconds) can cause false alerts.
Problem: Interval mismatch with cron.
- Cron runs at exact minutes (
:00,:05,:10). If your job starts at:00but takes 3 minutes, the ping arrives at:03. The next run starts at:05, ping at:08. The interval between pings is only 5 minutes, but the next gap might be 7 minutes if the job is slow. Set the grace period to accommodate this variance.
Outbound Checks
Problem: Single-probe false positives.
- Use multiple regions. A single region can have network issues to your endpoint. Two or more regions plus the majority consensus algorithm prevents single-point false alarms.
Problem: Transient 5xx errors.
- The default
failAfterConsecutive = 3handles this. A single 503 during a deployment won't trigger an incident. Only 3 consecutive failures will. - IonHour also retries probes once with a 300ms delay before recording a failure, and retries again for 502/503/504 responses.
Problem: Slow responses triggering timeouts.
- Increase the
timeoutMsto match your endpoint's expected response time. If your API sometimes takes 5 seconds, set timeout to 8–10 seconds, not the default 8 seconds. - Set a
latencyWarnMsthreshold to get alerted about degraded performance separately from outages.
Monitoring What Matters
Do Monitor
- Critical paths: Payment processing, authentication, core API endpoints.
- Scheduled jobs: Anything that runs on a cron — backups, data syncs, report generation.
- Health check endpoints: Dedicated
/healthor/readyendpoints that verify database connectivity and downstream dependencies. - External dependencies: Use dependency tracking to monitor services you depend on but don't control.
Don't Monitor
- Every endpoint: Focus on health check endpoints, not every route. Monitoring 50 individual routes creates noise without proportionally more signal.
- Internal-only services from outbound checks: If a service isn't publicly accessible, outbound probes can't reach it. Use inbound heartbeats from an internal health check script instead.
- Rapidly changing services: If a service is deployed many times per day, use deployment windows to suppress alerts during releases.
Heartbeat Patterns
Fire-and-Forget
The simplest pattern. Append a curl to your job:
./my-job.sh && curl -sf https://app.failsignal.com/api/signals/ping/TOKEN
If the job fails, the ping never fires, and IonHour detects the missing heartbeat.
Success-Only with Error Context
Ping on success, but include error information when things fail:
if ./my-job.sh 2>error.log; then
curl -sf https://app.failsignal.com/api/signals/ping/TOKEN
else
# Job failed — don't ping. IonHour will detect the missed heartbeat.
# Optionally log the error for your own records.
cat error.log >> /var/log/job-failures.log
fi
Rich Payload
Send structured metadata with each heartbeat:
START=$(date +%s)
./my-job.sh
EXIT_CODE=$?
END=$(date +%s)
DURATION=$((END - START))
if [ $EXIT_CODE -eq 0 ]; then
curl -sf -X POST https://app.failsignal.com/api/signals/ping/TOKEN \
-H "Content-Type: application/json" \
-d "{\"duration\": $DURATION, \"exit_code\": $EXIT_CODE}"
fi
The duration field is extracted by IonHour and tracked as avgDurationMs in the check's rolling analytics.
Wrapper Function
For repeated use, wrap the ping in a function:
failsignal_ping() {
local token="$1"
curl -sf --max-time 10 --retry 2 \
"https://app.failsignal.com/api/signals/ping/$token" > /dev/null 2>&1
}
# Usage
./backup.sh && failsignal_ping "YOUR_TOKEN"
./etl.sh && failsignal_ping "ANOTHER_TOKEN"
Multi-Region Strategy
For outbound checks, use multiple regions to distinguish between your service being down and a network issue between one probe region and your service.
Recommended Configuration
| Service Type | Regions | Why |
|---|---|---|
| Single-region service | 1 region (colocated) | Probe from the same region your service runs in. Adding distant regions introduces latency variance that can cause false timeouts. |
| Multi-region service | 2–3 regions | Probe from each region your service is deployed in. |
| Global service | All 3 regions | us-east-1, eu-west-1, ap-southeast-1 for worldwide coverage. |
Consensus Behavior
With 3 regions and default thresholds (failAfterConsecutive=3, resolveAfterConsecutiveSuccess=2):
- A network issue in one region won't trigger an incident (only 1 of 3 regions fails).
- A real outage affecting 2+ regions triggers an incident (majority consensus reached).
- Recovery requires all 3 regions to see 2 consecutive successes (conservative, prevents premature resolution).