Ionhour Docs

Troubleshooting

Diagnosing common issues with missed pings, false alerts, and unexpected check behavior.

Inbound Check Issues

Check Stuck in NEW

Symptom: Check was created but never transitions to OK.

Cause: No heartbeat signal has been received.

Fix:

  • Verify the ping URL is correct. The token is case-sensitive.
  • Send a test ping manually: curl -v https://app.failsignal.com/api/signals/ping/YOUR_TOKEN
  • Check that your job is actually running and reaching the ping URL.
  • Ensure there are no firewall or proxy rules blocking outbound HTTPS from your server.

Check Goes LATE Between Runs

Symptom: Check alternates between OK and LATE, even though the job is running on schedule.

Cause: The interval is too short for the job's actual cadence, or the grace period is too tight.

Fix:

  • Check the actual time between pings in the signal history. If your cron runs every 5 minutes but the job takes 2 minutes, pings arrive every 5 minutes, but the start of execution varies.
  • Increase the grace period to cover the variance in your job's execution time.
  • Make sure the interval matches your cron schedule exactly.

Missed Pings After Deployment

Symptom: Check goes DOWN after a deployment, even though the service recovered quickly.

Cause: The service was briefly unavailable during the deploy, and the missed heartbeat exceeded the grace period.

Fix:

  • Use deployment windows with autoPause: true to pause checks during releases.
  • Even without auto-pause, an active deployment window suppresses DOWN/LATE transitions.
  • Integrate deployment window creation into your CI/CD pipeline so it happens automatically.

Duplicate Signals

Symptom: Multiple signals appear at the same timestamp in the history.

Cause: Your job is sending the ping multiple times (e.g., retries, duplicate cron entries).

Fix:

  • IonHour deduplicates signals within a 1-second window, so exact duplicates are already handled.
  • If you see signals a few seconds apart, check your cron configuration for duplicate entries.
  • If using retry logic on the curl, the retry creates a new signal (this is usually fine).

Outbound Check Issues

Check Shows DOWN but Endpoint is Reachable

Symptom: You can access the endpoint from your browser, but IonHour says it's DOWN.

Possible causes:

  1. IP-based blocking. Your endpoint's firewall or WAF blocks IonHour's probe IPs. Check your access logs for denied requests from unknown IPs.

  2. Geographic restrictions. Your endpoint only accepts traffic from certain regions. IonHour probes from us-east-1, eu-west-1, and ap-southeast-1. If your service restricts access by region, only configure probe regions that can reach your service.

  3. Expected status mismatch. Your endpoint returns a status code outside the configured range. For example, if your health endpoint returns 204 No Content but the expected range is the default 200-399, this works fine. But if it returns 401, the probe fails. Check the run history for the actual status code.

  4. Timeout. Your endpoint responds correctly but takes longer than the configured timeoutMs. Check the run history for latency values and increase the timeout if needed.

  5. Redirect issues. Your endpoint redirects (e.g., HTTP to HTTPS), and the redirect chain exceeds maxRedirects or hits an HTTPS-to-HTTP downgrade (blocked by default).

High Latency Alerts but Endpoint Seems Fast

Symptom: Latency breach alerts fire, but your endpoint responds quickly when tested locally.

Cause: IonHour measures latency from the probe region to your endpoint. This includes:

  • DNS resolution time
  • TCP connection time
  • TLS handshake time
  • Time to first byte
  • Full response read time

If your service is in us-west-2 and the probe is in eu-west-1, the round-trip network latency adds 100+ ms that you don't see when testing locally.

Fix:

  • Configure probes from the region closest to your service.
  • Set latencyWarnMs to account for geographic distance.
  • Check the run history for a breakdown of where time is spent.

SSL Expiry Warnings for Auto-Renewed Certificates

Symptom: SSL expiry alerts fire even though you have auto-renewal configured.

Cause: Auto-renewal may have silently failed. IonHour reads the actual certificate expiry date from the TLS handshake — it reports what's really there, not what's expected.

Fix:

  • Verify the certificate: openssl s_client -connect yoursite.com:443 -servername yoursite.com < /dev/null 2>/dev/null | openssl x509 -noout -dates
  • Check your certificate renewal process (certbot, ACME client, cloud provider).
  • The default 14-day warning gives you time to fix it.

SSRF_BLOCKED Error

Symptom: Probe returns SSRF_BLOCKED reason in the run history.

Cause: The target URL resolves to a private or internal IP address (e.g., 127.0.0.1, 10.x.x.x, 192.168.x.x). IonHour blocks these to prevent server-side request forgery.

Fix:

  • Outbound checks are designed for externally-reachable endpoints only.
  • For internal services, use inbound heartbeat checks with an internal health check script.

Incident Issues

Incident Not Auto-Resolving

Symptom: The check shows OK, but the incident is still ACTIVE.

Cause: This can happen if:

  • The check recovered but the incident resolution event didn't fire (rare, usually due to transient DB error).
  • The incident was created for a different reason (e.g., DEPENDENCY_DOWN) and the dependency hasn't recovered yet.

Fix:

  • Check the incident's reason field. If it's DEPENDENCY_DOWN, the dependent service needs to recover, not just the check itself.
  • For SERVICE_DOWN incidents, receiving a successful signal should auto-resolve. If it didn't, the next signal will typically catch it.

Too Many Dependency Incidents

Symptom: A single dependency outage creates incidents on many service checks.

Cause: This is expected behavior. Every service check that depends on the down dependency gets its own DEPENDENCY_DOWN incident.

Fix:

  • This is working as designed — each service is independently impacted.
  • Use the incident list filters to filter by check or project.
  • If the noise is too high, consider whether all those services truly depend on the impacted dependency. Remove dependency links for services that are only loosely coupled.

Alert Issues

Not Receiving Email Alerts

Symptom: Check goes DOWN but no email arrives.

Checklist:

  1. Is the alert channel enabled? Check Settings > Alerts in your workspace.
  2. Is the recipient email correct? Verify the email addresses on the channel.
  3. Are your personal email notifications enabled? Check Settings > Profile > Email notifications.
  4. Is the check muted? Check the muteNotifications setting on the check itself.
  5. Check your spam/junk folder. IonHour sends from [email protected].

Not Receiving Slack Alerts

Symptom: Check goes DOWN but no Slack message appears.

Checklist:

  1. Is the Slack channel enabled? Check Settings > Alerts.
  2. Is the webhook URL still valid? Slack webhooks can be revoked if the app is removed from the workspace.
  3. Use the Test button on the alert channel to verify the connection.
  4. Re-authorize the Slack integration if the test fails.

Alert Fatigue (Too Many Alerts)

Symptom: Team is receiving so many alerts that they're ignoring them.

Fix:

  • Increase grace periods on inbound checks. A longer grace period means fewer LATE transitions.
  • Increase failAfterConsecutive on outbound checks. Requiring 5 consecutive failures instead of 3 reduces transient alerts.
  • Mute noisy checks temporarily while you tune them. But don't leave them muted — fix the root cause.
  • Use deployment windows so planned downtime doesn't trigger alerts.
  • Review check intervals. If you're monitoring a service every 60 seconds but don't need sub-minute detection, increase to 300 seconds.
  • Latency and SSL alerts have a 1-hour cooldown per check. If you're still getting too many, raise the thresholds.

General Issues

API Returns 429 Too Many Requests

Cause: You've exceeded the rate limit (default: 100 requests per 60 seconds).

Fix:

  • Space out your API calls.
  • For bulk operations, batch requests rather than sending them individually.
  • The ping endpoint shares the global rate limit. If you have many checks pinging from the same IP, consider staggering them.

API Returns 401 Unauthorized

Checklist:

  1. Is the token in the Authorization header? Format: Authorization: Bearer ionh_...
  2. Is the API key enabled? Disabled keys return 401.
  3. Is the API key for the correct workspace? Keys are workspace-scoped.
  4. For OAuth tokens: has the access token expired? Use the refresh endpoint to get a new one.

Data Not Showing in Dashboard

Symptom: Pings are being sent (curl returns OK) but the dashboard shows no signals.

Cause: You might be looking at the wrong workspace or the wrong check.

Fix:

  • Verify the token matches the check you're viewing. Each check has a unique token.
  • Check that you're in the correct workspace in the dashboard.
  • Signals are stored in MongoDB. If the MongoDB connection is down, pings succeed (the endpoint returns OK) but signals aren't persisted. This is a server-side issue — contact support.