Analytics & Uptime Reporting
Track uptime, latency, and reliability metrics across your workspace to measure SLA compliance and identify trends.
Ionhour collects signals from every check in your workspace and computes analytics that help you understand your infrastructure's health over time. These metrics range from workspace-wide reliability scores down to per-check latency percentiles, giving you the data you need for SLA reporting, capacity planning, and incident retrospectives.
Analytics Overview
Ionhour provides analytics at three levels:
| Level | What it covers | Key metrics |
|---|---|---|
| Workspace | All checks across all projects | Reliability score, total checks, incident count, MTTR |
| Project | All checks within a project | Uptime percentage, signal volume, check health overview |
| Check | Individual check performance | Uptime %, latency (avg, P50, P95), drift stats, downtime duration |
Each level builds on the one below it. Workspace reliability is derived from individual check uptimes. Project health is an aggregate of its checks. And check-level metrics are computed from raw signal data.
Workspace-Level Metrics
Reliability Score
The workspace reliability score is an aggregate uptime percentage across all checks that have received signals in the selected time period. It answers the question: "What percentage of the time were my services available?"
The score is calculated by:
- Computing the uptime ratio for each check (successful signals / total signals).
- Averaging those ratios across all checks with signal data.
- Rounding to two decimal places.
A workspace with 10 checks, where 9 have 100% uptime and 1 has 90% uptime, would show a reliability score of 99.00%.
Checks with no signals in the selected period are excluded from the reliability calculation. A newly created check that hasn't received its first ping won't drag your reliability score down.
You can query the workspace reliability score via the API or through the Ionhour MCP server:
# Via API — using the checks-stats endpoint with workspace scope
curl "https://api.ionhour.com/api/checks-stats?workspaceId=1" \
-H "Authorization: Bearer YOUR_TOKEN"The response includes:
| Field | Description |
|---|---|
checksWidget.activeChecks | Total number of active checks in the workspace |
checksWidget.activeChecksDiffPercentage | Month-over-month change in check count |
signalsWidget.totalSignals | Total signals received across all checks |
signalsWidget.uptimePercentage | Workspace-wide uptime for the current month |
signalsWidget.signalsDiffPercentage | Month-over-month change in signal volume |
newSignals | Signals received in the last 24 hours |
alertsTriggered | Total incidents created for workspace checks |
Workspace Summary
The checks overview endpoint provides a health breakdown of all checks in the workspace:
curl "https://api.ionhour.com/api/checks-stats/overview?workspaceId=1" \
-H "Authorization: Bearer YOUR_TOKEN"| Field | Description |
|---|---|
totalChecks | Total checks in scope |
downChecks | Checks currently in DOWN status |
dependencyImpactedChecks | Checks impacted by a dependency outage |
unstableChecks | Checks with a high SUSPECT/LATE signal ratio (>20% over 5+ samples) |
impactedChecks | Union of down, dependency-impacted, and unstable checks |
This overview is the data behind the dashboard health widgets. Use it to get a quick pulse on your infrastructure without drilling into individual checks.
MTTR (Mean Time to Resolution)
The workspace reliability report includes MTTR, calculated from resolved incidents in the selected period:
MTTR = sum(resolvedAt - startedAt) / count(resolved incidents)MTTR is one of the most important incident response metrics. A decreasing MTTR over time indicates improving incident response processes. An increasing MTTR may signal alert fatigue, staffing gaps, or growing system complexity.
MTTR only includes resolved incidents. Active incidents are excluded because their final duration is unknown. If you have long-running active incidents, your MTTR may appear artificially low.
Project-Level Metrics
Signal Stats
Project-level signal stats aggregate across all checks in a project:
curl "https://api.ionhour.com/api/checks-stats?projectId=1" \
-H "Authorization: Bearer YOUR_TOKEN"The response includes the same fields as the workspace-level query, but scoped to a single project. This is useful for per-team or per-service reporting when your projects map to organizational boundaries.
Response Time Overview
For projects with inbound (heartbeat) checks, Ionhour tracks signal drift — the difference between when a signal was expected and when it actually arrived.
curl "https://api.ionhour.com/api/checks-stats/response-time/overview?projectId=1" \
-H "Authorization: Bearer YOUR_TOKEN"| Field | Description |
|---|---|
avgDriftMs | Average drift across all checks in the project |
checksCount | Number of checks included |
onTimeRate | Percentage of signals arriving within tolerance |
degradedChecks | Checks where average drift exceeds 50% of their schedule |
This helps identify checks that are consistently running late but haven't yet triggered incidents — a leading indicator of capacity issues.
Check-Level Uptime
Inbound Check Uptime
For inbound (heartbeat) checks, uptime is calculated from the ratio of SUCCESS signals to total signals (SUCCESS + FAIL) for the current calendar month:
uptime = successSignals / (successSignals + failSignals) * 100Access per-check stats through the check detail view in the dashboard, or via the API:
# Dashboard stats endpoint returns uptime for a specific check
curl "https://api.ionhour.com/api/checks-stats?projectId=1" \
-H "Authorization: Bearer YOUR_TOKEN"Per-check stats include:
| Field | Description |
|---|---|
uptime | Uptime percentage for the current month |
avgDrift | Average signal drift in milliseconds |
totalPings | Total signals received since check creation |
currentMonthDowntime | Total downtime in milliseconds for the current month |
lastMonthDowntime | Total downtime in milliseconds for the previous month |
Downtime is calculated by summing the duration of all overlapping incidents within the time period, not from signal gaps. This means a check that goes down for 10 minutes, recovers, and goes down again for 5 minutes would show 15 minutes of downtime.
Outbound Check Uptime
For outbound checks, uptime is calculated from probe results:
uptimePercent = successfulProbes / totalProbes * 100Query outbound stats with a configurable time range:
curl "https://api.ionhour.com/api/checks/42/outbound-stats?rangeHours=24" \
-H "Authorization: Bearer YOUR_TOKEN"The response provides:
| Field | Description |
|---|---|
uptimePercent | Percentage of successful probes |
totalProbes | Total probe attempts in the range |
successCount | Probes that returned an OK result |
failureCount | Probes that failed for any reason |
avgLatencyMs | Mean response time across all probes |
p50LatencyMs | Median (50th percentile) response time |
p95LatencyMs | 95th percentile response time |
maxLatencyMs | Maximum observed response time |
You can filter by region using the probeId parameter to see per-region performance:
# Stats from the EU West probe only
curl "https://api.ionhour.com/api/checks/42/outbound-stats?rangeHours=24&probeId=eu-west-1" \
-H "Authorization: Bearer YOUR_TOKEN"Outbound Performance Metrics
Latency Percentiles
Ionhour computes P50 (median) and P95 latency for outbound checks. These percentiles are more useful than averages for understanding user experience:
- P50 tells you the typical response time experienced by most users.
- P95 tells you the worst-case response time experienced by 5% of requests. This is the number you should use for SLA reporting.
- Average can be misleading when a few slow outliers skew the mean.
For example, if your P50 is 120ms and your P95 is 800ms, that means half your probes complete in under 120ms, but 5% of probes take over 800ms. If your SLA promises sub-500ms responses, you have a problem that the average (which might be 180ms) would hide.
Outbound Timeline
The timeline endpoint provides bucketed probe results over time, enabling trend visualization:
curl "https://api.ionhour.com/api/checks/42/outbound-timeline?rangeHours=24&bucketMinutes=5" \
-H "Authorization: Bearer YOUR_TOKEN"| Parameter | Default | Description |
|---|---|---|
rangeHours | 24 | How many hours of data to return |
bucketMinutes | 5 | Size of each time bucket in minutes |
probeId | all | Filter to a specific probe region |
Each bucket in the response contains:
| Field | Description |
|---|---|
timestamp | Start of the time bucket |
avgLatencyMs | Average response time in this bucket |
maxLatencyMs | Maximum response time in this bucket |
successCount | Number of successful probes |
failureCount | Number of failed probes |
Use the timeline to identify patterns like:
- Periodic latency spikes that correlate with batch job schedules.
- Gradual latency increases indicating capacity degradation.
- Failure clusters that suggest network or DNS issues at specific times.
Inbound Check Response Time
Drift Statistics
For heartbeat checks, Ionhour measures drift — how late (or early) each signal arrives relative to the expected schedule. This is distinct from latency, which measures HTTP response time.
curl "https://api.ionhour.com/api/checks-stats/42/response-time?range=24h" \
-H "Authorization: Bearer YOUR_TOKEN"| Field | Description |
|---|---|
avgDriftMs | Average drift across all signals in the range |
p50DriftMs | Median drift |
p95DriftMs | 95th percentile drift |
p99DriftMs | 99th percentile drift |
minDriftMs | Smallest observed drift |
maxDriftMs | Largest observed drift |
onTimeRate | Percentage of signals arriving within 5% of the schedule |
avgDurationMs | Average execution time reported by the signal payload |
sampleCount | Number of signals included in the calculation |
Available time ranges: 1h, 24h, 7d, 30d.
Drift Timeline
The drift timeline provides bucketed drift statistics for trend visualization:
curl "https://api.ionhour.com/api/checks-stats/42/response-time/timeline?range=24h" \
-H "Authorization: Bearer YOUR_TOKEN"Bucket sizes vary by range:
| Range | Bucket size |
|---|---|
1h | 5 minutes |
24h | 1 hour |
7d | 6 hours (uses pre-aggregated hourly data) |
30d | 1 day (uses pre-aggregated daily data) |
For the 7d and 30d ranges, Ionhour uses pre-aggregated statistics rather than querying raw signals. This keeps queries fast even for checks that generate thousands of signals per day.
Uptime Timeline
The uptime timeline provides a time-bucketed view of signal health across all checks in a project or workspace:
# Workspace-wide, last 7 days
curl "https://api.ionhour.com/api/checks-stats/uptime?workspaceId=1&range=7d" \
-H "Authorization: Bearer YOUR_TOKEN"
# Single check, last 24 hours
curl "https://api.ionhour.com/api/checks-stats/uptime?workspaceId=1&range=24h&checkId=42" \
-H "Authorization: Bearer YOUR_TOKEN"Each bucket contains counts of SUCCESS, FAIL, SUSPECT, and DEPLOYMENT signals. This is the data behind the uptime visualization bars in the dashboard.
| Signal Type | Meaning |
|---|---|
SUCCESS | Check reported healthy |
FAIL | Check missed its schedule or returned an error |
SUSPECT | Signal arrived late but within the grace period |
DEPLOYMENT | A deployment event was recorded |
Status Page Uptime Bars
If you use status pages, each component can display an uptime bar showing availability history over a configurable period (up to 365 days). These bars are powered by the same signal data used in the analytics endpoints.
When a status page component is linked to a check, the uptime bar reflects that check's actual signal history. Each day segment is colored based on the ratio of successful to failed signals that day:
- Green — 100% successful signals
- Yellow/Orange — Partial failures
- Red — Majority or all signals failed
- Gray — No signals received
Configure the uptime bar length per-component in the status page settings. See the Status Pages guide for details.
Using Analytics for SLA Reporting
Ionhour's analytics map directly to common SLA metrics:
| SLA Metric | Ionhour Data Source |
|---|---|
| Availability (uptime %) | Workspace reliability score or per-check uptime percentage |
| Response time (P95) | Outbound check P95 latency |
| Incident count | Workspace or project incident count |
| MTTR | Workspace reliability MTTR calculation |
| Downtime duration | Per-check currentMonthDowntime |
Generating an SLA Report
To produce a monthly SLA report, query these endpoints:
Get workspace reliability for the reporting period. Use the checks-stats endpoint with your workspace ID to get the overall uptime percentage, incident count, and signal volume.
Get per-check downtime for each critical service. The check stats endpoint returns currentMonthDowntime and lastMonthDowntime in milliseconds, which you can convert to minutes of downtime.
Get outbound latency percentiles for any HTTP-monitored endpoints. The outbound-stats endpoint returns P50 and P95 latency, which map directly to response time SLAs.
Calculate SLA compliance. Compare the measured uptime and latency against your SLA thresholds. For example, if your SLA promises 99.9% uptime (roughly 43 minutes of downtime per month), compare the total downtime against that threshold.
Using Analytics for Capacity Planning
Analytics trends over time reveal capacity issues before they become incidents:
- Rising P95 latency on outbound checks suggests your service is approaching capacity limits. If P95 is climbing while P50 stays flat, a subset of requests is hitting a bottleneck.
- Increasing drift on inbound checks means your cron jobs are taking longer to complete. This often indicates growing data volumes or resource contention.
- Declining on-time rate below 95% signals that your jobs are routinely running late. Consider increasing the schedule interval or allocating more resources.
- Growing unstable check count in the workspace overview means more checks are showing intermittent issues. Investigate before they become persistent failures.
Use the 30d time range for capacity planning analysis. Shorter ranges (1h, 24h) show too much noise from transient spikes. The 30-day view reveals sustained trends.
Dashboard Widgets
The Ionhour dashboard surfaces analytics through several widgets:
| Widget | What it shows |
|---|---|
| Stats Segment | Active checks, total signals, uptime %, and month-over-month trends |
| Uptime Statistics | Time-bucketed signal chart (success/fail/suspect) with range selector |
| Live Status Badge | Current health status based on checks overview |
| Incidents Widget | Active incidents with severity and duration |
These widgets update in real time via SSE. When a new signal arrives or a check status changes, the dashboard reflects the change without requiring a page refresh.
Best Practices
- Monitor trends, not snapshots. A single 99.5% uptime reading doesn't tell you much. Track uptime weekly to spot declining trends before they breach your SLA.
- Use P95, not averages, for latency SLAs. Averages hide tail latency issues. If your SLA says "95th percentile response time under 500ms," measure P95 directly.
- Compare month-over-month. Ionhour provides month-over-month diffs for check count and signal volume. A sudden drop in signal volume might mean a check stopped running, not that everything is healthy.
- Set up outbound checks for SLA-critical endpoints. Inbound checks measure whether your cron jobs run. Outbound checks measure whether your users can reach your service. For SLA reporting, you usually need outbound data.
- Review MTTR trends quarterly. MTTR reflects your team's incident response effectiveness. If it's increasing, investigate whether it's due to more complex incidents, slower acknowledgment, or insufficient escalation rules.
- Export data for external reporting. Use the API endpoints documented above to pull analytics data into your reporting tools, internal dashboards, or customer-facing SLA reports.
- Filter by region for global services. If you run outbound checks from multiple regions, always break down latency and uptime by region. A global average can mask regional problems.