GitQuick Metrics Documentation
GitQuick analyzes GitHub Pull Request metadata to surface code review performance metrics. This site documents every metric that appears in the GitQuick UI and the exact calculation method used to produce it.
Contents
- Signals Guide — what the Signals tab shows, how to use it, and how each signal is calculated.
- Metrics Reference — full list of metrics, definitions, and formulas.
How data is collected
For each analyzed repository, GitQuick fetches Pull Requests and their associated reviews, commits, and timestamps via the GitHub REST and GraphQL APIs. The raw records are persisted in Postgres, and metrics are computed on demand from that stored dataset when an org run's results are viewed.
Scope of an "org run"
A metric is always computed over the set of PRs included in a single org run. Each PR contributes at most once to each metric. PRs created or merged outside the configured time window of the run are excluded.
Percentiles
Wherever "p50" (median) or "p90" are reported, GitQuick uses discrete percentile semantics (percentile_disc). For a sorted sample, the p‑th percentile is the smallest value such that at least p% of the sample is at or below it. Values of null for percentiles indicate that the sample was empty or all values were invalid.
Sample sizes
Every time‑based and size‑based metric is accompanied by a sample_size, which is the number of PRs that contributed a valid value to that metric (after excluding null, negative, or otherwise invalid records). A metric with a low sample size should be interpreted with care.
Signals Guide
The Signals tab is the fastest way to understand what is going wrong in your code review process. Instead of reading every chart and number on the Metrics tab, GitQuick runs your data through six automated checks and surfaces only the problems that matter — with severity, context, and suggested next steps.
This guide explains what Signals are, how to use them, and exactly how each one is calculated.
What are Signals?
Signals are interpreted findings, not raw numbers.
GitQuick first computes metrics from your GitHub pull request data (review times, throughput, PR size, and so on). The Signals engine then compares those metrics against healthy thresholds and flags patterns that usually indicate a real process problem.
Think of it this way:
| Tab | What you get |
|---|---|
| Signals | "Here is what looks unhealthy, how serious it is, and what to try next." |
| Metrics | "Here are the raw numbers that back it up." |
Signals are deterministic: the same run data always produces the same signals. There is no AI or randomness involved in detection — only fixed rules and thresholds.
How to use the Signals tab
1. Open a completed org run
Signals appear after GitQuick finishes analyzing your repositories. Select a run from your history, then click the Signals tab (it is the first tab by default).
2. Read the cards top to bottom
Only triggered signals are shown. They are sorted by:
- Severity — Critical first, then High, Medium, Low
- Confidence — Higher-confidence signals appear first within the same severity
If nothing is wrong, you will see:
✓ No signals detected in this run — all headline metrics are within healthy thresholds.
That is a good outcome. It means none of the six checks found a pattern above its warning threshold.
3. Expand a card for the full story
Each signal card has a collapsed header and an expandable body.
Collapsed header shows:
- Severity badge — Low, Medium, High, or Critical
- Title and one-line summary
- Confidence score — How strongly the data supports this finding (0–100%)
- Evidence preview — One or two key numbers (e.g. median review time)
Expanded body shows:
- Interpretation — What the pattern means in plain language
- Why it matters — Business and delivery impact
- Likely causes — Common reasons teams see this pattern
- Suggested next actions — Practical things to investigate or change
- Supporting metrics — Clickable links that jump to the relevant section on the Metrics tab
- Evidence — All numeric values used in the evaluation
4. Drill into the Metrics tab
Use Supporting metrics links (e.g. "Time to First Review (median, p90) ↓") to see the underlying data. GitQuick switches to the Metrics tab and scrolls to the matching chart, briefly highlighting it.
5. Change scope when needed
Use the scope selector (top right) to view signals for:
- The whole organization
- A single repository
- A custom repo group
Signals recalculate for the selected scope. A problem visible at org level may disappear when you zoom into one repo — or the opposite.
6. Compare runs over time
Signals reflect a single run. To see whether a problem is new or getting worse, use the Trend tab alongside Signals. A signal that appears in run after run is worth prioritizing.
How Signals are calculated (overview)
For each org run, GitQuick:
- Computes all metrics from stored PR data (see Metrics Reference)
- Runs six signal definitions against those metrics
- Evaluates multiple rules per signal (each rule checks one condition)
- Assigns severity and confidence
- Returns only signals whose status is triggered
A signal is triggered when at least one of its rules fires. If every rule passes (or cannot be evaluated because data is missing), the signal is considered healthy and is not shown.
Severity
Each triggered rule contributes a severity level. The signal's overall severity is the highest severity among its triggered rules.
Some signals have compound rules — conditions that only fire when multiple problems happen together (e.g. both review latency and approval latency are elevated). When a compound rule triggers alongside other rules, severity is escalated by one level (Low → Medium → High → Critical, capped at Critical).
Within a single rule, severity scales with how far the observed value exceeds the threshold:
| How far above the warning threshold | Severity |
|---|---|
| Just above threshold | Low |
| Roughly one-third of the way to critical | Medium |
| Approaching critical | High |
| At or above critical threshold | Critical |
Confidence
Confidence (shown as 0–100%) reflects how much you should trust the finding. It combines three factors:
| Factor | Weight | Meaning |
|---|---|---|
| Data completeness | 40% | Were all rules for this signal evaluable, or were some metrics missing? |
| Corroboration | 35% | How many rules triggered vs. how many were checked? More corroborating rules = higher confidence. |
| Sample size | 25% | Is there enough PR data? Signals based on very few PRs get lower confidence. |
Sample size is compared against a minimum of 10 PRs. Below that, confidence is penalized proportionally.
What you will not see
- Healthy signals — If a check passes, it does not appear. No news is good news.
- Signals with no evaluable data — If every rule for a signal is skipped (all required metrics are missing), the entire signal is omitted.
The six signals
GitQuick evaluates exactly six signals on every run. Each one maps to a common class of engineering delivery problem.
1. Review Latency Problem
What it detects: Pull requests are waiting too long for someone to review them or approve them.
Why it matters: Slow reviews stall delivery, let context go stale, and pull reviewers back to old work.
Rules and thresholds
| Rule | Condition | Warning threshold | Critical threshold |
|---|---|---|---|
| First review median | Median time to first review | > 8 hours | ≥ 24 hours |
| Approval median | Median time to approval | > 24 hours | ≥ 72 hours |
| First review tail ratio | p90 ÷ median for time to first review | > 3.0× | ≥ 6.0× |
| Both review and approval elevated (compound) | Both medians above their thresholds | — | Escalates severity +1 |
Tail ratio explained: A high p90-to-median ratio means a few PRs are stuck for much longer than typical. The median can look fine while a subset of PRs waits days.
Evidence shown: First review median/p90, approval median/p90 (in hours).
Sample size used for confidence: Number of PRs with valid time-to-first-review data.
2. Merge Pipeline Friction
What it detects: PRs are approved but not merging quickly. The bottleneck is after review — not during it.
Why it matters: Approved work sitting idle increases merge conflicts, stale CI, and deployment delays even when review throughput looks healthy.
Rules and thresholds
| Rule | Condition | Warning threshold | Critical threshold |
|---|---|---|---|
| Approval-to-merge median | Median time from last approval to merge | > 4 hours | ≥ 16 hours |
| Dominates cycle (compound) | Approval-to-merge ÷ total time-to-merge | > 40% of cycle | Escalates severity +1 |
| Approval-to-merge tail ratio | p90 ÷ median for approval-to-merge | > 3.0× | ≥ 6.0× |
Dominance fraction explained: If 60% of a PR's total lifetime is spent waiting after approval, review is not the problem — merge gating, CI, or release process likely is.
Evidence shown: Approval-to-merge median/p90, time-to-merge median, approval-to-merge as a fraction of total cycle.
Sample size used for confidence: Number of merged PRs with valid approval-to-merge data.
3. Backlog / Throughput Imbalance
What it detects: More PRs are being opened each week than are being merged. The review queue is growing.
Why it matters: Even if individual PR cycle times look acceptable, a growing backlog means older PRs go stale, conflicts mount, and reviewer attention fragments.
Rules and thresholds
The core metric is merge efficiency:
merge efficiency = avg PRs opened per week ÷ avg PRs merged per week
| Value | Meaning |
|---|---|
| 1.0 | Intake matches output — backlog is stable |
| 1.2 | 20% more opened than merged — backlog is growing |
| 2.0+ | Intake is double output — backlog is growing fast |
| Rule | Condition | Warning threshold | Critical threshold |
|---|---|---|---|
| Merge efficiency above threshold | merge efficiency | > 1.2 | ≥ 2.0 |
| Merge efficiency critical | merge efficiency | — | > 2.0 (always Critical) |
Evidence shown: Avg opened per week, avg merged per week, merge efficiency ratio.
Sample size used for confidence: Total PRs opened in the run.
4. Review Quality Risk
What it detects: Human review may be bypassed or diluted — the safety net is weaker than raw review counts suggest.
Why it matters: Code reaching main without human eyes increases regression risk and weakens knowledge sharing.
Rules and thresholds
| Rule | Condition | Warning threshold | Critical threshold |
|---|---|---|---|
| Merged without review | % of merged PRs with zero reviews | > 10% | ≥ 30% |
| Bot review share | % of all reviews from bot accounts | > 50% | ≥ 80% |
| Both bypass and bot elevated (compound) | Both conditions triggered | — | Escalates severity +1 |
Bot classification: GitQuick identifies bots by login patterns (e.g. [bot] suffix, dependabot-style names). See Metrics Reference — Bot vs Human Reviews.
Evidence shown: Merged-without-review percentage, bot review percentage.
Sample size used for confidence: Total merged PRs in the run.
5. PR Size / Complexity Risk
What it detects: Pull requests are too large for effective human review.
Why it matters: Research suggests reviewers struggle to give thorough feedback above ~200–400 changed lines. Oversized PRs tend to get rubber-stamped, hide bugs, and slow the whole pipeline.
PR size uses total lines (additions + deletions) per PR.
Rules and thresholds
| Rule | Condition | Warning threshold | Critical threshold |
|---|---|---|---|
| Median size | Median total lines per PR | > 400 lines | ≥ 1,000 lines |
| p90 size | 90th percentile total lines | > 1,000 lines | ≥ 2,500 lines |
| Median and p90 both elevated (compound) | Both size rules triggered | — | Escalates severity +1 |
When both median and p90 are elevated, oversized PRs are the norm — not just a few outliers.
Evidence shown: PR size median and p90 (total lines).
Sample size used for confidence: Number of PRs with valid size data.
6. Development Cycle Delay
What it detects: Engineers are keeping work on local branches too long before opening a PR for review.
Why it matters: Long local incubation correlates with larger PRs, harder reviews, more rebasing, and delayed feedback from CI and teammates.
Rules and thresholds
Uses average hours from first commit to PR open (not median).
| Rule | Condition | Warning threshold | Critical threshold |
|---|---|---|---|
| First commit to open | Avg hours from first commit to PR open | > 24 hours | ≥ 72 hours |
| First commit to open critical | Same metric | — | > 72 hours (always Critical) |
Evidence shown: Avg first-commit-to-open hours, sample size for first-commit metrics.
Sample size used for confidence: Number of PRs with first-commit data.
Threshold reference (quick lookup)
All default thresholds in one place:
| Signal | Metric | Warning | Critical |
|---|---|---|---|
| Review Latency | First review median (hours) | > 8 | ≥ 24 |
| Review Latency | Approval median (hours) | > 24 | ≥ 72 |
| Review Latency | First review p90/median ratio | > 3.0 | ≥ 6.0 |
| Merge Friction | Approval-to-merge median (hours) | > 4 | ≥ 16 |
| Merge Friction | Approval-to-merge share of cycle | > 40% | — |
| Merge Friction | Approval-to-merge p90/median ratio | > 3.0 | ≥ 6.0 |
| Backlog Imbalance | Opened ÷ merged per week | > 1.2 | ≥ 2.0 |
| Review Quality | Merged without review (%) | > 10 | ≥ 30 |
| Review Quality | Bot review share (%) | > 50 | ≥ 80 |
| PR Size | Total lines median | > 400 | ≥ 1,000 |
| PR Size | Total lines p90 | > 1,000 | ≥ 2,500 |
| Dev Cycle Delay | Avg first commit → open (hours) | > 24 | ≥ 72 |
These are industry-reasonable starting points, not calibrated to your specific team. A mature team with strict review SLAs may want stricter thresholds; a small team shipping infrequently may legitimately trigger fewer signals.
Practical tips for new users
Start with severity, then read confidence. A Critical signal with 90% confidence deserves immediate attention. A Low signal with 40% confidence may reflect a small sample — verify on the Metrics tab before changing process.
One signal often explains another. PR Size Risk frequently drives Review Latency. Dev Cycle Delay often precedes PR Size Risk. When multiple signals fire, look for a root cause rather than treating each in isolation.
Use scope to find where the problem lives. Org-level Review Latency with no signal at repo level means the issue is concentrated in specific repositories — check Repos Performance or narrow scope.
Empty Signals ≠ perfect engineering. Thresholds are conservative. You can have meaningful improvement opportunities that do not yet trigger a signal. The Metrics and Trend tabs still add value.
Signals complement AI reports. When AI Insights are enabled, executive reports incorporate triggered signals into their narrative. Signals give you the structured detection; AI reports add broader context and wording for stakeholders.
Related documentation
- Metrics Reference — definitions and formulas for every underlying metric
- GitQuick Metrics Documentation — how org runs, percentiles, and sample sizes work
Metrics Reference
All metrics below are computed from the set of Pull Requests included in a given org run. "PR" means a single GitHub Pull Request record.
Percentiles. Median (p50) and p90 are discrete percentiles: they always correspond to an actual PR in the sample. GitQuick computes some rollups in one pipeline and the primary-language breakdown in another; rounding and how durations are grouped can make median/p90 differ slightly between the main run view and the per-language view for the same underlying PRs.
1. Time to First Review
What it measures: how long a PR waits before any human or bot posts its first review.
Source fields: pr_created_at, first_review_at.
Per-PR value:
time_to_first_review_ms = first_review_at − pr_created_at
A PR contributes to the sample only when both timestamps exist and the difference is non‑negative.
Reported statistics:
sample_size— number of PRs contributing a valid valuemedian_hours— p50 of the sample, converted to hoursp90_hours— p90 of the sample, converted to hoursspread— counts of PRs in three buckets:≤ median,> medianand≤ p90,- and
> p90
2. Time to Approval
What it measures: how long from PR open until the first approving review.
Source fields: pr_created_at, first_approval_at.
Per-PR value:
time_to_approval_ms = first_approval_at − pr_created_at
A PR contributes only when both timestamps exist and the difference is non‑negative.
Reported statistics: sample_size, median_hours, p90_hours, spread.
3. Approval‑to‑Merge Time
What it measures: how long a PR sits between its last approval and actually being merged. This isolates "waiting to merge" latency from review latency.
Source fields: last_approval_at, merged_at.
Per-PR value:
approval_to_merge_ms = merged_at − last_approval_at
Only merged PRs with a recorded approval contribute. PRs where merged_at < last_approval_at (force-merges or data anomalies) are excluded.
Reported statistics: sample_size, median_hours, p90_hours, spread.
4. Time to Merge
What it measures: total wall-clock time from PR open to merge.
Source fields: pr_created_at, merged_at.
Per-PR value:
time_to_merge_ms = merged_at − pr_created_at
Only merged PRs with a non-negative difference contribute.
Reported statistics: sample_size, median_hours, p90_hours, spread.
5. PR Throughput
What it measures: opened and merged PR volume per week, over the run window.
Main run view:
- For each PR, assign an opened week from
pr_created_atand a merged week frommerged_atwhen merged. Weeks are Sunday-aligned calendar buckets (local date normalization, then a single date label per week). - Count PRs per week for opened and merged.
- Take the set of weeks that have at least one opened or merged PR (
weeks_analyzed). - Compute simple averages:
avg_opened_per_week = sum(weekly_opened) / weeks_analyzed
avg_merged_per_week = sum(weekly_merged) / weeks_analyzed
Weeks with zero PRs still count toward the denominator only if they appear in one of the weekly series. Weeks with no activity in both series are not counted.
Primary-language breakdown assigns weeks using a Monday-aligned week boundary. Throughput averages by language therefore do not use the same week buckets as the main run totals for identical timestamps.
6. Merged Without Review
What it measures: fraction of merged PRs that were merged without any recorded review.
Calculation:
merged_without_review.count = count(pr where merged_at IS NOT NULL AND review_count = 0)
merged_without_review.percentage = 100 × count / total_merged
total_merged is the number of PRs in the run that have a merged_at.
7. Reviewer Participation
What it measures: how many distinct reviewers typically engage on each PR.
Calculation:
distinct_reviewersper PR is the number of unique non-null reviewer logins on that PR.- PRs with
distinct_reviewers = 0are excluded from the average.
avg_reviewers_per_pr = mean(distinct_reviewers where distinct_reviewers > 0)
The primary-language breakdown also reports median_reviewers_per_pr (p50) and prs_with_reviewers (count where distinct_reviewers > 0).
8. Reviewer Load
What it measures: how review work is distributed across reviewers.
Calculation: for each distinct reviewer login in the run, count the number of reviews they authored. Reviewers are shown as reviewer login plus count, sorted descending by count. Bot accounts are included as-is (see Bot vs Human Reviews to split them).
9. Review Rounds
What it measures: breakdown of review activity by review state.
Source fields (per PR): approved_count, changes_requested_count, commented_count.
Calculation:
approved = Σ approved_count
changes_requested = Σ changes_requested_count
commented = Σ commented_count
total = approved + changes_requested + commented
avg_per_pr = total / pr_count
pr_count here is the total number of PRs in the run (not only PRs with reviews).
10. Re-review Rate
What it measures: among PRs that received at least one changes-requested review, how often there is some approving review after the first changes-requested review (any approver; not necessarily the same person who requested changes).
Calculation:
prs_with_changes_requested = count(pr where changes_requested_count > 0)
prs_with_rereview_approval = count(pr where changes_requested_count > 0 AND has_rereview_approval = true)
rereview_rate = 100 × prs_with_rereview_approval / prs_with_changes_requested
has_rereview_approval is true when an APPROVED review exists with submitted_at strictly after the earliest CHANGES_REQUESTED review on that PR. null is returned if no PRs had changes requested.
11. PR Size
What it measures: code churn characteristics of PRs in the run.
Source fields (per PR): additions, deletions, changed_files.
Calculation: for each of the three fields independently, compute p50 and p90 over non-null values, plus:
total_churn = sum of all non-null additions + sum of all non-null deletions
sample_size = count(pr where additions IS NOT NULL)
12. First-Commit Metrics
What it measures: how long code lives locally before being opened as a PR, and the relationship between "coding time" and "review/merge time".
Source fields (per PR): time_from_first_commit_to_open_ms, time_from_first_commit_to_merge_ms, commit_count (from GitHub commit metadata when the run is collected).
When data is first stored, durations are computed as:
time_from_first_commit_to_open_ms = max(0, pr_created_at − first_commit_at) when first_commit_at exists
time_from_first_commit_to_merge_ms = max(0, merged_at − first_commit_at) when both exist
Downstream metrics use these stored values. Negative intervals are clamped to zero when stored, rather than dropped from aggregates.
Per-PR ratio:
ratio_open_over_merge = time_from_first_commit_to_open_ms / time_from_first_commit_to_merge_ms
ratio_open_over_merge is only included when both values exist and time_from_first_commit_to_merge_ms > 0. It expresses the share of a PR's total lifetime that was spent before opening the PR — a value near 1.0 means most of the time was coding, near 0.0 means most of the time was review.
Reported statistics:
sample_size— number of PRs withcommit_countpresent and non‑negativeavg_commits_per_pr— mean ofcommit_countavg_first_commit_to_open_hours— mean of open duration, converted to hoursavg_first_commit_to_merge_hours— mean of merge duration, converted to hoursavg_ratio_open_over_merge— mean of the per-PR ratio
Note: these are means (not medians). Outliers will influence them.
13. Bot vs Human Reviews
What it measures: share of review activity attributed to bot-like reviewer logins versus the rest, based on login patterns (not GitHub account-type metadata from the API).
Calculation: reviewer logins are classified using heuristics such as a [bot] suffix, dependabot-style names, and patterns like -bot at the end or -bot- in the string. Review counts from the reviewer load breakdown are summed into bot vs human totals and shown as counts and percentages.
14. Merge predictability (tail risk)
What it measures: how heavy the slow tail of merge times is relative to the median, combined with how wide the p90–p50 gap is. The product surfaces this as merge predictability / long-tail risk (not a simple ratio of p90 to median).
Calculation (using p50 and p90 of time-to-merge, in hours):
gap_component = 1 − exp(−(p90 − p50) / τ)
shape_component = 1 − p50 / p90
tail_score = clamp(gap_component × shape_component, 0, ~1)
Default τ = 168 hours (seven days), in the same units as p90 − p50. If data are missing or p50 ≤ 0, p90 ≤ 0, or p90 < p50, no score is shown.
The score is mapped to semantic bands (for example healthy through systemic long-tail risk) for display alongside throughput.
15. Primary-Language Breakdown
Repositories are grouped by primary language from each repository’s language metadata (dominant language on the default branch as reported by GitHub).
Per-language metrics use the same definitions as above, but because week boundaries and percentile pipelines differ from the main run view (see §1 and §5), figures are not guaranteed to match a manual recomputation from the org-wide numbers.
Caveats and known limitations
- Null handling. Metrics skip PRs that lack required fields for that statistic. A low
sample_sizeis the honest signal that a metric is thinly supported. - Negative intervals. Latency metrics exclude negative differences between timestamps. First-commit spans are clamped to zero when stored if commit time would imply a negative interval to open or merge.
- Bots in latency metrics. A bot review counts as "first review" for Time to First Review. For a human-focused view, interpret reviewer load using the bot vs human split.
- Discrete percentiles. Reported medians/p90s correspond to values from the finite sample. The main view and primary-language breakdown may differ slightly (see intro).
- Weeks analyzed. PR throughput averages divide by weeks that appear in the activity series, not by the full calendar span of the run window. A run spanning 10 calendar weeks with activity in 6 of them will average over 6 weeks in the main view; the language breakdown builds its own week set per language.
If something here disagrees with what you see in the product, trust the app and let us know so this reference can be corrected.