Evaluation flow (issues + PRs)

This page is the implementation-level view of how dupcanon evaluates issues and PRs in v1:

what data is collected
how it moves through each layer
where decisions are made
how quality is measured before any mutation

Scope and core principles

This flow applies to both item types:

issue
pr

with strict type separation (issues are only compared to issues, PRs only to PRs). Core v1 principles:

DB-first, auditable pipeline (state persisted at every major stage)
Semantic retrieval first, LLM second
Deterministic veto gates over model output
Human-gated mutation path (plan-close -> reviewed apply-close --yes)
Precision-first online classification for new items (detect-new)

What data we use

1) Source data from GitHub

For each item we ingest:

identity: repo, type, number, url
textual content: title, body
state + actors: state, author_login, assignees
metadata: labels, comment_count, review_comment_count
timestamps: created_at_gh, updated_at_gh, closed_at_gh

Issue vs PR differences

Issues use comments as comment_count.
PRs use comments as comment_count; review_comment_count is populated when PR-specific review counts are available from the fetch path.
PR online detection only (detect-new) additionally fetches changed files + bounded patch excerpts for judge context.
Modeling for retrieval and batch judge now defaults to intent cards (derived from title + body, plus PR context when available). Raw title/body remains the fallback when --source raw is selected.

2) Derived modeling data

From item content we derive:

content_hash = hash of normalized {type, title, body}
content_version = monotonic counter incremented only when title/body content changes
embedded_content_hash in embeddings to detect stale vectors

3) Runtime decision data

The system also persists/uses:

candidate snapshots and similarity scores (candidate_sets, candidate_set_members)
judge outputs + final gate status (judge_decisions)
planning/apply state (close_runs, close_run_items)
audit run outputs (judge_audit_runs, judge_audit_run_items)

System layers and how data flows

Layer A — Ingestion & normalization (`sync`, `refresh`)

Inputs

GitHub API (via gh api) for issues/PRs

Processing

sync upserts repo metadata in repos (while refresh expects the repo to already exist in DB)
sync upserts full item rows in items; refresh discovers new items and optionally refreshes known-item metadata
Recompute content_hash on upsert paths that include title/body
Increment content_version only on semantic content change (title/body)
On content change, mark existing fresh candidate sets for that item as stale

Why this matters

This is what makes downstream retrieval/judging reproducible and freshness-aware.

Layer B — Intent extraction (`analyze-intent`)

Inputs

items rows for selected repo/type

Processing

LLM extracts intent cards from title/body (and PR context when available)
Upsert into intent_cards with schema/prompt version
With --only-changed, items whose intent hash is unchanged are skipped

Why this matters

Intent cards are the default representation for embed, candidates, judge, and detect-new.

If you want raw embeddings, skip analyze-intent and use --source raw downstream.

Layer C — Embedding substrate (`embed`)

Inputs

intent_cards rows for selected repo/type (default) or items when --source raw
provider/model config

Text used for embeddings

intent-card text by default
raw mode uses title + body, normalized and truncated
current raw limits in code:
- title excerpt: 300 chars
- body excerpt: 7700 chars
- combined cap: 8000 chars

Processing

Batch embed queued items
Upsert into embeddings with:
- model
- dim (v1: 3072)
- vector payload
- embedded_content_hash

Freshness behavior

With --only-changed, rows where embedded_content_hash == content_hash are skipped

Layer D — Retrieval snapshot (`candidates`)

Inputs

source items + intent_embeddings/embeddings
retrieval params (k, min_score, include_states)

Candidate query semantics

For each source item:

same repo
same type (issue vs pr isolated)
exclude self
state filter (open/closed/all)
cosine similarity score >= min_score
top k neighbors

Persistence

For each source item, command:

marks prior fresh candidate sets stale for that source
creates a new candidate_set row (query params + source content version)
writes ranked members to candidate_set_members

This means every judge run has an explicit retrieval snapshot to audit against.

Layer E — Semantic judgment + deterministic policy (`judge`)

Inputs

latest candidate set per open source item (fresh by default; stale only if --allow-stale)
source and candidate title/body context
provider/model/thinking settings

Prompted task

Model must return strict JSON:

duplicate or not
selected candidate number (if duplicate)
confidence
structured relation fields (relation, root_cause_match, scope_relation, path_match, certainty)

Pre-LLM skip conditions

source has no candidates
source appears too vague (short/generic/low-signal)
source already has accepted edge and --rejudge is not set

Deterministic acceptance gates

Even if model says duplicate, edge is accepted only if all pass:

parse/shape valid JSON
selected target is in candidate set
structural duplicate veto checks pass (relation/root-cause/scope/path/certainty)
bug-vs-feature mismatch veto does not trigger
target is open
confidence >= min_edge (default 0.85)
candidate score gap gate passes:
- selected_score - best_alternative_score >= 0.015

Persistence

Each evaluated set writes a judge_decisions record with:

raw model polarity (model_is_duplicate)
final outcome (accepted | rejected | skipped)
selected target (if any)
confidence + reasoning
structured decision fields
veto_reason if demoted/rejected/skipped
provider/model, run metadata

Edge policy

first accepted outgoing edge per source wins by default
explicit --rejudge required to supersede

Layer F — Canonical resolution (`canonicalize`)

Inputs

accepted edges from judge_decisions
item metadata (state, author, activity, timestamps)
maintainer set from GitHub collaborators

Processing

build connected components from accepted edges (undirected for clustering)
choose canonical per cluster using ordered preference:
1. if any open item exists, canonical must be open
2. prefer likely-English content
3. prefer maintainer-authored item
4. tie-break by activity, then age, then number

Output

v1 canonicalize emits stats only (does not persist canonical mapping table)
same selection logic is reused by plan-close

Layer G — Governed action planning (`plan-close`)

Inputs

accepted edges + confidence
canonical selection result per cluster
maintainer identities

Per-item action logic (non-canonical nodes)

Item is close-eligible only if:

source item is open
author is known and not maintainer
assignees are known and none is maintainer
edge evidence satisfies the selected target policy:
- canonical-only (default): direct accepted edge exists from source to canonical
- direct-fallback: if source->canonical is missing, allow source->direct-accepted-target
selected edge confidence >= min_close (default 0.90)

Otherwise action is skip with explicit reason:

not_open
uncertain_maintainer_identity
maintainer_author
maintainer_assignee
missing_accepted_edge
low_confidence

Persistence

non-dry-run creates close_runs(mode=plan)
writes close_run_items with action and skip reasons

Layer H — Controlled mutation (`apply-close`)

Gate

Must satisfy both:

input run exists and is mode=plan
explicit --yes

Processing

create new close_runs(mode=apply)
copy planned rows into apply run
execute only action=close rows against GitHub
close message template:
- Closing as duplicate of #{}. If this is incorrect, please contact us.
persist API results per item (gh_result, applied_at)

Online path for newly opened issues/PRs (`detect-new`)

This is the single-item classifier used for workflow automation.

Input and persistence behavior

fetch source issue/PR from GitHub
upsert source into items
ensure source intent-card + intent-embedding freshness (extract/embed when stale/missing; fall back to raw embeddings on intent failure)
retrieve open same-type neighbors from intent embeddings by default (k=8, min_score=0.75 defaults)

PR-specific online context

For PRs only, judge context appends bounded diff info:

up to 30 changed files
per-file patch excerpt cap: 2000 chars
total patch excerpt cap: 12000 chars

This PR diff context improves online judgment quality but is not persisted as modeled corpus text.

Classification mapping

After judge output and guardrails:

duplicate
- confidence >= duplicate threshold (default 0.92)
- top retrieval score >= 0.90
- strict guardrails all pass
maybe_duplicate
- model leans duplicate but strict guardrails fail, or confidence/retrieval support is weaker
not_duplicate
- model says non-duplicate, or evidence too weak

Strict online guardrails (downgrade path)

A duplicate is downgraded to maybe when any fails:

duplicate veto checks
bug/feature mismatch
strict structure requirements:
- relation = same_instance
- root_cause_match = same
- scope_relation = same_scope
- certainty = sure
score-gap guardrail (>= 0.015)

Parse-failure fallback behavior

If judge output is invalid:

if nearest score is strong (>= max(min_score, maybe_threshold)), return maybe_duplicate
else not_duplicate

Online persistence constraints (v1)

detect-new persists:

repos upsert
source items upsert
source intent_cards upsert (intent default)
source intent_embeddings update (when stale/missing)
source embeddings update when --source raw is selected or fallback occurs

It does not persist:

candidate set snapshots
judge decision rows
close actions

How issues and PRs are evaluated differently

Concern	Issues	PRs
Corpus type partitioning	issue-only	pr-only
Base modeled content	title + body	title + body
Extra online context	none	changed files + patch excerpts in detect-new
Activity signal in canonical tie-break	comment_count	comment_count + review_comment_count (as stored; review count can be 0 when unavailable from fetch path)
Close execution command	`gh issue close`	`gh pr close`

Everything else (retrieval/judge/gate/policy) is intentionally symmetric across types.

Evaluation methodology (quality measurement)

There are two quality loops in v1.

1) Operational run metrics (per command)

Each stage emits counters (discovered/processed/accepted/rejected/skipped/failed etc.), which show throughput and immediate quality posture. Examples:

candidates: missing embeddings, stale marked, members written
judge: accepted/rejected/skipped classes, invalid responses, veto categories
plan-close: close vs skip mix and skip-reason distribution

This is the real-time operational signal.

2) Sampled cheap-vs-strong audit (`judge-audit` + `report-audit`)

Sampling policy

latest fresh, non-empty candidate sets
source state=open
random uniform sample with deterministic seed

Two-lane evaluation

cheap lane (cost-optimized profile)
strong lane (higher-quality reference profile)
both lanes run through the same audit gates (vague-source skip, structural vetoes, bug/feature veto, min_edge, score-gap)
note: judge-audit is close to, but not identical with, operational judge gating (for example, it does not enforce the target must be open veto)

Outcome classes

tp: both accepted, same target
fp: cheap accepted, strong not accepted
fn: cheap not accepted, strong accepted
tn: both not accepted
conflict: both accepted, different targets
incomplete: skipped/error lane outcome

Core metrics

precision = tp / (tp + fp)
recall = tp / (tp + fn)
conflict count (target disagreement risk)
incomplete count (runtime/data quality issues)

report-audit also supports non-LLM gate simulation to estimate metric trade-offs when tightening:

rank constraints
score minimum
gap minimum

Why this flow is safe by design

Type-isolated retrieval prevents issue/PR cross-contamination.
State snapshots (candidate_sets) make judging reproducible.
LLM output is advisory, not authoritative.
Deterministic vetoes block high-confidence but risky decisions.
Plan/apply split ensures human review before mutation.
Per-item audit rows provide post-hoc explainability and threshold tuning.

Practical end-to-end sequence

# 1) keep corpus fresh
uv run dupcanon refresh --repo <org/repo> --refresh-known
uv run dupcanon analyze-intent --repo <org/repo> --type issue --only-changed
uv run dupcanon analyze-intent --repo <org/repo> --type pr --only-changed
uv run dupcanon embed --repo <org/repo> --type issue --only-changed
uv run dupcanon embed --repo <org/repo> --type pr --only-changed

# 2) build retrieval snapshots
uv run dupcanon candidates --repo <org/repo> --type issue --include open
uv run dupcanon candidates --repo <org/repo> --type pr --include open

# 3) semantic judgment with policy gates
uv run dupcanon judge --repo <org/repo> --type issue
uv run dupcanon judge --repo <org/repo> --type pr

# 4) planning (safe)
uv run dupcanon plan-close --repo <org/repo> --type issue --dry-run
uv run dupcanon plan-close --repo <org/repo> --type pr --dry-run

# 5) optional audit loop
uv run dupcanon judge-audit --repo <org/repo> --type issue --sample-size 100 --seed 42
uv run dupcanon report-audit --run-id <id> --simulate-gates --gate-gap-min 0.02

/architecture — command/state architecture map
/get-started — setup and first run
docs/internal/operator_runbook_v1.md — deeper operator playbook
docs/internal/online_duplicate_detection_pipeline_design_doc_v1.md — online-specific design details

​Scope and core principles

​What data we use

​1) Source data from GitHub

​Issue vs PR differences

​2) Derived modeling data

​3) Runtime decision data

​System layers and how data flows

​Layer A — Ingestion & normalization (sync, refresh)

​Inputs

​Processing

​Why this matters

​Layer B — Intent extraction (analyze-intent)

​Inputs

​Processing

​Why this matters

​Layer C — Embedding substrate (embed)

​Inputs

​Text used for embeddings

​Processing

​Freshness behavior

​Layer D — Retrieval snapshot (candidates)

​Inputs

​Candidate query semantics

​Persistence

​Layer E — Semantic judgment + deterministic policy (judge)

​Inputs

​Prompted task

​Pre-LLM skip conditions

​Deterministic acceptance gates

​Persistence

​Edge policy

​Layer F — Canonical resolution (canonicalize)

​Inputs

​Processing

​Output

​Layer G — Governed action planning (plan-close)

​Inputs

​Per-item action logic (non-canonical nodes)

​Persistence

​Layer H — Controlled mutation (apply-close)

​Gate

​Processing

​Online path for newly opened issues/PRs (detect-new)

​Input and persistence behavior

​PR-specific online context

​Classification mapping

​Strict online guardrails (downgrade path)

​Parse-failure fallback behavior

​Online persistence constraints (v1)

​How issues and PRs are evaluated differently

​Evaluation methodology (quality measurement)

​1) Operational run metrics (per command)

​2) Sampled cheap-vs-strong audit (judge-audit + report-audit)

​Sampling policy

​Two-lane evaluation

​Outcome classes

​Core metrics

​Why this flow is safe by design

​Practical end-to-end sequence

​Related docs

Scope and core principles

What data we use

1) Source data from GitHub

Issue vs PR differences

2) Derived modeling data

3) Runtime decision data

System layers and how data flows

Layer A — Ingestion & normalization (`sync`, `refresh`)

Inputs

Processing

Why this matters

Layer B — Intent extraction (`analyze-intent`)

Inputs

Processing

Why this matters

Layer C — Embedding substrate (`embed`)

Inputs

Text used for embeddings

Processing

Freshness behavior

Layer D — Retrieval snapshot (`candidates`)

Inputs

Candidate query semantics

Persistence

Layer E — Semantic judgment + deterministic policy (`judge`)

Inputs

Prompted task

Pre-LLM skip conditions

Deterministic acceptance gates

Persistence

Edge policy

Layer F — Canonical resolution (`canonicalize`)

Inputs

Processing

Output

Layer G — Governed action planning (`plan-close`)

Inputs

Per-item action logic (non-canonical nodes)

Persistence

Layer H — Controlled mutation (`apply-close`)

Gate

Processing

Online path for newly opened issues/PRs (`detect-new`)

Input and persistence behavior

PR-specific online context

Classification mapping

Strict online guardrails (downgrade path)

Parse-failure fallback behavior

Online persistence constraints (v1)

How issues and PRs are evaluated differently

Evaluation methodology (quality measurement)

1) Operational run metrics (per command)

2) Sampled cheap-vs-strong audit (`judge-audit` + `report-audit`)

Sampling policy

Two-lane evaluation

Outcome classes

Core metrics

Why this flow is safe by design

Practical end-to-end sequence

Related docs