This page is the implementation-level view of how dupcanon evaluates issues and PRs in v1:
- what data is collected
- how it moves through each layer
- where decisions are made
- how quality is measured before any mutation
Scope and core principles
This flow applies to both item types:issuepr
- DB-first, auditable pipeline (state persisted at every major stage)
- Semantic retrieval first, LLM second
- Deterministic veto gates over model output
- Human-gated mutation path (
plan-close-> reviewedapply-close --yes) - Precision-first online classification for new items (
detect-new)
What data we use
1) Source data from GitHub
For each item we ingest:- identity:
repo,type,number,url - textual content:
title,body - state + actors:
state,author_login,assignees - metadata:
labels,comment_count,review_comment_count - timestamps:
created_at_gh,updated_at_gh,closed_at_gh
Issue vs PR differences
- Issues use
commentsascomment_count. - PRs use
commentsascomment_count;review_comment_countis populated when PR-specific review counts are available from the fetch path. - PR online detection only (
detect-new) additionally fetches changed files + bounded patch excerpts for judge context. - Modeling for retrieval and batch judge now defaults to intent cards (derived from
title + body, plus PR context when available). Raw title/body remains the fallback when--source rawis selected.
2) Derived modeling data
From item content we derive:content_hash= hash of normalized{type, title, body}content_version= monotonic counter incremented only when title/body content changesembedded_content_hashinembeddingsto detect stale vectors
3) Runtime decision data
The system also persists/uses:- candidate snapshots and similarity scores (
candidate_sets,candidate_set_members) - judge outputs + final gate status (
judge_decisions) - planning/apply state (
close_runs,close_run_items) - audit run outputs (
judge_audit_runs,judge_audit_run_items)
System layers and how data flows
Layer A — Ingestion & normalization (sync, refresh)
Inputs
- GitHub API (via
gh api) for issues/PRs
Processing
syncupserts repo metadata inrepos(whilerefreshexpects the repo to already exist in DB)syncupserts full item rows initems;refreshdiscovers new items and optionally refreshes known-item metadata- Recompute
content_hashon upsert paths that include title/body - Increment
content_versiononly on semantic content change (title/body) - On content change, mark existing fresh candidate sets for that item as
stale
Why this matters
This is what makes downstream retrieval/judging reproducible and freshness-aware.Layer B — Intent extraction (analyze-intent)
Inputs
itemsrows for selected repo/type
Processing
- LLM extracts intent cards from title/body (and PR context when available)
- Upsert into
intent_cardswith schema/prompt version - With
--only-changed, items whose intent hash is unchanged are skipped
Why this matters
Intent cards are the default representation forembed, candidates, judge, and detect-new.
If you want raw embeddings, skip
analyze-intent and use --source raw downstream.Layer C — Embedding substrate (embed)
Inputs
intent_cardsrows for selected repo/type (default) oritemswhen--source raw- provider/model config
Text used for embeddings
- intent-card text by default
- raw mode uses title + body, normalized and truncated
- current raw limits in code:
- title excerpt: 300 chars
- body excerpt: 7700 chars
- combined cap: 8000 chars
Processing
- Batch embed queued items
- Upsert into
embeddingswith:modeldim(v1: 3072)- vector payload
embedded_content_hash
Freshness behavior
- With
--only-changed, rows whereembedded_content_hash == content_hashare skipped
Layer D — Retrieval snapshot (candidates)
Inputs
- source items + intent_embeddings/embeddings
- retrieval params (
k,min_score,include_states)
Candidate query semantics
For each source item:- same repo
- same type (
issuevsprisolated) - exclude self
- state filter (
open/closed/all) - cosine similarity score >=
min_score - top
kneighbors
Persistence
For each source item, command:- marks prior fresh candidate sets stale for that source
- creates a new
candidate_setrow (query params + source content version) - writes ranked members to
candidate_set_members
Layer E — Semantic judgment + deterministic policy (judge)
Inputs
- latest candidate set per open source item (fresh by default; stale only if
--allow-stale) - source and candidate title/body context
- provider/model/thinking settings
Prompted task
Model must return strict JSON:- duplicate or not
- selected candidate number (if duplicate)
- confidence
- structured relation fields (
relation,root_cause_match,scope_relation,path_match,certainty)
Pre-LLM skip conditions
- source has no candidates
- source appears too vague (short/generic/low-signal)
- source already has accepted edge and
--rejudgeis not set
Deterministic acceptance gates
Even if model says duplicate, edge is accepted only if all pass:- parse/shape valid JSON
- selected target is in candidate set
- structural duplicate veto checks pass (relation/root-cause/scope/path/certainty)
- bug-vs-feature mismatch veto does not trigger
- target is open
confidence >= min_edge(default 0.85)- candidate score gap gate passes:
selected_score - best_alternative_score >= 0.015
Persistence
Each evaluated set writes ajudge_decisions record with:
- raw model polarity (
model_is_duplicate) - final outcome (
accepted|rejected|skipped) - selected target (if any)
- confidence + reasoning
- structured decision fields
veto_reasonif demoted/rejected/skipped- provider/model, run metadata
Edge policy
- first accepted outgoing edge per source wins by default
- explicit
--rejudgerequired to supersede
Layer F — Canonical resolution (canonicalize)
Inputs
- accepted edges from
judge_decisions - item metadata (
state,author, activity, timestamps) - maintainer set from GitHub collaborators
Processing
- build connected components from accepted edges (undirected for clustering)
- choose canonical per cluster using ordered preference:
- if any open item exists, canonical must be open
- prefer likely-English content
- prefer maintainer-authored item
- tie-break by activity, then age, then number
Output
- v1
canonicalizeemits stats only (does not persist canonical mapping table) - same selection logic is reused by
plan-close
Layer G — Governed action planning (plan-close)
Inputs
- accepted edges + confidence
- canonical selection result per cluster
- maintainer identities
Per-item action logic (non-canonical nodes)
Item is close-eligible only if:- source item is open
- author is known and not maintainer
- assignees are known and none is maintainer
- edge evidence satisfies the selected target policy:
canonical-only(default): direct accepted edge exists from source to canonicaldirect-fallback: if source->canonical is missing, allow source->direct-accepted-target
- selected edge confidence >=
min_close(default 0.90)
skip with explicit reason:
not_openuncertain_maintainer_identitymaintainer_authormaintainer_assigneemissing_accepted_edgelow_confidence
Persistence
- non-dry-run creates
close_runs(mode=plan) - writes
close_run_itemswith action and skip reasons
Layer H — Controlled mutation (apply-close)
Gate
Must satisfy both:- input run exists and is
mode=plan - explicit
--yes
Processing
- create new
close_runs(mode=apply) - copy planned rows into apply run
- execute only
action=closerows against GitHub - close message template:
Closing as duplicate of #{}. If this is incorrect, please contact us.
- persist API results per item (
gh_result,applied_at)
Online path for newly opened issues/PRs (detect-new)
This is the single-item classifier used for workflow automation.
Input and persistence behavior
- fetch source issue/PR from GitHub
- upsert source into
items - ensure source intent-card + intent-embedding freshness (extract/embed when stale/missing; fall back to raw embeddings on intent failure)
- retrieve open same-type neighbors from intent embeddings by default (
k=8,min_score=0.75defaults)
PR-specific online context
For PRs only, judge context appends bounded diff info:- up to 30 changed files
- per-file patch excerpt cap: 2000 chars
- total patch excerpt cap: 12000 chars
Classification mapping
After judge output and guardrails:duplicate- confidence >= duplicate threshold (default 0.92)
- top retrieval score >= 0.90
- strict guardrails all pass
maybe_duplicate- model leans duplicate but strict guardrails fail, or confidence/retrieval support is weaker
not_duplicate- model says non-duplicate, or evidence too weak
Strict online guardrails (downgrade path)
A duplicate is downgraded to maybe when any fails:- duplicate veto checks
- bug/feature mismatch
- strict structure requirements:
relation = same_instanceroot_cause_match = samescope_relation = same_scopecertainty = sure
- score-gap guardrail (
>= 0.015)
Parse-failure fallback behavior
If judge output is invalid:- if nearest score is strong (
>= max(min_score, maybe_threshold)), returnmaybe_duplicate - else
not_duplicate
Online persistence constraints (v1)
detect-new persists:
reposupsert- source
itemsupsert - source
intent_cardsupsert (intent default) - source
intent_embeddingsupdate (when stale/missing) - source
embeddingsupdate when--source rawis selected or fallback occurs
- candidate set snapshots
- judge decision rows
- close actions
How issues and PRs are evaluated differently
| Concern | Issues | PRs |
|---|---|---|
| Corpus type partitioning | issue-only | pr-only |
| Base modeled content | title + body | title + body |
| Extra online context | none | changed files + patch excerpts in detect-new |
| Activity signal in canonical tie-break | comment_count | comment_count + review_comment_count (as stored; review count can be 0 when unavailable from fetch path) |
| Close execution command | gh issue close | gh pr close |
Evaluation methodology (quality measurement)
There are two quality loops in v1.1) Operational run metrics (per command)
Each stage emits counters (discovered/processed/accepted/rejected/skipped/failed etc.), which show throughput and immediate quality posture. Examples:candidates: missing embeddings, stale marked, members writtenjudge: accepted/rejected/skipped classes, invalid responses, veto categoriesplan-close: close vs skip mix and skip-reason distribution
2) Sampled cheap-vs-strong audit (judge-audit + report-audit)
Sampling policy
- latest fresh, non-empty candidate sets
- source state=open
- random uniform sample with deterministic seed
Two-lane evaluation
- cheap lane (cost-optimized profile)
- strong lane (higher-quality reference profile)
- both lanes run through the same audit gates (vague-source skip, structural vetoes, bug/feature veto,
min_edge, score-gap) - note:
judge-auditis close to, but not identical with, operationaljudgegating (for example, it does not enforce thetarget must be openveto)
Outcome classes
- tp: both accepted, same target
- fp: cheap accepted, strong not accepted
- fn: cheap not accepted, strong accepted
- tn: both not accepted
- conflict: both accepted, different targets
- incomplete: skipped/error lane outcome
Core metrics
precision = tp / (tp + fp)recall = tp / (tp + fn)- conflict count (target disagreement risk)
- incomplete count (runtime/data quality issues)
report-audit also supports non-LLM gate simulation to estimate metric trade-offs when tightening:
- rank constraints
- score minimum
- gap minimum
Why this flow is safe by design
- Type-isolated retrieval prevents issue/PR cross-contamination.
- State snapshots (
candidate_sets) make judging reproducible. - LLM output is advisory, not authoritative.
- Deterministic vetoes block high-confidence but risky decisions.
- Plan/apply split ensures human review before mutation.
- Per-item audit rows provide post-hoc explainability and threshold tuning.
Practical end-to-end sequence
Related docs
/architecture— command/state architecture map/get-started— setup and first rundocs/internal/operator_runbook_v1.md— deeper operator playbookdocs/internal/online_duplicate_detection_pipeline_design_doc_v1.md— online-specific design details