Graph Audit

Sweeps the graph looking for drift and violations at scales Node Validate cannot reach. Node Validate checks one node against its Form Contract and reports findings local to that node; Graph Audit walks the corpus, aggregates findings across files, and reports patterns — vocabularies accreting without documentation, ghost links accumulating into planning debt, edges that never got their annotation, predicates that forbid themselves appearing at scale.

The skill's output is not a comprehensive list of every small violation. It is a compressed audit report grouped by category, naming the scale of each finding and the remediation path. A scion author reading the report decides which categories warrant a cleanup pass now, which deserve their own follow-up work, and which are acceptable drift for the current stage.

The audit is intentionally read-only. Graph Audit does not fix violations, does not delete stale nodes, and does not propose edits. Fixing is the work that follows the audit — usually direct editing, sometimes Node Validate per flagged node, sometimes Predicate Propose for a vocabulary gap the audit surfaced.

Steps

Step 1: Scope the audit

Ask the user the scope unless one is already named. Common scopes:

Report the scope back to the user before running the sweep. A broad scope costs more context; narrow scopes are often what the user actually wanted.

Step 2: Mechanical sweep

Run mechanical checks with rg via Bash. These are fast, deterministic, and catch clear violations.

Filename rules — em-dashes and en-dashes forbidden in filenames:

find nodes -name '*—*' -o -name '*–*' 2>/dev/null

Forbidden predicaterelates_to:: is prohibited per No Generic relates_to Predicate. Anchor the pattern to ^- so prose discussions (which frequently mention the forbidden predicate in backticks) are excluded:

rg -n -I -- '^- relates_to::' nodes/

Identity block presence — every .md node under nodes/ should have a conforms_to:: line. Use find -print0 | while IFS= read -r -d '' so filenames with spaces (Gloss and Predicate filenames use -- separators) survive the loop intact:

find nodes -type f -name '*.md' -print0 | while IFS= read -r -d '' file; do
  grep -q '^- conforms_to::' "$file" || echo "missing conforms_to: $file"
done

YAML basics — per Markdown Node Contract, every node SHOULD carry tagline: in its YAML frontmatter; the build pipeline surfaces it as the row summary on each taxonomy's index page, and a missing tagline renders the row silent. brief_summary: is genuinely optional; report presence as an informational stat rather than a Shortfall.

find nodes -type f -name '*.md' -print0 | while IFS= read -r -d '' file; do
  awk '/^---$/{c++; next} c==1 && /^tagline:/{print "yes"; exit} c>=2{exit}' "$file" \
    | grep -q yes || echo "missing tagline: $file"
done

Report missing-tagline hits as Shortfall (SHOULD violation, not Violation). Form-specific Contracts MAY strengthen tagline: to MUST when the form's role makes the absence load-bearing — Contracts and Skills are the canonical cases; for those forms, missing-tagline becomes Violation.

Count brief_summary: presence per taxonomy as an informational stat — useful for noticing when a taxonomy's nodes have drifted away from the form's typical body shape (e.g., Decision nodes typically benefit from brief_summary: because their bodies are long; if many Decisions lack it, the form's authoring habit may have drifted). The audit does not flag absence as a finding.

Report any hits as Violations or Shortfalls per the categories above.

Step 2.5: Currency drift candidates

Currency cannot be checked mechanically — whether a tagline still describes the node's current claim is semantic judgment, not regex work. Word-overlap heuristics between tagline and H1 produce mostly false positives because taglines describe what the H1 names using deliberately complementary vocabulary; a Form Contract whose H1 is "Decision Form Contract" will have a tagline that talks about commitments, choices, and alternatives without the word "decision," and that is the tagline doing its job.

What the audit can flag mechanically is staleness relative to body edits — a tagline that hasn't been touched in months while the body has been substantially rewritten. The signal is direct: if the body's framing has shifted, the tagline that hasn't moved with it is a candidate for review.

Tagline staleness via git-blame — find the timestamp of the last edit to each node's tagline: line and compare against the timestamp of the node's most recent body-affecting commit. A gap larger than N months (start at 3 months) is a candidate.

# Implementation pattern: per-file git blame on the tagline line, compared to
# the file's last commit. Threshold-based candidate flagging.
find nodes -type f -name '*.md' -print0 | while IFS= read -r -d '' file; do
  tagline_line=$(awk '/^---$/{c++; next} c==1 && /^tagline:/{print NR; exit}' "$file")
  [ -z "$tagline_line" ] && continue
  tagline_date=$(git log -L "${tagline_line},${tagline_line}:$file" --format='%cs' -n 1 -- "$file" 2>/dev/null | head -1)
  body_date=$(git log -1 --format='%cs' -- "$file" 2>/dev/null)
  # Flag when body_date - tagline_date exceeds threshold (date math via Python).
  python3 -c "
from datetime import date
import sys
t='$tagline_date'; b='$body_date'
if t and b and t<b:
    td=date.fromisoformat(t); bd=date.fromisoformat(b)
    days=(bd-td).days
    if days > 90:
        print(f'  stale {days}d: $file (tagline {t}, body {b})')
" 2>/dev/null
done

Stale decided_on:: for Decisions — a Decision's decided_on:: date is its commitment timestamp. When the body has been substantially edited months later, the rendered Decision may carry framing that no longer matches what was decided. Flag when most-recent-body-commit minus decided_on:: exceeds N months (start at 6 months for Decisions, since their bodies legitimately evolve).

Report any flagged candidates as a separate "Currency drift candidates" section in Step 8's aggregate report — distinct from Violations and Shortfalls, and distinct from automated findings. Each candidate is "consider reviewing this node's tagline / brief_summary"; the scion author decides whether each is real drift, acceptable evolution, or evidence the body itself has moved past what the surrounding metadata still claims. The audit does NOT classify these as failures.

If a more sophisticated semantic check is desired, that work belongs in /node-validate per node (where the full Form Contract context applies) or in a domain-aware reading pass, not in this graph-scope sweep. Word-overlap, sentiment analysis, or LLM-classifier approaches at graph scale produce noise too high to act on.

Step 3: Vocabulary audit

List every predicate in use across the graph, count occurrences, and identify which have backing Predicate nodes:

rg -o -I -- '^- [a-z_]+::' nodes/ | sort | uniq -c | sort -rn

Cross-reference against the backing nodes:

ls nodes/Predicates/

Split the vocabulary into three tiers:

Report provisional predicates as candidates for /predicate-propose work — each is either drift to consolidate or vocabulary to codify.

Step 4: Ghost-link inventory

Extract every wikilink target and compare against the set of existing files. The difference is the ghost-link list.

rg -o -I -- '\[\[[^|\]]+\]\]' nodes/ | sort -u > /tmp/graph-audit-targets.txt
find nodes -name '*.md' -type f -exec basename {} .md \; | sort -u > /tmp/graph-audit-files.txt

For each target in the targets file that does not match a file stem (or the concept side of a ---suffixed file), it is a candidate ghost link.

Filter out template-token false positives before reporting. Contract bodies and skill bodies use [[X]], [[<Domain>]], [[Target]], [[Editor]], [[Principal]], [[Downstream Node]], [[X Form Contract]], [[<placeholder>]], and similar as syntactic placeholders in examples — these are documentation shapes, not intended wikilinks. A candidate whose target is a single uppercase letter, contains < or >, matches X Form Contract, or appears inside backtick-fenced content in its source is a false positive. Filter them out before producing the ghost-link list, or name them as "template-token false positives" in a separate bucket so the scion author knows to skip them.

Ghost links are not violations — they are planning signals per Markdown Node Contract's Named-edge syntax Requirement. Not every ghost is equal, though; four buckets sharpen the signal:

Deliberate ghosts — bare wikilinks in Predicate Crescent H3 headings (### Against [[predicate]]) and the paired contrasts_with::[[predicate]] edges in Relations, targeting predicates that deliberately don't have Predicate nodes. Categories include prohibited predicates (relates_to, is_a — forbidden by Decision, so no node will ever exist), base-contract predicates (authored_by, has_lifecycle, has_commitment and similar — introduced by Contracts rather than by Predicate nodes), and adjacent-graph vocabulary (derived_from, contradicts — predicates used in other graphs the Crescent is contrasting against). Deliberate ghosts are NOT drift; the contrasts_with:: edge annotations typically acknowledge them explicitly ("Ghost link; target is prohibited..."). Report them as informational — no action needed.

Drift ghosts — bare wikilinks to what looks like a node that should exist and doesn't. Usually arises from a file rename that didn't update incoming references, or from a reference to a planned node that was never created. These are actionable: fix the reference, create the target, or explicitly demote the reference to deliberate-ghost status with an annotation.

Vocabulary-value ghosts — identity-predicate values ([[Seed Stage]], [[Working Draft]], [[Provisional Commitment]], [[Empirical Observation]], etc.) that every node points to via has_lifecycle::, has_curation::, has_commitment::, or has_epistemic_status:: but which have no corresponding Gloss. These carry the highest-inbound-count ghosts in a new graph and represent the largest self-documentation gap. A healthy graph seeds Glosses for them early; an unhealthy graph accumulates identity predicates pointing to undefined values.

Planning-surface ghosts — single- or low-count bare wikilinks to genuinely-unfinished concepts ([[Convention Overhead vs Graph Quality]], a person's Gloss, a future Decision). These are the scion author's to-write list.

Report each bucket separately, with the bucket name in the report. Deliberate and template-token buckets are informational; vocabulary-value and drift buckets are actionable; planning-surface is curation. A ghost whose targets appears in MORE than one bucket (e.g., a predicate-name bare wikilink that's also a provisional predicate) counts as Drift until the author promotes it to a Predicate node, at which point it resolves.

Step 5: Un-annotated edge sweep

Every top-level bullet under ## Relations should be followed by an indented sub-bullet annotation per Annotate Edges With Why-They-Matter. An un-annotated edge is tag spaghetti.

The check pattern: within the ## Relations section of each node, for every top-level bullet matching ^- [a-z_]+::, the following non-blank line MUST begin with - (two-space indent then hyphen). Implementations vary — awk with getline, a short Python script over Path.read_text().split("\n"), or a two-line rg over the section's range are all reasonable. Choose what the agent can ship reliably.

Report un-annotated edges as Shortfall findings, grouped by file. If the count is large enough that the fix is a dedicated curation pass rather than quick edits, flag the scale in the summary rather than listing every instance.

Step 6: Reciprocal-edge sweep

When a node carries a forward edge like informs_downstream::[[X]] in its Relations, the target X typically carries a reciprocal edge back. Bidirectional edges are the convention across Decisions, References, Contracts, and Predicates; missing reciprocals are cascade failures — a forward edge was added without the back-edge being wired on the target.

The conventional pairs:

Forward edge Reciprocal on target
informs_downstream::[[X]] X has informed_by:: or grounded_in:: back
grounded_in::[[X]] X has informs_downstream:: back
informed_by::[[X]] X has informs_downstream:: back
extends_contract::[[X]] X has extended_by:: back
supersedes::[[X]] X has superseded_by:: back
contrasts_with::[[X]] X has contrasts_with:: back (symmetric)

The check pattern: for each forward edge in every node's ## Relations section, resolve the target to its file, and search that file for the corresponding reciprocal pointing back at the source node. Implementations vary; an rg sweep for forward edges piped into a per-target check is one pattern. Some forward edges are legitimately one-directional — an external reference marked with does not live in the graph, and a ghost link's target does not yet exist; skip externals and ghost-link targets.

Report missing reciprocals as Shortfall findings, grouped by the source node. Most missing reciprocals are quick Edit fixes — add the reciprocal on the target with an annotation explaining the relationship.

The sweep catches a specific failure pattern: when a new node is added and its forward edges are wired, it is easy to miss wiring the corresponding back-edges on older, established target nodes. The sweep makes the gap visible at graph scope rather than relying on per-node validation to notice.

Step 7: Orphan detection

An orphan node has no incoming edges from any other node. Some orphans are intentional (the landing page; founding documents that are only linked-to from outside the graph). Others are drift (a node that was written but never wired into the graph).

For each node, search for incoming references:

for file in $(find nodes -name '*.md' -type f); do
  stem=$(basename "$file" .md)
  concept=$(echo "$stem" | sed 's/ -- .*//')
  count=$(rg -c -I -- "\[\[$stem\]\]|\[\[$concept" nodes/ 2>/dev/null | wc -l)
  if [ "$count" -eq 0 ]; then
    echo "orphan: $file"
  fi
done

Report orphans with their form (from conforms_to::) and lifecycle stage. A Seed Stage orphan is usually a work-in-progress; an Evergreen orphan is a candidate for either promotion (adding incoming edges) or demotion (stepping back the lifecycle).

Step 8: Aggregate and report

Group findings by category, not by file. The report structure:

Report each category compressed. Do not dump every finding; sample representative cases per category and name the total count. A scion owner wants to know "there are 40 ghost links, the top five are…" — not 40 individual list items.

Step 9: Name the follow-ups

End the report by naming which follow-up skills or operations would address which categories:

The follow-up naming lets the scion author route the audit's findings without re-deriving what each category asks for.

Relations