Accordia - Architecture

How Accordia Works — Technical Architecture (Canonical)

Architectural Premise

Accordia is built on the premise that analytical reliability is a systems problem, not a prompt or model problem.

Most AI systems fail under organizational use because they:

treat documents as flat text,
treat memory as conversational history,
conflate retrieval with reasoning,
and rely on prompt construction to compensate for missing structure.

Accordia instead decomposes intelligence into explicit, inspectable system layers that jointly determine analytical depth, recall, and explainability.

1. Ingestion Pipeline: From Raw Text to Governed Context

1.1 Quality Gating and Signal Filtering

Before any semantic processing, documents pass through a two-tier quality assessment pipeline:

Tier 1 (fast heuristics)

Shannon entropy scoring (information density)
Alphabetic and symbol ratios
Language detection

Tier 2 (model-based validation)

Perplexity scoring using lightweight or transformer language models
Adaptive thresholds by document type (prose vs technical)

Low-quality, corrupted, or metadata-heavy text is filtered or repaired before indexing.

Why this matters:
Retrieval precision is bounded by ingestion quality. No downstream retrieval or ranking can compensate for polluted context.

1.2 Semantic Chunking (Meaning-Preserving Segmentation)

Documents are segmented using embedding-based semantic chunking, not fixed token or character windows.

Process:

Split text into sentences
Generate embeddings per sentence
Slide a window over sentence groups
Detect semantic similarity drops (cosine distance)
Insert boundaries where topic shifts occur

Enhancements:

Gradient-based boundary detection (not static thresholds)
Low-density start detection (TOC / index stripping)
Adaptive chunk sizing based on information density

Result: Chunks represent ideas, not storage artifacts.

1.3 Content-Rich Sentence Classification

Within each chunk, sentences are scored for content richness using multiple signals:

entropy (information density)
sentence length and structure
verb presence and punctuation complexity
stop-word ratios
position-based weighting (secondary)

Only content-rich sentences are emphasized during:

keyphrase extraction
contextualization
embedding generation

This prevents TOC, headers, and metadata from dominating semantic representations.

1.4 Ingestion-Stage RAG Optimizations (Extension)

To ensure ingestion does not silently degrade downstream reasoning:

Near-duplicate detection (LSH-based) prevents semantic collapse from repeated content
Structural preservation retains document boundaries and section order
Canonical representations normalize heterogeneous formats (PDF, DOC, HTML, transcripts)

These controls ensure ingestion produces a high-signal analytical substrate, not a raw text archive.

2. Contextualization: Making Chunks Self-Situating

2.1 Context Prefixing

Each semantic chunk is augmented with a context prefix generated at ingestion time.

The prefix:

situates the chunk within the source document
captures surrounding thematic scope
preserves section-level intent

Contextualized chunks are used consistently for:

embeddings
lexical indexing (BM25)

This allows each chunk to be retrieved independently without reconstructing document context at query time.

2.2 Contextual Retrieval Design

Contextualization ensures:

semantic retrieval remains stable as documents grow
long documents do not dominate similarity scores
retrieval precision does not degrade with scale

This aligns retrieval quality with document meaning rather than document length.

2.3 Long-Context Stability Improvements

Instead of expanding prompt context arbitrarily:

document-level meaning is compressed into contextual prefixes
retrieval selects meaningful units, not token windows
long-context inference is avoided unless explicitly required

This stabilizes reasoning and reduces hallucination pressure.

3. Embedding and Retrieval Layer

3.1 Label-Aware Embeddings

Accordia uses label-aware embeddings to align query and document representations:

Queries:
search_query: {text}
Documents / chunks:
search_document: {content + headings + context prefix}

This leverages training-time embedding alignment so similarity scores are semantically meaningful without altering retrieval math.

3.2 Multi-Query Expansion

Each user query is expanded into multiple semantic variants:

raw question
keyword-focused form
entity-anchored form
section-level abstraction

Each variant is embedded and retrieved independently.
Results are merged using reciprocal rank fusion (RRF).

Effect: higher recall without sacrificing precision.

3.3 Hybrid Retrieval Stack

Accordia combines retrieval signals from:

vector similarity search
BM25 lexical matching
pattern matching (IDs, codes, exact strings)
metadata and scope filters

Signals are fused and re-ranked so no single retrieval mode dominates.

3.4 REFRAG: Retrieval-Time Context Control

REFRAG operates as a retrieval-time optimization, not an ingestion shortcut.

Stages:

Compress — retrieved chunks are compactly represented
Sense — micro-units are scored for utility (semantic, structural, diversity signals)
Expand — only high-utility fragments are expanded to full text

This increases effective context capacity while controlling token cost and latency.

4. Memory Model: Persistent, Scoped, Curated

Accordia does not rely on conversational memory.

Instead, it maintains workstream-scoped memory that persists:

analytical intents
synthesized conclusions
assumptions and constraints
source linkages
refinement history

Memory is:

explicitly written
selectively retained
structurally linked

It is not an append-only chat log.

5. Workflow-Aware Execution

All reasoning occurs inside explicit workflows that define:

scope boundaries
expected artifacts
persistence rules
reviewability requirements

Outputs are automatically:

versioned
attached to workstream memory
linked to upstream evidence

This converts reasoning into institutional capability, not ephemeral assistance.

Why System Design Determines Whether Intelligence Compounds