Reader and Extraction

OfficeIMO.Reader provides a single extraction surface for Office and adjacent document formats. Instead of maintaining separate parsing pipelines for .docx , .xlsx , .pptx , Markdown, PDF, or text-like files, you can normalize them into one chunk model and then feed that output into indexing, search, and AI workflows.

Best fit scenarios

Build ingestion pipelines for RAG, semantic search, or compliance review.
Normalize mixed document folders into one extraction and chunking model.
Preserve headings, citations, token estimates, and source hashes while preparing content for downstream tools.
Run extraction in background workers, containers, Azure Functions, or scheduled jobs.

Core workflow

Extract a file into ReaderChunk instances with text, markdown, tables, visuals, and source information.
Tune ReaderOptions so emitted slices stay deterministic and sized for search or AI prompts.
Store chunks, citations, and source identifiers in your vector store, search index, or audit trail.

Quick start

using OfficeIMO.Reader;

var chunks = DocumentReader.Read("proposal.docx", new ReaderOptions
{
    MaxChars = 4_000,
    IncludeWordFootnotes = true,
    ComputeHashes = true
}).ToList();

foreach (var chunk in chunks)
{
    Console.WriteLine($"{chunk.Id} :: {chunk.Kind}");
    Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path);
    Console.WriteLine(chunk.TokenEstimate);
}

Formats and behavior

Input	Typical use
Word ( `.docx` )	Rich business documents, reports, contracts, and templates
Excel ( `.xlsx` )	Workbook content, tabular reports, and structured exports
PowerPoint ( `.pptx` )	Slide decks, speaker notes, and presentation narratives
Markdown	Documentation, changelogs, developer notes, and generated content
PDF	Published exports, archival documents, and third-party handoffs

Design goals

Deterministic chunking so repeated runs produce stable chunk boundaries.
Heading-aware extraction so downstream systems retain document structure.
Citation-friendly location data so search and AI responses can reference original sources.
Incremental indexing support through source IDs, hashes, and per-document chunk summaries.
Container-friendly execution with no Office installation requirements.

OfficeIMO.Word for producing .docx content before ingestion.
OfficeIMO.Markdown for rendering, transforming, and re-emitting extracted content.
AOT and Trimming for lean deployment guidance.