OfficeIMO.Reader

Why OfficeIMO.Reader?

OfficeIMO.Reader provides a single API to extract structured content from common document formats. Feed it Word, Excel, PowerPoint, Markdown, PDF, or text-like inputs and get back normalized chunks with location data, source hashes, and token estimates. It is purpose-built for RAG pipelines, search indexing, and any workflow where you need reproducible document slices instead of ad-hoc parsers.

Features

Extract from Word, Excel, PowerPoint, Markdown & PDF -- one API for all major document formats
Deterministic extraction chunks -- emit stable chunk boundaries with configurable character and row limits
Heading-aware extraction with citations -- preserve document structure and source locations on each chunk
Token estimates per chunk -- budget prompts and indexing payloads without an extra preprocessing pass
Folder batch processing -- process entire directories with progress callbacks, skip reporting, and cancellation support
Pluggable handler registration -- register custom extractors for proprietary or domain-specific formats

What teams build with Reader

Workflow	Output	Why Reader fits
Knowledge ingestion services	Chunked text plus source IDs and token estimates for vector stores and semantic search	One extractor handles mixed Office and adjacent formats with the same result model
Compliance and review pipelines	Searchable evidence bundles with headings and citations	Stable chunk boundaries make reviews and re-runs easier to compare
File-share indexing jobs	Normalized documents ready for Lucene, Elasticsearch, or Azure AI Search	Batch extraction works well in workers, scheduled jobs, and containers
Content migration tools	Markdown, JSON, or sidecar artifacts derived from legacy documents	Structured extraction keeps enough source context to transform before re-emitting

Quick start

using OfficeIMO.Reader;

var chunks = DocumentReader.Read("report.docx", new ReaderOptions
{
    MaxChars = 4_000,
    IncludeWordFootnotes = true,
    ComputeHashes = true
}).ToList();

foreach (var chunk in chunks)
{
    Console.WriteLine($"{chunk.Id} :: {chunk.Kind} :: ~{chunk.TokenEstimate ?? 0} tokens");
    Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path ?? "unknown");
    Console.WriteLine(chunk.Markdown ?? chunk.Text);
    Console.WriteLine();
}

var documents = DocumentReader.ReadFolderDocuments(
    folderPath: "./documents",
    folderOptions: new ReaderFolderOptions
    {
        Recurse = true,
        DeterministicOrder = true,
        MaxFiles = 500
    },
    options: new ReaderOptions
    {
        MaxChars = 4_000,
        ComputeHashes = true
    },
    onProgress: progress =>
        Console.WriteLine($"{progress.Kind}: scanned={progress.FilesScanned}, parsed={progress.FilesParsed}, skipped={progress.FilesSkipped}, chunks={progress.ChunksProduced}")
).ToList();

Console.WriteLine($"Processed {documents.Count} files");
Console.WriteLine($"Parsed {documents.Count(d => d.Parsed)} files");
Console.WriteLine($"Returned {documents.Sum(d => d.ChunksProduced)} chunks");

Typical ingestion flow

Detect the source format and extract chunks, headings, tables, visuals, and source information with one API call.
Normalize the result into a shape your pipeline understands, regardless of whether the input was Word, Excel, PowerPoint, Markdown, PDF, or text.
Tune ReaderOptions so citations and downstream embeddings stay stable across repeated runs.
Store the chunks, source hashes, and source references in your vector store, search index, or audit trail.
Reuse the same extractor in local tools, hosted services, or CI jobs without changing the document pipeline.

Compatibility

Target Framework	Supported
.NET 10.0	Yes
.NET 8.0	Yes
.NET Standard 2.0	Yes
.NET Framework 4.7.2	Yes

OfficeIMO.Reader targets the same cross-platform .NET runtimes as the packages it builds on, and the core extraction flow is a good fit for containers, hosted services, and server-side indexing jobs. As with any mixed-format pipeline, validate your exact deployment shape, input set, and runtime targets before treating it as broadly portable across every environment.

Guide	Description
Reader documentation	Learn the core extraction model, chunking workflow, and ingestion patterns.
AOT and trimming	Review runtime and deployment guidance for lean extraction services.
Reader tutorial	Walk through chunk inspection, folder ingestion, and indexing-oriented extraction patterns.
OfficeIMO.Markdown	Pair extraction with markdown rendering and transformation workflows.