Skip to main content

OfficeIMO.Reader

Unified document extraction from Word, Excel, PowerPoint, Markdown, PDF, and text-like inputs. Deterministic chunks for indexing and AI workflows.

dotnet add package OfficeIMO.Reader

AI-ready extraction pipeline

Mixed documents in, stable chunks out

A single ingestion surface for Office documents, Markdown, and PDFs. Normalize once, then feed the same output into search, RAG, and review workflows.

DOCX/XLSX/PPTX/PDF Input families
512 + overlap Typical chunk profile
Search + AI Downstream targets
  1. Extract text, headings, and metadata from mixed document folders.
  2. Chunk with deterministic settings so re-runs keep stable citations.
  3. Ship normalized results into vector stores, indexes, or audit workflows.

Why OfficeIMO.Reader?

OfficeIMO.Reader provides a single API to extract structured content from common document formats. Feed it Word, Excel, PowerPoint, Markdown, PDF, or text-like inputs and get back normalized chunks with location data, source hashes, and token estimates. It is purpose-built for RAG pipelines, search indexing, and any workflow where you need reproducible document slices instead of ad-hoc parsers.

Features

  • Extract from Word, Excel, PowerPoint, Markdown & PDF -- one API for all major document formats
  • Deterministic extraction chunks -- emit stable chunk boundaries with configurable character and row limits
  • Heading-aware extraction with citations -- preserve document structure and source locations on each chunk
  • Token estimates per chunk -- budget prompts and indexing payloads without an extra preprocessing pass
  • Folder batch processing -- process entire directories with progress callbacks, skip reporting, and cancellation support
  • Pluggable handler registration -- register custom extractors for proprietary or domain-specific formats

What teams build with Reader

WorkflowOutputWhy Reader fits
Knowledge ingestion servicesChunked text plus source IDs and token estimates for vector stores and semantic searchOne extractor handles mixed Office and adjacent formats with the same result model
Compliance and review pipelinesSearchable evidence bundles with headings and citationsStable chunk boundaries make reviews and re-runs easier to compare
File-share indexing jobsNormalized documents ready for Lucene, Elasticsearch, or Azure AI SearchBatch extraction works well in workers, scheduled jobs, and containers
Content migration toolsMarkdown, JSON, or sidecar artifacts derived from legacy documentsStructured extraction keeps enough source context to transform before re-emitting

Quick start

using OfficeIMO.Reader;

var chunks = DocumentReader.Read("report.docx", new ReaderOptions
{
    MaxChars = 4_000,
    IncludeWordFootnotes = true,
    ComputeHashes = true
}).ToList();

foreach (var chunk in chunks)
{
    Console.WriteLine($"{chunk.Id} :: {chunk.Kind} :: ~{chunk.TokenEstimate ?? 0} tokens");
    Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path ?? "unknown");
    Console.WriteLine(chunk.Markdown ?? chunk.Text);
    Console.WriteLine();
}

var documents = DocumentReader.ReadFolderDocuments(
    folderPath: "./documents",
    folderOptions: new ReaderFolderOptions
    {
        Recurse = true,
        DeterministicOrder = true,
        MaxFiles = 500
    },
    options: new ReaderOptions
    {
        MaxChars = 4_000,
        ComputeHashes = true
    },
    onProgress: progress =>
        Console.WriteLine($"{progress.Kind}: scanned={progress.FilesScanned}, parsed={progress.FilesParsed}, skipped={progress.FilesSkipped}, chunks={progress.ChunksProduced}")
).ToList();

Console.WriteLine($"Processed {documents.Count} files");
Console.WriteLine($"Parsed {documents.Count(d => d.Parsed)} files");
Console.WriteLine($"Returned {documents.Sum(d => d.ChunksProduced)} chunks");

Typical ingestion flow

  1. Detect the source format and extract chunks, headings, tables, visuals, and source information with one API call.
  2. Normalize the result into a shape your pipeline understands, regardless of whether the input was Word, Excel, PowerPoint, Markdown, PDF, or text.
  3. Tune ReaderOptions so citations and downstream embeddings stay stable across repeated runs.
  4. Store the chunks, source hashes, and source references in your vector store, search index, or audit trail.
  5. Reuse the same extractor in local tools, hosted services, or CI jobs without changing the document pipeline.

Compatibility

Target FrameworkSupported
.NET 10.0Yes
.NET 8.0Yes
.NET Standard 2.0Yes
.NET Framework 4.7.2Yes

OfficeIMO.Reader targets the same cross-platform .NET runtimes as the packages it builds on, and the core extraction flow is a good fit for containers, hosted services, and server-side indexing jobs. As with any mixed-format pipeline, validate your exact deployment shape, input set, and runtime targets before treating it as broadly portable across every environment.

GuideDescription
Reader documentationLearn the core extraction model, chunking workflow, and ingestion patterns.
AOT and trimmingReview runtime and deployment guidance for lean extraction services.
Reader tutorialWalk through chunk inspection, folder ingestion, and indexing-oriented extraction patterns.
OfficeIMO.MarkdownPair extraction with markdown rendering and transformation workflows.