Skip to main content

Reader and Extraction

Edit on GitHub

Overview of the OfficeIMO.Reader package for unified document extraction and AI-ready chunking workflows.

Reader and Extraction

OfficeIMO.Reader provides a single extraction surface for the built-in formats currently handled in the repo, plus optional adapters for adjacent document types. Instead of maintaining separate parsing pipelines for .docx, .xlsx, .pptx, Markdown, PDF, CSV, JSON, XML, HTML, EPUB, ZIP, or text-like files, you can normalize them into one chunk model and then feed that output into indexing, search, and AI workflows.

Best fit scenarios

  • Build ingestion pipelines for RAG, semantic search, or compliance review.
  • Normalize mixed document folders into one extraction and chunking model.
  • Preserve headings, citations, token estimates, and source hashes while preparing content for downstream tools.
  • Run extraction in background workers, containers, Azure Functions, or scheduled jobs.

Core workflow

  1. Extract a file into ReaderChunk instances with text, markdown, tables, visuals, and source information.
  2. Tune ReaderOptions so emitted slices stay deterministic and sized for search or AI prompts.
  3. Store chunks, citations, and source identifiers in your vector store, search index, or audit trail.

Quick start

using OfficeIMO.Reader;

var chunks = DocumentReader.Read("proposal.docx", new ReaderOptions
{
    MaxChars = 4_000,
    IncludeWordFootnotes = true,
    ComputeHashes = true
}).ToList();

foreach (var chunk in chunks)
{
    Console.WriteLine($"{chunk.Id} :: {chunk.Kind}");
    Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path);
    Console.WriteLine(chunk.TokenEstimate);
}

Formats and behavior

InputTypical use
Word (.docx)Rich business documents, reports, contracts, and templates
Excel (.xlsx)Workbook content, tabular reports, and structured exports
PowerPoint (.pptx)Slide decks, speaker notes, and presentation narratives
MarkdownDocumentation, changelogs, developer notes, and generated content
PDFPublished exports, archival documents, and third-party handoffs
CSV/TSVStructured flat files through OfficeIMO.Reader.Csv
JSONAPI exports and structured payloads through OfficeIMO.Reader.Json
XMLElement-oriented documents through OfficeIMO.Reader.Xml
HTMLWeb pages and exported fragments through OfficeIMO.Reader.Html
EPUBBooks and packaged publications through OfficeIMO.Reader.Epub
ZIPArchive traversal through OfficeIMO.Reader.Zip

Adapter packages

The core package covers the main Office-oriented extraction flow. Add adapter packages only when the deployment needs those input families:

  • OfficeIMO.Reader.Csv for CSV and TSV ingestion.
  • OfficeIMO.Reader.Json for JSON payloads.
  • OfficeIMO.Reader.Xml for XML documents.
  • OfficeIMO.Reader.Text when CSV, JSON, and XML adapters should travel together.
  • OfficeIMO.Reader.Html for HTML normalization through the Markdown pipeline.
  • OfficeIMO.Reader.Epub for EPUB publications.
  • OfficeIMO.Reader.Zip for safe archive traversal.

Use the Reader API reference for the core DocumentReader, ReaderOptions, chunk, manifest, and handler registration types. Adapter-specific APIs are documented through their package README files until the website generates separate adapter reference sections.

Design goals

  • Deterministic chunking so repeated runs produce stable chunk boundaries.
  • Heading-aware extraction so downstream systems retain document structure.
  • Citation-friendly location data so search and AI responses can reference original sources.
  • Incremental indexing support through source IDs, hashes, and per-document chunk summaries.
  • Container-friendly execution with no Office installation requirements.