Reader and Extraction
Edit on GitHubOverview of the OfficeIMO.Reader package for unified document extraction and AI-ready chunking workflows.
Reader and Extraction
OfficeIMO.Reader provides a single extraction surface for Office and adjacent document formats. Instead of maintaining separate parsing pipelines for .docx , .xlsx , .pptx , Markdown, PDF, or text-like files, you can normalize them into one chunk model and then feed that output into indexing, search, and AI workflows.
Best fit scenarios
- Build ingestion pipelines for RAG, semantic search, or compliance review.
- Normalize mixed document folders into one extraction and chunking model.
- Preserve headings, citations, token estimates, and source hashes while preparing content for downstream tools.
- Run extraction in background workers, containers, Azure Functions, or scheduled jobs.
Core workflow
- Extract a file into
ReaderChunkinstances with text, markdown, tables, visuals, and source information. - Tune
ReaderOptionsso emitted slices stay deterministic and sized for search or AI prompts. - Store chunks, citations, and source identifiers in your vector store, search index, or audit trail.
Quick start
using OfficeIMO.Reader;
var chunks = DocumentReader.Read("proposal.docx", new ReaderOptions
{
MaxChars = 4_000,
IncludeWordFootnotes = true,
ComputeHashes = true
}).ToList();
foreach (var chunk in chunks)
{
Console.WriteLine($"{chunk.Id} :: {chunk.Kind}");
Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path);
Console.WriteLine(chunk.TokenEstimate);
}Formats and behavior
| Input | Typical use |
|---|---|
Word ( .docx ) | Rich business documents, reports, contracts, and templates |
Excel ( .xlsx ) | Workbook content, tabular reports, and structured exports |
PowerPoint ( .pptx ) | Slide decks, speaker notes, and presentation narratives |
| Markdown | Documentation, changelogs, developer notes, and generated content |
| Published exports, archival documents, and third-party handoffs |
Design goals
- Deterministic chunking so repeated runs produce stable chunk boundaries.
- Heading-aware extraction so downstream systems retain document structure.
- Citation-friendly location data so search and AI responses can reference original sources.
- Incremental indexing support through source IDs, hashes, and per-document chunk summaries.
- Container-friendly execution with no Office installation requirements.
Related packages
- OfficeIMO.Word for producing
.docxcontent before ingestion. - OfficeIMO.Markdown for rendering, transforming, and re-emitting extracted content.
- AOT and Trimming for lean deployment guidance.