Reader and Extraction
Edit on GitHubOverview of the OfficeIMO.Reader package for unified document extraction and AI-ready chunking workflows.
Reader and Extraction
OfficeIMO.Reader provides a single extraction surface for the built-in formats currently handled in the repo, plus optional adapters for adjacent document types. Instead of maintaining separate parsing pipelines for .docx, .xlsx, .pptx, Markdown, PDF, CSV, JSON, XML, HTML, EPUB, ZIP, or text-like files, you can normalize them into one chunk model and then feed that output into indexing, search, and AI workflows.
Best fit scenarios
- Build ingestion pipelines for RAG, semantic search, or compliance review.
- Normalize mixed document folders into one extraction and chunking model.
- Preserve headings, citations, token estimates, and source hashes while preparing content for downstream tools.
- Run extraction in background workers, containers, Azure Functions, or scheduled jobs.
Core workflow
- Extract a file into
ReaderChunkinstances with text, markdown, tables, visuals, and source information. - Tune
ReaderOptionsso emitted slices stay deterministic and sized for search or AI prompts. - Store chunks, citations, and source identifiers in your vector store, search index, or audit trail.
Quick start
using OfficeIMO.Reader;
var chunks = DocumentReader.Read("proposal.docx", new ReaderOptions
{
MaxChars = 4_000,
IncludeWordFootnotes = true,
ComputeHashes = true
}).ToList();
foreach (var chunk in chunks)
{
Console.WriteLine($"{chunk.Id} :: {chunk.Kind}");
Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path);
Console.WriteLine(chunk.TokenEstimate);
}
Formats and behavior
| Input | Typical use |
|---|---|
Word (.docx) | Rich business documents, reports, contracts, and templates |
Excel (.xlsx) | Workbook content, tabular reports, and structured exports |
PowerPoint (.pptx) | Slide decks, speaker notes, and presentation narratives |
| Markdown | Documentation, changelogs, developer notes, and generated content |
| Published exports, archival documents, and third-party handoffs | |
| CSV/TSV | Structured flat files through OfficeIMO.Reader.Csv |
| JSON | API exports and structured payloads through OfficeIMO.Reader.Json |
| XML | Element-oriented documents through OfficeIMO.Reader.Xml |
| HTML | Web pages and exported fragments through OfficeIMO.Reader.Html |
| EPUB | Books and packaged publications through OfficeIMO.Reader.Epub |
| ZIP | Archive traversal through OfficeIMO.Reader.Zip |
Adapter packages
The core package covers the main Office-oriented extraction flow. Add adapter packages only when the deployment needs those input families:
OfficeIMO.Reader.Csvfor CSV and TSV ingestion.OfficeIMO.Reader.Jsonfor JSON payloads.OfficeIMO.Reader.Xmlfor XML documents.OfficeIMO.Reader.Textwhen CSV, JSON, and XML adapters should travel together.OfficeIMO.Reader.Htmlfor HTML normalization through the Markdown pipeline.OfficeIMO.Reader.Epubfor EPUB publications.OfficeIMO.Reader.Zipfor safe archive traversal.
Use the Reader API reference for the core DocumentReader, ReaderOptions, chunk, manifest, and handler registration types. Adapter-specific APIs are documented through their package README files until the website generates separate adapter reference sections.
Design goals
- Deterministic chunking so repeated runs produce stable chunk boundaries.
- Heading-aware extraction so downstream systems retain document structure.
- Citation-friendly location data so search and AI responses can reference original sources.
- Incremental indexing support through source IDs, hashes, and per-document chunk summaries.
- Container-friendly execution with no Office installation requirements.
Related packages
- OfficeIMO.Word for producing
.docxcontent before ingestion. - OfficeIMO.Markdown for rendering, transforming, and re-emitting extracted content.
- AOT and Trimming for lean deployment guidance.