Heading-aware text and metadata
Title, author, headings, page count, and normalized body text ready for chunking.
Unified document extraction from Word, Excel, PowerPoint, Markdown, PDF, and text-like inputs. Deterministic chunks for indexing and AI workflows.
dotnet add package OfficeIMO.Reader AI-ready extraction pipeline
A single ingestion surface for Office documents, Markdown, and PDFs. Normalize once, then feed the same output into search, RAG, and review workflows.
OfficeIMO.Reader provides a single API to extract structured content from common document formats. Feed it Word, Excel, PowerPoint, Markdown, PDF, or text-like inputs and get back normalized chunks with location data, source hashes, and token estimates. It is purpose-built for RAG pipelines, search indexing, and any workflow where you need reproducible document slices instead of ad-hoc parsers.
| Workflow | Output | Why Reader fits |
|---|---|---|
| Knowledge ingestion services | Chunked text plus source IDs and token estimates for vector stores and semantic search | One extractor handles mixed Office and adjacent formats with the same result model |
| Compliance and review pipelines | Searchable evidence bundles with headings and citations | Stable chunk boundaries make reviews and re-runs easier to compare |
| File-share indexing jobs | Normalized documents ready for Lucene, Elasticsearch, or Azure AI Search | Batch extraction works well in workers, scheduled jobs, and containers |
| Content migration tools | Markdown, JSON, or sidecar artifacts derived from legacy documents | Structured extraction keeps enough source context to transform before re-emitting |
using OfficeIMO.Reader;
var chunks = DocumentReader.Read("report.docx", new ReaderOptions
{
MaxChars = 4_000,
IncludeWordFootnotes = true,
ComputeHashes = true
}).ToList();
foreach (var chunk in chunks)
{
Console.WriteLine($"{chunk.Id} :: {chunk.Kind} :: ~{chunk.TokenEstimate ?? 0} tokens");
Console.WriteLine(chunk.Location.HeadingPath ?? chunk.Location.Path ?? "unknown");
Console.WriteLine(chunk.Markdown ?? chunk.Text);
Console.WriteLine();
}
var documents = DocumentReader.ReadFolderDocuments(
folderPath: "./documents",
folderOptions: new ReaderFolderOptions
{
Recurse = true,
DeterministicOrder = true,
MaxFiles = 500
},
options: new ReaderOptions
{
MaxChars = 4_000,
ComputeHashes = true
},
onProgress: progress =>
Console.WriteLine($"{progress.Kind}: scanned={progress.FilesScanned}, parsed={progress.FilesParsed}, skipped={progress.FilesSkipped}, chunks={progress.ChunksProduced}")
).ToList();
Console.WriteLine($"Processed {documents.Count} files");
Console.WriteLine($"Parsed {documents.Count(d => d.Parsed)} files");
Console.WriteLine($"Returned {documents.Sum(d => d.ChunksProduced)} chunks");ReaderOptions so citations and downstream embeddings stay stable across repeated runs.| Target Framework | Supported |
|---|---|
| .NET 10.0 | Yes |
| .NET 8.0 | Yes |
| .NET Standard 2.0 | Yes |
| .NET Framework 4.7.2 | Yes |
OfficeIMO.Reader targets the same cross-platform .NET runtimes as the packages it builds on, and the core extraction flow is a good fit for containers, hosted services, and server-side indexing jobs. As with any mixed-format pipeline, validate your exact deployment shape, input set, and runtime targets before treating it as broadly portable across every environment.
| Guide | Description |
|---|---|
| Reader documentation | Learn the core extraction model, chunking workflow, and ingestion patterns. |
| AOT and trimming | Review runtime and deployment guidance for lean extraction services. |
| Reader tutorial | Walk through chunk inspection, folder ingestion, and indexing-oriented extraction patterns. |
| OfficeIMO.Markdown | Pair extraction with markdown rendering and transformation workflows. |