Back to Registry
2026-02-23 5 min read AI Engineering Agentic Systems #2

Smart File Organizer (Part 2): Scaling Context

How to handle large documents in local LLMs using Map-Reduce summarization, token counting, and production-grade configuration.

In the first part of this series, I detailed how I built a local-first, AI-driven file organizer using Go and Llama-3 (via MLX). By leveraging a streaming pipeline and a worker pool, the system could rapidly categorize thousands of PDFs and invoices without sending a single byte of sensitive data to the cloud.

But as I fed the system larger and more complex datasets, a fundamental limitation of local, small-parameter models reared its head: The Context Window Chokepoint.

Here is how I architected the next evolution of the system—moving from naive text extraction to a recursive Map-Reduce summarization pipeline, and hardening the service for production.


1. The Context Window Chokepoint

When you run a small local model like Llama-3.2-1B, your context window is constrained. If you feed a 50-page research paper directly into the prompt to ask "What category does this belong in?", the model will either throw an error or truncate the input, losing the end of the document entirely.

Initially, my workaround was simple but flawed: Character Truncation. I would just grab the first N characters of a file and hope the necessary context was in the introduction.

But what if the classification context is buried in the conclusion? What if the document is a massive log file where the relevant error is at the bottom? Naive truncation doesn't work for a production-grade organizer.

2. Thinking in Tokens, Not Characters

The first step was to stop treating text as just "strings" and start treating it the way the LLM does: as Tokens.

I integrated a Go port of tiktoken to accurately measure the size of the textual payload before it ever hits the AI engine.

func (t *Tokenizer) CountTokens(text string) int {
	return len(t.bpe.Encode(text, nil, nil))
}

Now, the pipeline doesn't just guess if a text is too long. It knows exactly how much of the context budget a document will consume.

3. Map-Reduce Summarization (The Star Feature)

When a document exceeds the allowed token limit, we need to compress it without losing the semantic meaning required for categorization. The solution is a Recursive Map-Reduce pipeline.

Instead of classifying the entire document at once, the system breaks it down and summarizes it in phases:

Phase 1: Chunking

We split the massive text into smaller, token-safe chunks.

Phase 2: Mapping (Concurrent Summarization)

We iterate over the chunks. For each chunk, we ask the local LLM to generate a dense, information-rich summary. Because these chunks are small, the inference is fast and accurate.

func (e *MLXEngine) summarizeChunk(ctx context.Context, text string, index, total int) (string, error) {
       prompt := fmt.Sprintf("Summarize the following document part (%d/%d). Keep key technical details, names, and core topics relevant for categorization:\n\n%s", index, total, text)
       // ... API call to local MLX instance ...
}

Phase 3: Reducing (Recursive Aggregation)

Once all chunks are summarized, we combine the summaries into a single new document.

But what if combined summaries still exceed the token budget? We apply the magic of recursion. The MapReduceSummarize function checks the token count of the combined summaries and, if necessary, sends it right back through the pipeline to be chunked and summarized again.

func (e *MLXEngine) MapReduceSummarize(ctx context.Context, text string, limit int) (string, error) {
       // ... (Chunking and Mapping logic) ...
 
       // 3. Reduce: Combine and recursively summarize if needed
       combined := strings.Join(summaries, "\n\n")
       combinedTokens := e.ctxMgr.tokenizer.CountTokens(combined)
 
       if combinedTokens > limit {
               // Recursive reduction
               return e.MapReduceSummarize(ctx, combined, limit)
       }
 
       return combined, nil
}

This ensures that even a 1,000-page document is distilled down into a dense, context-rich representation that perfectly fits the classification prompt.

4. Hardening Extraction and Memory

While building the Map-Reduce flow, I realized the initial file extraction logic was allocating massive buffers indiscriminately. To make the pipeline truly scalable, the extractor package was updated to dynamically size its read buffers based on the actual file size.

stat, err := f.Stat()
if err == nil && int(stat.Size()) < limit {
        limit = int(stat.Size())
}
buf := make([]byte, limit)

It’s a small detail, but in a streaming pipeline processing thousands of files concurrently, these micro-optimizations prevent the nightmare of memory bloat.

5. Moving to Production-Grade Configs

A system stops being a "script" and starts being a "platform" when you decouple its configuration from its code.

I stripped out the standard Go flag parsing and integrated Viper. Now, the entire pipeline—from context window limits to API URLs and worker counts—is driven by a structured config.yaml and environment variables.

# config.yaml (Example)
pipeline:
  source_dir: "/Users/me/Downloads"
  dest_dir: "/Users/me/Organized"
  workers: 5
 
ai:
  api_url: "http://localhost:8080/v1"
  model_name: "mlx-community/Llama-3.2-1B-Instruct-4bit"
  context_window: 8000
  extract_limit: 100000

This adheres to standard 12-factor application principles, making the engine much easier to deploy as a background daemon or run in different environments without altering command-line arguments.

Conclusion

Building this second phase highlighted an important reality of AI Engineering: The models are just the engine; you still have to build the car.

Handling context windows isn't about finding a bigger model; it's about building intelligent data pipelines (like Map-Reduce) that curate the context before the inference happens. By adding token awareness, dynamic extraction buffers, and structured configuration, the File Organizer has officially graduated from a neat weekend project into a resilient, production-grade backend service.

References

  • MLX Framework - Apple's machine learning research framework for Apple Silicon.
  • tiktoken-go - A Go port of OpenAI's tiktoken for accurate BPE token counting.
  • Map-Reduce Pattern - The foundational distributed computing model adapted here for LLM context summarization.
  • Viper - Go configuration with fangs, used for production-grade settings management.
  • The Twelve-Factor App - Methodology for building backend applications, specifically regarding configuration.

Found this insight useful?

Follow me on X/Twitter for daily systems engineering updates.

Follow @vjitendra