A Series B legal tech company deployed an AI agent to handle contract review escalations. The agent had access to every support ticket, every customer email thread, and a 200-page knowledge base.
Day one: impressive. The agent was catching edge cases, flagging risks, providing accurate guidance.
Day three: confused. The agent started contradicting itself across threads.
Day seven: confidently telling customers things that directly contradicted decisions made two weeks earlier in email exchanges it couldn't parse.
The problem wasn't the model. GPT-5 is excellent at contract analysis when you feed it a clean contract. The problem was that the agent had no idea what had actually happened. It couldn't reconstruct the conversation history. It couldn't tell that when the VP of Product said "let's hold off on this" in message 6 of an 18-email thread, that decision superseded everything that came before. It couldn't detect that three days of silence after "I'll look into this" meant the issue had been abandoned, not resolved.
The agent was brilliant in isolation and completely lost in context.
Here's what breaks most enterprise AI projects before they even ship:
Your CRM is structured. Your dashboards are structured. Your task lists are structured.
None of that is where real decisions actually happen.
Real decisions happen in email threads where the conclusion evolves across 47 replies, in Slack debates where someone says "nvm" and reverses three days of planning, in Google Docs with comment wars buried in the margins, in forwarded chains where the actual decision is in message 3 of 11 and everything else is just context you need to understand why.
This is messy, recursive, full of implied meaning and unstated intent. Humans navigate it fine because we track narrative continuity automatically. We know that when Sarah says "I'll handle this" in one thread and then goes silent for three weeks in a related thread, there's a blocker we need to surface.
AI does not know this. AI sees tokens, not narrative. It sees text, not story.
Email is brutally difficult for the same reasons it's brutally valuable:
Replies include half-quoted fragments, creating recursive nested structure. Forwards create thread forks where conversations branch into parallel timelines. Participants join mid-context, so "we decided" means different groups at different points. Tone shifts signal risk, three "sounds good" replies followed by "actually, quick question" usually means a deal is unraveling. Attachments carry business logic but are referenced indirectly. People say "I'll send it Friday" instead of "task assigned with deadline November 22."
Email is not text. Email is conversation architecture wrapped around text.
Understanding it requires reconstructing conversation logic, not just processing sentences. That's where most AI breaks.
So everyone tries the same four solutions. They all fail for the same reason.
The theory: give the LLM all the context and let it figure it out.
The result: slow, expensive, brittle, hallucination-prone.
LLMs don't get better with more tokens—they drown. A 50-email thread has maybe 3 emails that matter and 47 that are conversational scaffolding. The model can't tell the difference. It weighs everything equally, gets confused by contradictions, and invents a conclusion that sounds plausible but reflects nothing that actually happened.
The theory: retrieve relevant emails, let semantic search handle the rest.
The result: great for documents, terrible for conversations.
RAG can retrieve the five most relevant emails. But it can't tell you that the reply on line 47 contradicts the conclusion at the top. It can't detect that "sounds good" from the CFO means approval while "sounds good" from an intern means nothing. It can't model that this thread forked into three parallel conversations and the decision in fork B invalidates discussion in fork A.
RAG gives you pieces. You need narrative. Those aren't the same thing.
The theory: train the model on your communication patterns.
The result: a smarter parrot, not a better historian.
Fine-tuning can make an LLM better at extracting action items from your team's phrasing. But it won't help the model understand that when Sarah commits to something in Thread A and then goes silent in Thread B about the same topic for three weeks, there's a blocker you need to know about.
You can't fine-tune your way into understanding live, constantly changing, multi-participant conversations that span weeks and branch across tools. Fine-tuning optimizes for patterns. Conversations are graphs.
We tried this. Everyone tries this.
You end up building a zoo of weak micro-detectors: sentiment classifiers, task extractors, decision markers, owner identifiers, deadline parsers, risk signals, tone analyzers. They're individually okay. Together they're fragile, contradictory, and they break the moment someone writes "sure, that works" instead of "approved" or "not sure about this" instead of "I have concerns."
The classifiers don't talk to each other. They don't share context. They don't understand that the same phrase means different things depending on who says it and when. You spend six months building and tuning them, and they still miss the thing that matters: the narrative arc of the conversation.
None of these solutions address the actual problem. Human communication is not explicit. It has to be reconstructed.
Ask an LLM what your team decided last week. It can't tell you. Not because it's bad at summarization, but because it doesn't have the assumptions required to interpret what happened.
When you lack the right assumptions, harmless emails look angry. A routine "following up on this" gets flagged as urgent when it's not. Major commitments go unnoticed because they're phrased as casual agreements. Tasks slip silently because "I'll take a look" isn't recognized as a soft commitment that needs tracking. Deals stall because the agent doesn't detect that three polite emails in a row with no concrete next steps means the prospect is ghosting.
Humans track backstory naturally. We know the relationships. We know the history. We know that this person always says "let me think about it" when they mean no, and that person says "yeah maybe" when they mean yes. We weight recency against contradiction. We notice when someone who's usually responsive goes silent.
Machines need help. Specifically, they need structure.
We stopped trying to make LLMs magically understand raw email. Instead, we built an engine that transforms unstructured communication into structured intelligence before it ever touches a model.
Think of it as a preprocessor for human conversation.
The first layer handles OAuth sync, real-time pull, attachment linking, message normalization.
The second layer is where it gets hard: parsing nested replies, forwards, inline quoting, participant changes, time gaps, reference resolution. When someone says "see attached," the system needs to know which attachment from which message sent by which person at which point. This is conversation archaeology.
The reasoning layer models conversation as a graph, not a list. Each message is a node. Replies create edges. Forwards create new subgraphs. The system tracks sentiment over time as trends, not static labels. It tracks commitments and whether they're followed up on. It detects when tone shifts from collaborative to defensive. It flags when someone makes a decision and then contradicts it three days later. It notices when a task is assigned and then silently dropped.
It extracts tasks as commitments with owners, implied deadlines, and context. It extracts decisions as outcomes with history, dissent tracked, follow-through monitored.
It understands that "I'm not sure this is right" means different things depending on who says it and when. From a junior engineer two days before launch, it's flag-for-review. From the CTO three weeks into a project, it's stop-and-rethink. The system needs to know both role and timing to interpret correctly.
The engine returns clean, predictable JSON: decisions with timestamps and participants, tasks with owners and deadlines, risks with severity scores and trends, sentiment analysis showing how discussions evolve, blockers when commitments go silent.
Now downstream systems can reason over it. Instead of trying to interpret "let's revisit this next week," they get a structured task with an implied deadline and a flag that this is soft postponement, not hard commitment.
Half of business communication is polite ambiguity. "Got it." "Works for me." "Let's revisit this." None are explicit commitments. All imply something, but what they imply depends on context you can't get from text alone.
The fix wasn't better pattern matching. It was building a system that reconstructs context first, then interprets patterns within that context.
Reply trees fork. Forwards create alternate timelines. Someone CCs a new person, and now there are two parallel discussions in what looks like one thread.
You have to reconstruct the entire graph, not read sequentially. You can't process email as a list. You have to process it as a directed acyclic graph with multiple roots, tracking which branches are active and which are abandoned.
Email Thread Structure (What AI Actually Sees)
Message 1 ─┐ ├─ Reply 2 ── Reply 4 ── Reply 7 └─ Reply 3 ──┐ ├─ Forwarded Chain → Reply 5 └─ Reply 6 (new participant) ── Reply 8
Active branches: 7, 8
Abandoned: 5
Decision made in: 7 (contradicts discussion in branch 3→6)
A single calm email means nothing. A downward trend across weeks means everything.
The signal isn't in the individual message—it's in the trajectory. Three "sounds good" emails followed by "actually, quick question" is a leading indicator that a deal is unraveling. The system needed to track slope, not state.
This is why AI copilots feel smart on day one and stupid by day ten. They don't remember what happened. They don't track how decisions evolved. They treat every conversation as isolated, when every conversation is part of a larger story.
The fix was building memory that persists across conversations and tools. Not just "here's what we discussed," but "here's what we decided, who committed to what, what's still open, what changed, what got dropped."
Story continuity is the difference between an AI that helps and an AI that confuses.
You cannot rebuild email parsing with regex. Conversation structure is too complex, too recursive, too contextual for pattern matching. You need graph reconstruction.
Narrative continuity matters more than token count. Stuffing 50 emails into a prompt gives the model noise, not context. It needs to know what happened, in what order, and why it matters.
Without structured context, agents drift. They'll be brilliant on day one and incoherent by day ten because they have no memory of decisions, no tracking of commitments, no awareness of how conversations evolved.
The bottleneck isn't the model. GPT-5 is excellent at reasoning when you give it clean, structured input. The bottleneck is turning unstructured communication into that input.
This layer has to exist somewhere. You either build it yourself (months of work, ongoing maintenance, endless edge cases) or you use infrastructure that already handles it.
If you're building with LangChain, LangGraph, LlamaIndex, or custom agent frameworks, you eventually hit the same brick wall: the model needs structured context, not raw text. You can chain prompts and implement sophisticated RAG pipelines, but none of that solves reconstructing narrative from unstructured communication.
Every AI product that touches human communication needs this. Customer support AI that can't track escalation history is useless. Legal AI that can't reconstruct contract negotiation history can't assess risk. Sales AI that can't detect when a deal is stalling can't help close.
Everything breaks without structured context. This is the missing layer.
We spent three years building it because email is our core product. Most developers don't have three years. They need this layer to exist so they can build on top of it.
The system we built is available as the Email Intelligence API. It takes raw email and returns structured, reasoning-ready signals.
You call a single endpoint. You get back tasks with owners and deadlines, decisions with participants and history, risks scored and tracked over time, sentiment trends, blockers identified when commitments go silent.
No prompt chains. No stitching RAG results. No building custom classifiers for six months.
We've been running this in production for two years. Developers integrate it in under a day. It processes millions of emails monthly with 90%+ accuracy on decision extraction and task identification.
If you're building AI tools that touch email, chat, or docs, this is the layer you don't want to build yourself.
The next wave of AI won't be about bigger models. It'll be about better context.
Most teams are still trying to improve prompts, trying to get GPT-5 to be 5% better at summarizing messy email threads. That's the wrong problem.
The bottleneck isn't the model. The bottleneck is that the model has no idea what's going on. It's blind to your history, your relationships, your decisions, your commitments. It's analyzing text when what it needs is story.
Context doesn't come from the web. Context doesn't come from bigger models. Context comes from your work—and your work is trapped in unstructured communication that AI can't parse without help.
Fix that, and AI stops sounding smart and starts being useful.
\
\ The Email Intelligence API is part of iGPT's context engine for AI developers. If this is the problem you're solving, we've already built the infrastructure.
\n \n
\


