Shape the Field: AI Agents in Product
Speaker notes and outline
Open interactive deck →Opening (1 min)
"I've been running AI agents as my actual operating system for about a year now - across startups, an advisory role, and now a large enterprise. Not experimenting. Operating. And I built the whole thing with a product manager's lens - backlogs, feedback loops, interaction contracts, value checks. That framing turned out to matter more than the AI itself. I want to share what I learned across three phases."
Frame the three acts:
- What I was building when I first started (the solo phase)
- How my thinking evolved (the shift to enterprise)
- What the space looks like 12 months from now
Act 1: Early Use Cases (3 min)
Key message: Agents started as task automation. The interesting part is when they stopped being tools and became domain-specific operating systems.
Where everyone starts
Task automation. Auto-bookers, calendar sync, receipt scanners. The agent does what a human would do, just faster. This is where most orgs are right now.
Where it gets interesting: agents for real businesses
- An early-stage startup: Built a full operating layer - email pipeline that triages VIP contacts, classifies, and ingests into a 200+ doc knowledge base. A data room agent that takes a one-line email and turns it into investor doc edits + deck uploads + confirmation drafts. Financial models, vendor comparisons, location-level unit economics. An autonomous PM agent that runs every 6 hours and pushes back when deadlines slip.
- A compliance review firm: Built a vision-based evaluator that reviews field documentation photos against industry standards. The unlock was sending full document sets instead of samples - the model started catching things humans missed (location mismatches, signature gaps, backdated records). 29-submission batch, correctly triaged pass / hold / incomplete. Heading to production.
- Agents that build software: Set up a plan-build-eval loop. Agent reads context and backlog, writes an implementation plan, executes it (code, tests, commits), then a second agent reviews the diff and feeds back. Loop runs until eval passes - human approves the merge. Multiple production apps shipped this way. The agent doesn't just help you manage work - it does the work. You become the evaluator, not the operator.
- The pattern across all three: I approached the agent the way a PM approaches a product. What's the user need? What's the feedback loop? Where does the system break? What's the minimum that delivers value? The value wasn't in any single prompt - it was in the accumulated system.
The turning point: memory
- Added persistent memory, a task system the agent reads and writes to, strategic context that loads every session
- Built a PM layer on top: every completed task gets a value check (what artifact was produced, what decision was made, what metric moved). Tasks that produce no tangible output get flagged.
- Session 1 and session 250 feel like the same conversation
- The agent stopped being a tool I use and became a system I work inside
"All of this was for me and small teams. Full control, fast iteration. Then I took a role at a large company and everything broke."
Act 2: What Changed (7-8 min)
Key message: The pattern survived the transition. The implementation had to be completely rewritten. Three weeks to get it load-bearing.
The 3-week bootstrap
| Week | Mode | What happened |
|---|---|---|
| 1 | Bootstrap | Setup. Basic triage. AI captures, doesn't really act. Mostly trust-building. |
| 2 | OS Overhaul | Rewrote the system mid-week. AI started doing real work through the system. First skills emerged. |
| 3 | Leverage | Load-bearing for daily ops and strategic synthesis. Started building infrastructure for other people to use. |
The shape: capture -> automate -> leverage -> share.
What "Share" actually looked like (Week 3)
- Started extracting patterns that weren't just personal - the triage pipeline, the meeting prep flow, the drafting-with-context pattern
- Built shareable skills: reusable prompt templates that encode a workflow, not just a task
- The hard part: separating personal context from transferable infrastructure
- This is where it shifts from "I'm more productive" to "I'm changing how the team operates"
The Manager OS: 8 use cases
Not a chatbot. A manager OS. Two repos: one shareable, one private. AI handles inputs and drafts; the human owns judgment and the send button.
- Daily intelligence triage - One sweep of email, chat, calendar, transcripts, shared files, work items, service health
- Meeting prep & capture - Auto-prep before, auto-capture after. Action items reconcile back to a unified backlog
- Drafting with context - Replies pulled from full thread history, KB, and prior interactions. Written in your voice. Click-to-copy HTML draft, never auto-send
- People intelligence - Every shared artifact updates a running profile: what they own, what they care about, friction points
- Knowledge base - LLM-maintained pages over curated raw sources. Humans curate inputs, AI structures the knowledge
- Strategic synthesis - Briefs, POV docs, framings generated from accumulated context, not from scratch
- Backlog & follow-through - Unified store for todos, drafts, delegated items, watch items. Auto-resolves when downstream evidence shows up
- Operational hygiene - Privacy guards on every commit, nightly lint, session memory rotation
What works (the honest version)
- The handoff pattern. AI prepares; human reviews and sends. Never delegate the send button.
- Privacy as structure, not vigilance. Two-repo split makes leaks structurally impossible.
- Investigation before draft. Replies are grounded in real history, not generic AI tone.
- Curated > comprehensive. A 5-observation synthesis lands. A 20-page assessment doesn't.
- Always-on enrichment. Substantive work flows back into the right files automatically. Memory compounds.
Where it breaks (the honest version)
- Signal-to-noise on automation. First version of any pipeline is too noisy. Needs an allowlist pass.
- Tooling fragility. MCP servers drop. Auth walls appear constantly.
- The "drafted vs sent" gap. AI surfaces what's owed. It can't make me close the loop.
- Judgment doesn't scale. 5-10x on inputs. 2-3x on actual leverage. That's the ceiling.
- Single-user shape. A teammate can't fork without rebuilding their own private layer.
"That last point - single-user shape - is the unsolved problem. And it tells you where this is going."
Act 3: 12 Months Out (4-5 min)
Key message: The patterns are clear. The infrastructure isn't there yet. Here's what closes the gap.
Prediction 1: Memory becomes the moat, not the model
Proved this twice - once solo (250+ sessions), once at enterprise scale. Every product org will have access to the same foundation models. The differentiation is what your agent knows about this user, this workflow, this org.
CPO implication: Your agent strategy should start with "what's the memory architecture?" not "which model?"
Prediction 2: The Manager OS is the first real agent product category
Not "chat with your data." Not "autocomplete in your IDE." A system that does daily triage, preps you for meetings, drafts in your voice, tracks follow-through, and maintains a running model of your people and priorities. I built it by hand. It works. Someone will productize it.
CPO implication: If you're building agent products, look at what managers actually do all day. That's the TAM, not developer productivity.
Prediction 3: The ceiling is judgment, not automation
5-10x on inputs and drafts. 2-3x on actual leverage. That's the honest number. Products that respect this boundary will win. Products that pretend the AI replaces judgment will fail.
CPO implication: Design for the 2-3x, not the 5-10x. Build your products around the handoff, not the automation.
Prediction 4: Agent-native PMs will create an uncomfortable performance gap
The PMs who build agent systems around themselves will operate at a visibly different level. This isn't a tool adoption curve. It's a skill gap. The difference between a PM who uses agents and a PM who builds agent systems is the difference between someone who uses Excel and someone who builds models.
CPO implication: You need to decide if this is something you encourage, require, or let happen organically. In 12 months, you'll be able to tell which PMs built agent systems and which didn't - just from the quality of their work.
Close (1 min)
The shape, one more time: capture -> automate -> leverage -> share.
Every CPO in this room can start week 1 tomorrow. The question isn't whether agents work in product leadership - they do. The question is whether you're willing to invest the trust-building to get to the point where it's load-bearing.
The hard part isn't the AI. It's designing the system around the AI - the memory, the contracts, the feedback loops, the human-agent boundaries. Building great agent systems is product work. It's PM work. The people best equipped to lead this transition are already in your org.
Anticipated Questions
"What about hallucination / trust?"
The handoff pattern handles this. AI never sends - it drafts. Investigation before draft means replies are grounded in real thread history. Trust is a design problem, not a model problem.
"Does it scale beyond one person?"
Not yet - that's the honest answer. The patterns are transferable but the infrastructure isn't forkable. That's prediction #2 - someone will productize this.
"What model do you use?"
Claude (Opus), but that's the least interesting part. The value is in the memory architecture, the interaction contracts, and the workflow design.
"How do you handle sensitive data?"
Privacy as structure: two-repo split, PII guards on every commit, session memory rotation. Make leaks structurally impossible rather than relying on vigilance.
"What should I try first?"
Daily intelligence triage. One sweep of everything that came in overnight. Lowest risk, highest immediate value, builds the trust that unlocks everything else.