November 72025

Notion’s rebuild for agentic AI: How GPT‑5 helped unlock autonomous workflows

By rebuilding their agent system with GPT‑5Notion created an AI workspace that can reasonactand adapt across workflows.

Loading…

In late 2022within weeks of getting access to GPT‑4Notion had already shipped a writing assistantrolled out workspace-wide Q&A featuresand integrated OpenAI models deeply across its searchcontentand planning tools.

But as models advanced—and users began asking agents to complete entire workflows—Notion’s team saw limits in their system architecture. The old pattern of prompting models to do isolated tasks was limiting the ceiling of what was capable on their platform. Agents needed to make decisionsorchestrate toolsand reason through ambiguityand that shift required more than prompt engineering.

“We didn’t want to retrofit the system. We needed an architecture that actually supports how reasoning models work.”

Sarah SachsHead of AI Modeling at Notion

Rebuilding for reasoning modelsnot retrofitting around them

Instead of patching their existing stackNotion rebuilt it. They replaced task-specific prompt chains with a central reasoning model that coordinates modular sub-agents. These agents can search across NotionSlackor the web; add to or edit databases; and synthesize responses using whatever tools the task requires.

With their launch of Notion 3.0⁠(opens in a new window)AI isn’t just embedded in workflows; it can now run them. Users assign a broad task—for examplecompiling stakeholder feedback—and their agent plansexecutesand reports back. The shift toward agents that choose how to work meant designing for model autonomy from the start.

Testing GPT‑5 with real product workloads

To validate the architectural shiftNotion evaluated GPT‑5 against other state-of-the-art models using actual user tasks.

Evaluations were grounded in feedback Notion had already marked as high priorityincluding questions that surfaced in Research Modelong-form tasks that required multi-step reasoningand ambiguous or outdated content where model judgment mattered.

The team used a combination of LLM-as-judge scoringstructured test fixturesand human-labeled feedback.

Key results:

7.6% improvement over state-of-the-art models on outputs aligned with real user feedback
15% better performance on difficult Research Mode questions
100%+ improvement on multi-stepstructured tasks like deadline updates and competitor research
Only model to fully saturate benchmarks with conflicting or outdated inputs

These evaluations helped Notion identify where GPT‑5 added value—for examplein reasoningambiguityresearch—and where environment-specific tuning would improve results.

“We didn’t cherry-pick tasks. These were high-signal workflows from our product,” says Sachs. “That’s where model differences actually show up.”

Designing for outcomesnot just speed

Some tasks need fast responses; others don’t. By experimenting with the different reasoning levels of GPT‑5Notion was able to customize the intelligence of their agents and find the perfect balance between response quality and latency depending on the requirements of the task.

Notion designed its agents to run for seconds or minutes depending on the job. Short latency is prioritized for direct lookups. Long-running agents—up to 20 minutes—are used for background workflows like summarizing content or updating databases.

What matters most to the team is how much time the user gets backand not how fast the model responds. That philosophy drives how orchestration and expectations are set across the UI.

Using Notion to build Notion AI

Every Notion team uses Notion AI. That daily use generates structured feedback and direct annotation from humans when something goes wrong. If a user thumbs down a resultit enters a pipeline for trace-level debugging.

But internal use alone wasn’t enough. The team also worked with design partners—technical customers with early access to agent features—to uncover edge cases and spot blind spots.

This outside-in testing helped shape product readinesstune orchestration behaviorsand validate where GPT‑5 really moved the needle. OpenAI also uses Notion to coordinate projects and knowledgewith Notion AI embedded in daily workflows to speed up reviews and close the loop on feedback. This mutual usage creates a unique dynamic; both teams build with each other’s productsproviding constant feedback and visibility into how the work performs in practice.

A group of nine people sit and smile around a conference table in a bright office meeting roomsome holding laptops and making peace signs. A large screen on the right shows a video call with three remote participants. Everyone looks relaxed and happysuggesting a collaborative hybrid team meeting.

Lessons for teams building with GPT‑5

Notion’s rebuild wasn’t just about launching Notion 3.0. It was about designing a system that could support new model capabilities and adapt as those models get smarter. Their approach offers a clear roadmap for other teams deploying agentic AI in production:

Evaluate what matters. Use tasks your users actually donot synthetic benchmarks.
Test the hard stuff. GPT‑5 shines when information is ambiguousoutdatedor multi-step.
Architect for autonomy. If agents are making decisionsyour system has to give them room to reason and tools to act.
Clarity drives performance. Even top models fall short without clean tool descriptions and good interface design.
Rebuilding is better than patching. If your system was built for completion modelsit might not scale to agents.

“We’re already seeing returns from the rebuild,” says Sachs. “If the next model unlocks something newwe’ll do what it takes to support it.”

Ready to get started?

Contact sales Start building

Keep reading

View all

TRUSTBANK uses AI agents to personalize Furusato Nozei gifts

APIJan 272026

How Indeed uses AI to help evolve the job search

APIJan 262026

Inside Praktika's conversational approach to language learning

APIJan 222026