RAG 与记忆系统设计 · 知识管理 | Agent 产品架构师知识库

目录 · Table of Contents

RAG 是什么，为什么需要它 What is RAG and Why It Matters
RAG 的完整工作流程 The Complete RAG Pipeline
检索策略：向量、关键词与混合 Retrieval Strategies: Vector, Keyword & Hybrid
文档切分：被严重低估的基础工程 Chunking: The Underestimated Foundation
Agent 的四层记忆架构 Agent's Four-Layer Memory Architecture
出行场景的记忆系统实践 Memory System in Mobility Scenarios
中英术语对照表 Bilingual Glossary

Section 01

RAG 是什么，为什么需要它

What is RAG and Why It Matters

一个出行平台的客服 Agent 上线第一天，就被用户问到了一个问题："我昨天买的高铁票能退吗？"Agent 信心满满地给出了退票规则——但那是去年的旧规则，平台上个月刚更新了政策。用户信了，去操作，发现不对，投诉了。

问题出在哪？不是模型不够聪明，而是模型根本不知道新政策存在。训练截止日就是它知识的边界，之后发生的一切，对它来说都是盲区。

On the first day a mobility platform's customer-service Agent went live, a user asked: "Can I get a refund on the high-speed rail ticket I bought yesterday?" The Agent confidently recited the refund policy — but it was the policy from the previous year. The platform had updated the rules just last month. The user trusted the answer, tried to act on it, found it wrong, and filed a complaint. The problem wasn't that the model was unintelligent — it simply had no idea the new policy existed. The model's knowledge ends at its training cutoff; everything that happened after is a blind spot.

RAG（Retrieval-Augmented Generation，检索增强生成）解决的核心问题是：让模型在回答时，能看到它训练时没见过的信息。它的做法很直接：用户提问 → 先从外部知识库检索相关内容 → 把检索到的内容连同问题一起送给模型 → 模型基于这些"实时参考资料"给出回答。

RAG (Retrieval-Augmented Generation) solves a fundamental problem: enabling the model to access information it never saw during training when generating a response. The approach is straightforward — user question → retrieve relevant content from an external knowledge base → feed retrieved content alongside the question into the model → the model answers based on these "real-time reference materials."

RAG 并不是在"重新训练"模型，而是在推理阶段临时扩充模型的上下文窗口。类比一下：模型本身是一个博学的专家，RAG 是给这个专家配了一个助手，每次用户提问前，助手先去档案室找相关文件递给专家看，专家再作答。专家的脑子没变，但手里有了最新的参考资料。

RAG doesn't retrain the model — it dynamically expands the model's context window at inference time. Think of it this way: the model is a knowledgeable expert; RAG provides a research assistant. Before the user's question reaches the expert, the assistant runs to the filing room, pulls the relevant documents, and hands them to the expert. The expert's knowledge base hasn't changed — but now they have the latest reference material in hand.

RAG 能解决什么问题

① 知识时效性：模型训练截止日之后的新政策、新规则、新数据，可以通过 RAG 实时注入；

② 私有知识：公司内部文档、产品手册、客服 SOP 等不在互联网上的内容，RAG 可以让模型引用；

③减少幻觉：有了真实文档作为依据，模型"编造"的概率显著降低，可追溯性也更强；

④ 降低成本：把知识放在外部库而不是微调进模型，更新成本接近零，不需要重新训练。

      RAG solves four key problems: ① Knowledge currency — new policies, rules, and data after the training cutoff can be injected in real time; ② Private knowledge — internal company documents, product manuals, and customer service SOPs not available on the internet can be referenced; ③ Reduced hallucination — with real documents as a grounding source, the model is far less likely to fabricate answers; ④ Lower cost — keeping knowledge in an external store rather than fine-tuning it into the model means near-zero update cost and no retraining required.
    

Section 02

RAG 的完整工作流程

The Complete RAG Pipeline

很多人以为 RAG 就是"搜索 + 大模型"，实际上它是一条有六个关键节点的流水线。每个节点出了问题，最终答案的质量都会下降。搞清楚这条流水线，才能知道系统表现不好时，问题出在哪一环。

Many people think RAG is simply "search + LLM." In reality, it's a pipeline with six critical nodes. A failure at any node degrades the final answer quality. Understanding this pipeline is the prerequisite for diagnosing where things go wrong when the system underperforms.

RAG 六节点流水线 · The six-node RAG pipeline

① 数据预处理：质量决定天花板

Data Preprocessing: Quality Sets the Ceiling

在文档进入知识库之前，需要先做清洗（去掉页眉页脚、广告、乱码）、格式标准化（PDF/Word/网页统一转成纯文本或结构化格式），以及切分（把长文档拆成合适大小的段落，后文详述）。这一步被很多团队轻视，但它直接决定了整个 RAG 系统的质量上限——垃圾进，垃圾出。

Before documents enter the knowledge base, they must be cleaned (remove headers, footers, ads, garbled text), format-normalized (PDF/Word/web pages converted to plain text or structured format), and chunked (split into appropriately sized segments — more on this later). Many teams underestimate this step, but it directly sets the quality ceiling for the entire RAG system — garbage in, garbage out.

③ 查询理解：用户说的不一定是要的

Query Understanding: What Users Say vs. What They Mean

用户的原始问题往往很口语化，或者信息不完整。查询理解层会对用户的问题做改写（让它更适合检索）、意图识别（是查政策还是查订单），甚至分解（一个问题里包含多个子问题）。这一步越精准，后续检索的质量就越高。

Users' raw questions are often colloquial or incomplete. The query understanding layer rewrites the question (making it more retrieval-friendly), identifies intent (is the user looking for a policy or an order status?), and sometimes decomposes it (a single question may embed multiple sub-questions). The more precise this step, the higher the quality of what follows.

⑤ 重排序：召回的不等于有用的

Reranking: Retrieved ≠ Relevant

初步检索通常会返回 10-20 个候选片段，但不是所有都真的有用。重排序（Reranking）用更精细的模型对候选结果打分，把最相关的保留下来，把噪声剔除，再截取前 3-5 个送给生成模型。这一步的成本不高，但对答案质量的提升很显著。

Initial retrieval typically returns 10–20 candidate chunks, but not all of them are truly useful. Reranking uses a more refined model to score candidates, keeps the most relevant ones, removes noise, and feeds the top 3–5 to the generation model. This step has low computational cost but yields a significant improvement in answer quality.

Section 03

检索策略：向量、关键词与混合

Retrieval Strategies: Vector, Keyword & Hybrid

用户问："G1234 次列车明天几点发车？"和用户问："去上海的快速方式有哪些？"——这是两类完全不同的检索需求。前者要精确匹配（车次号），后者要语义相关（去上海 + 快速 + 交通方式）。一套检索策略通吃所有场景，往往两头都做不好。

"What time does train G1234 depart tomorrow?" vs. "What are the fastest ways to get to Shanghai?" — these are two fundamentally different retrieval needs. The first requires exact matching (train number); the second requires semantic relevance (Shanghai + fast + transportation mode). A single retrieval strategy that tries to handle everything often handles nothing well.

检索方式 Retrieval Method	原理 How It Works	擅长场景 Best For	弱点 Weakness
向量检索 Dense Retrieval	把文本转成数值向量，按语义相似度搜索	语义模糊的自然语言提问；同义词表达；跨语言理解	精确词汇匹配弱；计算开销较大；需要 Embedding 模型
关键词检索 Sparse / BM25	基于词频统计，匹配包含查询词的文档	订单号、车次号、人名、编码等精确词汇；短关键词搜索	无法理解语义；同义词不能命中；对拼写变体敏感
混合检索 Hybrid Retrieval	同时运行向量 + 关键词，合并结果并重排序	大多数真实业务场景；需要同时覆盖精确词和语义相关	系统复杂度提升；需要调优两套检索权重

In most real-world Agent deployments, hybrid retrieval is the recommended default — run both dense (vector) and sparse (keyword) retrieval in parallel, merge results, then rerank. Use pure keyword retrieval only for structured fields like order IDs, flight numbers, or phone numbers. Use pure vector retrieval for general FAQ or document Q&A scenarios.

🏗 架构师视角 · Architect's Perspective

检索策略不是一次性选定的配置，而是需要根据场景持续调优的工程参数。推荐的起点：混合检索 + Rerank。然后用真实用户问题集测试，找到"哪类问题还是召回不准"，针对性优化，而不是一上来就追求最复杂的方案。

Retrieval strategy is not a one-time configuration decision — it's an engineering parameter that requires continuous tuning based on real usage. Recommended starting point: hybrid retrieval + reranking. Then test with a real user query set, identify which question types still produce poor recall, and optimize those specifically — rather than pursuing the most complex solution from day one.

Section 04

文档切分：被严重低估的基础工程

Chunking: The Underestimated Foundation

一个团队花了两周时间调模型、优化提示词，系统表现还是不行。最后发现问题出在：他们把一份 80 页的退款政策 PDF 切成了固定 500 字一段，结果关键的"退款条件"和"不退款的例外情况"被切断在两个不同的片段里，检索时永远只能召回一半信息。

切分策略改掉之后，什么 prompt 都没动，准确率提升了 30%。

A team spent two weeks tuning the model and refining prompts with little improvement. Eventually they discovered the problem: they had chunked an 80-page refund policy PDF into fixed 500-character segments. As a result, the critical "refund conditions" and "non-refundable exceptions" were split across two separate chunks — retrieval could only ever find half the answer. After fixing the chunking strategy — without touching a single prompt — accuracy improved by 30%.

切分（Chunking）是把长文档拆成知识库存储单元的过程。切分策略直接影响检索的精度和召回的完整性，是 RAG 系统质量中最被低估的环节。

Chunking is the process of splitting long documents into knowledge base storage units. Chunking strategy directly impacts retrieval precision and recall completeness — the most underestimated link in RAG system quality.

❌ 常见的错误做法

固定字数切分（每段 500 字）：简单粗暴，但会把完整的逻辑单元劈开，比如一条退款规则被切成两段，检索时只拿到前半段。

✅ 更好的做法

按语义边界切分：以段落、章节、QA 对为自然边界；设置重叠窗口（相邻片段共享约 10% 内容），避免逻辑断裂；对结构化内容（表格、列表）保持完整。

切分粒度的权衡

The Chunk Size Trade-off

切片太小：单个片段信息不完整，上下文丢失，模型拿到的参考资料残缺；但精度高，噪声少。切片太大：单个片段信息丰富，但包含太多无关内容，会"稀释"关键信息，导致模型被干扰；而且每次检索消耗的 token 更多，成本也更高。实践中没有绝对正确的粒度，需要根据文档类型、典型问题长度、上下文窗口限制综合决定。

Too small: individual chunks lack completeness, context is lost, and the model receives incomplete reference material — though precision is high and noise is low. Too large: individual chunks are information-rich but contain too much irrelevant content, diluting key information and introducing model distraction — and each retrieval consumes more tokens, increasing cost. In practice, there is no universally correct chunk size; it must be calibrated based on document type, typical query length, and context window constraints.

📋 产品负责人决策清单 · PM Decision Checklist

是否梳理了知识库包含的文档类型（政策类、FAQ 类、流程类、表格类）？不同类型需要不同切分策略。
是否统计了用户典型问题的平均复杂度？简单查询用小片段，复杂推理用大片段。
切分后有没有做人工抽检——随机取 20-30 个片段，确认关键信息没有被切断？
有没有建立"知识库更新"的工作流？政策变了，旧片段要及时替换，否则 RAG 会给出过时答案。
有没有给每个片段附加元数据（来源文档、生效日期、所属类别），方便后续过滤和追溯？

Have you catalogued the document types in your knowledge base (policies, FAQs, workflows, tables)? Different types need different chunking strategies. Have you measured the average complexity of typical user queries? Simple queries benefit from smaller chunks; complex reasoning benefits from larger ones. Have you done manual spot checks post-chunking — randomly sampling 20–30 chunks to confirm no critical information was split? Is there an update workflow when policies change — so stale chunks get replaced before they produce outdated answers? Does each chunk carry metadata (source document, effective date, category) for downstream filtering and traceability?

Section 05

Agent 的四层记忆架构

Agent's Four-Layer Memory Architecture

RAG 解决的是"知识"的问题——Agency 知道世界的哪些事情。但 Agent 还有另一个需求：记住"关于这个用户"的事情。用户上次说他更喜欢靠窗的座位，下次订票时 Agent 应该记得。用户两周前投诉过一次晚点问题，Agent 再遇到相关场景时应该更谨慎。

这不是 RAG 能解决的——这需要一套专门的记忆系统。

RAG solves the "knowledge" problem — what does the Agent know about the world. But Agents have another need: remembering things about this specific user. The user mentioned last time they prefer window seats — the Agent should remember this when booking next time. A user complained about a delay two weeks ago — the Agent should be more careful in related scenarios. This is not something RAG can solve — it requires a dedicated memory system.

Agent 四层记忆架构（由上到下：短暂 → 持久） · Four-layer memory architecture (top to bottom: ephemeral → persistent)

工作记忆：当下的注意力

Working Memory: Current Focus

工作记忆就是当前对话窗口中的所有内容：用户说了什么、Agent 回答了什么、调用工具的结果、中间推理的过程。它的容量受模型上下文窗口的限制（通常是 4K-200K token），对话结束即清空。工作记忆的管理核心是"什么时候需要压缩或截断"——当对话越来越长，需要智能地决定哪些早期内容可以丢弃，哪些必须保留。

Working memory is everything in the current conversation window: what the user said, what the Agent replied, tool call results, and intermediate reasoning steps. Its capacity is bounded by the model's context window (typically 4K–200K tokens) and is cleared when the conversation ends. The key management question is "when to compress or truncate" — as conversations grow longer, the system must intelligently decide which early content can be discarded and which must be retained.

情景记忆：跨会话的用户画像

Episodic Memory: Cross-Session User Profile

情景记忆负责记录用户的历史交互摘要和关键信息：上次旅行目的地、投诉记录、表达过的偏好、未完成的任务。这部分存储在数据库里（向量数据库或关系型数据库），每次新会话开始时，按需加载相关记忆注入到工作记忆中。情景记忆需要定期"压缩"——把大量历史对话提炼成简洁的用户档案，避免无限膨胀。

Episodic memory stores summaries of the user's historical interactions and key facts: last travel destination, complaint history, expressed preferences, and incomplete tasks. This is stored in a database (vector or relational), with relevant memories loaded into working memory at the start of each new session. Episodic memory requires periodic "compression" — distilling extensive conversation histories into concise user profiles to prevent unbounded growth.

语义记忆：外部知识库（RAG 的主场）

Semantic Memory: External Knowledge Base (RAG's Domain)

语义记忆就是 RAG 系统的知识库——产品手册、退款政策、城市攻略、交通规则……这些内容不属于某个特定用户，是所有用户共享的客观知识。Agent 通过检索召回的方式，按需读取。这层记忆可以独立更新，不需要修改模型或 Agent 的其他部分。

Semantic memory is the RAG system's knowledge base — product manuals, refund policies, city guides, traffic regulations… These are objective, shared knowledge resources not tied to any specific user. The Agent retrieves them on-demand through the RAG pipeline. This memory layer can be updated independently, without modifying the model or any other part of the Agent.

程序记忆：Agent 的"本能"

Procedural Memory: The Agent's "Instincts"

程序记忆是 Agent 被"硬编码"进去的行为模式：System Prompt 里的角色设定和行为规则、可以调用的工具列表和调用方式、安全红线和过滤规则。这层记忆不会因对话内容而改变，它是 Agent 稳定性的基础。

Procedural memory is the "hard-coded" behavioral patterns embedded in the Agent: role definition and behavior rules in the System Prompt, the list of callable tools and how to use them, safety guardrails and content filters. This memory layer does not change based on conversation content — it is the foundation of Agent stability and predictability.

Section 06

出行场景的记忆系统实践

Memory System in Mobility Scenarios

同样是订票助手，两种体验：

A："你好，请问您要去哪里？出发时间是？偏好什么座位？" ——每次都从零开始。

B："您好，根据您上次的偏好，为您推荐了 G108 次靠窗座位，明天 08:00 从上海虹桥出发。需要确认吗？" ——一句话解决，感觉被了解。

两者的差距，就是有没有设计好情景记忆。

Same booking assistant, two very different experiences:

A: "Hello, where would you like to go? What's your departure time? Do you have a seat preference?" — starting from zero every time.

B: "Hello! Based on your previous preferences, I found train G108 with a window seat, departing Shanghai Hongqiao at 08:00 tomorrow. Shall I confirm?" — resolved in one turn, feels understood.

The difference between these two experiences is whether episodic memory was properly designed.

出行场景的用户记忆，大致分为四类，优先级和存储策略各不相同：

User memory in mobility scenarios generally falls into four categories, each with different priority and storage strategy:

记忆类型 Memory Type	典型内容 Examples	触发召回时机 When to Recall	优先级 Priority
出行偏好 Travel Preferences	靠窗/靠走道、高铁/飞机、早班/晚班、是否需要发票	用户发起订票意图时自动加载	🔴 高
常用路线 Frequent Routes	上海-北京（出差）、老家城市（节假日）	用户提及目的地但未指定出发地时自动填充	🔴 高
服务记录 Service History	投诉记录、退票记录、特殊服务需求（轮椅/儿童票）	用户反馈异常或提交投诉时查阅历史	🟡 中
对话摘要 Conversation Summary	上次对话解决了什么问题、遗留了什么未处理	用户回复历史会话或询问"上次……"时	🟡 中

Travel preferences and frequent routes have the highest recall priority — they directly reduce interaction friction. Service history is medium priority — mostly needed for exception handling. Conversation summaries are medium priority — recalled when users reference past interactions or pick up an unfinished task.

什么信息值得存，什么不值得

What's Worth Storing vs. What Isn't

不是所有用户说过的话都应该进情景记忆。存太多，记忆库膨胀，召回时噪声增大，还有隐私风险；存太少，又达不到"了解用户"的体验目标。一个实用的原则：只存"下次对话中可能用到"的信息，以及"出错会导致用户体验严重受损"的信息。一次性的抱怨、模糊的闲聊，通常不值得存。

Not everything a user says should enter episodic memory. Storing too much causes the memory store to bloat, increases noise at retrieval, and creates privacy risk. Storing too little fails the goal of "knowing the user." A practical principle: only store information that may be useful in the next conversation, and information whose absence would cause serious experience failures. One-time complaints and casual chitchat are generally not worth storing.

RAG 解决的是知识的问题，记忆系统解决的是关系的问题。

RAG solves the knowledge problem. Memory systems solve the relationship problem.

🏗 架构师视角 · Architect's Perspective

RAG 系统质量差的原因，80% 出在数据预处理和检索策略上，而不是模型本身不够好。优化检索，比换模型更值得投入。

记忆系统的设计，要在"个性化体验"和"用户隐私"之间找到平衡。出行数据（行程记录、位置偏好）属于用户的敏感个人信息，在设计时就需要明确：哪些数据在用户授权范围内、如何脱敏存储、用户如何查看和删除自己的记忆数据。这不是上线后再处理的问题，而是从架构阶段就必须内嵌进去的约束。

80% of RAG quality issues stem from poor data preprocessing and retrieval strategy — not from the model being inadequate. Optimizing retrieval delivers more value than switching to a more powerful model. Memory system design requires balancing personalized experience against user privacy. Travel data (itinerary records, location preferences) is sensitive personal information. From the design phase, you must define clearly: what data is within the user's authorization scope, how to store it in de-identified form, and how users can view and delete their own memory data. This is not a post-launch concern — it must be built into the architecture from the start.

Glossary

中英术语对照表

Bilingual Terminology Glossary

本篇涉及的核心概念，中英对照及简明释义。

检索增强生成

RAG · Retrieval-Augmented Generation

在生成回答前，先从外部知识库检索相关内容作为参考的技术范式。

A technique that retrieves relevant content from an external knowledge base before generating a response, grounding the model's output in real sources.

向量检索

Dense Retrieval / Vector Search

把文本转为数值向量，基于语义相似度检索最相关内容。

Converts text to numerical vectors and retrieves content by semantic similarity rather than exact keyword matching.

关键词检索

Sparse Retrieval / BM25

基于词频统计的传统检索方法，擅长精确词汇匹配。

Traditional term-frequency-based retrieval; excels at exact keyword and identifier matching.

混合检索

Hybrid Retrieval

同时运行向量检索和关键词检索，合并结果后重排序，兼顾精确与语义。

Runs both vector and keyword retrieval in parallel, merges results, and reranks to combine precision and semantic coverage.

文档切分

Chunking

把长文档拆成知识库存储单元的过程，是 RAG 质量的核心基础。

The process of splitting long documents into storage units for the knowledge base; the foundational determinant of RAG quality.

重叠窗口

Overlap Window

切分时相邻片段共享一部分内容，避免关键信息在边界处断裂。

Adjacent chunks share a portion of content during splitting, preventing key information from being severed at chunk boundaries.

向量嵌入

Embedding

把文本转换为高维数值向量的过程，是向量检索的基础。

The process of converting text into high-dimensional numerical vectors; the foundation of vector retrieval.

重排序

Reranking

用更精细的模型对初步检索结果重新打分排序，提升最终召回精度。

Uses a more refined model to re-score and re-order initial retrieval results, improving final precision.

工作记忆

Working Memory

当前会话的上下文窗口内容，对话结束即消失。

The content within the current conversation's context window; cleared when the session ends.

情景记忆

Episodic Memory

跨会话持久化的用户历史记录和偏好摘要，是个性化体验的基础。

Persistent cross-session summaries of a user's history and preferences; the foundation for personalized experience.

语义记忆

Semantic Memory

结构化的外部知识库，供所有用户共享，通过 RAG 检索访问。

Structured external knowledge base shared across all users, accessed via RAG retrieval.

程序记忆

Procedural Memory

Agent 的硬编码行为规则、工具清单和 System Prompt，构成稳定性基础。

The Agent's hard-coded behavior rules, tool list, and System Prompt — the foundation of stability and predictability.

上下文窗口

Context Window

模型单次处理的最大 token 数量，决定了工作记忆的容量上限。

The maximum number of tokens a model can process in a single call; determines the capacity ceiling of working memory.

元数据

Metadata

附加在知识库片段上的描述信息（来源、日期、类别），用于过滤和追溯。

Descriptive attributes attached to knowledge base chunks (source, date, category) for filtering and traceability.

幂等性

Idempotency

对同一操作执行多次与执行一次结果相同，是 Agent 重试安全的关键保障。

A property where executing the same operation multiple times produces the same result as executing it once; critical for safe Agent retries.

向量数据库

Vector Database

专门存储和检索向量嵌入的数据库，如 Pinecone、Weaviate、Chroma。

A database specialized for storing and querying vector embeddings (e.g., Pinecone, Weaviate, Chroma).

记忆压缩

Memory Compression / Summarization

定期把大量历史对话提炼成简洁摘要，防止记忆无限膨胀。

Periodically distilling extensive conversation histories into concise summaries to prevent unbounded memory growth.

知识截止日

Knowledge Cutoff / Training Cutoff

模型训练数据的最新时间点，此后发生的事件模型不知道。

The latest date included in the model's training data; events after this date are unknown to the model without external retrieval.

出行偏好

Travel Preferences

用户明示或从行为中推断的出行习惯，如座位偏好、出发时间偏好。

User travel habits expressed explicitly or inferred from behavior, such as seat preference or preferred departure time.

隐私脱敏

Data De-identification / Anonymization

将用户数据中的可识别个人信息移除或替换，降低隐私泄露风险。

Removing or replacing personally identifiable information from user data to reduce privacy exposure risk.