A language model's knowledge is a frozen snapshot from training time — it doesn't know your company's internal documents, has no awareness of recent events, and forgets everything between sessions. RAG (Retrieval-Augmented Generation) and memory systems give an Agent a "living brain" — the ability to read external knowledge in real time and accumulate experience across conversations. This article breaks down how RAG works, how to choose a retrieval strategy, and the four-layer architecture of Agent memory, with practical product guidance for mobility/transportation scenarios.
Part II · Agent 搭建第 02-02 篇 · 共 12 篇知识管理 / Knowledge Management约 5,000 字
On the first day a mobility platform's customer-service Agent went live, a user asked: "Can I get a refund on the high-speed rail ticket I bought yesterday?" The Agent confidently recited the refund policy — but it was the policy from the previous year. The platform had updated the rules just last month. The user trusted the answer, tried to act on it, found it wrong, and filed a complaint. The problem wasn't that the model was unintelligent — it simply had no idea the new policy existed. The model's knowledge ends at its training cutoff; everything that happened after is a blind spot.
RAG (Retrieval-Augmented Generation) solves a fundamental problem: enabling the model to access information it never saw during training when generating a response. The approach is straightforward — user question → retrieve relevant content from an external knowledge base → feed retrieved content alongside the question into the model → the model answers based on these "real-time reference materials."
RAG doesn't retrain the model — it dynamically expands the model's context window at inference time. Think of it this way: the model is a knowledgeable expert; RAG provides a research assistant. Before the user's question reaches the expert, the assistant runs to the filing room, pulls the relevant documents, and hands them to the expert. The expert's knowledge base hasn't changed — but now they have the latest reference material in hand.
RAG 能解决什么问题
① 知识时效性:模型训练截止日之后的新政策、新规则、新数据,可以通过 RAG 实时注入;
② 私有知识:公司内部文档、产品手册、客服 SOP 等不在互联网上的内容,RAG 可以让模型引用;
③减少幻觉:有了真实文档作为依据,模型"编造"的概率显著降低,可追溯性也更强;
④ 降低成本:把知识放在外部库而不是微调进模型,更新成本接近零,不需要重新训练。
RAG solves four key problems: ① Knowledge currency — new policies, rules, and data after the training cutoff can be injected in real time; ② Private knowledge — internal company documents, product manuals, and customer service SOPs not available on the internet can be referenced; ③ Reduced hallucination — with real documents as a grounding source, the model is far less likely to fabricate answers; ④ Lower cost — keeping knowledge in an external store rather than fine-tuning it into the model means near-zero update cost and no retraining required.
Many people think RAG is simply "search + LLM." In reality, it's a pipeline with six critical nodes. A failure at any node degrades the final answer quality. Understanding this pipeline is the prerequisite for diagnosing where things go wrong when the system underperforms.
Before documents enter the knowledge base, they must be cleaned (remove headers, footers, ads, garbled text), format-normalized (PDF/Word/web pages converted to plain text or structured format), and chunked (split into appropriately sized segments — more on this later). Many teams underestimate this step, but it directly sets the quality ceiling for the entire RAG system — garbage in, garbage out.
③ 查询理解:用户说的不一定是要的
Query Understanding: What Users Say vs. What They Mean
Users' raw questions are often colloquial or incomplete. The query understanding layer rewrites the question (making it more retrieval-friendly), identifies intent (is the user looking for a policy or an order status?), and sometimes decomposes it (a single question may embed multiple sub-questions). The more precise this step, the higher the quality of what follows.
Initial retrieval typically returns 10–20 candidate chunks, but not all of them are truly useful. Reranking uses a more refined model to score candidates, keeps the most relevant ones, removes noise, and feeds the top 3–5 to the generation model. This step has low computational cost but yields a significant improvement in answer quality.
"What time does train G1234 depart tomorrow?" vs. "What are the fastest ways to get to Shanghai?" — these are two fundamentally different retrieval needs. The first requires exact matching (train number); the second requires semantic relevance (Shanghai + fast + transportation mode). A single retrieval strategy that tries to handle everything often handles nothing well.
检索方式 Retrieval Method
原理 How It Works
擅长场景 Best For
弱点 Weakness
向量检索 Dense Retrieval
把文本转成数值向量,按语义相似度搜索
语义模糊的自然语言提问;同义词表达;跨语言理解
精确词汇匹配弱;计算开销较大;需要 Embedding 模型
关键词检索 Sparse / BM25
基于词频统计,匹配包含查询词的文档
订单号、车次号、人名、编码等精确词汇;短关键词搜索
无法理解语义;同义词不能命中;对拼写变体敏感
混合检索 Hybrid Retrieval
同时运行向量 + 关键词,合并结果并重排序
大多数真实业务场景;需要同时覆盖精确词和语义相关
系统复杂度提升;需要调优两套检索权重
In most real-world Agent deployments, hybrid retrieval is the recommended default — run both dense (vector) and sparse (keyword) retrieval in parallel, merge results, then rerank. Use pure keyword retrieval only for structured fields like order IDs, flight numbers, or phone numbers. Use pure vector retrieval for general FAQ or document Q&A scenarios.
Retrieval strategy is not a one-time configuration decision — it's an engineering parameter that requires continuous tuning based on real usage. Recommended starting point: hybrid retrieval + reranking. Then test with a real user query set, identify which question types still produce poor recall, and optimize those specifically — rather than pursuing the most complex solution from day one.
Section 04
文档切分:被严重低估的基础工程
Chunking: The Underestimated Foundation
一个团队花了两周时间调模型、优化提示词,系统表现还是不行。最后发现问题出在:他们把一份 80 页的退款政策 PDF 切成了固定 500 字一段,结果关键的"退款条件"和"不退款的例外情况"被切断在两个不同的片段里,检索时永远只能召回一半信息。
切分策略改掉之后,什么 prompt 都没动,准确率提升了 30%。
A team spent two weeks tuning the model and refining prompts with little improvement. Eventually they discovered the problem: they had chunked an 80-page refund policy PDF into fixed 500-character segments. As a result, the critical "refund conditions" and "non-refundable exceptions" were split across two separate chunks — retrieval could only ever find half the answer. After fixing the chunking strategy — without touching a single prompt — accuracy improved by 30%.
Chunking is the process of splitting long documents into knowledge base storage units. Chunking strategy directly impacts retrieval precision and recall completeness — the most underestimated link in RAG system quality.
Too small: individual chunks lack completeness, context is lost, and the model receives incomplete reference material — though precision is high and noise is low. Too large: individual chunks are information-rich but contain too much irrelevant content, diluting key information and introducing model distraction — and each retrieval consumes more tokens, increasing cost. In practice, there is no universally correct chunk size; it must be calibrated based on document type, typical query length, and context window constraints.
📋 产品负责人决策清单 · PM Decision Checklist
是否梳理了知识库包含的文档类型(政策类、FAQ 类、流程类、表格类)?不同类型需要不同切分策略。
是否统计了用户典型问题的平均复杂度?简单查询用小片段,复杂推理用大片段。
切分后有没有做人工抽检——随机取 20-30 个片段,确认关键信息没有被切断?
有没有建立"知识库更新"的工作流?政策变了,旧片段要及时替换,否则 RAG 会给出过时答案。
有没有给每个片段附加元数据(来源文档、生效日期、所属类别),方便后续过滤和追溯?
Have you catalogued the document types in your knowledge base (policies, FAQs, workflows, tables)? Different types need different chunking strategies. Have you measured the average complexity of typical user queries? Simple queries benefit from smaller chunks; complex reasoning benefits from larger ones. Have you done manual spot checks post-chunking — randomly sampling 20–30 chunks to confirm no critical information was split? Is there an update workflow when policies change — so stale chunks get replaced before they produce outdated answers? Does each chunk carry metadata (source document, effective date, category) for downstream filtering and traceability?
RAG solves the "knowledge" problem — what does the Agent know about the world. But Agents have another need: remembering things about this specific user. The user mentioned last time they prefer window seats — the Agent should remember this when booking next time. A user complained about a delay two weeks ago — the Agent should be more careful in related scenarios. This is not something RAG can solve — it requires a dedicated memory system.
Working memory is everything in the current conversation window: what the user said, what the Agent replied, tool call results, and intermediate reasoning steps. Its capacity is bounded by the model's context window (typically 4K–200K tokens) and is cleared when the conversation ends. The key management question is "when to compress or truncate" — as conversations grow longer, the system must intelligently decide which early content can be discarded and which must be retained.
Episodic memory stores summaries of the user's historical interactions and key facts: last travel destination, complaint history, expressed preferences, and incomplete tasks. This is stored in a database (vector or relational), with relevant memories loaded into working memory at the start of each new session. Episodic memory requires periodic "compression" — distilling extensive conversation histories into concise user profiles to prevent unbounded growth.
语义记忆:外部知识库(RAG 的主场)
Semantic Memory: External Knowledge Base (RAG's Domain)
Semantic memory is the RAG system's knowledge base — product manuals, refund policies, city guides, traffic regulations… These are objective, shared knowledge resources not tied to any specific user. The Agent retrieves them on-demand through the RAG pipeline. This memory layer can be updated independently, without modifying the model or any other part of the Agent.
Procedural memory is the "hard-coded" behavioral patterns embedded in the Agent: role definition and behavior rules in the System Prompt, the list of callable tools and how to use them, safety guardrails and content filters. This memory layer does not change based on conversation content — it is the foundation of Agent stability and predictability.
Same booking assistant, two very different experiences:
A: "Hello, where would you like to go? What's your departure time? Do you have a seat preference?" — starting from zero every time.
B: "Hello! Based on your previous preferences, I found train G108 with a window seat, departing Shanghai Hongqiao at 08:00 tomorrow. Shall I confirm?" — resolved in one turn, feels understood.
The difference between these two experiences is whether episodic memory was properly designed.
出行场景的用户记忆,大致分为四类,优先级和存储策略各不相同:
User memory in mobility scenarios generally falls into four categories, each with different priority and storage strategy:
记忆类型 Memory Type
典型内容 Examples
触发召回时机 When to Recall
优先级 Priority
出行偏好 Travel Preferences
靠窗/靠走道、高铁/飞机、早班/晚班、是否需要发票
用户发起订票意图时自动加载
🔴 高
常用路线 Frequent Routes
上海-北京(出差)、老家城市(节假日)
用户提及目的地但未指定出发地时自动填充
🔴 高
服务记录 Service History
投诉记录、退票记录、特殊服务需求(轮椅/儿童票)
用户反馈异常或提交投诉时查阅历史
🟡 中
对话摘要 Conversation Summary
上次对话解决了什么问题、遗留了什么未处理
用户回复历史会话或询问"上次……"时
🟡 中
Travel preferences and frequent routes have the highest recall priority — they directly reduce interaction friction. Service history is medium priority — mostly needed for exception handling. Conversation summaries are medium priority — recalled when users reference past interactions or pick up an unfinished task.
Not everything a user says should enter episodic memory. Storing too much causes the memory store to bloat, increases noise at retrieval, and creates privacy risk. Storing too little fails the goal of "knowing the user." A practical principle: only store information that may be useful in the next conversation, and information whose absence would cause serious experience failures. One-time complaints and casual chitchat are generally not worth storing.
RAG 解决的是知识的问题,记忆系统解决的是关系的问题。
RAG solves the knowledge problem. Memory systems solve the relationship problem.
80% of RAG quality issues stem from poor data preprocessing and retrieval strategy — not from the model being inadequate. Optimizing retrieval delivers more value than switching to a more powerful model. Memory system design requires balancing personalized experience against user privacy. Travel data (itinerary records, location preferences) is sensitive personal information. From the design phase, you must define clearly: what data is within the user's authorization scope, how to store it in de-identified form, and how users can view and delete their own memory data. This is not a post-launch concern — it must be built into the architecture from the start.
Glossary
中英术语对照表
Bilingual Terminology Glossary
本篇涉及的核心概念,中英对照及简明释义。
检索增强生成
RAG · Retrieval-Augmented Generation
在生成回答前,先从外部知识库检索相关内容作为参考的技术范式。
A technique that retrieves relevant content from an external knowledge base before generating a response, grounding the model's output in real sources.
向量检索
Dense Retrieval / Vector Search
把文本转为数值向量,基于语义相似度检索最相关内容。
Converts text to numerical vectors and retrieves content by semantic similarity rather than exact keyword matching.
关键词检索
Sparse Retrieval / BM25
基于词频统计的传统检索方法,擅长精确词汇匹配。
Traditional term-frequency-based retrieval; excels at exact keyword and identifier matching.
混合检索
Hybrid Retrieval
同时运行向量检索和关键词检索,合并结果后重排序,兼顾精确与语义。
Runs both vector and keyword retrieval in parallel, merges results, and reranks to combine precision and semantic coverage.
文档切分
Chunking
把长文档拆成知识库存储单元的过程,是 RAG 质量的核心基础。
The process of splitting long documents into storage units for the knowledge base; the foundational determinant of RAG quality.
重叠窗口
Overlap Window
切分时相邻片段共享一部分内容,避免关键信息在边界处断裂。
Adjacent chunks share a portion of content during splitting, preventing key information from being severed at chunk boundaries.
向量嵌入
Embedding
把文本转换为高维数值向量的过程,是向量检索的基础。
The process of converting text into high-dimensional numerical vectors; the foundation of vector retrieval.
重排序
Reranking
用更精细的模型对初步检索结果重新打分排序,提升最终召回精度。
Uses a more refined model to re-score and re-order initial retrieval results, improving final precision.
工作记忆
Working Memory
当前会话的上下文窗口内容,对话结束即消失。
The content within the current conversation's context window; cleared when the session ends.
情景记忆
Episodic Memory
跨会话持久化的用户历史记录和偏好摘要,是个性化体验的基础。
Persistent cross-session summaries of a user's history and preferences; the foundation for personalized experience.
语义记忆
Semantic Memory
结构化的外部知识库,供所有用户共享,通过 RAG 检索访问。
Structured external knowledge base shared across all users, accessed via RAG retrieval.
程序记忆
Procedural Memory
Agent 的硬编码行为规则、工具清单和 System Prompt,构成稳定性基础。
The Agent's hard-coded behavior rules, tool list, and System Prompt — the foundation of stability and predictability.
上下文窗口
Context Window
模型单次处理的最大 token 数量,决定了工作记忆的容量上限。
The maximum number of tokens a model can process in a single call; determines the capacity ceiling of working memory.
元数据
Metadata
附加在知识库片段上的描述信息(来源、日期、类别),用于过滤和追溯。
Descriptive attributes attached to knowledge base chunks (source, date, category) for filtering and traceability.
幂等性
Idempotency
对同一操作执行多次与执行一次结果相同,是 Agent 重试安全的关键保障。
A property where executing the same operation multiple times produces the same result as executing it once; critical for safe Agent retries.
向量数据库
Vector Database
专门存储和检索向量嵌入的数据库,如 Pinecone、Weaviate、Chroma。
A database specialized for storing and querying vector embeddings (e.g., Pinecone, Weaviate, Chroma).
记忆压缩
Memory Compression / Summarization
定期把大量历史对话提炼成简洁摘要,防止记忆无限膨胀。
Periodically distilling extensive conversation histories into concise summaries to prevent unbounded memory growth.
知识截止日
Knowledge Cutoff / Training Cutoff
模型训练数据的最新时间点,此后发生的事件模型不知道。
The latest date included in the model's training data; events after this date are unknown to the model without external retrieval.
出行偏好
Travel Preferences
用户明示或从行为中推断的出行习惯,如座位偏好、出发时间偏好。
User travel habits expressed explicitly or inferred from behavior, such as seat preference or preferred departure time.
隐私脱敏
Data De-identification / Anonymization
将用户数据中的可识别个人信息移除或替换,降低隐私泄露风险。
Removing or replacing personally identifiable information from user data to reduce privacy exposure risk.