M2-04 Ingestion & Indexing Module #24
Closed
opened 2026-05-22 21:09:49 +08:00 by wangdl
·
1 comment
Labels
Clear labels
area:activity
活动/统计
area:admin
管理后台
area:admin-api
area:ai
AI/RAG
area:ai-runtime
AI Runtime / AI 分析体系相关
area:analytics
area:api
API 接口
area:auth
认证与授权
area:cos
对象存储
area:database
数据库/Migration
area:import
文件导入/解析
area:knowledge
知识库/知识点
area:learning-info
area:learning-session
area:quiz
测验/自测
area:reading-event
area:reading-progress
area:review
复习系统
area:security
安全相关
audit:api-admin-info
audit:api-info
audit:planned
已完成宏观规划,尚未代码审查
audit:reviewed
blocked-by:api-info-aggregation
blocked-by:api-info-core
blocked-by:api-info-ops
blocked-by:api-info-schema
blocked-by:processor
blocked-by:schema
priority:p0
最高优先级,阻塞发布
priority:p1
高优先级,里程碑必需
priority:p2
中优先级,后续版本
repo:api
API 仓库 Issue
status:blocked
被阻塞
status:done
已完成
status:partial
status:todo
type:aggregation
type:bug
缺陷修复
type:design
设计
type:docs
文档
type:feature
新功能
type:migration
type:refactor
重构
type:test
work:admin-api
work:aggregation
work:api
work:artifact
题目/卡片产物
work:audit
work:circuit-breaker
熔断
work:contract
work:design
架构/协议设计工作
work:docs
work:export
work:extend-existing
work:internal-api
Runtime 内部接口
work:job
Job 调度相关
work:new-module
work:new-table
work:ops
work:query
work:quota
额度/限流
work:schema
Prisma Schema 设计
work:security
work:service
Service 层实现
work:snapshot
Snapshot 构建
work:test
No Label
Milestone
No items
No Milestone
M2:知识库主链路闭环(P1)
Projects
Clear projects
No project
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: wangdl/api-server#24
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
目标
设计知习文档导入与索引模块,负责将用户上传的资料解析、清洗、切片、生成 embedding 并写入 Qdrant 索引。
本 Issue 只做模块架构设计,不直接实现代码。
背景说明
用户上传的 PDF/DOCX/TXT/MD 等文件需要经过解析才能变成可检索的数据。Ingestion 模块是资料从"文件"到"知识"的加工流水线:DocumentImport 管理导入任务的状态机,解析生成 parsed.md,切片生成 chunk,调用 AI Gateway 生成 embedding,写入 Qdrant 索引。
注意:Vision 在本阶段只做 fallback 预留——仅用于 OCR 失败或复杂图片页,不做完整多模态文档理解。
模块职责
本模块负责:
本模块不负责:
候选数据对象
解析流程设计
请设计完整的导入流水线:
从 COS 拉取文件 → 文件类型判断 → 选择解析器
→ PDF:提取文本 + 图片 OCR(如有图片)
→ DOCX:提取文本 + 内嵌图片 OCR
→ TXT/MD:直接读取
→ 文本清洗 → 切片
→ 每个 chunk 调用 AI Gateway 生成 embedding
→ 批量写入 Qdrant(通过 Vector & Retrieval Module)
→ 更新 DocumentImport 状态为完成
→ 回写 parsed.md 到 COS(通过 File Storage)
基础设施依赖判断
API 设计
CAPI:
IAPI(Worker 消费):
AAPI:
Domain Event 设计
交付检查
验收标准
禁止事项
不建议当前阶段实现
✅ M2-04 实施完成
关键发现
现有基础设施非常完善:
DocumentImport状态机(QUEUED→PROCESSING→COMPLETED/FAILED)+ progress/heartbeat/retryinternal/rag/*IAPI 已实现 job claim/heartbeat/status update交付内容
ImportStepLogGET /admin-api/imports列表、GET /admin-api/imports/:id详情含步骤日志、POST /admin-api/imports/:id/retry重试E2E (test/m2.e2e-spec.ts)
5 tests: import create, admin list + 401, detail, retry.
运行