本教程面向单据处理场景,展示如何利用 xParse Extract API 直接从单据文档中抽取结构化信息,并通过 Agent 进行数据验证。
场景介绍
业务痛点
在财务和采购场景中,企业面临以下挑战:- 单据量大:需要处理大量发票、合同、订单、收据等单据
- 信息提取繁琐:需要从单据中提取关键信息(金额、税号、日期、商品明细等)
- 数据验证困难:需要验证数据的完整性和准确性(金额计算、日期合理性等)
- 格式多样:单据格式不统一,有PDF、图片、扫描件等
- 人工成本高:手动录入和核对效率低,容易出错
解决方案
通过构建单据提取Agent,我们可以实现:- 一步完成解析与抽取:使用 xParse Extract API,通过定义 Schema 直接从文档中提取结构化数据,无需先解析再用大模型抽取
- Schema 驱动:为发票、合同、订单分别定义抽取 Schema,精确控制提取字段
- 数据验证:自动验证提取的数据(金额校验、日期检查、必填项检查等)
- 批量处理:支持批量处理大量单据
架构设计
单据文档(PDF/图片/扫描件)
↓
[xParse Extract API]
└─ 解析 + 结构化抽取(一步完成)
↓
[LangChain Agent]
├─ Tool 1: extract_invoice_info(发票抽取 Schema)
├─ Tool 2: extract_contract_info(合同抽取 Schema)
├─ Tool 3: extract_order_info(订单抽取 Schema)
└─ Tool 4: validate_data(数据验证)
↓
结构化数据(JSON)+ 验证报告
环境准备
首先安装必要的依赖:python -m venv .venv && source .venv/bin/activate
pip install requests langchain langchain-community langchain-core python-dotenv dashscope
.env 文件存储配置:
# .env
TEXTIN_APP_ID=your-app-id
TEXTIN_SECRET_CODE=your-secret-code
DASHSCOPE_API_KEY=your-dashscope-key
提示:TEXTIN_APP_ID与TEXTIN_SECRET_CODE参考 API Key,请登录 Textin 工作台 获取。示例中使用通义千问的大模型能力,其他模型用法类似。
完整代码示例
下面是一个完整的、可以直接运行的示例:import os
import json
import base64
import glob
import re
from datetime import datetime
from dotenv import load_dotenv
import requests
from langchain_core.tools import Tool
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi
load_dotenv()
# ========== Step 1: Extract API 配置 ==========
DOCS_DIR = "/your/doc/folder"
EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"
def extract_from_file(file_path: str, schema: dict, generate_citations: bool = False, stamp: bool = False) -> dict:
with open(file_path, "rb") as f:
file_base64 = base64.b64encode(f.read()).decode("utf-8")
payload = {
"file": {"file_base64": file_base64, "file_name": os.path.basename(file_path)},
"schema": schema,
"extract_options": {"generate_citations": generate_citations, "stamp": stamp}
}
headers = {
"x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
"x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
"Content-Type": "application/json"
}
response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
result = response.json()
if result.get("code") != 200:
raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")
return result["result"]
# ========== Step 2: 定义抽取 Schema ==========
INVOICE_SCHEMA = {
"type": "object",
"properties": {
"发票号码": {"type": ["string", "null"], "description": "发票号码"},
"发票代码": {"type": ["string", "null"], "description": "发票代码"},
"开票日期": {"type": ["string", "null"], "description": "开票日期"},
"销售方名称": {"type": ["string", "null"], "description": "销售方名称"},
"销售方税号": {"type": ["string", "null"], "description": "销售方纳税人识别号"},
"购买方名称": {"type": ["string", "null"], "description": "购买方名称"},
"购买方税号": {"type": ["string", "null"], "description": "购买方纳税人识别号"},
"商品明细": {
"type": "array", "description": "商品明细列表",
"items": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "商品名称"},
"规格型号": {"type": ["string", "null"], "description": "规格型号"},
"数量": {"type": ["string", "null"], "description": "数量"},
"单价": {"type": ["string", "null"], "description": "单价"},
"金额": {"type": ["string", "null"], "description": "金额"},
"税率": {"type": ["string", "null"], "description": "税率"}
},
"required": ["名称", "金额"]
}
},
"合计金额": {"type": ["string", "null"], "description": "合计金额"},
"税额": {"type": ["string", "null"], "description": "税额"},
"价税合计": {"type": ["string", "null"], "description": "价税合计"}
},
"required": ["发票号码", "开票日期", "商品明细", "价税合计"]
}
CONTRACT_SCHEMA = {
"type": "object",
"properties": {
"合同编号": {"type": ["string", "null"], "description": "合同编号"},
"签署日期": {"type": ["string", "null"], "description": "签署日期"},
"生效日期": {"type": ["string", "null"], "description": "生效日期"},
"到期日期": {"type": ["string", "null"], "description": "到期日期"},
"甲方名称": {"type": ["string", "null"], "description": "甲方名称"},
"乙方名称": {"type": ["string", "null"], "description": "乙方名称"},
"甲方联系方式": {"type": ["string", "null"], "description": "甲方联系方式"},
"乙方联系方式": {"type": ["string", "null"], "description": "乙方联系方式"},
"合同总价": {"type": ["string", "null"], "description": "合同总价"},
"付款方式": {"type": ["string", "null"], "description": "付款方式"},
"付款期限": {"type": ["string", "null"], "description": "付款期限"},
"违约责任": {"type": ["string", "null"], "description": "违约责任条款"},
"争议解决": {"type": ["string", "null"], "description": "争议解决方式"},
"合同期限": {"type": ["string", "null"], "description": "合同期限"}
},
"required": ["合同编号", "签署日期", "甲方名称", "乙方名称", "合同总价"]
}
ORDER_SCHEMA = {
"type": "object",
"properties": {
"订单号": {"type": ["string", "null"], "description": "订单号"},
"下单日期": {"type": ["string", "null"], "description": "下单日期"},
"交货日期": {"type": ["string", "null"], "description": "交货日期"},
"客户名称": {"type": ["string", "null"], "description": "客户名称"},
"联系方式": {"type": ["string", "null"], "description": "联系方式"},
"地址": {"type": ["string", "null"], "description": "地址"},
"商品明细": {
"type": "array", "description": "商品明细列表",
"items": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "商品名称"},
"规格": {"type": ["string", "null"], "description": "规格"},
"数量": {"type": ["string", "null"], "description": "数量"},
"单价": {"type": ["string", "null"], "description": "单价"},
"金额": {"type": ["string", "null"], "description": "金额"}
},
"required": ["名称", "金额"]
}
},
"订单总额": {"type": ["string", "null"], "description": "订单总额"},
"运费": {"type": ["string", "null"], "description": "运费"},
"优惠金额": {"type": ["string", "null"], "description": "优惠金额"},
"实付金额": {"type": ["string", "null"], "description": "实付金额"}
},
"required": ["订单号", "下单日期", "商品明细", "实付金额"]
}
VALIDATION_SCHEMA = {
"type": "object",
"properties": {
"发票号码": {"type": ["string", "null"], "description": "发票号码"},
"发票代码": {"type": ["string", "null"], "description": "发票代码"},
"开票日期": {"type": ["string", "null"], "description": "开票日期"},
"销售方税号": {"type": ["string", "null"], "description": "销售方纳税人识别号"},
"购买方税号": {"type": ["string", "null"], "description": "购买方纳税人识别号"},
"合计金额": {"type": ["string", "null"], "description": "合计金额"},
"税额": {"type": ["string", "null"], "description": "税额"},
"价税合计": {"type": ["string", "null"], "description": "价税合计"},
"合同编号": {"type": ["string", "null"], "description": "合同编号"},
"签署日期": {"type": ["string", "null"], "description": "签署日期"},
"合同总价": {"type": ["string", "null"], "description": "合同总价"},
"订单号": {"type": ["string", "null"], "description": "订单号"},
"下单日期": {"type": ["string", "null"], "description": "下单日期"},
"实付金额": {"type": ["string", "null"], "description": "实付金额"},
"商品明细": {
"type": "array", "description": "商品明细列表",
"items": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "商品名称"},
"金额": {"type": ["string", "null"], "description": "金额"}
},
"required": ["名称"]
}
}
},
"required": []
}
# ========== Step 3: 初始化大模型 ==========
llm = ChatTongyi(
model="qwen-max",
top_p=0.8,
dashscope_api_key=os.getenv("DASHSCOPE_API_KEY")
)
# ========== Step 4: 构建 LangChain Tools ==========
def extract_invoice_info(query: str) -> str:
"""从发票中提取结构化信息"""
filename = query.split("文件:")[-1].strip() if "文件:" in query else None
if not filename:
return "❌ 请提供文件名,格式:提取发票信息 文件:发票.pdf"
file_path = os.path.join(DOCS_DIR, filename)
if not os.path.exists(file_path):
return f"❌ 文件不存在: {file_path}"
try:
result = extract_from_file(file_path, INVOICE_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
except Exception as e:
return f"❌ 提取信息时出错:{str(e)}"
def extract_contract_info(query: str) -> str:
"""从合同中提取关键信息"""
filename = query.split("文件:")[-1].strip() if "文件:" in query else None
if not filename:
return "❌ 请提供文件名,格式:提取合同信息 文件:合同.pdf"
file_path = os.path.join(DOCS_DIR, filename)
if not os.path.exists(file_path):
return f"❌ 文件不存在: {file_path}"
try:
result = extract_from_file(file_path, CONTRACT_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
except Exception as e:
return f"❌ 提取信息时出错:{str(e)}"
def extract_order_info(query: str) -> str:
"""从订单中提取信息"""
filename = query.split("文件:")[-1].strip() if "文件:" in query else None
if not filename:
return "❌ 请提供文件名,格式:提取订单信息 文件:订单.pdf"
file_path = os.path.join(DOCS_DIR, filename)
if not os.path.exists(file_path):
return f"❌ 文件不存在: {file_path}"
try:
result = extract_from_file(file_path, ORDER_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
except Exception as e:
return f"❌ 提取信息时出错:{str(e)}"
def validate_data(query: str) -> str:
"""验证提取的数据"""
filename = query.split("文件:")[-1].strip() if "文件:" in query else None
if not filename:
return "❌ 请提供文件名,格式:验证数据 文件:发票.pdf"
file_path = os.path.join(DOCS_DIR, filename)
if not os.path.exists(file_path):
return f"❌ 文件不存在: {file_path}"
try:
result = extract_from_file(file_path, VALIDATION_SCHEMA, generate_citations=True)
data = result["extracted_schema"]
citations = result.get("citations", {})
checks = []
if "发票" in filename:
required_fields = ["发票号码", "开票日期", "价税合计"]
elif "合同" in filename or "contract" in filename:
required_fields = ["合同编号", "签署日期", "合同总价"]
elif "订单" in filename or "order" in filename:
required_fields = ["订单号", "下单日期", "实付金额"]
else:
required_fields = []
missing = [f for f in required_fields if not data.get(f)]
checks.append({
"type": "必填项检查",
"status": "fail" if missing else "pass",
"message": f"缺少: {', '.join(missing)}" if missing else "所有必填项已填写"
})
amount_status = "pass"
amount_message = "金额校验通过"
subtotal = data.get("合计金额") or data.get("订单总额")
tax = data.get("税额")
total = data.get("价税合计") or data.get("实付金额") or data.get("合同总价")
if subtotal and tax and total:
try:
s = float(re.sub(r"[^\d.]", "", subtotal))
t = float(re.sub(r"[^\d.]", "", tax))
tot = float(re.sub(r"[^\d.]", "", total))
if abs(s + t - tot) > 0.01:
amount_status = "warning"
amount_message = f"合计金额({s}) + 税额({t}) = {s+t},与价税合计({tot})不一致"
except ValueError:
amount_status = "warning"
amount_message = "金额字段包含非数字内容,无法自动校验"
else:
amount_status = "warning"
amount_message = "部分金额字段缺失,无法校验"
checks.append({"type": "金额计算验证", "status": amount_status, "message": amount_message})
date_status = "pass"
date_message = "日期格式合理"
date_fields = ["开票日期", "签署日期", "下单日期", "交货日期", "生效日期", "到期日期"]
for field in date_fields:
val = data.get(field)
if val:
for fmt in ["%Y-%m-%d", "%Y年%m月%d日", "%Y/%m/%d", "%Y.%m.%d"]:
try:
dt = datetime.strptime(val, fmt)
if dt > datetime.now():
date_status = "warning"
date_message = f"{field}({val}) 为未来日期"
break
except ValueError:
continue
checks.append({"type": "日期合理性检查", "status": date_status, "message": date_message})
format_status = "pass"
format_message = "格式验证通过"
for tax_field in ["销售方税号", "购买方税号"]:
val = data.get(tax_field)
if val and not re.match(r"^[A-Za-z0-9]{15,20}$", val):
format_status = "warning"
format_message = f"{tax_field}({val}) 格式可能不正确"
break
checks.append({"type": "格式验证", "status": format_status, "message": format_message})
overall = "pass"
if any(c["status"] == "fail" for c in checks):
overall = "fail"
elif any(c["status"] == "warning" for c in checks):
overall = "warning"
return json.dumps({
"file": filename,
"checks": checks,
"overall_status": overall,
"extracted_data": data
}, ensure_ascii=False, indent=2)
except Exception as e:
return f"❌ 验证数据时出错:{str(e)}"
def process_documents(query: str) -> str:
"""批量提取所有单据"""
patterns = ["*.pdf", "*.png", "*.jpg", "*.jpeg"]
results = {}
count = 0
for pattern in patterns:
for file_path in glob.glob(os.path.join(DOCS_DIR, pattern)):
fname = os.path.basename(file_path)
try:
if "发票" in fname or "invoice" in fname:
result = extract_from_file(file_path, INVOICE_SCHEMA)
elif "合同" in fname or "contract" in fname:
result = extract_from_file(file_path, CONTRACT_SCHEMA)
elif "订单" in fname or "order" in fname:
result = extract_from_file(file_path, ORDER_SCHEMA)
else:
result = extract_from_file(file_path, INVOICE_SCHEMA)
results[fname] = result["extracted_schema"]
count += 1
except Exception as e:
results[fname] = f"提取失败: {str(e)}"
return json.dumps({"total": count, "results": results}, ensure_ascii=False, indent=2)
tools = [
Tool(
name="process_documents",
description="批量提取所有单据文档的结构化信息。输入:'提取所有文档' 或文件名。",
func=process_documents
),
Tool(
name="extract_invoice_info",
description="从发票中提取结构化信息,包括发票号码、开票日期、销售方信息、购买方信息、商品明细、金额信息等。输入格式:提取发票信息 文件:发票.pdf",
func=extract_invoice_info
),
Tool(
name="extract_contract_info",
description="从合同中提取关键信息,包括合同编号、签署日期、签约方信息、合同金额、关键条款等。输入格式:提取合同信息 文件:合同.pdf",
func=extract_contract_info
),
Tool(
name="extract_order_info",
description="从订单中提取信息,包括订单号、下单日期、客户信息、商品明细、金额信息等。输入格式:提取订单信息 文件:订单.pdf",
func=extract_order_info
),
Tool(
name="validate_data",
description="验证提取的数据,包括必填项检查、金额计算验证、日期合理性检查、格式验证等。输入格式:验证数据 文件:发票.pdf",
func=validate_data
)
]
# ========== Step 5: 初始化 Agent ==========
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True,
agent_kwargs={
"prefix": """你是一个专业的单据处理助手。你的任务是帮助用户:
1. 从发票、合同、订单中提取关键信息
2. 验证提取的数据完整性和准确性
3. 检查数据格式和合理性
在回答时,请:
- 选择合适的工具提取信息
- 提供结构化的提取结果(JSON格式)
- 明确标注验证结果(通过/失败/警告)
- 如果发现问题,说明具体的问题和建议
- 使用工具获取准确的信息,不要猜测
"""
}
)
# ========== Step 6: 使用示例 ==========
if __name__ == "__main__":
print("=" * 60)
print("示例 1: 提取发票信息")
print("=" * 60)
response = agent.invoke({
"input": "从发票中提取发票号码、开票日期、金额和商品明细 文件:invoice.pdf"
})
print(response["output"])
print()
print("=" * 60)
print("示例 2: 提取合同信息")
print("=" * 60)
response = agent.invoke({
"input": "从合同中提取合同编号、签署日期、签约方和合同金额 文件:contract.pdf"
})
print(response["output"])
print()
print("=" * 60)
print("示例 3: 提取订单信息")
print("=" * 60)
response = agent.invoke({
"input": "从订单中提取订单号、下单日期、客户信息和商品明细 文件:order.pdf"
})
print(response["output"])
print()
print("=" * 60)
print("示例 4: 数据验证")
print("=" * 60)
response = agent.invoke({
"input": "验证发票数据:检查必填项、金额计算、日期合理性、税号格式 文件:invoice.pdf"
})
print(response["output"])
代码说明
Step 1: Extract API 配置
extract_from_file 是核心辅助函数,负责:
- 读取文件并编码为 Base64
- 构建请求体(文件 + Schema + 选项)
- 调用 Extract API,一步完成文档解析与结构化抽取
- 通过
x-ti-app-id和x-ti-secret-code请求头进行认证
Step 2: Schema 定义
为每种单据类型定义抽取 Schema:- INVOICE_SCHEMA:发票信息(发票号码、销售方/购买方、商品明细、金额等)
- CONTRACT_SCHEMA:合同信息(合同编号、签约方、金额、关键条款等)
- ORDER_SCHEMA:订单信息(订单号、客户信息、商品明细、金额等)
- VALIDATION_SCHEMA:验证用的通用 Schema,覆盖各类单据的关键字段
type、description、required 精确定义提取字段。
Step 3: 信息提取 Tools
每个 Tool 的工作流程:- 从查询中提取文件名
- 调用
extract_from_file传入对应的 Schema - Extract API 直接返回结构化 JSON 结果
Step 4: Agent 配置
Agent 会自动:- 根据用户意图选择合适的 Tool
- 调用 Extract API 提取信息
- 组织最终的回答
使用示例
示例 1:提取发票信息
response = agent.invoke({
"input": "从发票中提取发票号码、开票日期、销售方税号、购买方税号、商品明细和金额 文件:invoice.pdf"
})
print(response["output"])
示例 2:提取合同信息
response = agent.invoke({
"input": "从合同中提取合同编号、签署日期、甲方、乙方、合同金额和违约责任条款 文件:contract.pdf"
})
print(response["output"])
示例 3:提取订单信息
response = agent.invoke({
"input": "从订单中提取订单号、下单日期、客户信息和商品明细 文件:order.pdf"
})
print(response["output"])
示例 4:数据验证
response = agent.invoke({
"input": "验证提取的发票数据:检查必填项、金额计算、日期合理性、税号格式 文件:invoice.pdf"
})
print(response["output"])
最佳实践
- Schema 设计:根据实际业务需求定义 Schema 字段,使用
required标记必要字段,使用description提供清晰的字段说明 - 文档质量:对于扫描件和图片,确保分辨率足够,Extract API 内置高精度 OCR 引擎
- 坐标引用:开启
generate_citations可获取字段在文档中的位置坐标,便于人工核对 - 数据验证:提取后立即验证,确保数据完整性和准确性
- 批量处理:使用
process_documents批量提取,提高效率 - 错误处理:对于提取失败的单据,记录错误信息,便于人工处理
常见问题
Q: 如何处理模糊的扫描件?A: 1) 使用高质量的扫描件;2) 预处理图片(去噪、增强对比度);3) Extract API 内置了高精度 OCR 引擎,可以处理大多数扫描件。 Q: 如何自定义提取字段?
A: 修改对应的 Schema 定义即可。Schema 遵循 JSON Schema 规范,支持
string、number、array、object 等类型,通过 description 描述字段含义。
Q: 如何处理多页单据?A: Extract API 会自动处理多页文档,从所有页面中提取信息。 Q: 可以使用其他 LLM 吗?
A: 可以。Agent 编排部分使用 LangChain,支持多种 LLM,只需替换
ChatTongyi(通义千问)为对应的类,如 ChatOpenAI(OpenAI)、ChatZhipuAI(智谱AI)等。信息提取由 Extract API 完成,不依赖特定 LLM。

