本教程面向信息提取场景,展示如何利用 xParse Extract API 作为数据底座,构建能够从非结构化文档中提取结构化信息(如发票、医疗票据、合同、简历、产品规格、API接口等)并自动整理的智能Agent。
场景介绍
业务痛点
在信息提取场景中,企业和开发者面临以下挑战:- 文档格式多样:需要处理发票、医疗票据、合同、简历、产品文档、技术文档等多种格式
- 信息提取繁琐:需要从非结构化文档中提取结构化信息(发票信息、医疗费用、合同条款、个人信息、工作经历、产品参数、API接口等)
- 数据标准化困难:不同来源的数据格式不统一,需要标准化处理
- 批量处理需求:需要处理大量文档,手动提取效率低
- 数据验证:提取的数据需要验证和校验,确保准确性
- 财务合规:发票和医疗票据需要符合财务和税务要求
- 法律风险:合同信息提取需要准确识别关键条款和风险点
解决方案
通过构建信息提取 Agent,我们可以实现:- 一步完成解析与提取:使用 xParse Extract API,文档解析与结构化抽取在一次 API 调用中完成,无需分步处理
- Schema 驱动提取:通过定义 JSON Schema 精确控制提取字段和格式,确保输出一致性
- 数据标准化:将提取的信息转换为标准格式(JSON、CSV等)
- 数据验证:验证提取的数据完整性和准确性
- 批量处理:支持批量处理大量文档
- 财务自动化:自动提取发票和医疗票据信息,支持财务系统对接
- 合同分析:提取合同关键信息,识别重要条款和风险点
架构设计
文档(PDF/Word/Excel/图片)
↓
[xParse Extract API]
└─ 解析文档 + 结构化抽取(一步完成)
↓
[LangChain Agent]
├─ Tool 1-7: 调用 Extract API(各自定义 Schema)
↓
结构化数据(JSON/CSV)
- 每个提取工具定义专属的 JSON Schema,描述需要提取的字段和结构
- 调用 xParse Extract API,传入文档文件和 Schema,一步完成解析与结构化抽取
环境准备
python -m venv .venv && source .venv/bin/activate
pip install requests langchain langchain-community langchain-core \
python-dotenv pandas
export TEXTIN_APP_ID=your-app-id # 在 TextIn 官网注册获取
export TEXTIN_SECRET_CODE=your-secret-code # 在 TextIn 官网注册获取
export DASHSCOPE_API_KEY=your-dashscope-api-key # 本教程使用通义千问大模型,也可以替换成其他大模型
提示:TEXTIN_APP_ID与TEXTIN_SECRET_CODE参考 API Key,请登录 Textin 工作台 获取。示例中使用通义千问的大模型能力,其他模型用法类似。
Step 1:配置 Extract API
定义通用的 Extract API 调用函数和文件路径解析辅助函数:import os
import json
import base64
import requests
from dotenv import load_dotenv
load_dotenv()
EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"
def extract_from_file(file_path: str, schema: dict, generate_citations: bool = False, stamp: bool = False) -> dict:
"""使用 xParse Extract API 从文档中提取结构化信息"""
with open(file_path, "rb") as f:
file_base64 = base64.b64encode(f.read()).decode("utf-8")
payload = {
"file": {
"file_base64": file_base64,
"file_name": os.path.basename(file_path)
},
"schema": schema,
"extract_options": {
"generate_citations": generate_citations,
"stamp": stamp
}
}
headers = {
"x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
"x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
"Content-Type": "application/json"
}
response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
result = response.json()
if result.get("code") != 200:
raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")
return result["result"]
def _resolve_file_path(file_path: str = None) -> str:
"""解析文件路径,返回有效路径或 None"""
if file_path in ("None", "none", None, "", "null"):
return None
if os.path.exists(file_path):
return file_path
return None
Step 2:构建 LangChain Tools
定义提取 Schema
为每种文档类型定义专属的 JSON Schema,精确控制提取字段:INVOICE_SCHEMA = {
"type": "object",
"properties": {
"发票基本信息": {
"type": "object",
"properties": {
"发票代码": {"type": ["string", "null"], "description": "发票代码"},
"发票号码": {"type": ["string", "null"], "description": "发票号码"},
"开票日期": {"type": ["string", "null"], "description": "开票日期"}
}
},
"销售方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "销售方名称"},
"纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
"地址电话": {"type": ["string", "null"], "description": "地址电话"},
"开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
}
},
"购买方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "购买方名称"},
"纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
"地址电话": {"type": ["string", "null"], "description": "地址电话"},
"开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
}
},
"商品明细": {
"type": "array",
"description": "商品明细列表",
"items": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "商品名称"},
"规格型号": {"type": ["string", "null"], "description": "规格型号"},
"单位": {"type": ["string", "null"], "description": "单位"},
"数量": {"type": ["string", "null"], "description": "数量"},
"单价": {"type": ["string", "null"], "description": "单价"},
"金额": {"type": ["string", "null"], "description": "金额"},
"税率": {"type": ["string", "null"], "description": "税率"},
"税额": {"type": ["string", "null"], "description": "税额"}
},
"required": ["名称", "金额"]
}
},
"金额信息": {
"type": "object",
"properties": {
"合计金额": {"type": ["string", "null"], "description": "合计金额"},
"合计税额": {"type": ["string", "null"], "description": "合计税额"},
"价税合计": {"type": ["string", "null"], "description": "价税合计(大写)"}
}
},
"其他信息": {
"type": "object",
"properties": {
"备注": {"type": ["string", "null"], "description": "备注"},
"收款人": {"type": ["string", "null"], "description": "收款人"},
"复核人": {"type": ["string", "null"], "description": "复核人"},
"开票人": {"type": ["string", "null"], "description": "开票人"}
}
}
},
"required": ["发票基本信息", "销售方", "购买方", "商品明细", "金额信息"]
}
MEDICAL_BILL_SCHEMA = {
"type": "object",
"properties": {
"患者信息": {
"type": "object",
"properties": {
"姓名": {"type": ["string", "null"], "description": "患者姓名"},
"性别": {"type": ["string", "null"], "description": "性别"},
"年龄": {"type": ["string", "null"], "description": "年龄"},
"身份证号": {"type": ["string", "null"], "description": "身份证号"},
"医保卡号": {"type": ["string", "null"], "description": "医保卡号"}
}
},
"医疗机构信息": {
"type": "object",
"properties": {
"医院名称": {"type": ["string", "null"], "description": "医院名称"},
"科室": {"type": ["string", "null"], "description": "科室"},
"医生姓名": {"type": ["string", "null"], "description": "医生姓名"}
}
},
"就诊信息": {
"type": "object",
"properties": {
"就诊日期": {"type": ["string", "null"], "description": "就诊日期"},
"就诊类型": {"type": ["string", "null"], "description": "门诊/住院"},
"诊断结果": {"type": ["string", "null"], "description": "诊断结果"}
}
},
"费用明细": {
"type": "array",
"description": "费用明细列表",
"items": {
"type": "object",
"properties": {
"项目名称": {"type": ["string", "null"], "description": "项目名称"},
"数量": {"type": ["string", "null"], "description": "数量"},
"单价": {"type": ["string", "null"], "description": "单价"},
"金额": {"type": ["string", "null"], "description": "金额"},
"医保类型": {"type": ["string", "null"], "description": "甲类/乙类/丙类"}
},
"required": ["项目名称", "金额"]
}
},
"费用汇总": {
"type": "object",
"properties": {
"总费用": {"type": ["string", "null"], "description": "总费用"},
"自费金额": {"type": ["string", "null"], "description": "自费金额"},
"医保支付": {"type": ["string", "null"], "description": "医保支付金额"},
"个人支付": {"type": ["string", "null"], "description": "个人支付金额"}
}
},
"其他信息": {
"type": "object",
"properties": {
"发票号码": {"type": ["string", "null"], "description": "发票号码"},
"结算方式": {"type": ["string", "null"], "description": "结算方式"}
}
}
},
"required": ["患者信息", "医疗机构信息", "费用明细", "费用汇总"]
}
CONTRACT_SCHEMA = {
"type": "object",
"properties": {
"合同基本信息": {
"type": "object",
"properties": {
"合同编号": {"type": ["string", "null"], "description": "合同编号"},
"合同名称": {"type": ["string", "null"], "description": "合同名称"},
"签订日期": {"type": ["string", "null"], "description": "签订日期"},
"生效日期": {"type": ["string", "null"], "description": "生效日期"},
"到期日期": {"type": ["string", "null"], "description": "到期日期"}
}
},
"合同双方": {
"type": "object",
"properties": {
"甲方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "甲方名称"},
"地址": {"type": ["string", "null"], "description": "甲方地址"},
"法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
"联系方式": {"type": ["string", "null"], "description": "联系方式"}
}
},
"乙方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "乙方名称"},
"地址": {"type": ["string", "null"], "description": "乙方地址"},
"法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
"联系方式": {"type": ["string", "null"], "description": "联系方式"}
}
}
}
},
"合同标的": {
"type": "object",
"properties": {
"标的物": {"type": ["string", "null"], "description": "标的物或服务内容"},
"数量": {"type": ["string", "null"], "description": "数量"},
"金额": {"type": ["string", "null"], "description": "金额"}
}
},
"关键条款": {
"type": "object",
"properties": {
"付款方式": {"type": ["string", "null"], "description": "付款方式"},
"交付方式": {"type": ["string", "null"], "description": "交付方式"},
"违约责任": {"type": ["string", "null"], "description": "违约责任"},
"争议解决": {"type": ["string", "null"], "description": "争议解决方式"}
}
},
"金额信息": {
"type": "object",
"properties": {
"合同总金额": {"type": ["string", "null"], "description": "合同总金额"},
"付款计划": {"type": ["string", "null"], "description": "付款计划"},
"保证金": {"type": ["string", "null"], "description": "保证金"}
}
}
},
"required": ["合同基本信息", "合同双方", "合同标的", "关键条款", "金额信息"]
}
RESUME_SCHEMA = {
"type": "object",
"properties": {
"个人信息": {
"type": "object",
"properties": {
"姓名": {"type": ["string", "null"], "description": "姓名"},
"性别": {"type": ["string", "null"], "description": "性别"},
"年龄": {"type": ["string", "null"], "description": "年龄"},
"电话": {"type": ["string", "null"], "description": "电话"},
"邮箱": {"type": ["string", "null"], "description": "邮箱"},
"地址": {"type": ["string", "null"], "description": "地址"}
}
},
"教育经历": {
"type": "array",
"description": "教育经历列表",
"items": {
"type": "object",
"properties": {
"学校": {"type": ["string", "null"], "description": "学校名称"},
"专业": {"type": ["string", "null"], "description": "专业"},
"学历": {"type": ["string", "null"], "description": "学历(本科/硕士/博士等)"},
"入学时间": {"type": ["string", "null"], "description": "入学时间"},
"毕业时间": {"type": ["string", "null"], "description": "毕业时间"}
},
"required": ["学校"]
}
},
"工作经历": {
"type": "array",
"description": "工作经历列表",
"items": {
"type": "object",
"properties": {
"公司": {"type": ["string", "null"], "description": "公司名称"},
"职位": {"type": ["string", "null"], "description": "职位"},
"入职时间": {"type": ["string", "null"], "description": "入职时间"},
"离职时间": {"type": ["string", "null"], "description": "离职时间"},
"工作内容": {"type": ["string", "null"], "description": "主要工作内容"}
},
"required": ["公司"]
}
},
"技能": {
"type": "object",
"properties": {
"专业技能": {"type": "array", "items": {"type": "string"}, "description": "专业技能列表"},
"语言能力": {"type": "array", "items": {"type": "string"}, "description": "语言能力列表"},
"证书": {"type": "array", "items": {"type": "string"}, "description": "证书列表"}
}
}
},
"required": ["个人信息"]
}
PRODUCT_SPECS_SCHEMA = {
"type": "object",
"properties": {
"产品名称": {"type": ["string", "null"], "description": "产品名称"},
"型号": {"type": ["string", "null"], "description": "产品型号"},
"技术参数": {
"type": "array",
"description": "技术参数列表",
"items": {
"type": "object",
"properties": {
"参数名": {"type": ["string", "null"], "description": "参数名称"},
"参数值": {"type": ["string", "null"], "description": "参数值"},
"单位": {"type": ["string", "null"], "description": "单位"}
},
"required": ["参数名", "参数值"]
}
},
"功能特性": {
"type": "array",
"items": {"type": "string"},
"description": "功能特性列表"
},
"价格信息": {
"type": "object",
"properties": {
"价格": {"type": ["string", "null"], "description": "价格"},
"币种": {"type": ["string", "null"], "description": "币种"}
}
}
},
"required": ["产品名称"]
}
API_INFO_SCHEMA = {
"type": "object",
"properties": {
"接口列表": {
"type": "array",
"description": "API 接口列表",
"items": {
"type": "object",
"properties": {
"端点": {"type": ["string", "null"], "description": "API 端点 URL"},
"请求方法": {"type": ["string", "null"], "description": "GET/POST/PUT/DELETE"},
"描述": {"type": ["string", "null"], "description": "接口描述"},
"请求参数": {
"type": "array",
"items": {
"type": "object",
"properties": {
"参数名": {"type": ["string", "null"], "description": "参数名"},
"类型": {"type": ["string", "null"], "description": "参数类型"},
"必填": {"type": ["string", "null"], "description": "是否必填"},
"说明": {"type": ["string", "null"], "description": "参数说明"}
}
}
},
"响应格式": {"type": ["string", "null"], "description": "响应数据格式描述"},
"认证方式": {"type": ["string", "null"], "description": "认证方式"}
},
"required": ["端点", "请求方法"]
}
}
},
"required": ["接口列表"]
}
KEY_VALUE_SCHEMA = {
"type": "object",
"properties": {
"键值对列表": {
"type": "array",
"description": "从文档中提取的所有键值对",
"items": {
"type": "object",
"properties": {
"键": {"type": ["string", "null"], "description": "键名"},
"值": {"type": ["string", "null"], "description": "对应的值"}
},
"required": ["键", "值"]
}
}
},
"required": ["键值对列表"]
}
Tool 1: 提取发票信息
from langchain_core.tools import Tool
def extract_invoice_info(file_path: str = None) -> str:
"""
从发票中提取结构化信息(使用 xParse Extract API)
提取内容包括:
- 发票基本信息(发票代码、发票号码、开票日期)
- 销售方信息(名称、纳税人识别号、地址电话、开户行及账号)
- 购买方信息(名称、纳税人识别号、地址电话、开户行及账号)
- 商品明细(名称、规格、单位、数量、单价、金额、税率、税额)
- 金额信息(合计金额、合计税额、价税合计)
- 其他信息(备注、收款人、复核人、开票人等)
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, INVOICE_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
Tool 2: 提取医疗票据信息
def extract_medical_bill_info(file_path: str = None) -> str:
"""
从医疗票据中提取结构化信息(使用 xParse Extract API)
提取内容包括:
- 患者信息(姓名、性别、年龄、身份证号、医保卡号)
- 医疗机构信息(医院名称、科室、医生姓名)
- 就诊信息(就诊日期、就诊类型、诊断结果)
- 费用明细(项目名称、数量、单价、金额、医保类型)
- 费用汇总(总费用、自费金额、医保支付、个人支付)
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, MEDICAL_BILL_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
Tool 3: 提取合同信息
def extract_contract_info(file_path: str = None) -> str:
"""
从合同中提取结构化信息(使用 xParse Extract API)
提取内容包括:
- 合同基本信息(合同编号、合同名称、签订日期、生效日期、到期日期)
- 合同双方(甲方、乙方:名称、地址、法定代表人、联系方式)
- 合同标的(标的物、数量、金额)
- 关键条款(付款方式、交付方式、违约责任、争议解决)
- 金额信息(合同总金额、付款计划、保证金)
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, CONTRACT_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
Tool 4: 提取简历信息
def extract_resume_info(file_path: str = None) -> str:
"""
从简历中提取结构化信息(使用 xParse Extract API)
提取内容包括:
- 个人信息(姓名、性别、年龄、联系方式)
- 教育经历(学校、专业、学历、时间)
- 工作经历(公司、职位、时间、工作内容)
- 技能(专业技能、语言能力、证书等)
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, RESUME_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
Tool 5: 提取产品规格
def extract_product_specs(file_path: str = None) -> str:
"""
从产品文档中提取产品规格和技术参数(使用 xParse Extract API)
提取内容包括:
- 产品名称和型号
- 技术参数(尺寸、重量、性能指标等)
- 功能特性
- 价格信息
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, PRODUCT_SPECS_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
Tool 6: 提取 API 信息
def extract_api_info(file_path: str = None) -> str:
"""
从技术文档中提取 API 接口信息(使用 xParse Extract API)
提取内容包括:
- API端点(URL路径)
- 请求方法(GET、POST等)
- 请求参数
- 响应格式
- 认证方式
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, API_INFO_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
Tool 7: 数据格式化
import pandas as pd
def format_data(file_path: str = None) -> str:
"""
从文档中提取键值对并格式化为标准格式(JSON、CSV等)
使用 xParse Extract API 提取文档中的所有键值对信息,
并转换为 JSON 和 CSV 格式。
Args:
file_path: 文档路径
"""
fp = _resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = extract_from_file(fp, KEY_VALUE_SCHEMA)
extracted = result["extracted_schema"]
data_list = extracted.get("键值对列表", [])
if not data_list:
return "未找到可格式化的数据"
json_output = json.dumps(data_list, ensure_ascii=False, indent=2)
try:
df = pd.DataFrame(data_list)
csv_output = df.to_csv(index=False)
except:
csv_output = "CSV格式化失败"
return f"JSON格式:\n{json_output}\n\nCSV格式:\n{csv_output}"
组装所有Tools
tools = [
Tool(
name="extract_invoice_info",
description="从发票中提取结构化信息,包括发票基本信息、销售方/购买方信息、商品明细、金额信息等。需要提供文档路径作为参数。",
func=extract_invoice_info
),
Tool(
name="extract_medical_bill_info",
description="从医疗票据中提取结构化信息,包括患者信息、医疗机构信息、就诊信息、费用明细、费用汇总等。需要提供文档路径作为参数。",
func=extract_medical_bill_info
),
Tool(
name="extract_contract_info",
description="从合同中提取结构化信息,包括合同基本信息、合同双方信息、合同标的、关键条款、金额信息等。需要提供文档路径作为参数。",
func=extract_contract_info
),
Tool(
name="extract_resume_info",
description="从简历中提取结构化信息,包括个人信息、教育经历、工作经历、技能等。需要提供文档路径作为参数。",
func=extract_resume_info
),
Tool(
name="extract_product_specs",
description="从产品文档中提取产品规格和技术参数,包括产品名称、型号、技术参数、功能特性、价格等。需要提供文档路径作为参数。",
func=extract_product_specs
),
Tool(
name="extract_api_info",
description="从技术文档中提取API接口信息,包括API端点、请求方法、请求参数、响应格式等。需要提供文档路径作为参数。",
func=extract_api_info
),
Tool(
name="format_data",
description="从文档中提取键值对信息并格式化为标准格式(JSON、CSV等)。需要提供文档路径作为参数。",
func=format_data
)
]
Step 3:配置 LangChain Agent
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi
llm = ChatTongyi(
model="qwen-max",
dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
temperature=0.2,
)
agent_executor = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True,
agent_kwargs={
"prefix": """你是一个专业的信息提取助手。你的任务是帮助用户:
1. 从文档中提取结构化信息(发票、医疗票据、合同、简历、产品规格、API接口等)
2. 将提取的信息格式化为标准格式(JSON、CSV等)
3. 验证提取数据的完整性和准确性
在回答时,请:
- 提供结构化的提取结果
- 使用JSON或表格格式展示数据
- 如果数据不完整,说明缺失的部分
- 使用工具获取准确的信息,不要猜测
- 对于财务类文档(发票、医疗票据),确保金额和税务信息的准确性
- 对于合同文档,重点关注关键条款和风险点
- 所有工具都需要提供文档文件路径作为参数
"""
}
)
Step 4:完整示例代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
信息提取Agent完整示例
"""
import os
import json
import base64
import requests
import pandas as pd
from dotenv import load_dotenv
from langchain_core.tools import Tool
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi
load_dotenv()
EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"
INVOICE_SCHEMA = {
"type": "object",
"properties": {
"发票基本信息": {
"type": "object",
"properties": {
"发票代码": {"type": ["string", "null"], "description": "发票代码"},
"发票号码": {"type": ["string", "null"], "description": "发票号码"},
"开票日期": {"type": ["string", "null"], "description": "开票日期"}
}
},
"销售方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "销售方名称"},
"纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
"地址电话": {"type": ["string", "null"], "description": "地址电话"},
"开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
}
},
"购买方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "购买方名称"},
"纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
"地址电话": {"type": ["string", "null"], "description": "地址电话"},
"开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
}
},
"商品明细": {
"type": "array",
"description": "商品明细列表",
"items": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "商品名称"},
"规格型号": {"type": ["string", "null"], "description": "规格型号"},
"单位": {"type": ["string", "null"], "description": "单位"},
"数量": {"type": ["string", "null"], "description": "数量"},
"单价": {"type": ["string", "null"], "description": "单价"},
"金额": {"type": ["string", "null"], "description": "金额"},
"税率": {"type": ["string", "null"], "description": "税率"},
"税额": {"type": ["string", "null"], "description": "税额"}
},
"required": ["名称", "金额"]
}
},
"金额信息": {
"type": "object",
"properties": {
"合计金额": {"type": ["string", "null"], "description": "合计金额"},
"合计税额": {"type": ["string", "null"], "description": "合计税额"},
"价税合计": {"type": ["string", "null"], "description": "价税合计(大写)"}
}
},
"其他信息": {
"type": "object",
"properties": {
"备注": {"type": ["string", "null"], "description": "备注"},
"收款人": {"type": ["string", "null"], "description": "收款人"},
"复核人": {"type": ["string", "null"], "description": "复核人"},
"开票人": {"type": ["string", "null"], "description": "开票人"}
}
}
},
"required": ["发票基本信息", "销售方", "购买方", "商品明细", "金额信息"]
}
MEDICAL_BILL_SCHEMA = {
"type": "object",
"properties": {
"患者信息": {
"type": "object",
"properties": {
"姓名": {"type": ["string", "null"], "description": "患者姓名"},
"性别": {"type": ["string", "null"], "description": "性别"},
"年龄": {"type": ["string", "null"], "description": "年龄"},
"身份证号": {"type": ["string", "null"], "description": "身份证号"},
"医保卡号": {"type": ["string", "null"], "description": "医保卡号"}
}
},
"医疗机构信息": {
"type": "object",
"properties": {
"医院名称": {"type": ["string", "null"], "description": "医院名称"},
"科室": {"type": ["string", "null"], "description": "科室"},
"医生姓名": {"type": ["string", "null"], "description": "医生姓名"}
}
},
"就诊信息": {
"type": "object",
"properties": {
"就诊日期": {"type": ["string", "null"], "description": "就诊日期"},
"就诊类型": {"type": ["string", "null"], "description": "门诊/住院"},
"诊断结果": {"type": ["string", "null"], "description": "诊断结果"}
}
},
"费用明细": {
"type": "array",
"description": "费用明细列表",
"items": {
"type": "object",
"properties": {
"项目名称": {"type": ["string", "null"], "description": "项目名称"},
"数量": {"type": ["string", "null"], "description": "数量"},
"单价": {"type": ["string", "null"], "description": "单价"},
"金额": {"type": ["string", "null"], "description": "金额"},
"医保类型": {"type": ["string", "null"], "description": "甲类/乙类/丙类"}
},
"required": ["项目名称", "金额"]
}
},
"费用汇总": {
"type": "object",
"properties": {
"总费用": {"type": ["string", "null"], "description": "总费用"},
"自费金额": {"type": ["string", "null"], "description": "自费金额"},
"医保支付": {"type": ["string", "null"], "description": "医保支付金额"},
"个人支付": {"type": ["string", "null"], "description": "个人支付金额"}
}
},
"其他信息": {
"type": "object",
"properties": {
"发票号码": {"type": ["string", "null"], "description": "发票号码"},
"结算方式": {"type": ["string", "null"], "description": "结算方式"}
}
}
},
"required": ["患者信息", "医疗机构信息", "费用明细", "费用汇总"]
}
CONTRACT_SCHEMA = {
"type": "object",
"properties": {
"合同基本信息": {
"type": "object",
"properties": {
"合同编号": {"type": ["string", "null"], "description": "合同编号"},
"合同名称": {"type": ["string", "null"], "description": "合同名称"},
"签订日期": {"type": ["string", "null"], "description": "签订日期"},
"生效日期": {"type": ["string", "null"], "description": "生效日期"},
"到期日期": {"type": ["string", "null"], "description": "到期日期"}
}
},
"合同双方": {
"type": "object",
"properties": {
"甲方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "甲方名称"},
"地址": {"type": ["string", "null"], "description": "甲方地址"},
"法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
"联系方式": {"type": ["string", "null"], "description": "联系方式"}
}
},
"乙方": {
"type": "object",
"properties": {
"名称": {"type": ["string", "null"], "description": "乙方名称"},
"地址": {"type": ["string", "null"], "description": "乙方地址"},
"法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
"联系方式": {"type": ["string", "null"], "description": "联系方式"}
}
}
}
},
"合同标的": {
"type": "object",
"properties": {
"标的物": {"type": ["string", "null"], "description": "标的物或服务内容"},
"数量": {"type": ["string", "null"], "description": "数量"},
"金额": {"type": ["string", "null"], "description": "金额"}
}
},
"关键条款": {
"type": "object",
"properties": {
"付款方式": {"type": ["string", "null"], "description": "付款方式"},
"交付方式": {"type": ["string", "null"], "description": "交付方式"},
"违约责任": {"type": ["string", "null"], "description": "违约责任"},
"争议解决": {"type": ["string", "null"], "description": "争议解决方式"}
}
},
"金额信息": {
"type": "object",
"properties": {
"合同总金额": {"type": ["string", "null"], "description": "合同总金额"},
"付款计划": {"type": ["string", "null"], "description": "付款计划"},
"保证金": {"type": ["string", "null"], "description": "保证金"}
}
}
},
"required": ["合同基本信息", "合同双方", "合同标的", "关键条款", "金额信息"]
}
RESUME_SCHEMA = {
"type": "object",
"properties": {
"个人信息": {
"type": "object",
"properties": {
"姓名": {"type": ["string", "null"], "description": "姓名"},
"性别": {"type": ["string", "null"], "description": "性别"},
"年龄": {"type": ["string", "null"], "description": "年龄"},
"电话": {"type": ["string", "null"], "description": "电话"},
"邮箱": {"type": ["string", "null"], "description": "邮箱"},
"地址": {"type": ["string", "null"], "description": "地址"}
}
},
"教育经历": {
"type": "array",
"description": "教育经历列表",
"items": {
"type": "object",
"properties": {
"学校": {"type": ["string", "null"], "description": "学校名称"},
"专业": {"type": ["string", "null"], "description": "专业"},
"学历": {"type": ["string", "null"], "description": "学历"},
"入学时间": {"type": ["string", "null"], "description": "入学时间"},
"毕业时间": {"type": ["string", "null"], "description": "毕业时间"}
},
"required": ["学校"]
}
},
"工作经历": {
"type": "array",
"description": "工作经历列表",
"items": {
"type": "object",
"properties": {
"公司": {"type": ["string", "null"], "description": "公司名称"},
"职位": {"type": ["string", "null"], "description": "职位"},
"入职时间": {"type": ["string", "null"], "description": "入职时间"},
"离职时间": {"type": ["string", "null"], "description": "离职时间"},
"工作内容": {"type": ["string", "null"], "description": "主要工作内容"}
},
"required": ["公司"]
}
},
"技能": {
"type": "object",
"properties": {
"专业技能": {"type": "array", "items": {"type": "string"}, "description": "专业技能列表"},
"语言能力": {"type": "array", "items": {"type": "string"}, "description": "语言能力列表"},
"证书": {"type": "array", "items": {"type": "string"}, "description": "证书列表"}
}
}
},
"required": ["个人信息"]
}
PRODUCT_SPECS_SCHEMA = {
"type": "object",
"properties": {
"产品名称": {"type": ["string", "null"], "description": "产品名称"},
"型号": {"type": ["string", "null"], "description": "产品型号"},
"技术参数": {
"type": "array",
"description": "技术参数列表",
"items": {
"type": "object",
"properties": {
"参数名": {"type": ["string", "null"], "description": "参数名称"},
"参数值": {"type": ["string", "null"], "description": "参数值"},
"单位": {"type": ["string", "null"], "description": "单位"}
},
"required": ["参数名", "参数值"]
}
},
"功能特性": {
"type": "array",
"items": {"type": "string"},
"description": "功能特性列表"
},
"价格信息": {
"type": "object",
"properties": {
"价格": {"type": ["string", "null"], "description": "价格"},
"币种": {"type": ["string", "null"], "description": "币种"}
}
}
},
"required": ["产品名称"]
}
API_INFO_SCHEMA = {
"type": "object",
"properties": {
"接口列表": {
"type": "array",
"description": "API 接口列表",
"items": {
"type": "object",
"properties": {
"端点": {"type": ["string", "null"], "description": "API 端点 URL"},
"请求方法": {"type": ["string", "null"], "description": "GET/POST/PUT/DELETE"},
"描述": {"type": ["string", "null"], "description": "接口描述"},
"请求参数": {
"type": "array",
"items": {
"type": "object",
"properties": {
"参数名": {"type": ["string", "null"], "description": "参数名"},
"类型": {"type": ["string", "null"], "description": "参数类型"},
"必填": {"type": ["string", "null"], "description": "是否必填"},
"说明": {"type": ["string", "null"], "description": "参数说明"}
}
}
},
"响应格式": {"type": ["string", "null"], "description": "响应数据格式描述"},
"认证方式": {"type": ["string", "null"], "description": "认证方式"}
},
"required": ["端点", "请求方法"]
}
}
},
"required": ["接口列表"]
}
KEY_VALUE_SCHEMA = {
"type": "object",
"properties": {
"键值对列表": {
"type": "array",
"description": "从文档中提取的所有键值对",
"items": {
"type": "object",
"properties": {
"键": {"type": ["string", "null"], "description": "键名"},
"值": {"type": ["string", "null"], "description": "对应的值"}
},
"required": ["键", "值"]
}
}
},
"required": ["键值对列表"]
}
class InformationExtractionAgent:
"""信息提取Agent"""
def __init__(self):
self.setup_llm()
self.setup_agent()
def setup_llm(self):
self.llm = ChatTongyi(
model="qwen-max",
dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
temperature=0,
)
@staticmethod
def extract_from_file(file_path: str, schema: dict, generate_citations: bool = False, stamp: bool = False) -> dict:
with open(file_path, "rb") as f:
file_base64 = base64.b64encode(f.read()).decode("utf-8")
payload = {
"file": {
"file_base64": file_base64,
"file_name": os.path.basename(file_path)
},
"schema": schema,
"extract_options": {
"generate_citations": generate_citations,
"stamp": stamp
}
}
headers = {
"x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
"x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
"Content-Type": "application/json"
}
response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
result = response.json()
if result.get("code") != 200:
raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")
return result["result"]
@staticmethod
def _resolve_file_path(file_path: str = None) -> str:
if file_path in ("None", "none", None, "", "null"):
return None
if os.path.exists(file_path):
return file_path
return None
def setup_agent(self):
tools = [
Tool(
name="extract_invoice_info",
description="从发票中提取结构化信息,包括发票基本信息、销售方/购买方信息、商品明细、金额信息等。需要提供文档路径作为参数。",
func=self.extract_invoice_info
),
Tool(
name="extract_medical_bill_info",
description="从医疗票据中提取结构化信息,包括患者信息、医疗机构信息、就诊信息、费用明细、费用汇总等。需要提供文档路径作为参数。",
func=self.extract_medical_bill_info
),
Tool(
name="extract_contract_info",
description="从合同中提取结构化信息,包括合同基本信息、合同双方信息、合同标的、关键条款、金额信息等。需要提供文档路径作为参数。",
func=self.extract_contract_info
),
Tool(
name="extract_resume_info",
description="从简历中提取结构化信息,包括个人信息、教育经历、工作经历、技能等。需要提供文档路径作为参数。",
func=self.extract_resume_info
),
Tool(
name="extract_product_specs",
description="从产品文档中提取产品规格和技术参数,包括产品名称、型号、技术参数、功能特性、价格等。需要提供文档路径作为参数。",
func=self.extract_product_specs
),
Tool(
name="extract_api_info",
description="从技术文档中提取API接口信息,包括API端点、请求方法、请求参数、响应格式等。需要提供文档路径作为参数。",
func=self.extract_api_info
),
Tool(
name="format_data",
description="从文档中提取键值对信息并格式化为标准格式(JSON、CSV等)。需要提供文档路径作为参数。",
func=self.format_data
)
]
self.agent = initialize_agent(
tools=tools,
llm=self.llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True,
agent_kwargs={
"prefix": """你是一个专业的信息提取助手。你的任务是帮助用户:
1. 从文档中提取结构化信息(发票、医疗票据、合同、简历、产品规格、API接口等)
2. 将提取的信息格式化为标准格式(JSON、CSV等)
3. 验证提取数据的完整性和准确性
在回答时,请:
- 提供结构化的提取结果
- 使用JSON或表格格式展示数据
- 如果数据不完整,说明缺失的部分
- 使用工具获取准确的信息,不要猜测
- 对于财务类文档(发票、医疗票据),确保金额和税务信息的准确性
- 对于合同文档,重点关注关键条款和风险点
- 所有工具都需要提供文档文件路径作为参数
"""
}
)
def extract_invoice_info(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, INVOICE_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
def extract_medical_bill_info(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, MEDICAL_BILL_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
def extract_contract_info(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, CONTRACT_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
def extract_resume_info(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, RESUME_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
def extract_product_specs(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, PRODUCT_SPECS_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
def extract_api_info(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, API_INFO_SCHEMA)
return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)
def format_data(self, file_path: str = None) -> str:
fp = self._resolve_file_path(file_path)
if not fp:
return "错误:请提供有效的文档路径。"
result = self.extract_from_file(fp, KEY_VALUE_SCHEMA)
extracted = result["extracted_schema"]
data_list = extracted.get("键值对列表", [])
if not data_list:
return "未找到可格式化的数据"
json_output = json.dumps(data_list, ensure_ascii=False, indent=2)
try:
df = pd.DataFrame(data_list)
csv_output = df.to_csv(index=False)
except:
csv_output = "CSV格式化失败"
return f"JSON格式:\n{json_output}\n\nCSV格式:\n{csv_output}"
def query(self, question: str) -> str:
"""查询 Agent,获取响应结果"""
response = self.agent.invoke({"input": question})
return response["output"]
def main():
agent = InformationExtractionAgent()
questions = [
"从 ./extraction_documents/invoice.pdf 中提取发票代码、发票号码、销售方和购买方信息、商品明细和金额",
# "从 ./extraction_documents/medical_bill.pdf 中提取患者信息、医院信息、诊断结果和费用明细",
# "从 ./extraction_documents/contract.pdf 中提取合同编号、合同双方信息、合同金额和关键条款",
# "从 ./extraction_documents/resume.pdf 中提取所有个人信息、教育经历和工作经历",
# "从 ./extraction_documents/product_spec.pdf 中提取产品规格和技术参数",
# "从 ./extraction_documents/api_docs.pdf 中提取所有API接口信息",
"将 ./extraction_documents/invoice.pdf 中的数据格式化为JSON格式"
]
for question in questions:
print(f"\n{'='*60}")
print(f"问题: {question}")
print(f"{'='*60}")
answer = agent.query(question)
print(f"\n回答:\n{answer}")
if __name__ == "__main__":
main()
使用示例
示例1:提取发票信息
agent = InformationExtractionAgent()
response = agent.query("从 ./extraction_documents/invoice.pdf 中提取发票代码、发票号码、销售方和购买方信息、商品明细和金额")
print(response)
示例2:提取医疗票据信息
response = agent.query("从 ./extraction_documents/medical_bill.pdf 中提取患者姓名、医院名称、诊断结果、总费用和医保支付金额")
print(response)
示例3:提取合同信息
response = agent.query("从 ./extraction_documents/contract.pdf 中提取合同编号、甲方和乙方信息、合同金额、付款方式和违约责任")
print(response)
示例4:提取简历信息
response = agent.query("从 ./extraction_documents/resume.pdf 中提取姓名、联系方式、教育经历和工作经历")
print(response)
示例5:提取产品规格
response = agent.query("从 ./extraction_documents/product_spec.pdf 中提取产品名称、型号、技术参数和价格")
print(response)
示例6:提取API信息
response = agent.query("从 ./extraction_documents/api_docs.pdf 中提取所有API端点、请求方法和参数")
print(response)
最佳实践
- Schema 设计:为每种文档类型设计专属的 JSON Schema,精确定义需要提取的字段、类型和约束,确保输出格式一致
- 批量处理:支持批量处理多个文档,提高效率
- 格式标准化:将提取的数据转换为标准格式(JSON、CSV),便于后续处理
- 财务文档处理:
- 发票提取时重点关注发票代码、号码、金额等关键信息
- 医疗票据提取时注意区分自费、医保支付等不同费用类型
- 确保金额计算的准确性,支持财务系统对接
- 合同文档处理:
- 重点关注合同双方信息、合同金额、关键条款
- 识别违约责任、争议解决等重要条款
- 提取合同有效期,便于合同管理
- 引用溯源:对需要审核的场景,可以在调用
extract_from_file时设置generate_citations=True,获取提取结果在原文中的引用位置 - 错误处理:对提取失败的情况进行记录和人工复核,检查 API 返回的错误信息
- 性能优化:Extract API 在服务端完成解析和抽取,无需本地部署解析引擎,适合大规模批量处理
常见问题
Q: 如何提高提取准确率?A: 1) 优化 JSON Schema,精确定义字段和描述信息;2) 确保文档清晰,避免模糊或低质量的扫描件;3) 对提取结果进行验证和校验。 Q: 如何处理格式不统一的文档?
A: 1) Extract API 支持多种文档格式(PDF、Word、Excel、图片等),会自动处理格式差异;2) 通过 Schema 统一输出格式;3) 人工校验和修正。 Q: 如何批量处理大量文档?
A: 1) 遍历文档目录,逐个调用提取工具;2) 并行处理多个文档(使用多线程或异步);3) 使用队列管理任务,避免并发过高。 Q: 发票信息提取不准确怎么办?
A: 1) 确保发票图片清晰;2) 优化 Schema 中的字段描述;3) 启用
generate_citations=True 检查引用位置,排查问题字段;4) 对于特殊格式的发票,可以调整 Schema 适配。
Q: 医疗票据的费用明细如何提取?A: 1) MEDICAL_BILL_SCHEMA 已定义费用明细数组,包含项目名称、数量、单价、金额、医保类型等字段;2) 费用汇总包含总费用、自费金额、医保支付、个人支付等;3) Extract API 能自动识别表格结构。 Q: 合同关键条款如何识别?
A: 1) CONTRACT_SCHEMA 已定义关键条款字段(付款方式、交付方式、违约责任、争议解决);2) 可以根据业务需求扩展 Schema,添加更多条款字段;3) 启用 citations 获取条款在原文中的位置。
相关文档
- 快速启动 - 了解 xParse SDK 基本使用方法
- xParse SDK 参考 - 了解 SDK API 详情
- Agent教程 - 了解通用Agent构建方法

