跳转到主要内容
本教程面向信息提取场景,展示如何利用 xParse Extract API 作为数据底座,构建能够从非结构化文档中提取结构化信息(如发票、医疗票据、合同、简历、产品规格、API接口等)并自动整理的智能Agent。

场景介绍

业务痛点

在信息提取场景中,企业和开发者面临以下挑战:
  • 文档格式多样:需要处理发票、医疗票据、合同、简历、产品文档、技术文档等多种格式
  • 信息提取繁琐:需要从非结构化文档中提取结构化信息(发票信息、医疗费用、合同条款、个人信息、工作经历、产品参数、API接口等)
  • 数据标准化困难:不同来源的数据格式不统一,需要标准化处理
  • 批量处理需求:需要处理大量文档,手动提取效率低
  • 数据验证:提取的数据需要验证和校验,确保准确性
  • 财务合规:发票和医疗票据需要符合财务和税务要求
  • 法律风险:合同信息提取需要准确识别关键条款和风险点

解决方案

通过构建信息提取 Agent,我们可以实现:
  • 一步完成解析与提取:使用 xParse Extract API,文档解析与结构化抽取在一次 API 调用中完成,无需分步处理
  • Schema 驱动提取:通过定义 JSON Schema 精确控制提取字段和格式,确保输出一致性
  • 数据标准化:将提取的信息转换为标准格式(JSON、CSV等)
  • 数据验证:验证提取的数据完整性和准确性
  • 批量处理:支持批量处理大量文档
  • 财务自动化:自动提取发票和医疗票据信息,支持财务系统对接
  • 合同分析:提取合同关键信息,识别重要条款和风险点

架构设计

文档(PDF/Word/Excel/图片)

[xParse Extract API]
    └─ 解析文档 + 结构化抽取(一步完成)

[LangChain Agent]
    ├─ Tool 1-7: 调用 Extract API(各自定义 Schema)

结构化数据(JSON/CSV)
核心流程
  1. 每个提取工具定义专属的 JSON Schema,描述需要提取的字段和结构
  2. 调用 xParse Extract API,传入文档文件和 Schema,一步完成解析与结构化抽取

环境准备

python -m venv .venv && source .venv/bin/activate
pip install requests langchain langchain-community langchain-core \
            python-dotenv pandas
export TEXTIN_APP_ID=your-app-id # 在 TextIn 官网注册获取
export TEXTIN_SECRET_CODE=your-secret-code # 在 TextIn 官网注册获取
export DASHSCOPE_API_KEY=your-dashscope-api-key # 本教程使用通义千问大模型,也可以替换成其他大模型
提示:TEXTIN_APP_IDTEXTIN_SECRET_CODE 参考 API Key,请登录 Textin 工作台 获取。示例中使用 通义千问 的大模型能力,其他模型用法类似。

Step 1:配置 Extract API

定义通用的 Extract API 调用函数和文件路径解析辅助函数:
import os
import json
import base64
import requests
from dotenv import load_dotenv

load_dotenv()

EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"

def extract_from_file(file_path: str, schema: dict, generate_citations: bool = False, stamp: bool = False) -> dict:
    """使用 xParse Extract API 从文档中提取结构化信息"""
    with open(file_path, "rb") as f:
        file_base64 = base64.b64encode(f.read()).decode("utf-8")

    payload = {
        "file": {
            "file_base64": file_base64,
            "file_name": os.path.basename(file_path)
        },
        "schema": schema,
        "extract_options": {
            "generate_citations": generate_citations,
            "stamp": stamp
        }
    }

    headers = {
        "x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
        "x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
        "Content-Type": "application/json"
    }

    response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
    result = response.json()

    if result.get("code") != 200:
        raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")

    return result["result"]

def _resolve_file_path(file_path: str = None) -> str:
    """解析文件路径,返回有效路径或 None"""
    if file_path in ("None", "none", None, "", "null"):
        return None
    if os.path.exists(file_path):
        return file_path
    return None

Step 2:构建 LangChain Tools

定义提取 Schema

为每种文档类型定义专属的 JSON Schema,精确控制提取字段:
INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "发票基本信息": {
            "type": "object",
            "properties": {
                "发票代码": {"type": ["string", "null"], "description": "发票代码"},
                "发票号码": {"type": ["string", "null"], "description": "发票号码"},
                "开票日期": {"type": ["string", "null"], "description": "开票日期"}
            }
        },
        "销售方": {
            "type": "object",
            "properties": {
                "名称": {"type": ["string", "null"], "description": "销售方名称"},
                "纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
                "地址电话": {"type": ["string", "null"], "description": "地址电话"},
                "开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
            }
        },
        "购买方": {
            "type": "object",
            "properties": {
                "名称": {"type": ["string", "null"], "description": "购买方名称"},
                "纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
                "地址电话": {"type": ["string", "null"], "description": "地址电话"},
                "开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
            }
        },
        "商品明细": {
            "type": "array",
            "description": "商品明细列表",
            "items": {
                "type": "object",
                "properties": {
                    "名称": {"type": ["string", "null"], "description": "商品名称"},
                    "规格型号": {"type": ["string", "null"], "description": "规格型号"},
                    "单位": {"type": ["string", "null"], "description": "单位"},
                    "数量": {"type": ["string", "null"], "description": "数量"},
                    "单价": {"type": ["string", "null"], "description": "单价"},
                    "金额": {"type": ["string", "null"], "description": "金额"},
                    "税率": {"type": ["string", "null"], "description": "税率"},
                    "税额": {"type": ["string", "null"], "description": "税额"}
                },
                "required": ["名称", "金额"]
            }
        },
        "金额信息": {
            "type": "object",
            "properties": {
                "合计金额": {"type": ["string", "null"], "description": "合计金额"},
                "合计税额": {"type": ["string", "null"], "description": "合计税额"},
                "价税合计": {"type": ["string", "null"], "description": "价税合计(大写)"}
            }
        },
        "其他信息": {
            "type": "object",
            "properties": {
                "备注": {"type": ["string", "null"], "description": "备注"},
                "收款人": {"type": ["string", "null"], "description": "收款人"},
                "复核人": {"type": ["string", "null"], "description": "复核人"},
                "开票人": {"type": ["string", "null"], "description": "开票人"}
            }
        }
    },
    "required": ["发票基本信息", "销售方", "购买方", "商品明细", "金额信息"]
}

MEDICAL_BILL_SCHEMA = {
    "type": "object",
    "properties": {
        "患者信息": {
            "type": "object",
            "properties": {
                "姓名": {"type": ["string", "null"], "description": "患者姓名"},
                "性别": {"type": ["string", "null"], "description": "性别"},
                "年龄": {"type": ["string", "null"], "description": "年龄"},
                "身份证号": {"type": ["string", "null"], "description": "身份证号"},
                "医保卡号": {"type": ["string", "null"], "description": "医保卡号"}
            }
        },
        "医疗机构信息": {
            "type": "object",
            "properties": {
                "医院名称": {"type": ["string", "null"], "description": "医院名称"},
                "科室": {"type": ["string", "null"], "description": "科室"},
                "医生姓名": {"type": ["string", "null"], "description": "医生姓名"}
            }
        },
        "就诊信息": {
            "type": "object",
            "properties": {
                "就诊日期": {"type": ["string", "null"], "description": "就诊日期"},
                "就诊类型": {"type": ["string", "null"], "description": "门诊/住院"},
                "诊断结果": {"type": ["string", "null"], "description": "诊断结果"}
            }
        },
        "费用明细": {
            "type": "array",
            "description": "费用明细列表",
            "items": {
                "type": "object",
                "properties": {
                    "项目名称": {"type": ["string", "null"], "description": "项目名称"},
                    "数量": {"type": ["string", "null"], "description": "数量"},
                    "单价": {"type": ["string", "null"], "description": "单价"},
                    "金额": {"type": ["string", "null"], "description": "金额"},
                    "医保类型": {"type": ["string", "null"], "description": "甲类/乙类/丙类"}
                },
                "required": ["项目名称", "金额"]
            }
        },
        "费用汇总": {
            "type": "object",
            "properties": {
                "总费用": {"type": ["string", "null"], "description": "总费用"},
                "自费金额": {"type": ["string", "null"], "description": "自费金额"},
                "医保支付": {"type": ["string", "null"], "description": "医保支付金额"},
                "个人支付": {"type": ["string", "null"], "description": "个人支付金额"}
            }
        },
        "其他信息": {
            "type": "object",
            "properties": {
                "发票号码": {"type": ["string", "null"], "description": "发票号码"},
                "结算方式": {"type": ["string", "null"], "description": "结算方式"}
            }
        }
    },
    "required": ["患者信息", "医疗机构信息", "费用明细", "费用汇总"]
}

CONTRACT_SCHEMA = {
    "type": "object",
    "properties": {
        "合同基本信息": {
            "type": "object",
            "properties": {
                "合同编号": {"type": ["string", "null"], "description": "合同编号"},
                "合同名称": {"type": ["string", "null"], "description": "合同名称"},
                "签订日期": {"type": ["string", "null"], "description": "签订日期"},
                "生效日期": {"type": ["string", "null"], "description": "生效日期"},
                "到期日期": {"type": ["string", "null"], "description": "到期日期"}
            }
        },
        "合同双方": {
            "type": "object",
            "properties": {
                "甲方": {
                    "type": "object",
                    "properties": {
                        "名称": {"type": ["string", "null"], "description": "甲方名称"},
                        "地址": {"type": ["string", "null"], "description": "甲方地址"},
                        "法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
                        "联系方式": {"type": ["string", "null"], "description": "联系方式"}
                    }
                },
                "乙方": {
                    "type": "object",
                    "properties": {
                        "名称": {"type": ["string", "null"], "description": "乙方名称"},
                        "地址": {"type": ["string", "null"], "description": "乙方地址"},
                        "法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
                        "联系方式": {"type": ["string", "null"], "description": "联系方式"}
                    }
                }
            }
        },
        "合同标的": {
            "type": "object",
            "properties": {
                "标的物": {"type": ["string", "null"], "description": "标的物或服务内容"},
                "数量": {"type": ["string", "null"], "description": "数量"},
                "金额": {"type": ["string", "null"], "description": "金额"}
            }
        },
        "关键条款": {
            "type": "object",
            "properties": {
                "付款方式": {"type": ["string", "null"], "description": "付款方式"},
                "交付方式": {"type": ["string", "null"], "description": "交付方式"},
                "违约责任": {"type": ["string", "null"], "description": "违约责任"},
                "争议解决": {"type": ["string", "null"], "description": "争议解决方式"}
            }
        },
        "金额信息": {
            "type": "object",
            "properties": {
                "合同总金额": {"type": ["string", "null"], "description": "合同总金额"},
                "付款计划": {"type": ["string", "null"], "description": "付款计划"},
                "保证金": {"type": ["string", "null"], "description": "保证金"}
            }
        }
    },
    "required": ["合同基本信息", "合同双方", "合同标的", "关键条款", "金额信息"]
}

RESUME_SCHEMA = {
    "type": "object",
    "properties": {
        "个人信息": {
            "type": "object",
            "properties": {
                "姓名": {"type": ["string", "null"], "description": "姓名"},
                "性别": {"type": ["string", "null"], "description": "性别"},
                "年龄": {"type": ["string", "null"], "description": "年龄"},
                "电话": {"type": ["string", "null"], "description": "电话"},
                "邮箱": {"type": ["string", "null"], "description": "邮箱"},
                "地址": {"type": ["string", "null"], "description": "地址"}
            }
        },
        "教育经历": {
            "type": "array",
            "description": "教育经历列表",
            "items": {
                "type": "object",
                "properties": {
                    "学校": {"type": ["string", "null"], "description": "学校名称"},
                    "专业": {"type": ["string", "null"], "description": "专业"},
                    "学历": {"type": ["string", "null"], "description": "学历(本科/硕士/博士等)"},
                    "入学时间": {"type": ["string", "null"], "description": "入学时间"},
                    "毕业时间": {"type": ["string", "null"], "description": "毕业时间"}
                },
                "required": ["学校"]
            }
        },
        "工作经历": {
            "type": "array",
            "description": "工作经历列表",
            "items": {
                "type": "object",
                "properties": {
                    "公司": {"type": ["string", "null"], "description": "公司名称"},
                    "职位": {"type": ["string", "null"], "description": "职位"},
                    "入职时间": {"type": ["string", "null"], "description": "入职时间"},
                    "离职时间": {"type": ["string", "null"], "description": "离职时间"},
                    "工作内容": {"type": ["string", "null"], "description": "主要工作内容"}
                },
                "required": ["公司"]
            }
        },
        "技能": {
            "type": "object",
            "properties": {
                "专业技能": {"type": "array", "items": {"type": "string"}, "description": "专业技能列表"},
                "语言能力": {"type": "array", "items": {"type": "string"}, "description": "语言能力列表"},
                "证书": {"type": "array", "items": {"type": "string"}, "description": "证书列表"}
            }
        }
    },
    "required": ["个人信息"]
}

PRODUCT_SPECS_SCHEMA = {
    "type": "object",
    "properties": {
        "产品名称": {"type": ["string", "null"], "description": "产品名称"},
        "型号": {"type": ["string", "null"], "description": "产品型号"},
        "技术参数": {
            "type": "array",
            "description": "技术参数列表",
            "items": {
                "type": "object",
                "properties": {
                    "参数名": {"type": ["string", "null"], "description": "参数名称"},
                    "参数值": {"type": ["string", "null"], "description": "参数值"},
                    "单位": {"type": ["string", "null"], "description": "单位"}
                },
                "required": ["参数名", "参数值"]
            }
        },
        "功能特性": {
            "type": "array",
            "items": {"type": "string"},
            "description": "功能特性列表"
        },
        "价格信息": {
            "type": "object",
            "properties": {
                "价格": {"type": ["string", "null"], "description": "价格"},
                "币种": {"type": ["string", "null"], "description": "币种"}
            }
        }
    },
    "required": ["产品名称"]
}

API_INFO_SCHEMA = {
    "type": "object",
    "properties": {
        "接口列表": {
            "type": "array",
            "description": "API 接口列表",
            "items": {
                "type": "object",
                "properties": {
                    "端点": {"type": ["string", "null"], "description": "API 端点 URL"},
                    "请求方法": {"type": ["string", "null"], "description": "GET/POST/PUT/DELETE"},
                    "描述": {"type": ["string", "null"], "description": "接口描述"},
                    "请求参数": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "参数名": {"type": ["string", "null"], "description": "参数名"},
                                "类型": {"type": ["string", "null"], "description": "参数类型"},
                                "必填": {"type": ["string", "null"], "description": "是否必填"},
                                "说明": {"type": ["string", "null"], "description": "参数说明"}
                            }
                        }
                    },
                    "响应格式": {"type": ["string", "null"], "description": "响应数据格式描述"},
                    "认证方式": {"type": ["string", "null"], "description": "认证方式"}
                },
                "required": ["端点", "请求方法"]
            }
        }
    },
    "required": ["接口列表"]
}

KEY_VALUE_SCHEMA = {
    "type": "object",
    "properties": {
        "键值对列表": {
            "type": "array",
            "description": "从文档中提取的所有键值对",
            "items": {
                "type": "object",
                "properties": {
                    "键": {"type": ["string", "null"], "description": "键名"},
                    "值": {"type": ["string", "null"], "description": "对应的值"}
                },
                "required": ["键", "值"]
            }
        }
    },
    "required": ["键值对列表"]
}

Tool 1: 提取发票信息

from langchain_core.tools import Tool

def extract_invoice_info(file_path: str = None) -> str:
    """
    从发票中提取结构化信息(使用 xParse Extract API)

    提取内容包括:
    - 发票基本信息(发票代码、发票号码、开票日期)
    - 销售方信息(名称、纳税人识别号、地址电话、开户行及账号)
    - 购买方信息(名称、纳税人识别号、地址电话、开户行及账号)
    - 商品明细(名称、规格、单位、数量、单价、金额、税率、税额)
    - 金额信息(合计金额、合计税额、价税合计)
    - 其他信息(备注、收款人、复核人、开票人等)

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"
    result = extract_from_file(fp, INVOICE_SCHEMA)
    return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

Tool 2: 提取医疗票据信息

def extract_medical_bill_info(file_path: str = None) -> str:
    """
    从医疗票据中提取结构化信息(使用 xParse Extract API)

    提取内容包括:
    - 患者信息(姓名、性别、年龄、身份证号、医保卡号)
    - 医疗机构信息(医院名称、科室、医生姓名)
    - 就诊信息(就诊日期、就诊类型、诊断结果)
    - 费用明细(项目名称、数量、单价、金额、医保类型)
    - 费用汇总(总费用、自费金额、医保支付、个人支付)

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"
    result = extract_from_file(fp, MEDICAL_BILL_SCHEMA)
    return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

Tool 3: 提取合同信息

def extract_contract_info(file_path: str = None) -> str:
    """
    从合同中提取结构化信息(使用 xParse Extract API)

    提取内容包括:
    - 合同基本信息(合同编号、合同名称、签订日期、生效日期、到期日期)
    - 合同双方(甲方、乙方:名称、地址、法定代表人、联系方式)
    - 合同标的(标的物、数量、金额)
    - 关键条款(付款方式、交付方式、违约责任、争议解决)
    - 金额信息(合同总金额、付款计划、保证金)

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"
    result = extract_from_file(fp, CONTRACT_SCHEMA)
    return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

Tool 4: 提取简历信息

def extract_resume_info(file_path: str = None) -> str:
    """
    从简历中提取结构化信息(使用 xParse Extract API)

    提取内容包括:
    - 个人信息(姓名、性别、年龄、联系方式)
    - 教育经历(学校、专业、学历、时间)
    - 工作经历(公司、职位、时间、工作内容)
    - 技能(专业技能、语言能力、证书等)

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"
    result = extract_from_file(fp, RESUME_SCHEMA)
    return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

Tool 5: 提取产品规格

def extract_product_specs(file_path: str = None) -> str:
    """
    从产品文档中提取产品规格和技术参数(使用 xParse Extract API)

    提取内容包括:
    - 产品名称和型号
    - 技术参数(尺寸、重量、性能指标等)
    - 功能特性
    - 价格信息

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"
    result = extract_from_file(fp, PRODUCT_SPECS_SCHEMA)
    return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

Tool 6: 提取 API 信息

def extract_api_info(file_path: str = None) -> str:
    """
    从技术文档中提取 API 接口信息(使用 xParse Extract API)

    提取内容包括:
    - API端点(URL路径)
    - 请求方法(GET、POST等)
    - 请求参数
    - 响应格式
    - 认证方式

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"
    result = extract_from_file(fp, API_INFO_SCHEMA)
    return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

Tool 7: 数据格式化

import pandas as pd

def format_data(file_path: str = None) -> str:
    """
    从文档中提取键值对并格式化为标准格式(JSON、CSV等)

    使用 xParse Extract API 提取文档中的所有键值对信息,
    并转换为 JSON 和 CSV 格式。

    Args:
        file_path: 文档路径
    """
    fp = _resolve_file_path(file_path)
    if not fp:
        return "错误:请提供有效的文档路径。"

    result = extract_from_file(fp, KEY_VALUE_SCHEMA)
    extracted = result["extracted_schema"]
    data_list = extracted.get("键值对列表", [])

    if not data_list:
        return "未找到可格式化的数据"

    json_output = json.dumps(data_list, ensure_ascii=False, indent=2)

    try:
        df = pd.DataFrame(data_list)
        csv_output = df.to_csv(index=False)
    except:
        csv_output = "CSV格式化失败"

    return f"JSON格式:\n{json_output}\n\nCSV格式:\n{csv_output}"

组装所有Tools

tools = [
    Tool(
        name="extract_invoice_info",
        description="从发票中提取结构化信息,包括发票基本信息、销售方/购买方信息、商品明细、金额信息等。需要提供文档路径作为参数。",
        func=extract_invoice_info
    ),
    Tool(
        name="extract_medical_bill_info",
        description="从医疗票据中提取结构化信息,包括患者信息、医疗机构信息、就诊信息、费用明细、费用汇总等。需要提供文档路径作为参数。",
        func=extract_medical_bill_info
    ),
    Tool(
        name="extract_contract_info",
        description="从合同中提取结构化信息,包括合同基本信息、合同双方信息、合同标的、关键条款、金额信息等。需要提供文档路径作为参数。",
        func=extract_contract_info
    ),
    Tool(
        name="extract_resume_info",
        description="从简历中提取结构化信息,包括个人信息、教育经历、工作经历、技能等。需要提供文档路径作为参数。",
        func=extract_resume_info
    ),
    Tool(
        name="extract_product_specs",
        description="从产品文档中提取产品规格和技术参数,包括产品名称、型号、技术参数、功能特性、价格等。需要提供文档路径作为参数。",
        func=extract_product_specs
    ),
    Tool(
        name="extract_api_info",
        description="从技术文档中提取API接口信息,包括API端点、请求方法、请求参数、响应格式等。需要提供文档路径作为参数。",
        func=extract_api_info
    ),
    Tool(
        name="format_data",
        description="从文档中提取键值对信息并格式化为标准格式(JSON、CSV等)。需要提供文档路径作为参数。",
        func=format_data
    )
]

Step 3:配置 LangChain Agent

from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi

llm = ChatTongyi(
    model="qwen-max",
    dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
    temperature=0.2,
)

agent_executor = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    agent_kwargs={
        "prefix": """你是一个专业的信息提取助手。你的任务是帮助用户:
1. 从文档中提取结构化信息(发票、医疗票据、合同、简历、产品规格、API接口等)
2. 将提取的信息格式化为标准格式(JSON、CSV等)
3. 验证提取数据的完整性和准确性

在回答时,请:
- 提供结构化的提取结果
- 使用JSON或表格格式展示数据
- 如果数据不完整,说明缺失的部分
- 使用工具获取准确的信息,不要猜测
- 对于财务类文档(发票、医疗票据),确保金额和税务信息的准确性
- 对于合同文档,重点关注关键条款和风险点
- 所有工具都需要提供文档文件路径作为参数
"""
    }
)

Step 4:完整示例代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
信息提取Agent完整示例
"""

import os
import json
import base64
import requests
import pandas as pd
from dotenv import load_dotenv
from langchain_core.tools import Tool
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi

load_dotenv()

EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "发票基本信息": {
            "type": "object",
            "properties": {
                "发票代码": {"type": ["string", "null"], "description": "发票代码"},
                "发票号码": {"type": ["string", "null"], "description": "发票号码"},
                "开票日期": {"type": ["string", "null"], "description": "开票日期"}
            }
        },
        "销售方": {
            "type": "object",
            "properties": {
                "名称": {"type": ["string", "null"], "description": "销售方名称"},
                "纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
                "地址电话": {"type": ["string", "null"], "description": "地址电话"},
                "开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
            }
        },
        "购买方": {
            "type": "object",
            "properties": {
                "名称": {"type": ["string", "null"], "description": "购买方名称"},
                "纳税人识别号": {"type": ["string", "null"], "description": "纳税人识别号"},
                "地址电话": {"type": ["string", "null"], "description": "地址电话"},
                "开户行及账号": {"type": ["string", "null"], "description": "开户行及账号"}
            }
        },
        "商品明细": {
            "type": "array",
            "description": "商品明细列表",
            "items": {
                "type": "object",
                "properties": {
                    "名称": {"type": ["string", "null"], "description": "商品名称"},
                    "规格型号": {"type": ["string", "null"], "description": "规格型号"},
                    "单位": {"type": ["string", "null"], "description": "单位"},
                    "数量": {"type": ["string", "null"], "description": "数量"},
                    "单价": {"type": ["string", "null"], "description": "单价"},
                    "金额": {"type": ["string", "null"], "description": "金额"},
                    "税率": {"type": ["string", "null"], "description": "税率"},
                    "税额": {"type": ["string", "null"], "description": "税额"}
                },
                "required": ["名称", "金额"]
            }
        },
        "金额信息": {
            "type": "object",
            "properties": {
                "合计金额": {"type": ["string", "null"], "description": "合计金额"},
                "合计税额": {"type": ["string", "null"], "description": "合计税额"},
                "价税合计": {"type": ["string", "null"], "description": "价税合计(大写)"}
            }
        },
        "其他信息": {
            "type": "object",
            "properties": {
                "备注": {"type": ["string", "null"], "description": "备注"},
                "收款人": {"type": ["string", "null"], "description": "收款人"},
                "复核人": {"type": ["string", "null"], "description": "复核人"},
                "开票人": {"type": ["string", "null"], "description": "开票人"}
            }
        }
    },
    "required": ["发票基本信息", "销售方", "购买方", "商品明细", "金额信息"]
}

MEDICAL_BILL_SCHEMA = {
    "type": "object",
    "properties": {
        "患者信息": {
            "type": "object",
            "properties": {
                "姓名": {"type": ["string", "null"], "description": "患者姓名"},
                "性别": {"type": ["string", "null"], "description": "性别"},
                "年龄": {"type": ["string", "null"], "description": "年龄"},
                "身份证号": {"type": ["string", "null"], "description": "身份证号"},
                "医保卡号": {"type": ["string", "null"], "description": "医保卡号"}
            }
        },
        "医疗机构信息": {
            "type": "object",
            "properties": {
                "医院名称": {"type": ["string", "null"], "description": "医院名称"},
                "科室": {"type": ["string", "null"], "description": "科室"},
                "医生姓名": {"type": ["string", "null"], "description": "医生姓名"}
            }
        },
        "就诊信息": {
            "type": "object",
            "properties": {
                "就诊日期": {"type": ["string", "null"], "description": "就诊日期"},
                "就诊类型": {"type": ["string", "null"], "description": "门诊/住院"},
                "诊断结果": {"type": ["string", "null"], "description": "诊断结果"}
            }
        },
        "费用明细": {
            "type": "array",
            "description": "费用明细列表",
            "items": {
                "type": "object",
                "properties": {
                    "项目名称": {"type": ["string", "null"], "description": "项目名称"},
                    "数量": {"type": ["string", "null"], "description": "数量"},
                    "单价": {"type": ["string", "null"], "description": "单价"},
                    "金额": {"type": ["string", "null"], "description": "金额"},
                    "医保类型": {"type": ["string", "null"], "description": "甲类/乙类/丙类"}
                },
                "required": ["项目名称", "金额"]
            }
        },
        "费用汇总": {
            "type": "object",
            "properties": {
                "总费用": {"type": ["string", "null"], "description": "总费用"},
                "自费金额": {"type": ["string", "null"], "description": "自费金额"},
                "医保支付": {"type": ["string", "null"], "description": "医保支付金额"},
                "个人支付": {"type": ["string", "null"], "description": "个人支付金额"}
            }
        },
        "其他信息": {
            "type": "object",
            "properties": {
                "发票号码": {"type": ["string", "null"], "description": "发票号码"},
                "结算方式": {"type": ["string", "null"], "description": "结算方式"}
            }
        }
    },
    "required": ["患者信息", "医疗机构信息", "费用明细", "费用汇总"]
}

CONTRACT_SCHEMA = {
    "type": "object",
    "properties": {
        "合同基本信息": {
            "type": "object",
            "properties": {
                "合同编号": {"type": ["string", "null"], "description": "合同编号"},
                "合同名称": {"type": ["string", "null"], "description": "合同名称"},
                "签订日期": {"type": ["string", "null"], "description": "签订日期"},
                "生效日期": {"type": ["string", "null"], "description": "生效日期"},
                "到期日期": {"type": ["string", "null"], "description": "到期日期"}
            }
        },
        "合同双方": {
            "type": "object",
            "properties": {
                "甲方": {
                    "type": "object",
                    "properties": {
                        "名称": {"type": ["string", "null"], "description": "甲方名称"},
                        "地址": {"type": ["string", "null"], "description": "甲方地址"},
                        "法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
                        "联系方式": {"type": ["string", "null"], "description": "联系方式"}
                    }
                },
                "乙方": {
                    "type": "object",
                    "properties": {
                        "名称": {"type": ["string", "null"], "description": "乙方名称"},
                        "地址": {"type": ["string", "null"], "description": "乙方地址"},
                        "法定代表人": {"type": ["string", "null"], "description": "法定代表人"},
                        "联系方式": {"type": ["string", "null"], "description": "联系方式"}
                    }
                }
            }
        },
        "合同标的": {
            "type": "object",
            "properties": {
                "标的物": {"type": ["string", "null"], "description": "标的物或服务内容"},
                "数量": {"type": ["string", "null"], "description": "数量"},
                "金额": {"type": ["string", "null"], "description": "金额"}
            }
        },
        "关键条款": {
            "type": "object",
            "properties": {
                "付款方式": {"type": ["string", "null"], "description": "付款方式"},
                "交付方式": {"type": ["string", "null"], "description": "交付方式"},
                "违约责任": {"type": ["string", "null"], "description": "违约责任"},
                "争议解决": {"type": ["string", "null"], "description": "争议解决方式"}
            }
        },
        "金额信息": {
            "type": "object",
            "properties": {
                "合同总金额": {"type": ["string", "null"], "description": "合同总金额"},
                "付款计划": {"type": ["string", "null"], "description": "付款计划"},
                "保证金": {"type": ["string", "null"], "description": "保证金"}
            }
        }
    },
    "required": ["合同基本信息", "合同双方", "合同标的", "关键条款", "金额信息"]
}

RESUME_SCHEMA = {
    "type": "object",
    "properties": {
        "个人信息": {
            "type": "object",
            "properties": {
                "姓名": {"type": ["string", "null"], "description": "姓名"},
                "性别": {"type": ["string", "null"], "description": "性别"},
                "年龄": {"type": ["string", "null"], "description": "年龄"},
                "电话": {"type": ["string", "null"], "description": "电话"},
                "邮箱": {"type": ["string", "null"], "description": "邮箱"},
                "地址": {"type": ["string", "null"], "description": "地址"}
            }
        },
        "教育经历": {
            "type": "array",
            "description": "教育经历列表",
            "items": {
                "type": "object",
                "properties": {
                    "学校": {"type": ["string", "null"], "description": "学校名称"},
                    "专业": {"type": ["string", "null"], "description": "专业"},
                    "学历": {"type": ["string", "null"], "description": "学历"},
                    "入学时间": {"type": ["string", "null"], "description": "入学时间"},
                    "毕业时间": {"type": ["string", "null"], "description": "毕业时间"}
                },
                "required": ["学校"]
            }
        },
        "工作经历": {
            "type": "array",
            "description": "工作经历列表",
            "items": {
                "type": "object",
                "properties": {
                    "公司": {"type": ["string", "null"], "description": "公司名称"},
                    "职位": {"type": ["string", "null"], "description": "职位"},
                    "入职时间": {"type": ["string", "null"], "description": "入职时间"},
                    "离职时间": {"type": ["string", "null"], "description": "离职时间"},
                    "工作内容": {"type": ["string", "null"], "description": "主要工作内容"}
                },
                "required": ["公司"]
            }
        },
        "技能": {
            "type": "object",
            "properties": {
                "专业技能": {"type": "array", "items": {"type": "string"}, "description": "专业技能列表"},
                "语言能力": {"type": "array", "items": {"type": "string"}, "description": "语言能力列表"},
                "证书": {"type": "array", "items": {"type": "string"}, "description": "证书列表"}
            }
        }
    },
    "required": ["个人信息"]
}

PRODUCT_SPECS_SCHEMA = {
    "type": "object",
    "properties": {
        "产品名称": {"type": ["string", "null"], "description": "产品名称"},
        "型号": {"type": ["string", "null"], "description": "产品型号"},
        "技术参数": {
            "type": "array",
            "description": "技术参数列表",
            "items": {
                "type": "object",
                "properties": {
                    "参数名": {"type": ["string", "null"], "description": "参数名称"},
                    "参数值": {"type": ["string", "null"], "description": "参数值"},
                    "单位": {"type": ["string", "null"], "description": "单位"}
                },
                "required": ["参数名", "参数值"]
            }
        },
        "功能特性": {
            "type": "array",
            "items": {"type": "string"},
            "description": "功能特性列表"
        },
        "价格信息": {
            "type": "object",
            "properties": {
                "价格": {"type": ["string", "null"], "description": "价格"},
                "币种": {"type": ["string", "null"], "description": "币种"}
            }
        }
    },
    "required": ["产品名称"]
}

API_INFO_SCHEMA = {
    "type": "object",
    "properties": {
        "接口列表": {
            "type": "array",
            "description": "API 接口列表",
            "items": {
                "type": "object",
                "properties": {
                    "端点": {"type": ["string", "null"], "description": "API 端点 URL"},
                    "请求方法": {"type": ["string", "null"], "description": "GET/POST/PUT/DELETE"},
                    "描述": {"type": ["string", "null"], "description": "接口描述"},
                    "请求参数": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "参数名": {"type": ["string", "null"], "description": "参数名"},
                                "类型": {"type": ["string", "null"], "description": "参数类型"},
                                "必填": {"type": ["string", "null"], "description": "是否必填"},
                                "说明": {"type": ["string", "null"], "description": "参数说明"}
                            }
                        }
                    },
                    "响应格式": {"type": ["string", "null"], "description": "响应数据格式描述"},
                    "认证方式": {"type": ["string", "null"], "description": "认证方式"}
                },
                "required": ["端点", "请求方法"]
            }
        }
    },
    "required": ["接口列表"]
}

KEY_VALUE_SCHEMA = {
    "type": "object",
    "properties": {
        "键值对列表": {
            "type": "array",
            "description": "从文档中提取的所有键值对",
            "items": {
                "type": "object",
                "properties": {
                    "键": {"type": ["string", "null"], "description": "键名"},
                    "值": {"type": ["string", "null"], "description": "对应的值"}
                },
                "required": ["键", "值"]
            }
        }
    },
    "required": ["键值对列表"]
}


class InformationExtractionAgent:
    """信息提取Agent"""

    def __init__(self):
        self.setup_llm()
        self.setup_agent()

    def setup_llm(self):
        self.llm = ChatTongyi(
            model="qwen-max",
            dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
            temperature=0,
        )

    @staticmethod
    def extract_from_file(file_path: str, schema: dict, generate_citations: bool = False, stamp: bool = False) -> dict:
        with open(file_path, "rb") as f:
            file_base64 = base64.b64encode(f.read()).decode("utf-8")

        payload = {
            "file": {
                "file_base64": file_base64,
                "file_name": os.path.basename(file_path)
            },
            "schema": schema,
            "extract_options": {
                "generate_citations": generate_citations,
                "stamp": stamp
            }
        }

        headers = {
            "x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
            "x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
            "Content-Type": "application/json"
        }

        response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
        result = response.json()

        if result.get("code") != 200:
            raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")

        return result["result"]

    @staticmethod
    def _resolve_file_path(file_path: str = None) -> str:
        if file_path in ("None", "none", None, "", "null"):
            return None
        if os.path.exists(file_path):
            return file_path
        return None

    def setup_agent(self):
        tools = [
            Tool(
                name="extract_invoice_info",
                description="从发票中提取结构化信息,包括发票基本信息、销售方/购买方信息、商品明细、金额信息等。需要提供文档路径作为参数。",
                func=self.extract_invoice_info
            ),
            Tool(
                name="extract_medical_bill_info",
                description="从医疗票据中提取结构化信息,包括患者信息、医疗机构信息、就诊信息、费用明细、费用汇总等。需要提供文档路径作为参数。",
                func=self.extract_medical_bill_info
            ),
            Tool(
                name="extract_contract_info",
                description="从合同中提取结构化信息,包括合同基本信息、合同双方信息、合同标的、关键条款、金额信息等。需要提供文档路径作为参数。",
                func=self.extract_contract_info
            ),
            Tool(
                name="extract_resume_info",
                description="从简历中提取结构化信息,包括个人信息、教育经历、工作经历、技能等。需要提供文档路径作为参数。",
                func=self.extract_resume_info
            ),
            Tool(
                name="extract_product_specs",
                description="从产品文档中提取产品规格和技术参数,包括产品名称、型号、技术参数、功能特性、价格等。需要提供文档路径作为参数。",
                func=self.extract_product_specs
            ),
            Tool(
                name="extract_api_info",
                description="从技术文档中提取API接口信息,包括API端点、请求方法、请求参数、响应格式等。需要提供文档路径作为参数。",
                func=self.extract_api_info
            ),
            Tool(
                name="format_data",
                description="从文档中提取键值对信息并格式化为标准格式(JSON、CSV等)。需要提供文档路径作为参数。",
                func=self.format_data
            )
        ]

        self.agent = initialize_agent(
            tools=tools,
            llm=self.llm,
            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
            verbose=True,
            agent_kwargs={
                "prefix": """你是一个专业的信息提取助手。你的任务是帮助用户:
1. 从文档中提取结构化信息(发票、医疗票据、合同、简历、产品规格、API接口等)
2. 将提取的信息格式化为标准格式(JSON、CSV等)
3. 验证提取数据的完整性和准确性

在回答时,请:
- 提供结构化的提取结果
- 使用JSON或表格格式展示数据
- 如果数据不完整,说明缺失的部分
- 使用工具获取准确的信息,不要猜测
- 对于财务类文档(发票、医疗票据),确保金额和税务信息的准确性
- 对于合同文档,重点关注关键条款和风险点
- 所有工具都需要提供文档文件路径作为参数
"""
            }
        )

    def extract_invoice_info(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"
        result = self.extract_from_file(fp, INVOICE_SCHEMA)
        return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

    def extract_medical_bill_info(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"
        result = self.extract_from_file(fp, MEDICAL_BILL_SCHEMA)
        return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

    def extract_contract_info(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"
        result = self.extract_from_file(fp, CONTRACT_SCHEMA)
        return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

    def extract_resume_info(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"
        result = self.extract_from_file(fp, RESUME_SCHEMA)
        return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

    def extract_product_specs(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"
        result = self.extract_from_file(fp, PRODUCT_SPECS_SCHEMA)
        return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

    def extract_api_info(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"
        result = self.extract_from_file(fp, API_INFO_SCHEMA)
        return json.dumps(result["extracted_schema"], ensure_ascii=False, indent=2)

    def format_data(self, file_path: str = None) -> str:
        fp = self._resolve_file_path(file_path)
        if not fp:
            return "错误:请提供有效的文档路径。"

        result = self.extract_from_file(fp, KEY_VALUE_SCHEMA)
        extracted = result["extracted_schema"]
        data_list = extracted.get("键值对列表", [])

        if not data_list:
            return "未找到可格式化的数据"

        json_output = json.dumps(data_list, ensure_ascii=False, indent=2)

        try:
            df = pd.DataFrame(data_list)
            csv_output = df.to_csv(index=False)
        except:
            csv_output = "CSV格式化失败"

        return f"JSON格式:\n{json_output}\n\nCSV格式:\n{csv_output}"

    def query(self, question: str) -> str:
        """查询 Agent,获取响应结果"""
        response = self.agent.invoke({"input": question})
        return response["output"]


def main():
    agent = InformationExtractionAgent()

    questions = [
        "从 ./extraction_documents/invoice.pdf 中提取发票代码、发票号码、销售方和购买方信息、商品明细和金额",
        # "从 ./extraction_documents/medical_bill.pdf 中提取患者信息、医院信息、诊断结果和费用明细",
        # "从 ./extraction_documents/contract.pdf 中提取合同编号、合同双方信息、合同金额和关键条款",
        # "从 ./extraction_documents/resume.pdf 中提取所有个人信息、教育经历和工作经历",
        # "从 ./extraction_documents/product_spec.pdf 中提取产品规格和技术参数",
        # "从 ./extraction_documents/api_docs.pdf 中提取所有API接口信息",
        "将 ./extraction_documents/invoice.pdf 中的数据格式化为JSON格式"
    ]

    for question in questions:
        print(f"\n{'='*60}")
        print(f"问题: {question}")
        print(f"{'='*60}")
        answer = agent.query(question)
        print(f"\n回答:\n{answer}")

if __name__ == "__main__":
    main()

使用示例

示例1:提取发票信息

agent = InformationExtractionAgent()

response = agent.query("从 ./extraction_documents/invoice.pdf 中提取发票代码、发票号码、销售方和购买方信息、商品明细和金额")
print(response)

示例2:提取医疗票据信息

response = agent.query("从 ./extraction_documents/medical_bill.pdf 中提取患者姓名、医院名称、诊断结果、总费用和医保支付金额")
print(response)

示例3:提取合同信息

response = agent.query("从 ./extraction_documents/contract.pdf 中提取合同编号、甲方和乙方信息、合同金额、付款方式和违约责任")
print(response)

示例4:提取简历信息

response = agent.query("从 ./extraction_documents/resume.pdf 中提取姓名、联系方式、教育经历和工作经历")
print(response)

示例5:提取产品规格

response = agent.query("从 ./extraction_documents/product_spec.pdf 中提取产品名称、型号、技术参数和价格")
print(response)

示例6:提取API信息

response = agent.query("从 ./extraction_documents/api_docs.pdf 中提取所有API端点、请求方法和参数")
print(response)

最佳实践

  1. Schema 设计:为每种文档类型设计专属的 JSON Schema,精确定义需要提取的字段、类型和约束,确保输出格式一致
  2. 批量处理:支持批量处理多个文档,提高效率
  3. 格式标准化:将提取的数据转换为标准格式(JSON、CSV),便于后续处理
  4. 财务文档处理
    • 发票提取时重点关注发票代码、号码、金额等关键信息
    • 医疗票据提取时注意区分自费、医保支付等不同费用类型
    • 确保金额计算的准确性,支持财务系统对接
  5. 合同文档处理
    • 重点关注合同双方信息、合同金额、关键条款
    • 识别违约责任、争议解决等重要条款
    • 提取合同有效期,便于合同管理
  6. 引用溯源:对需要审核的场景,可以在调用 extract_from_file 时设置 generate_citations=True,获取提取结果在原文中的引用位置
  7. 错误处理:对提取失败的情况进行记录和人工复核,检查 API 返回的错误信息
  8. 性能优化:Extract API 在服务端完成解析和抽取,无需本地部署解析引擎,适合大规模批量处理

常见问题

Q: 如何提高提取准确率?
A: 1) 优化 JSON Schema,精确定义字段和描述信息;2) 确保文档清晰,避免模糊或低质量的扫描件;3) 对提取结果进行验证和校验。
Q: 如何处理格式不统一的文档?
A: 1) Extract API 支持多种文档格式(PDF、Word、Excel、图片等),会自动处理格式差异;2) 通过 Schema 统一输出格式;3) 人工校验和修正。
Q: 如何批量处理大量文档?
A: 1) 遍历文档目录,逐个调用提取工具;2) 并行处理多个文档(使用多线程或异步);3) 使用队列管理任务,避免并发过高。
Q: 发票信息提取不准确怎么办?
A: 1) 确保发票图片清晰;2) 优化 Schema 中的字段描述;3) 启用 generate_citations=True 检查引用位置,排查问题字段;4) 对于特殊格式的发票,可以调整 Schema 适配。
Q: 医疗票据的费用明细如何提取?
A: 1) MEDICAL_BILL_SCHEMA 已定义费用明细数组,包含项目名称、数量、单价、金额、医保类型等字段;2) 费用汇总包含总费用、自费金额、医保支付、个人支付等;3) Extract API 能自动识别表格结构。
Q: 合同关键条款如何识别?
A: 1) CONTRACT_SCHEMA 已定义关键条款字段(付款方式、交付方式、违约责任、争议解决);2) 可以根据业务需求扩展 Schema,添加更多条款字段;3) 启用 citations 获取条款在原文中的位置。

相关文档