跳转到主要内容
本教程面向财务审计、合规审核等场景,展示如何利用 xParse 作为数据底座,构建能够自动解析财务文档、提取关键信息、进行合规性检查和异常检测的智能 Agent。

场景介绍

业务痛点

在企业财务审计和合规审核场景中,审计人员面临以下挑战:
  • 文档量大:需要处理大量财务报表、合同、发票、银行对账单等文档
  • 信息提取繁琐:需要从非结构化文档中提取关键财务指标(金额、日期、合同条款等)
  • 合规性检查复杂:需要对照法规和内部政策,检查合同条款、财务数据是否符合规范
  • 异常检测困难:需要识别金额异常、日期冲突、数据不一致等问题
  • 追溯困难:发现问题后,需要追溯到原始文档的具体位置进行验证

解决方案

通过构建财务审计 Agent,我们可以实现:
  • 自动化文档解析:使用 xParse SDK 自动解析各类财务文档,构建向量知识库
  • 智能信息提取:使用 xParse Extract API 从原始文档中结构化提取关键财务数据
  • 合规性自动检查:基于知识库和历史案例,自动检查合规性
  • 异常自动检测:识别金额异常、日期冲突等异常情况
  • 结果可追溯:Extract API 支持引用溯源,保留原始元素和坐标信息

架构设计

财务文档(PDF/Excel/图片)

xParse SDK 解析(构建向量知识库)
  ├─ RecursiveCharacterTextSplitter
  └─ DashScopeEmbeddings

向量数据库(Milvus/Zilliz)

xParse Extract API(结构化提取)
  └─ 从原始文档提取金额、日期、
     合同信息、发票信息等

LangChain Agent
  ├─ Tool 1: extract_financial_data(Extract 结构化提取)
  ├─ Tool 2: check_compliance(合规性检查)
  ├─ Tool 3: detect_anomalies(异常检测)
  └─ Tool 4: vector_search(检索历史案例)

审计报告(含引用和追溯信息)

环境准备

python -m venv .venv && source .venv/bin/activate
pip install xparse-client langchain langchain-core langchain-text-splitters \
            langchain_milvus langchain-community \
            pymilvus python-dotenv requests
export TEXTIN_APP_ID=your-app-id # 在 TextIn 官网注册获取
export TEXTIN_SECRET_CODE=your-secret-code # 在 TextIn 官网注册获取
export DASHSCOPE_API_KEY=your-dashscope-api-key # 本教程使用通义千问大模型,也可以替换成其他大模型
提示:TEXTIN_APP_IDTEXTIN_SECRET_CODE 参考 API Key,请登录 Textin 工作台 获取。示例中使用 通义千问 的大模型能力,其他模型用法类似。

Step 1:使用 xParse SDK 解析文档

针对财务审计场景,我们使用 xParse SDK 解析文档,再通过 LangChain 进行分块和向量化:
  • 分块策略:按页面分组后使用 RecursiveCharacterTextSplitter 分块,保持页面完整性,便于追溯
  • 向量化:使用 DashScopeEmbeddings 进行向量化
from xparse_client import XParseClient
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_milvus import Milvus
from langchain_core.documents import Document
from collections import defaultdict
import os, glob
from dotenv import load_dotenv

load_dotenv()

client = XParseClient()

def run_audit_pipeline():
    """处理审计文档"""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=100)

    all_chunks = []
    patterns = ["*.pdf", "*.xlsx", "*.xls", "*.png", "*.jpg", "*.txt", "*.docx", "*.doc"]
    for pattern in patterns:
        for file_path in glob.glob(os.path.join("./audit_documents", pattern)):
            with open(file_path, "rb") as f:
                result = client.parse.run(file=f, filename=os.path.basename(file_path))

            page_texts = defaultdict(list)
            for el in result.elements:
                page_texts[el.page_number].append(el.text)

            page_docs = [
                Document(
                    page_content="\n\n".join(texts),
                    metadata={"filename": os.path.basename(file_path), "page_number": pn}
                )
                for pn, texts in sorted(page_texts.items())
            ]
            chunks = text_splitter.split_documents(page_docs)
            all_chunks.extend(chunks)

    embedding = DashScopeEmbeddings(model="text-embedding-v3")
    Milvus.from_documents(
        documents=all_chunks,
        embedding=embedding,
        collection_name="audit_documents",
        connection_args={"uri": "./audit_vectors.db"},
    )

Step 2:构建 LangChain Tools

Extract 提取工具

使用 xParse Extract API 直接从原始文档中结构化提取财务数据,无需依赖正则匹配:
from langchain_core.tools import Tool
from langchain_milvus import Milvus
from langchain_community.embeddings import DashScopeEmbeddings
import base64
import requests
import re
import json
import os
import glob

EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"

FINANCIAL_SCHEMA = {
    "type": "object",
    "properties": {
        "金额列表": {
            "type": "array",
            "description": "文档中的所有金额信息",
            "items": {
                "type": "object",
                "properties": {
                    "金额": {"type": ["string", "null"], "description": "金额数值"},
                    "说明": {"type": ["string", "null"], "description": "金额对应的描述或用途"}
                },
                "required": ["金额"]
            }
        },
        "日期列表": {
            "type": "array",
            "description": "文档中的所有日期信息",
            "items": {
                "type": "object",
                "properties": {
                    "日期": {"type": ["string", "null"], "description": "日期"},
                    "说明": {"type": ["string", "null"], "description": "日期对应的描述"}
                },
                "required": ["日期"]
            }
        },
        "合同信息": {
            "type": "object",
            "description": "合同相关信息",
            "properties": {
                "甲方": {"type": ["string", "null"], "description": "甲方名称"},
                "乙方": {"type": ["string", "null"], "description": "乙方名称"},
                "合同编号": {"type": ["string", "null"], "description": "合同编号"},
                "合同金额": {"type": ["string", "null"], "description": "合同总金额"}
            }
        },
        "发票信息": {
            "type": "object",
            "description": "发票相关信息",
            "properties": {
                "发票号码": {"type": ["string", "null"], "description": "发票号码"},
                "税号": {"type": ["string", "null"], "description": "纳税人识别号"},
                "价税合计": {"type": ["string", "null"], "description": "价税合计金额"}
            }
        }
    },
    "required": ["金额列表", "日期列表"]
}

def extract_from_file(file_path: str, schema: dict, generate_citations: bool = False) -> dict:
    with open(file_path, "rb") as f:
        file_base64 = base64.b64encode(f.read()).decode("utf-8")
    payload = {
        "file": {"file_base64": file_base64, "file_name": os.path.basename(file_path)},
        "schema": schema,
        "extract_options": {"generate_citations": generate_citations}
    }
    headers = {
        "x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
        "x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
        "Content-Type": "application/json"
    }
    response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
    result = response.json()
    if result.get("code") != 200:
        raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")
    return result["result"]

embedding = DashScopeEmbeddings(
    model="text-embedding-v3",
    dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
)
vector_store = Milvus(
    embedding_function=embedding,
    collection_name="audit_documents",
    connection_args={"uri": "./audit_vectors.db"},
)

def extract_financial_data(query: str) -> str:
    """从财务文档中提取关键财务数据"""
    docs_dir = "./audit_documents"
    results = []

    if "文件:" in query:
        filename = query.split("文件:")[-1].strip()
        file_path = os.path.join(docs_dir, filename)
        if os.path.exists(file_path):
            try:
                result = extract_from_file(file_path, FINANCIAL_SCHEMA, generate_citations=True)
                results.append({
                    "file": filename,
                    "data": result["extracted_schema"],
                    "citations": result.get("citations", {})
                })
            except Exception as e:
                results.append({"file": filename, "error": str(e)})
    else:
        for pattern in ["*.pdf", "*.xlsx", "*.png", "*.jpg", "*.docx"]:
            for file_path in glob.glob(os.path.join(docs_dir, pattern)):
                try:
                    result = extract_from_file(file_path, FINANCIAL_SCHEMA, generate_citations=True)
                    results.append({
                        "file": os.path.basename(file_path),
                        "data": result["extracted_schema"]
                    })
                except Exception as e:
                    results.append({"file": os.path.basename(file_path), "error": str(e)})

    return json.dumps(results, ensure_ascii=False, indent=2)

Tool 2: 合规性检查

def check_compliance(query: str) -> str:
    """
    检查文档是否符合合规要求
    
    检查项包括:
    - 合同条款是否符合法规要求
    - 财务数据是否符合会计准则
    - 发票信息是否完整
    - 审批流程是否合规
    """
    docs = vector_store.similarity_search(query, k=3)
    
    compliance_checks = []
    
    for doc in docs:
        text = doc.page_content
        metadata = doc.metadata
        
        checks = {
            "file": metadata.get("filename", "unknown"),
            "page": metadata.get("page_number", "unknown"),
            "issues": []
        }
        
        amounts = re.findall(r'[\d,]+\.?\d*', text)
        for amount_str in amounts:
            try:
                amount = float(amount_str.replace(',', ''))
                if amount > 1000000:
                    checks["issues"].append(f"金额 {amount_str} 超过100万,需要特殊审批")
            except:
                pass
        
        dates = re.findall(r'\d{4}[-年]\d{1,2}[-月]\d{1,2}[]?', text)
        for date_str in dates:
            pass
        
        required_terms = ["违约责任", "争议解决", "合同期限"]
        missing_terms = [term for term in required_terms if term not in text]
        if missing_terms:
            checks["issues"].append(f"缺少关键条款: {', '.join(missing_terms)}")
        
        if checks["issues"]:
            compliance_checks.append(checks)
    
    if not compliance_checks:
        return "✅ 未发现合规性问题"
    
    return json.dumps(compliance_checks, ensure_ascii=False, indent=2)

Tool 3: 异常检测

def detect_anomalies(query: str) -> str:
    """
    检测财务数据中的异常
    
    检测项包括:
    - 金额异常(过大、过小、负数等)
    - 日期冲突(付款日期早于合同日期等)
    - 数据不一致(同一合同在不同文档中金额不同)
    """
    docs = vector_store.similarity_search(query, k=5)
    
    anomalies = []
    
    all_amounts = []
    all_dates = []
    
    for doc in docs:
        text = doc.page_content
        metadata = doc.metadata
        
        amounts = re.findall(r'[\d,]+\.?\d*', text)
        for amount_str in amounts:
            try:
                amount = float(amount_str.replace(',', ''))
                all_amounts.append({
                    "value": amount,
                    "source": metadata.get("filename", "unknown"),
                    "page": metadata.get("page_number", "unknown")
                })
            except:
                pass
        
        dates = re.findall(r'\d{4}[-年]\d{1,2}[-月]\d{1,2}[]?', text)
        all_dates.extend([{
            "value": date,
            "source": metadata.get("filename", "unknown"),
            "page": metadata.get("page_number", "unknown")
        } for date in dates])
    
    if all_amounts:
        positive_amounts = [a for a in all_amounts if a["value"] > 0]
        
        if len(positive_amounts) >= 2:
            amounts_values = [a["value"] for a in positive_amounts]
            max_amount = max(amounts_values)
            min_amount = min(amounts_values)
            
            if min_amount > 0:
                ratio = max_amount / min_amount
                if ratio > 1000:
                    anomalies.append({
                        "type": "金额差异异常",
                        "description": f"金额差异异常:最大金额 {max_amount:,.2f} 元与最小金额 {min_amount:,.2f} 元的比例达到 {ratio:.2f},超过1000倍",
                        "details": [a for a in positive_amounts if a["value"] in [max_amount, min_amount]]
                    })
    
    negative_amounts = [a for a in all_amounts if a["value"] < 0]
    if negative_amounts:
        anomalies.append({
            "type": "负数金额异常",
            "description": "发现负数金额,可能是录入错误",
            "details": negative_amounts
        })
    
    if not anomalies:
        return "✅ 未发现异常"
    
    return json.dumps(anomalies, ensure_ascii=False, indent=2)

Tool 4: 检索历史案例

def search_historical_cases(query: str) -> str:
    """检索历史审计案例"""
    docs = vector_store.similarity_search(query, k=5)
    
    results = []
    for i, doc in enumerate(docs, 1):
        results.append({
            f"案例 {i}": {
                "文件": doc.metadata.get("filename", "unknown"),
                "页码": doc.metadata.get("page_number", "unknown"),
                "内容": doc.page_content[:300] + "...",
                "相似度": "高" if i <= 2 else "中"
            }
        })
    
    return json.dumps(results, ensure_ascii=False, indent=2)

组装所有Tools

tools = [
    Tool(
        name="extract_financial_data",
        description="从财务文档中提取关键财务数据(金额、日期、合同条款、发票信息等)。输入可以是 '提取所有文档的财务数据' 或 '提取财务数据 文件:合同.pdf'。",
        func=extract_financial_data
    ),
    Tool(
        name="check_compliance",
        description="检查文档是否符合合规要求,包括合同条款合规性、财务数据合规性、发票信息完整性等。输入应为要检查的合规项描述。",
        func=check_compliance
    ),
    Tool(
        name="detect_anomalies",
        description="检测财务数据中的异常,包括金额异常、日期冲突、数据不一致等。输入应为要检测的异常类型描述。",
        func=detect_anomalies
    ),
    Tool(
        name="search_historical_cases",
        description="检索历史审计案例,用于参考和对比。输入应为要检索的案例类型或关键词。",
        func=search_historical_cases
    ),
    Tool(
        name="vector_search",
        description="基于语义检索相关文档片段。输入应为自然语言查询。",
        func=lambda q: "\n\n".join([
            f"[{i+1}] {doc.metadata.get('filename', 'unknown')}\n{doc.page_content[:300]}..."
            for i, doc in enumerate(vector_store.similarity_search(q, k=3))
        ])
    )
]

Step 3:配置 LangChain Agent

from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi
import os

llm = ChatTongyi(
    model="qwen-max",
    dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
    temperature=0,
)

agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
)

Step 4:完整示例代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
财务审计Agent完整示例
"""

import os
import re
import json
import glob
import base64
from collections import defaultdict
from dotenv import load_dotenv
import requests
from xparse_client import XParseClient
from langchain_core.tools import Tool
from langchain_core.documents import Document
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatTongyi
from langchain_milvus import Milvus
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

load_dotenv()

EXTRACT_API_URL = "https://api.textin.com/ai/service/v3/entity_extraction"

FINANCIAL_SCHEMA = {
    "type": "object",
    "properties": {
        "金额列表": {
            "type": "array",
            "description": "文档中的所有金额信息",
            "items": {
                "type": "object",
                "properties": {
                    "金额": {"type": ["string", "null"], "description": "金额数值"},
                    "说明": {"type": ["string", "null"], "description": "金额对应的描述或用途"}
                },
                "required": ["金额"]
            }
        },
        "日期列表": {
            "type": "array",
            "description": "文档中的所有日期信息",
            "items": {
                "type": "object",
                "properties": {
                    "日期": {"type": ["string", "null"], "description": "日期"},
                    "说明": {"type": ["string", "null"], "description": "日期对应的描述"}
                },
                "required": ["日期"]
            }
        },
        "合同信息": {
            "type": "object",
            "description": "合同相关信息",
            "properties": {
                "甲方": {"type": ["string", "null"], "description": "甲方名称"},
                "乙方": {"type": ["string", "null"], "description": "乙方名称"},
                "合同编号": {"type": ["string", "null"], "description": "合同编号"},
                "合同金额": {"type": ["string", "null"], "description": "合同总金额"}
            }
        },
        "发票信息": {
            "type": "object",
            "description": "发票相关信息",
            "properties": {
                "发票号码": {"type": ["string", "null"], "description": "发票号码"},
                "税号": {"type": ["string", "null"], "description": "纳税人识别号"},
                "价税合计": {"type": ["string", "null"], "description": "价税合计金额"}
            }
        }
    },
    "required": ["金额列表", "日期列表"]
}

class AuditAgent:
    """财务审计Agent"""
    
    def __init__(self):
        self.client = XParseClient()
        self.setup_vector_store()
        self.setup_agent()
    
    def setup_vector_store(self):
        """配置向量数据库"""
        self.embedding = DashScopeEmbeddings(
            model="text-embedding-v3",
            dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
        )
        self.vector_store = Milvus(
            embedding_function=self.embedding,
            collection_name="audit_documents",
            connection_args={"uri": "./audit_vectors.db"},
        )
    
    def extract_from_file(self, file_path: str, schema: dict, generate_citations: bool = False) -> dict:
        with open(file_path, "rb") as f:
            file_base64 = base64.b64encode(f.read()).decode("utf-8")
        payload = {
            "file": {"file_base64": file_base64, "file_name": os.path.basename(file_path)},
            "schema": schema,
            "extract_options": {"generate_citations": generate_citations}
        }
        headers = {
            "x-ti-app-id": os.getenv("TEXTIN_APP_ID"),
            "x-ti-secret-code": os.getenv("TEXTIN_SECRET_CODE"),
            "Content-Type": "application/json"
        }
        response = requests.post(EXTRACT_API_URL, json=payload, headers=headers)
        result = response.json()
        if result.get("code") != 200:
            raise Exception(f"Extract API 错误: {result.get('message', '未知错误')}")
        return result["result"]
    
    def setup_agent(self):
        """配置Agent和Tools"""
        tools = [
            Tool(
                name="extract_financial_data",
                description="从财务文档中提取关键财务数据(金额、日期、合同条款、发票信息等)。输入可以是 '提取所有文档的财务数据' 或 '提取财务数据 文件:合同.pdf'。",
                func=self.extract_financial_data
            ),
            Tool(
                name="check_compliance",
                description="检查合规性:合同条款、财务数据合规性等",
                func=self.check_compliance
            ),
            Tool(
                name="detect_anomalies",
                description="检测异常:金额异常、日期冲突等",
                func=self.detect_anomalies
            ),
            Tool(
                name="vector_search",
                description="语义检索相关文档",
                func=lambda q: "\n\n".join([
                    f"[{i+1}] {doc.metadata.get('filename', 'unknown')}\n{doc.page_content[:300]}..."
                    for i, doc in enumerate(self.vector_store.similarity_search(q, k=3))
                ])
            )
        ]
        
        llm = ChatTongyi(
            model="qwen-max",
            dashscope_api_key=os.getenv("DASHSCOPE_API_KEY"),
            temperature=0,
        )
        
        self.agent = initialize_agent(
            tools=tools,
            llm=llm,
            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
            verbose=True,
        )
    
    def extract_financial_data(self, query: str) -> str:
        """从财务文档中提取关键财务数据"""
        docs_dir = "./audit_documents"
        results = []

        if "文件:" in query:
            filename = query.split("文件:")[-1].strip()
            file_path = os.path.join(docs_dir, filename)
            if os.path.exists(file_path):
                try:
                    result = self.extract_from_file(file_path, FINANCIAL_SCHEMA, generate_citations=True)
                    results.append({
                        "file": filename,
                        "data": result["extracted_schema"],
                        "citations": result.get("citations", {})
                    })
                except Exception as e:
                    results.append({"file": filename, "error": str(e)})
        else:
            for pattern in ["*.pdf", "*.xlsx", "*.png", "*.jpg", "*.docx"]:
                for file_path in glob.glob(os.path.join(docs_dir, pattern)):
                    try:
                        result = self.extract_from_file(file_path, FINANCIAL_SCHEMA, generate_citations=True)
                        results.append({
                            "file": os.path.basename(file_path),
                            "data": result["extracted_schema"]
                        })
                    except Exception as e:
                        results.append({"file": os.path.basename(file_path), "error": str(e)})

        return json.dumps(results, ensure_ascii=False, indent=2)
    
    def check_compliance(self, query: str) -> str:
        """合规性检查"""
        docs = self.vector_store.similarity_search(query, k=3)
        issues = []
        for doc in docs:
            text = doc.page_content
            amounts = re.findall(r'[\d,]+\.?\d*', text)
            for amount_str in amounts:
                try:
                    amount = float(amount_str.replace(',', ''))
                    if amount > 1000000:
                        issues.append({
                            "file": doc.metadata.get("filename", "unknown"),
                            "page": doc.metadata.get("page_number", "unknown"),
                            "issue": f"金额 {amount_str} 超过100万,需要特殊审批"
                        })
                except:
                    pass
        return json.dumps(issues, ensure_ascii=False, indent=2) if issues else "✅ 未发现合规性问题"
    
    def detect_anomalies(self, query: str) -> str:
        """异常检测"""
        docs = self.vector_store.similarity_search(query, k=5)
        anomalies = []
        all_amounts = []
        for doc in docs:
            amounts = re.findall(r'[\d,]+\.?\d*', doc.page_content)
            for amount_str in amounts:
                try:
                    amount = float(amount_str.replace(',', ''))
                    if amount > 0:
                        all_amounts.append(amount)
                except:
                    pass
        
        if len(all_amounts) >= 2:
            min_amount = min(all_amounts)
            max_amount = max(all_amounts)
            if min_amount > 0:
                ratio = max_amount / min_amount
                if ratio > 1000:
                    anomalies.append(f"金额差异异常:最大金额 {max_amount:,.2f} 元与最小金额 {min_amount:,.2f} 元的比例达到 {ratio:.2f},超过1000倍")
            elif max_amount > 0:
                anomalies.append(f"发现零金额异常:存在金额为0的记录,同时存在金额为 {max_amount:,.2f} 元的记录")
        
        return json.dumps(anomalies, ensure_ascii=False, indent=2) if anomalies else "✅ 未发现异常"
    
    def process_documents(self):
        """处理文档"""
        print("=" * 60)
        print("开始处理审计文档...")
        print("=" * 60)
        
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=100)

        all_chunks = []
        patterns = ["*.pdf", "*.xlsx", "*.xls", "*.png", "*.jpg", "*.txt", "*.docx", "*.doc"]
        for pattern in patterns:
            for file_path in glob.glob(os.path.join("./audit_documents", pattern)):
                with open(file_path, "rb") as f:
                    result = self.client.parse.run(file=f, filename=os.path.basename(file_path))

                page_texts = defaultdict(list)
                for el in result.elements:
                    page_texts[el.page_number].append(el.text)

                page_docs = [
                    Document(
                        page_content="\n\n".join(texts),
                        metadata={"filename": os.path.basename(file_path), "page_number": pn}
                    )
                    for pn, texts in sorted(page_texts.items())
                ]
                chunks = text_splitter.split_documents(page_docs)
                all_chunks.extend(chunks)

        Milvus.from_documents(
            documents=all_chunks,
            embedding=self.embedding,
            collection_name="audit_documents",
            connection_args={"uri": "./audit_vectors.db"},
        )
        
        print("\n文档处理完成!")
    
    def query(self, question: str) -> str:
        """查询Agent"""
        response = self.agent.invoke({
            "input": question
        })
        return response["output"]

def main():
    """主函数"""
    agent = AuditAgent()
    
    # 1. 处理文档(首次运行)
    agent.process_documents()
    
    # 2. 查询示例
    questions = [
        "提取所有合同中的金额和签署日期",
        "检查这些合同是否符合合规要求",
        "检测是否有金额异常的情况",
        "检索类似的历史审计案例"
    ]
    
    for question in questions:
        print(f"\n{'='*60}")
        print(f"问题: {question}")
        print(f"{'='*60}")
        answer = agent.query(question)
        print(f"\n回答:\n{answer}")

if __name__ == "__main__":
    main()

使用示例

示例1:提取财务数据

agent = AuditAgent()
response = agent.query("从财务报表中提取所有超过10万的金额和对应的日期")
print(response)

示例2:提取指定文件的财务数据

response = agent.query("提取财务数据 文件:合同_2024Q1.pdf")
print(response)

示例3:合规性检查

response = agent.query("检查所有合同是否符合以下要求:1) 金额超过100万需要特殊审批 2) 必须包含违约责任条款")
print(response)

示例4:异常检测

response = agent.query("检测财务报表中是否有异常:金额为负数、日期不合理、同一合同金额不一致等")
print(response)

最佳实践

  1. 文档预处理:确保文档格式统一,命名规范(如:合同_2024Q1_供应商A.pdf
  2. 分块策略:按页面分组后使用 RecursiveCharacterTextSplitter 分块,保持页面完整性,便于追溯
  3. 向量化配置:使用 DashScopeEmbeddingstext-embedding-v3 模型获得更好的语义理解
  4. 结构化提取:使用 Extract API 定义精确的 Schema,提取结构化财务数据,避免正则匹配的局限性
  5. 引用溯源:开启 generate_citations 获取提取结果的原文引用,便于审计追溯
  6. 合规规则配置:将合规规则存储在配置文件中,便于更新和维护
  7. 异常阈值设置:根据业务需求设置合理的异常检测阈值
  8. 结果追溯:在 Agent 回答中包含文档名称和页码,便于人工验证

常见问题

Q: 如何处理加密的PDF文档?
A: 在调用 client.parse.run() 时添加 pdf_pwd 参数,或在处理前先解密文档。
Q: 如何提高提取准确率?
A: 1) 使用 Extract API 进行结构化提取,定义精确的 Schema;2) 使用 text-embedding-v3 模型进行向量检索;3) 开启 generate_citations 验证提取结果。
Q: Extract API 支持哪些文件格式?
A: 支持 PDF、Word(docx)、Excel(xlsx)、图片(PNG/JPG)等常见格式,文件通过 base64 编码上传。
Q: 如何集成到现有审计系统?
A: 可以将 Agent 封装为 REST API,通过HTTP接口调用,或集成到现有的审计工作流中。

相关文档