在处理长篇技术文档、学术论文或企业规范文档时,RAG系统面临的最大挑战之一是如何理解内容的逻辑层次和上下文关系。简单的文本分块往往会破坏文档的原有结构,导致检索到的信息缺乏必要的背景context。例如:当用户询问”数据安全相关的实施要求”时,如果系统无法区分这些要求是来自”总体概述”、“技术规范”还是”合规检查”章节,就可能提供不准确或不完整的信息。 实践中通常有一种技巧,即利用文档的标题层级分chunk,然后在检索和重排序的时候也利用标题层级过滤无关的chunk,从而提升Top5召回的相关度,以便让大模型在最终回答时效果更好。 在TextIn xParse文档解析API中,我们提供了获取文档标题层级的功能,最多可支持6级标题的输出,您可以基于API的返回结果来构建完整的文档目录树。

如何获取目录树

当您想要获取文档目录树(即大纲结构)时,您可以参考以下教程和示例代码。
这里为您提供了一份Textin官方pdf示例文件,您可点击下载或使用该链接:文档解析pdf示例.pdf
  • 参考快速启动,在 options 中设置URL参数 catalog_details=1,API会在返回结果中包含目录相关信息。
  • 在main函数中添加以下示例代码,获取API输出的目录信息,并保存为 json 文件。
		# 解析JSON响应
        json_response = json.loads(response)

        if "result" in json_response and "catalog" in json_response["result"]:
            catalog = json_response["result"]["catalog"]

            # 保存为json文件
            with open("catalog.json", "w", encoding="utf-8") as f:
                json.dump(catalog, f, ensure_ascii=False, indent=2)
            print("目录已保存为 catalog.json")
        else:
            print("未检测到目录字段,可能文档没有目录或参数设置有误。")
            return
  • 请注意:为了更加灵活的支持下游业务场景,文档解析API的目录返回结果中通过hierarchy字段表示目录的标题层级,但并没有目录之间直接的父子层级关系,您可以参考以下示例代码获取标题目录间的父子层级关系,以构建目录树状结构。
        toc_list = catalog['toc']
        result = []
        parent_stack = []  # 用于跟踪当前路径上的父节点

        for item in toc_list:
            # 复制当前项目,避免修改原始数据
            current_item = item.copy()
            current_item['children'] = []

            current_level = item.get('hierarchy', 1)

            # 根据层级调整父节点栈
            # 移除层级大于等于当前层级的节点
            while parent_stack and parent_stack[-1]['hierarchy'] >= current_level:
                parent_stack.pop()

            # 如果有父节点,将当前项目添加到父节点的children中
            if parent_stack:
                parent_stack[-1]['children'].append(current_item)
            else:
                # 如果没有父节点,说明是根节点
                result.append(current_item)

            # 将当前项目添加到父节点栈中
            parent_stack.append(current_item)

        print(result)

        # 保存处理后的目录结构为json文件
        with open("processed_catalog.json", "w", encoding="utf-8") as f:
            json.dump(result, f, ensure_ascii=False, indent=2)
        print("处理后的目录结构已保存为 processed_catalog.json")

        return result
  • 处理后的目录结构如下,标题目录间有了父子层级关系,您可以据此构建标题目录的树状结构。
[
  {
    "title": "Textin",
    "hierarchy": 1,
    "page_id": 1,
    "paragraph_id": 0,
    "pos": [
      276,
      228,
      467,
      228,
      467,
      272,
      276,
      272
    ],
    "pos_list": [
      [
        276,
        228,
        467,
        228,
        467,
        272,
        276,
        272
      ]
    ],
    "sub_type": "text_title",
    "children": []
  },
  {
    "pos": [
      371,
      382,
      906,
      382,
      906,
      450,
      371,
      450
    ],
    "pos_list": [
      [
        371,
        382,
        906,
        382,
        906,
        450,
        371,
        450
      ]
    ],
    "sub_type": "text_title",
    "title": "标准参考样例",
    "hierarchy": 1,
    "page_id": 1,
    "paragraph_id": 2,
    "children": []
  },
  {
    "paragraph_id": 3,
    "pos": [
      214,
      564,
      974,
      564,
      974,
      657,
      214,
      657
    ],
    "pos_list": [
      [
        214,
        564,
        974,
        564,
        974,
        657,
        214,
        657
      ]
    ],
    "sub_type": "text_title",
    "title": "本科毕业论文模板",
    "hierarchy": 1,
    "page_id": 1,
    "children": []
  },
  {
    "sub_type": "text_title",
    "title": "目录",
    "hierarchy": 1,
    "page_id": 2,
    "paragraph_id": 0,
    "pos": [
      571,
      198,
      685,
      198,
      685,
      228,
      571,
      228
    ],
    "pos_list": [
      [
        571,
        198,
        685,
        198,
        685,
        228,
        571,
        228
      ]
    ],
    "children": []
  },
  {
    "title": "第一章 背景介绍",
    "hierarchy": 1,
    "page_id": 3,
    "paragraph_id": 0,
    "pos": [
      456,
      212,
      733,
      212,
      733,
      241,
      456,
      241
    ],
    "pos_list": [
      [
        456,
        212,
        733,
        212,
        733,
        241,
        456,
        241
      ]
    ],
    "sub_type": "text_title",
    "children": [
      {
        "page_id": 3,
        "paragraph_id": 2,
        "pos": [
          342,
          413,
          657,
          413,
          657,
          440,
          342,
          440
        ],
        "pos_list": [
          [
            342,
            413,
            657,
            413,
            657,
            440,
            342,
            440
          ]
        ],
        "sub_type": "text_title",
        "title": "第1节   模板使用说明",
        "hierarchy": 2,
        "children": [
          {
            "title": "1.1.如何使用样式?",
            "hierarchy": 3,
            "page_id": 3,
            "paragraph_id": 5,
            "pos": [
              179,
              798,
              411,
              798,
              411,
              820,
              179,
              820
            ],
            "pos_list": [
              [
                179,
                798,
                411,
                798,
                411,
                820,
                179,
                820
              ]
            ],
            "sub_type": "text_title",
            "children": []
          }
        ]
      },
      {
        "title": "第2节 如何刷新目录",
        "hierarchy": 2,
        "page_id": 3,
        "paragraph_id": 7,
        "pos": [
          342,
          1019,
          657,
          1019,
          657,
          1046,
          342,
          1046
        ],
        "pos_list": [
          [
            342,
            1019,
            657,
            1019,
            657,
            1046,
            342,
            1046
          ]
        ],
        "sub_type": "text_title",
        "children": [
          {
            "hierarchy": 3,
            "page_id": 3,
            "paragraph_id": 9,
            "pos": [
              177,
              1286,
              747,
              1286,
              747,
              1308,
              177,
              1308
            ],
            "pos_list": [
              [
                177,
                1286,
                747,
                1286,
                747,
                1308,
                177,
                1308
              ]
            ],
            "sub_type": "text_title",
            "title": "2.1.为什么我写了新的章节后没有新的目录项出现?",
            "children": []
          },
          {
            "title": "2.2. 如何排版文章章节",
            "hierarchy": 3,
            "page_id": 4,
            "paragraph_id": 1,
            "pos": [
              176,
              221,
              445,
              221,
              445,
              243,
              176,
              243
            ],
            "pos_list": [
              [
                176,
                221,
                445,
                221,
                445,
                243,
                176,
                243
              ]
            ],
            "sub_type": "text_title",
            "children": []
          },
          {
            "paragraph_id": 4,
            "pos": [
              176,
              522,
              423,
              522,
              423,
              543,
              176,
              543
            ],
            "pos_list": [
              [
                176,
                522,
                423,
                522,
                423,
                543,
                176,
                543
              ]
            ],
            "sub_type": "text_title",
            "title": "2.3. 其他的一些样式",
            "hierarchy": 3,
            "page_id": 4,
            "children": []
          },
          {
            "sub_type": "text_title",
            "title": "2.4.如何使用其他的高级功能?",
            "hierarchy": 3,
            "page_id": 4,
            "paragraph_id": 6,
            "pos": [
              176,
              684,
              531,
              684,
              531,
              704,
              176,
              704
            ],
            "pos_list": [
              [
                176,
                684,
                531,
                684,
                531,
                704,
                176,
                704
              ]
            ],
            "children": []
          }
        ]
      }
    ]
  },
  {
    "title": "第二章 正文要求说明",
    "hierarchy": 1,
    "page_id": 5,
    "paragraph_id": 0,
    "pos": [
      422,
      210,
      766,
      210,
      766,
      241,
      422,
      241
    ],
    "pos_list": [
      [
        422,
        210,
        766,
        210,
        766,
        241,
        422,
        241
      ]
    ],
    "sub_type": "text_title",
    "children": [
      {
        "hierarchy": 2,
        "page_id": 5,
        "paragraph_id": 2,
        "pos": [
          384,
          412,
          642,
          412,
          642,
          439,
          384,
          439
        ],
        "pos_list": [
          [
            384,
            412,
            642,
            412,
            642,
            439,
            384,
            439
          ]
        ],
        "sub_type": "text_title",
        "title": "第1节 字体和大小",
        "children": [
          {
            "hierarchy": 3,
            "page_id": 5,
            "paragraph_id": 3,
            "pos": [
              176,
              496,
              349,
              496,
              349,
              518,
              176,
              518
            ],
            "pos_list": [
              [
                176,
                496,
                349,
                496,
                349,
                518,
                176,
                518
              ]
            ],
            "sub_type": "text_title",
            "title": "1.1.文章标题",
            "children": []
          },
          {
            "pos": [
              176,
              636,
              324,
              636,
              324,
              657,
              176,
              657
            ],
            "pos_list": [
              [
                176,
                636,
                324,
                636,
                324,
                657,
                176,
                657
              ]
            ],
            "sub_type": "text_title",
            "title": "1.2. 章标题",
            "hierarchy": 3,
            "page_id": 5,
            "paragraph_id": 5,
            "children": []
          },
          {
            "hierarchy": 3,
            "page_id": 5,
            "paragraph_id": 7,
            "pos": [
              176,
              775,
              324,
              775,
              324,
              795,
              176,
              795
            ],
            "pos_list": [
              [
                176,
                775,
                324,
                775,
                324,
                795,
                176,
                795
              ]
            ],
            "sub_type": "text_title",
            "title": "1.3.  节标题",
            "children": []
          },
          {
            "title": "1.4.子节标题",
            "hierarchy": 3,
            "page_id": 5,
            "paragraph_id": 9,
            "pos": [
              176,
              913,
              349,
              913,
              349,
              934,
              176,
              934
            ],
            "pos_list": [
              [
                176,
                913,
                349,
                913,
                349,
                934,
                176,
                934
              ]
            ],
            "sub_type": "text_title",
            "children": []
          },
          {
            "title": "1.5.正文",
            "hierarchy": 3,
            "page_id": 5,
            "paragraph_id": 11,
            "pos": [
              176,
              1054,
              300,
              1054,
              300,
              1076,
              176,
              1076
            ],
            "pos_list": [
              [
                176,
                1054,
                300,
                1054,
                300,
                1076,
                176,
                1076
              ]
            ],
            "sub_type": "text_title",
            "children": []
          }
        ]
      }
    ]
  },
  {
    "title": "第三章 公式排版",
    "hierarchy": 1,
    "page_id": 6,
    "paragraph_id": 0,
    "pos": [
      457,
      211,
      733,
      211,
      733,
      241,
      457,
      241
    ],
    "pos_list": [
      [
        457,
        211,
        733,
        211,
        733,
        241,
        457,
        241
      ]
    ],
    "sub_type": "text_title",
    "children": [
      {
        "hierarchy": 2,
        "page_id": 6,
        "paragraph_id": 2,
        "pos": [
          297,
          412,
          731,
          412,
          731,
          441,
          297,
          441
        ],
        "pos_list": [
          [
            297,
            412,
            731,
            412,
            731,
            441,
            297,
            441
          ]
        ],
        "sub_type": "text_title",
        "title": "第1节 Microsoft Equation Editor",
        "children": []
      },
      {
        "page_id": 6,
        "paragraph_id": 4,
        "pos": [
          366,
          641,
          637,
          641,
          637,
          669,
          366,
          669
        ],
        "pos_list": [
          [
            366,
            641,
            637,
            641,
            637,
            669,
            366,
            669
          ]
        ],
        "sub_type": "text_title",
        "title": "第2节   MathType",
        "hierarchy": 2,
        "children": []
      }
    ]
  },
  {
    "pos_list": [
      [
        359,
        163,
        644,
        163,
        644,
        189,
        359,
        189
      ]
    ],
    "sub_type": "text_title",
    "title": "第3节    TeX/LaTeX",
    "hierarchy": 1,
    "page_id": 7,
    "paragraph_id": 0,
    "pos": [
      359,
      163,
      644,
      163,
      644,
      189,
      359,
      189
    ],
    "children": []
  },
  {
    "pos": [
      442,
      210,
      750,
      210,
      750,
      241,
      442,
      241
    ],
    "pos_list": [
      [
        442,
        210,
        750,
        210,
        750,
        241,
        442,
        241
      ]
    ],
    "sub_type": "text_title",
    "title": "第四章 图形和表格",
    "hierarchy": 1,
    "page_id": 8,
    "paragraph_id": 0,
    "children": [
      {
        "pos": [
          428,
          350,
          602,
          350,
          602,
          376,
          428,
          376
        ],
        "pos_list": [
          [
            428,
            350,
            602,
            350,
            602,
            376,
            428,
            376
          ]
        ],
        "sub_type": "text_title",
        "title": "第1节 图形",
        "hierarchy": 2,
        "page_id": 8,
        "paragraph_id": 1,
        "children": [
          {
            "sub_type": "image_title",
            "title": "图表1.1这是一幅牛的图片",
            "hierarchy": 3,
            "page_id": 8,
            "paragraph_id": 4,
            "pos": [
              471,
              847,
              719,
              847,
              719,
              865,
              471,
              865
            ],
            "pos_list": [
              [
                471,
                847,
                719,
                847,
                719,
                865,
                471,
                865
              ]
            ],
            "children": []
          }
        ]
      },
      {
        "pos": [
          401,
          1087,
          602,
          1087,
          602,
          1114,
          401,
          1114
        ],
        "pos_list": [
          [
            401,
            1087,
            602,
            1087,
            602,
            1114,
            401,
            1114
          ]
        ],
        "sub_type": "text_title",
        "title": "第2节    表格",
        "hierarchy": 2,
        "page_id": 8,
        "paragraph_id": 6,
        "children": [
          {
            "sub_type": "table_title",
            "title": "插入表格与图片类似。当然,可以使用Excel预先作一个表格,然后导入进来,但是word本身也可以胜任一部分简单表格的绘制,如:",
            "hierarchy": 3,
            "page_id": 8,
            "paragraph_id": 8,
            "pos": [
              179,
              1167,
              1011,
              1167,
              1011,
              1231,
              179,
              1231
            ],
            "pos_list": [
              [
                179,
                1167,
                1011,
                1167,
                1011,
                1231,
                179,
                1231
              ]
            ],
            "children": []
          },
          {
            "hierarchy": 3,
            "page_id": 8,
            "paragraph_id": 10,
            "pos": [
              479,
              1372,
              713,
              1372,
              713,
              1391,
              479,
              1391
            ],
            "pos_list": [
              [
                479,
                1372,
                713,
                1372,
                713,
                1391,
                479,
                1391
              ]
            ],
            "sub_type": "table_title",
            "title": "表格2.1一个简单的表格",
            "children": []
          }
        ]
      }
    ]
  },
  {
    "sub_type": "text_title",
    "title": "第五章 定理环境",
    "hierarchy": 1,
    "page_id": 9,
    "paragraph_id": 0,
    "pos": [
      456,
      208,
      734,
      208,
      734,
      242,
      456,
      242
    ],
    "pos_list": [
      [
        456,
        208,
        734,
        208,
        734,
        242,
        456,
        242
      ]
    ],
    "children": [
      {
        "pos_list": [
          [
            357,
            452,
            673,
            452,
            673,
            478,
            357,
            478
          ]
        ],
        "sub_type": "text_title",
        "title": "第1节 自定义定理环境",
        "hierarchy": 2,
        "page_id": 9,
        "paragraph_id": 2,
        "pos": [
          357,
          452,
          673,
          452,
          673,
          478,
          357,
          478
        ],
        "children": [
          {
            "page_id": 9,
            "paragraph_id": 4,
            "pos": [
              176,
              581,
              411,
              581,
              411,
              603,
              176,
              603
            ],
            "pos_list": [
              [
                176,
                581,
                411,
                581,
                411,
                603,
                176,
                603
              ]
            ],
            "sub_type": "text_title",
            "title": "定理1.1.对顶角相等。",
            "hierarchy": 3,
            "children": []
          },
          {
            "hierarchy": 3,
            "page_id": 9,
            "paragraph_id": 6,
            "pos": [
              176,
              683,
              581,
              683,
              581,
              706,
              176,
              706
            ],
            "pos_list": [
              [
                176,
                683,
                581,
                683,
                581,
                706,
                176,
                706
              ]
            ],
            "sub_type": "text_title",
            "title": "定理1.2.三边对应相等的三角形全等。",
            "children": []
          }
        ]
      },
      {
        "page_id": 9,
        "paragraph_id": 9,
        "pos": [
          371,
          925,
          631,
          925,
          631,
          951,
          371,
          951
        ],
        "pos_list": [
          [
            371,
            925,
            631,
            925,
            631,
            951,
            371,
            951
          ]
        ],
        "sub_type": "text_title",
        "title": "第2节   已有环境",
        "hierarchy": 2,
        "children": []
      },
      {
        "pos_list": [
          [
            357,
            1255,
            645,
            1255,
            645,
            1282,
            357,
            1282
          ]
        ],
        "sub_type": "text_title",
        "title": "第3节   自定义环境",
        "hierarchy": 2,
        "page_id": 9,
        "paragraph_id": 14,
        "pos": [
          357,
          1255,
          645,
          1255,
          645,
          1282,
          357,
          1282
        ],
        "children": []
      }
    ]
  }
]