文档解析 - Textin 智能文档解析

{
  "code": 200,
  "message": "success",
  "result": {
    "markdown": "# hello markdown",
    "detail": [
      {
        "page_id": 1,
        "paragraph_id": 123,
        "outline_level": -1,
        "text": "hello markdown",
        "position": [
          217,
          390,
          1336,
          390,
          1336,
          460,
          217,
          460
        ],
        "content": 0,
        "type": "paragraph",
        "origin_position": [
          217,
          390,
          1336,
          390,
          1336,
          460,
          217,
          460
        ],
        "sub_type": "catalog",
        "image_url": "<string>",
        "tags": [
          "formula",
          "handwritten"
        ],
        "caption_id": {
          "page_id": 123,
          "paragraph_id": 123
        },
        "cells": [
          {
            "row": 123,
            "col": 123,
            "row_span": 123,
            "col_span": 123,
            "position": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ],
            "origin_position": [
              123
            ],
            "text": "<string>",
            "type": "<string>"
          }
        ],
        "split_section_page_ids": [
          1,
          2,
          3
        ],
        "split_section_positions": [
          [
            0,
            0,
            100,
            100,
            100,
            200,
            0,
            200
          ],
          [
            0,
            0,
            100,
            100,
            100,
            200,
            0,
            200
          ],
          [
            0,
            0,
            100,
            100,
            100,
            200,
            0,
            200
          ]
        ],
        "stamp": {
          "value": "<string>",
          "stamp_shape": "<string>",
          "type": "<string>",
          "color": "<string>"
        }
      }
    ],
    "pages": [
      {
        "status": "success",
        "page_id": 0,
        "durations": 612.5,
        "image_id": "90u12adcad08r2",
        "origin_image_id": "90u12adcad08r2",
        "base64": "<string>",
        "origin_base64": "<string>",
        "width": 123,
        "height": 123,
        "angle": 123,
        "content": [
          {
            "id": 123,
            "type": "line",
            "text": "<string>",
            "pos": [
              123
            ],
            "origin_position": [
              123
            ],
            "direction": 123,
            "score": 0.5,
            "char_pos": [
              [
                123
              ]
            ]
          }
        ],
        "raw_ocr": [
          {
            "text": "这是一个例子。",
            "score": 0.99,
            "type": "text",
            "position": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ],
            "angle": 123,
            "direction": 1,
            "handwritten": 1,
            "char_scores": [
              0.99,
              0.98,
              0.95,
              0.95,
              0.99,
              0.93,
              0.87
            ],
            "char_centers": [
              [
                20,
                10
              ],
              [
                30,
                10
              ],
              [
                40,
                10
              ],
              [
                50,
                10
              ],
              [
                60,
                10
              ],
              [
                70,
                10
              ],
              [
                80,
                10
              ]
            ],
            "char_positions": [
              [
                18,
                8,
                22,
                8,
                22,
                12,
                18,
                12
              ],
              [
                28,
                88,
                32,
                8,
                32,
                12,
                28,
                12
              ],
              [
                38,
                88,
                42,
                8,
                42,
                12,
                38,
                12
              ],
              [
                48,
                88,
                52,
                8,
                52,
                12,
                48,
                12
              ],
              [
                58,
                88,
                62,
                8,
                62,
                12,
                58,
                12
              ],
              [
                68,
                88,
                72,
                8,
                72,
                12,
                68,
                12
              ],
              [
                78,
                88,
                82,
                8,
                82,
                12,
                78,
                12
              ]
            ],
            "char_candidates": [
              [
                "这"
              ],
              [
                "是"
              ],
              [
                "一",
                "-"
              ],
              [
                "个"
              ],
              [
                "例"
              ],
              [
                "子"
              ],
              [
                "。",
                "O"
              ]
            ],
            "char_candidates_score": [
              [
                0.99
              ],
              [
                0.99
              ],
              [
                0.95,
                0.05
              ],
              [
                0.99
              ],
              [
                0.99
              ],
              [
                0.99
              ],
              [
                0.89,
                0.11
              ]
            ]
          }
        ],
        "structured": [
          {
            "type": "textblock",
            "pos": [
              123
            ],
            "content": [
              0,
              1,
              2
            ],
            "origin_position": [
              123
            ],
            "sub_type": "text",
            "continue": true,
            "next_page_id": 2,
            "next_para_id": 1,
            "text": "<string>",
            "outline_level": -1
          }
        ]
      }
    ],
    "catalog": {
      "toc": [
        [
          {
            "hierarchy": 2,
            "title": "1.公司简介和主要财务指标",
            "page_id": 3,
            "pos": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ]
          },
          {
            "hierarchy": 3,
            "title": "1.1 公司简介",
            "page_id": 4,
            "pos": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ]
          }
        ]
      ]
    },
    "total_page_number": 10,
    "valid_page_number": 3,
    "excel_base64": "",
    "success_count": 1,
    "elements": [
      {
        "element_id": "",
        "type": "NarrativeText",
        "text": "xParse 是一个端到端文档处理 AI 基础设施",
        "metadata": {
          "page_image_url": "https://web-api.textin.com/ocr_image/external/01a91572ca81092c.jpg",
          "angle": 0,
          "page_number": 1,
          "page_width": 600,
          "page_height": 800,
          "coordinates": [
            0.182212,
            0.231623,
            0.671714,
            0.231633,
            0.671711,
            0.273233,
            0.182244,
            0.273255
          ],
          "is_continue": false,
          "category_depth": -1,
          "parent_id": "",
          "original_image_url": "",
          "sub_type": "stamp",
          "image_url": "https://web-api.textin.com/ocr_image/external/e47f8aed69ccabce.jpg",
          "image_base64": ""
        }
      }
    ]
  },
  "version": "2.1.0",
  "duration": 999,
  "metrics": [
    {
      "page_image_width": 1024,
      "page_image_height": 768,
      "durations": 123,
      "status": "<string>",
      "page_id": 123,
      "angle": 90,
      "dpi": 72,
      "image_id": "<string>"
    }
  ]
}

POST

service

pdf_to_markdown

{
  "code": 200,
  "message": "success",
  "result": {
    "markdown": "# hello markdown",
    "detail": [
      {
        "page_id": 1,
        "paragraph_id": 123,
        "outline_level": -1,
        "text": "hello markdown",
        "position": [
          217,
          390,
          1336,
          390,
          1336,
          460,
          217,
          460
        ],
        "content": 0,
        "type": "paragraph",
        "origin_position": [
          217,
          390,
          1336,
          390,
          1336,
          460,
          217,
          460
        ],
        "sub_type": "catalog",
        "image_url": "<string>",
        "tags": [
          "formula",
          "handwritten"
        ],
        "caption_id": {
          "page_id": 123,
          "paragraph_id": 123
        },
        "cells": [
          {
            "row": 123,
            "col": 123,
            "row_span": 123,
            "col_span": 123,
            "position": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ],
            "origin_position": [
              123
            ],
            "text": "<string>",
            "type": "<string>"
          }
        ],
        "split_section_page_ids": [
          1,
          2,
          3
        ],
        "split_section_positions": [
          [
            0,
            0,
            100,
            100,
            100,
            200,
            0,
            200
          ],
          [
            0,
            0,
            100,
            100,
            100,
            200,
            0,
            200
          ],
          [
            0,
            0,
            100,
            100,
            100,
            200,
            0,
            200
          ]
        ],
        "stamp": {
          "value": "<string>",
          "stamp_shape": "<string>",
          "type": "<string>",
          "color": "<string>"
        }
      }
    ],
    "pages": [
      {
        "status": "success",
        "page_id": 0,
        "durations": 612.5,
        "image_id": "90u12adcad08r2",
        "origin_image_id": "90u12adcad08r2",
        "base64": "<string>",
        "origin_base64": "<string>",
        "width": 123,
        "height": 123,
        "angle": 123,
        "content": [
          {
            "id": 123,
            "type": "line",
            "text": "<string>",
            "pos": [
              123
            ],
            "origin_position": [
              123
            ],
            "direction": 123,
            "score": 0.5,
            "char_pos": [
              [
                123
              ]
            ]
          }
        ],
        "raw_ocr": [
          {
            "text": "这是一个例子。",
            "score": 0.99,
            "type": "text",
            "position": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ],
            "angle": 123,
            "direction": 1,
            "handwritten": 1,
            "char_scores": [
              0.99,
              0.98,
              0.95,
              0.95,
              0.99,
              0.93,
              0.87
            ],
            "char_centers": [
              [
                20,
                10
              ],
              [
                30,
                10
              ],
              [
                40,
                10
              ],
              [
                50,
                10
              ],
              [
                60,
                10
              ],
              [
                70,
                10
              ],
              [
                80,
                10
              ]
            ],
            "char_positions": [
              [
                18,
                8,
                22,
                8,
                22,
                12,
                18,
                12
              ],
              [
                28,
                88,
                32,
                8,
                32,
                12,
                28,
                12
              ],
              [
                38,
                88,
                42,
                8,
                42,
                12,
                38,
                12
              ],
              [
                48,
                88,
                52,
                8,
                52,
                12,
                48,
                12
              ],
              [
                58,
                88,
                62,
                8,
                62,
                12,
                58,
                12
              ],
              [
                68,
                88,
                72,
                8,
                72,
                12,
                68,
                12
              ],
              [
                78,
                88,
                82,
                8,
                82,
                12,
                78,
                12
              ]
            ],
            "char_candidates": [
              [
                "这"
              ],
              [
                "是"
              ],
              [
                "一",
                "-"
              ],
              [
                "个"
              ],
              [
                "例"
              ],
              [
                "子"
              ],
              [
                "。",
                "O"
              ]
            ],
            "char_candidates_score": [
              [
                0.99
              ],
              [
                0.99
              ],
              [
                0.95,
                0.05
              ],
              [
                0.99
              ],
              [
                0.99
              ],
              [
                0.99
              ],
              [
                0.89,
                0.11
              ]
            ]
          }
        ],
        "structured": [
          {
            "type": "textblock",
            "pos": [
              123
            ],
            "content": [
              0,
              1,
              2
            ],
            "origin_position": [
              123
            ],
            "sub_type": "text",
            "continue": true,
            "next_page_id": 2,
            "next_para_id": 1,
            "text": "<string>",
            "outline_level": -1
          }
        ]
      }
    ],
    "catalog": {
      "toc": [
        [
          {
            "hierarchy": 2,
            "title": "1.公司简介和主要财务指标",
            "page_id": 3,
            "pos": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ]
          },
          {
            "hierarchy": 3,
            "title": "1.1 公司简介",
            "page_id": 4,
            "pos": [
              10,
              10,
              100,
              10,
              100,
              50,
              10,
              50
            ]
          }
        ]
      ]
    },
    "total_page_number": 10,
    "valid_page_number": 3,
    "excel_base64": "",
    "success_count": 1,
    "elements": [
      {
        "element_id": "",
        "type": "NarrativeText",
        "text": "xParse 是一个端到端文档处理 AI 基础设施",
        "metadata": {
          "page_image_url": "https://web-api.textin.com/ocr_image/external/01a91572ca81092c.jpg",
          "angle": 0,
          "page_number": 1,
          "page_width": 600,
          "page_height": 800,
          "coordinates": [
            0.182212,
            0.231623,
            0.671714,
            0.231633,
            0.671711,
            0.273233,
            0.182244,
            0.273255
          ],
          "is_continue": false,
          "category_depth": -1,
          "parent_id": "",
          "original_image_url": "",
          "sub_type": "stamp",
          "image_url": "https://web-api.textin.com/ocr_image/external/e47f8aed69ccabce.jpg",
          "image_base64": ""
        }
      }
    ]
  },
  "version": "2.1.0",
  "duration": 999,
  "metrics": [
    {
      "page_image_width": 1024,
      "page_image_height": 768,
      "durations": 123,
      "status": "<string>",
      "page_id": 123,
      "angle": 90,
      "dpi": 72,
      "image_id": "<string>"
    }
  ]
}

授权

x-ti-app-id

string

header

必填

请登录Textin后前往 "工作台-账号设置-开发者信息" 查看 x-ti-app-id

x-ti-secret-code

string

header

必填

请登录Textin后前往 "工作台-账号设置-开发者信息" 查看 x-ti-secret-code

查询参数

parse_mode

enum<string>

默认值:scan

文档的解析模式，默认为scan模式。

auto 由引擎自动选择，适用范围最广
scan 文档统一当成图片解析
lite 轻量版，只输出表格和文字结果
parse 仅电子档文字解析，速度最快

可用选项:

auto,

scan,

lite,

parse

pdf_pwd

string

当pdf为加密文档时，需要提供密码。备注：对前端封装该接口时，需要自行对密码进行安全防护

page_start

integer

默认值:0

当上传的是pdf时，表示从第几页开始解析，不传该参数时默认从首页开始

page_count

integer

默认值:1000

当上传的是pdf时，page_count 表示要进行转换的pdf页数，总页数不得超过1000页，默认为1000页

dpi

enum<integer>

默认值:144

pdf文档的坐标基准，默认144dpi，与parse_mode参数联动：

当parse_mode=auto时，默认动态，支持72，144，216；
当parse_mode=scan时，默认144，支持72，144，216；

可用选项:

72,

144,

216

apply_document_tree

enum<integer>

默认值:1

markdown中是否生成标题层级，默认为1，生成标题。

0 不生成标题，同时也不会返回catalog字段
1 生成标题

可用选项:

0,

1

table_flavor

enum<string>

默认值:html

markdown里的表格格式，默认为html，按html语法输出表格

md 按md语法输出表格
html 按html语法输出表格
none 不进行表格识别，把表格图像当成普通文字段落来识别

可用选项:

md,

html,

none

get_image

enum<string>

默认值:none

获取markdown里的图片，默认为none，不返回任何图像

none 不返回任何图像
page 返回每一页的整页图像：即pdf页的完整页图片
objects 返回页面内的子图像：即pdf页内的各个子图片
both 返回整页图像和图像对象

可用选项:

none,

page,

objects,

both

image_output_type

enum<string>

默认值:default

指定引擎返回的图片对象输出类型，默认返回子图片url和页图片id

base64str 指定所有图片对象为base64字符串，适用于没有云存储的用户，但是引擎返回结果体积会很大。识别页数page_count超过1000页时，不支持base64返回，只会以default格式返回。
default 指定子图片对象为图片url,页图片对象为图片id

可用选项:

base64str,

default

paratext_mode

enum<string>

markdown中非正文文本内容展示模式。默认为annotation。非正文内容包括页眉页脚，子图中的文本。

none 不展示
annotation 以注释格式插入到markdown中。页眉页脚中的图片只保留文本，图片base64或url不保留。
body 以正文格式插入到markdown中

可用选项:

none,

annotation,

body

formula_level

enum<integer>

默认值:0

公式识别等级，默认为0，全识别。开启公式识别后，会使用latex表达式。

0 全识别
1 仅识别行间公式，行内公式不识别
2 不识别

可用选项:

0,

1,

2

underline_level

enum<integer>

默认值:0

控制下划线识别范围，默认为0，不识别。

0: 不识别
1: 仅识别无文字的下划线，仅scan模式可用
2: 识别全部的下划线，仅scan模式可用

可用选项:

0,

1,

2

apply_merge

enum<integer>

默认值:1

是否进行段落合并和表格合并。默认为1，合并段落和表格。

0 不合并
1 合并

可用选项:

0,

1

apply_image_analysis

enum<integer>

默认值:0

利用大模型对文档中的子图进行分析。分析结果以markdown格式输出，并替换掉子图的文本识别内容。默认为0，不进行图像分析。

0 不进行图像分析
1 进行图像分析

可用选项:

0,

1

markdown_details

enum<integer>

默认值:1

是否返回结果中的detail字段。默认为1，返回detail字段，保存markdown各类型元素的详细信息。

0 不生成
1 生成

可用选项:

0,

1

page_details

enum<integer>

默认值:1

是否返回结果中的pages字段。默认为1，返回pages字段，保存每一页更加详细的解析结果。

可用选项:

0,

1

raw_ocr

enum<integer>

默认值:0

是否返回全部文字识别结果(包含字符坐标信息)，结果字段为raw_ocr。默认为0，不返回。与page_details参数联动，当page_details为0或false时不返回。

0 不返回
1 返回

可用选项:

0,

1

char_details

enum<integer>

默认值:0

是否返回结果中的char_pos字段（保存每个字符的位置信息）和raw_ocr中的char_相关字段。默认为0，不返回。

0 不返回
1 返回

可用选项:

0,

1

catalog_details

enum<integer>

默认值:0

是否返回结果中的catalog字段，保存目录相关信息。与apply_document_tree参数联动，当apply_document_tree为0时不返回。

0 不返回
1 返回

可用选项:

0,

1

get_excel

enum<integer>

默认值:0

是否返回excel的base64结果，结果字段为excel_base64，可以根据该字段进行后处理保存excel文件。默认为0，不返回。

0 不返回
1 返回

可用选项:

0,

1

crop_dewarp

enum<integer>

默认值:0

是否进行切边矫正处理，默认为0，不进行切边矫正

0 不进行切边矫正
1 进行切边矫正

可用选项:

0,

1

remove_watermark

enum<integer>

默认值:0

是否进行去水印处理，默认为0，不去水印

0 不去水印
1 去水印

可用选项:

0,

1

apply_chart

enum<integer>

默认值:0

是否开启图表识别，开启图表识别会将识别到的图表以表格形式输出。默认为0，不进行图表识别。

0 不开启图表识别
1 开启图表识别

可用选项:

0,

1

请求体

支持以下两种请求格式：

Content-Type: application/octet-stream

支持的文件格式：png, jpg, jpeg, pdf, bmp, tiff, webp, doc, docx, html, mhtml, xls, xlsx, csv, ppt, pptx, txt, ofd, rtf。
- 如果是xls/xlsx/csv文件，每个sheet行数不能超过2000，列数不能超过100。
- 如果是txt文件，文件大小不超过100k。
- 请求体为本地文件的二进制流，非 FormData 或其他格式。
- 文件大小不超过500M。
- 长宽比小于2的图片宽高需在20～20000像素范围内，其他图片的宽高需在20～10000像素范围内。
Content-Type: text/plain

请求体为文本，内容为在线文件的URL链接（支持http以及https协议）。
- 在线文件大小不超过500M。
- 长宽比小于2的图片宽高需在20～20000像素范围内，其他图片的宽高需在20～10000像素范围内。

The body is of type file.

响应

200 - application/json

解析结果

返回markdown及结构化数据

code

enum<integer>

默认值:200

必填

状态码

200: Success
40101: x-ti-app-id 或 x-ti-secret-code 为空
40102: x-ti-app-id 或 x-ti-secret-code 无效，验证失败
40103: 客户端IP不在白名单
40003: 余额不足，请充值后再使用
40004: 参数错误，请查看技术文档，检查传参
40007: 机器人不存在或未发布
40008: 机器人未开通，请至市场开通后重试
40301: 图片类型不支持
40302: 上传文件大小不符，文件大小不超过 500M
40303: 文件类型不支持，接口会返回实际检测到的文件类型，如“当前文件类型为.gif”
40304: 图片尺寸不符，长宽比小于2的图片宽高需在20～20000像素范围内，其他图片的宽高需在20～10000像素范围内
40305: 识别文件未上传
40422: 文件损坏（The file is corrupted.）
40423: PDF密码错误（Password required or incorrect password.）
40424: 页数设置超出文件范围（Page number out of range.）
40425: 文件格式不支持（The input file format is not supported.）
40427: DPI参数不在支持列表中（Input DPI is not in the allowed DPIs list(72,144,216).）
40428: word和ppt转pdf失败或者超时（Process office file failed.）
50207: 部分页面解析失败（Partial failed）
40400: 无效的请求链接，请检查链接是否正确
30203: 基础服务故障，请稍后重试
500: 服务器内部错误

可用选项:

200,

40101,

40102,

40103,

40003,

40004,

40007,

40008,

40301,

40302,

40303,

40304,

40305,

40422,

40423,

40424,

40425,

40427,

40428,

50207,

40400,

30203,

500

message

string

必填

错误信息

示例:

"success"

result

object

必填

Show child attributes

version

string

必填

doc_restore 引擎版本号

示例:

"2.1.0"

duration

integer

必填

引擎耗时（毫秒）

示例:

999

metrics

object[]

必填

每一页的信息

Show child attributes

获取异步解析结果智能抽取

⌘I