PDF文档中文本解析

游戏开发
2025-08-26 03:15:01

常用的PDF文档解析解决方案有两种。一是通过文档结构读取解析，另一种是通过ocr技术处理。这里我们主要说一下文档读取解析的方案，现在常用的解析库有mupdf、pdfium、Aspose等第三方库来处理。其中mupdf、pdfium为开源、免费的。Aspose是一款收费的商业库。下边我们分别说一说各种库的使用。

mupdf 库编译以及链接至项目中这里就不做介绍了我们主要说一下使用该库做文本提取，代码示例如下：

std::string strPath = "pdf.pdf"; fz_context* ctx = fz_new_context(NULL, NULL, FZ_STORE_UNLIMITED); fz_register_document_handlers(ctx); fz_document* doc = fz_open_document(ctx, strPath.c_str()); int nCount = fz_count_pages(ctx, doc); if (nCount > 0) { //这里我们只演示第一页数据的获取，如果需要获取其他页的则自行处理 fz_stext_page* text_page = fz_new_stext_page(ctx, fz_bound_page(ctx, page)); fz_device* device = fz_new_stext_device(ctx, text_page, NULL); fz_run_page(ctx, page, device, fz_identity, NULL); fz_stext_block* block; SStringW sstrData; for (block = text_page->first_block; block; block = block->next) { if (block->type == FZ_STEXT_BLOCK_TEXT) { fz_stext_line* line; for (line = block->u.t.first_line; line; line = line->next) { fz_stext_char* ch; for (ch = line->first_char; ch; ch = ch->next) { //获取字符 SStringW sstrChar; sstrChar.Format(L"%c", ch->c); sstrData += sstrChar; //获取字体 std::string strFont = ch->font->name; //其他参数获取可自行实现，具体能获取那些可参考fz_stext_char结构，比如颜色、大小、位置等数据 //TODO: } sstrData += L"\n"; } } } }

pdfium 代码示例如下：

std::string strPath = "pdf.pdf"; FPDF_InitLibrary(); FPDF_DOCUMENT document = FPDF_LoadDocument(strPath.c_str(), nullptr); if (!document) { //error } int nCount = FPDF_GetPageCount(document); if (nCount > 0) { //这里我们只演示第一页数据的获取，如果需要获取其他页的则自行处理 FPDF_PAGE page = FPDF_LoadPage(document, 0); // 加载第一页 (索引 0) if (page) { std::wstring wstrText; FPDF_TEXTPAGE text_page = FPDFText_LoadPage(page); if (text_page) { int char_count = FPDFText_CountChars(text_page); for (int i = 0; i < char_count; ++i) { unsigned short ch = FPDFText_GetUnicode(text_page, i); wchar_t wide_char = static_cast<wchar_t>(ch); wstrText += wide_char; } FPDFText_ClosePage(text_page); } } }

在实际使用中发现使用mupdf解析文本时每一个block即为一段落的文本。但是在pdfium中获取的文本为整页中的所有文本，如果要划分段落则需要使用者自己根据字符的位置信息自己做归类处理。

Aspose 代码示例如下：

if (!System::IO::File::Exists(u"Example1.pdf")) { //文件不存在 } auto extractor = MakeObject<Facades::PdfExtractor>(); extractor->BindPdf(u"Example1.pdf"); extractor->ExtractText(); auto memStream = MakeObject<System::IO::MemoryStream>(); extractor->GetText(memStream); auto unicode = System::Text::Encoding::get_Unicode(); String allText = unicode->GetString(memStream->ToArray());

标签：

PDF文档中文本解析由讯客互联游戏开发栏目发布，感谢您对讯客互联的认可，以及对我们原创作品以及文章的青睐，非常欢迎各位朋友分享到个人网站或者朋友圈，但转载请说明文章出处“PDF文档中文本解析”

上一篇
Wireshark使用介绍

下一篇
Linux(Centos7.6)命令详解：less