pdf-extract-kitpaddlepaddleocrpdf2markdown.py(效果不佳)
- 电脑硬件
- 2025-09-04 06:30:01

GitHub - opendatalab/PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction
github /opendatalab/PDF-Extract-Kit
pdf2markdown.py 运行遇到的问题:
错误:
-------------------------------------- C++ Traceback (most recent call last): -------------------------------------- 0 paddle_infer::Predictor::Predictor(paddle::AnalysisConfig const&) 1 std::unique_ptr<paddle::PaddlePredictor, std::default_delete<paddle::PaddlePredictor> > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&) 2 paddle::AnalysisPredictor::Init(std::shared_ptr<paddle::framework::Scope> const&, std::shared_ptr<paddle::framework::ProgramDesc> const&) 3 paddle::AnalysisPredictor::PrepareProgram(std::shared_ptr<paddle::framework::ProgramDesc> const&) 4 paddle::AnalysisPredictor::OptimizeInferenceProgram() 5 paddle::inference::analysis::Analyzer::RunAnalysis(paddle::inference::analysis::Argument*) 6 paddle::inference::analysis::IrAnalysisPass::RunImpl(paddle::inference::analysis::Argument*) 7 paddle::inference::analysis::IRPassManager::Apply(std::unique_ptr<paddle::framework::ir::Graph, std::default_delete<paddle::framework::ir::Graph> >) 8 paddle::framework::ir::Pass::Apply(paddle::framework::ir::Graph*) const 9 paddle::framework::ir::SelfAttentionFusePass::ApplyImpl(paddle::framework::ir::Graph*) const 10 paddle::framework::ir::GraphPatternDetector::operator()(paddle::framework::ir::Graph*, std::function<void (std::map<paddle::framework::ir::PDNode*, paddle::framework::ir::Node*, paddle::framework::ir::GraphPatternDetector::PDNodeCompare, std::allocator<std::pair<paddle::framework::ir::PDNode* const, paddle::framework::ir::Node*> > > const&, paddle::framework::ir::Graph*)>) ---------------------- Error Message Summary: ---------------------- FatalError: `Illegal instruction` is detected by the operating system. [TimeInfo: *** Aborted at 1739780413 (unix time) try "date -d @1739780413" if you are using GNU date ***] [SignalInfo: *** SIGILL (@0x7f024e84e31a) received by PID 667042 (TID 0x7f0354c40740) from PID 1317331738 ***]解决: 安装 paddlepaddle==2.5.2
错误:
File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/tools/infer/predict_rec.py", line 628, in __call__ rec_result = self.postprocess_op(preds) File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 121, in __call__ text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True) File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 83, in decode char_list = [ File "/usr/local/py310/lib/python3.10/site-packages/paddleocr/ppocr/postprocess/rec_postprocess.py", line 84, in <listcomp> self.character[text_id] IndexError: list index out of range解决: 配置 pdf2markdown.yaml ocr: model_config: lang: 设置成 ch, 而不是 en
终于跑出结果了:
[2025/02/17 16:56:20] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process [2025/02/17 16:56:21] ppocr DEBUG: dt_boxes num : 3, elapsed : 0.10364508628845215 [2025/02/17 16:56:21] ppocr DEBUG: split text box by formula, new dt_boxes num : 7, elapsed : 0.000263214111328125 [2025/02/17 16:56:22] ppocr DEBUG: rec_res num : 7, elapsed : 1.4980812072753906 [2025/02/17 16:56:22] ppocr WARNING: Since the angle classifier is not initialized, it will not be used during the forward process [2025/02/17 16:56:22] ppocr DEBUG: dt_boxes num : 3, elapsed : 0.10365056991577148 ........... ocr cost: 7.42 Task done, results can be found at outputs/pdf2markdown初步结果表明,文本识别可以,但是组合成 markdown时,存在问题:(没有按照原内容一行一行呈现),还有重复混乱)
4.(3分)下列各句中,没有语病的一句是(4.(3分)下列各句中,没有语病的一句是(一 A C.一所学校能否形成独特、健康的校园文化,学生能否真正接受并融入其中,这对德育C.一所学校能否形成独特、健康的校园文化,学生能否真正接受并融入其中,这对德育活动的有效开展起着至关重要的作用。活动的有效开展起看至关重要的作用。 $\textcircled{2}$我国5岁至19岁青少年尝试吸烟率$20\%$,吸烟率近$7\%$。pdf-extract-kitpaddlepaddleocrpdf2markdown.py(效果不佳)由讯客互联电脑硬件栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“pdf-extract-kitpaddlepaddleocrpdf2markdown.py(效果不佳)”