编者按: 自 2023 年以来,RAG 已成为基于 LLM 的人工智能系统中应用最为广泛的架构之一。由于诸多产品的关键功能(如:领域智能问答、知识库构建等)严重依赖RAG,优化其性能、提高检索效率和准确性迫在眉睫,成为当前 RAG 相关研究的核心问题。如何高效准确地从PDF等非结构化数据中提取信息并加以利用,是其中一个亟待解决的重要问题。本文比较分析了多种解决方案的优缺点,着重探讨了这一问题的应对之策。
文章首先介绍了基于规则的解析方法,如pypdf,指出其无法很好地保留文档结构。接着作者评估了基于深度学习模型的解析方法,如 Unstructured 和 Layout-parser ,阐述了这种方法在提取表格、图像和保留文档布局结构等方面的优势,但同时也存在一些局限性。对于具有双列(double-column)等复杂布局的 PDF 文档,作者提出了一种经过改进的重排序算法。此外,作者还探讨了利用多模态大模型直接从 PDF 文档中提取信息的可能性。
这篇文章系统地分析了 PDF 文档解析中的各种挑战,并给出了一系列解决思路和改进算法,为进一步提高非结构化数据解析的质量贡献了有价值的见解,同时也指出了未来 PDF 文档解析的发展方向。
作者 | Florian June
编译|岳扬
对于 RAG 系统而言,从文档中提取信息是一种不可避免的情况。确保能够从源文件中有效地提取内容,对于提高最终输出的质量至关重要。
切勿低估这一流程的重要性。在使用 RAG 系统时,如果在文档解析过程中信息提取不力,会导致对 PDF 文件中所含信息的理解和利用受限。
解析流程(Pasing process)在 RAG 系统中的位置如图 1 所示:
图 1:解析流程(Pasing process)在 RAG 系统中的位置。Image by author。
在实际工作场景中,非结构化数据远比结构化数据丰富。但如果这些海量数据不能被解析,其巨大价值将无法发掘,其中 PDF 文档尤为突出。
在非结构化数据中,PDF 文档占绝大多数。有效处理 PDF 文档对管理其他类型的非结构化文档也有很大帮助。
本文主要介绍解析 PDF 文档的方法,包括但不限于如何有效解析 PDF 文档、如何尽可能提取更多有用信息等相关问题的算法和建议。
01 解析PDF将会面临的挑战
PDF 文档是非结构化文档的代表性格式,然而,从 PDF 文档中提取信息是一个极具挑战性的过程。
与其说 PDF 是一种数据格式,不如将其描述为一系列打印指令的集合更为准确。PDF 文件由一系列指令组成,这些指令指示 PDF 阅读器或打印机在屏幕或纸张上如何安排各种符号、文字的位置和显示方式。 这与 HTML 和 docx 等文件格式截然不同,这些文件格式使用
、
、
Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
---|---|---|---|
Self-Attention | O(n?-d) | O(1) | O(1) |
Recurrent | O(n-d?) | O(n) | O(n) |
Convolutional | O(k-n-d?) | O(1) | O(logy(n)) |
Self-Attention(restricted) | O(r-n-d) | ol) | O(n/r) |
复制 HTML 标签并将它们保存为 HTML 文件。然后,使用 Chrome 打开它,如图 6 所示:
图 6:图 3 中表格 1 的内容提取。Image by author。
可以看出,unstructured 算法基本准确提取了整个表格的数据。
挑战 2:如何重新排列检测到的数据块?特别是如何处理双列(double-column) PDF
在处理双列(double-column)PDF时,我们以《BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding》[11]论文为例。红色箭头表示阅读顺序:
图 7:Double-column page
确定布局后,unstructured 框架会将每个页面划分为若干矩形块,如图 8 所示。
图 8:布局检测结果的可视化。Image by author。
每个矩形块的详细信息可以按以下格式获取:
[
LayoutElement(bbox=Rectangle(x1=851.1539916992188,y1=181.15073777777613,x2=1467.844970703125,y2=587.8204599999975),text='These approaches have been generalized to coarser granularities,such as sentence embed-dings(Kiros et al.,2015;Logeswaran and Lee,2018)or paragraph embeddings(Le and Mikolov,2014).To train sentence representations,prior work has used objectives to rank candidate next sentences(Jernite et al.,2017;Logeswaran and Lee,2018),left-to-right generation of next sen-tence words given a representation of the previous sentence(Kiros et al.,2015),or denoising auto-encoder derived objectives(Hill et al.,2016).',source=, type='Text',prob=0.9519357085227966,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=196.5296173095703,y1=181.1507377777777,x2=815.468994140625,y2=512.548237777777),text='word based only on its context.Unlike left-to-right language model pre-training,the MLM ob-jective enables the representation to fuse the left and the right context,which allows us to pre-In addi-train a deep bidirectional Transformer.tion to the masked language model,we also use a“next sentence prediction”task that jointly pre-trains text-pair representations.The contributions of our paper are as follows:',source=, type='Text',prob=0.9517233967781067,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=200.22352600097656,y1=539.1451822222216,x2=825.0242919921875,y2=870.542682222221),text='•We demonstrate the importance of bidirectional pre-training for language representations.Un-like Radford et al.(2018),which uses unidirec-tional language models for pre-training,BERT uses masked language models to enable pre-trained deep bidirectional representations.This is also in contrast to Peters et al.(2018a),which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.',source=, type='List-item',prob=0.9414362907409668,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=851.8727416992188,y1=599.8257377777753,x2=1468.0499267578125,y2=1420.4982377777742),text='ELMo and its predecessor(Peters et al.,2017,2018a)generalize traditional word embedding re-search along a different dimension.They extract context-sensitive features from a left-to-right and a right-to-left language model.The contextual rep-resentation of each token is the concatenation of the left-to-right and right-to-left representations.When integrating contextual word embeddings with existing task-specific architectures,ELMo advances the state of the art for several major NLP benchmarks(Peters et al.,2018a)including ques-tion answering(Rajpurkar et al.,2016),sentiment analysis(Socher et al.,2013),and named entity recognition(Tjong Kim Sang and De Meulder,2003).Melamud et al.(2016)proposed learning contextual representations through a task to pre-dict a single word from both left and right context using LSTMs.Similar to ELMo,their model is feature-based and not deeply bidirectional.Fedus et al.(2018)shows that the cloze task can be used to improve the robustness of text generation mod-els.',source=, type='Text',prob=0.938507616519928,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=199.3734130859375,y1=900.5257377777765,x2=824.69873046875,y2=1156.648237777776),text='•We show that pre-trained representations reduce the need for many heavily-engineered task-specific architectures.BERT is thefirstfine-tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks,outper-forming many task-specific architectures.',source=, type='List-item',prob=0.9461237788200378,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=195.5695343017578,y1=1185.526123046875,x2=815.9393920898438,y2=1330.3272705078125),text='•BERT advances the state of the art for eleven NLP tasks.The code and pre-trained mod-els are available at https://github.com/google-research/bert.',source=, type='List-item',prob=0.9213815927505493,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=195.33956909179688,y1=1360.7886962890625,x2=447.47264000000007,y2=1397.038330078125),text='2 Related Work',source=, type='Section-header',prob=0.8663332462310791,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=197.7477264404297,y1=1419.3353271484375,x2=817.3308715820312,y2=1527.54443359375),text='There is a long history of pre-training general lan-guage representations,and we briefly review the most widely-used approaches in this section.',source=, type='Text',prob=0.928022563457489,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=851.0028686523438,y1=1468.341394166663,x2=1420.4693603515625,y2=1498.6444497222187),text='2.2 Unsupervised Fine-tuning Approaches',source=, type='Section-header',prob=0.8346447348594666,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=853.5444444444446,y1=1526.3701822222185,x2=1470.989990234375,y2=1669.5843488888852),text='As with the feature-based approaches,thefirst works in this direction only pre-trained word em-(Col-bedding parameters from unlabeled text lobert and Weston,2008).',source=, type='Text',prob=0.9344717860221863,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=200.00000000000009,y1=1556.2037353515625,x2=799.1743774414062,y2=1588.031982421875),text='2.1 Unsupervised Feature-based Approaches',source=, type='Section-header',prob=0.8317819237709045,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=198.64227294921875,y1=1606.3146266666645,x2=815.2886352539062,y2=2125.895459999998),text='Learning widely applicable representations of words has been an active area of research for decades,including non-neural(Brown et al.,1992;Ando and Zhang,2005;Blitzer et al.,2006)and neural(Mikolov et al.,2013;Pennington et al.,2014)methods.Pre-trained word embeddings are an integral part of modern NLP systems,of-fering significant improvements over embeddings learned from scratch(Turian et al.,2010).To pre-train word embedding vectors,left-to-right lan-guage modeling objectives have been used(Mnih and Hinton,2009),as well as objectives to dis-criminate correct from incorrect words in left and right context(Mikolov et al.,2013).',source=, type='Text',prob=0.9450697302818298,image_path=None,parent=None),
LayoutElement(bbox=Rectangle(x1=853.4905395507812,y1=1681.5868488888855,x2=1467.8729248046875,y2=2125.8954599999965),text='More recently,sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text andfine-tuned for a supervised downstream task(Dai and Le,2015;Howard and Ruder,2018;Radford et al.,2018).The advantage of these approaches is that few parameters need to be learned from scratch.At least partly due to this advantage,OpenAI GPT(Radford et al.,2018)achieved pre-viously state-of-the-art results on many sentence-level tasks from the GLUE benchmark(Wang language model-Left-to-right et al.,2018a).',source=, type='Text',prob=0.9476840496063232,image_path=None,parent=None)
]
其中(x1, y1)是左上顶点的坐标,而(x2, y2)是右下顶点的坐标:
(x_1,y_1)--------
||
||
||
----------(x_2,y_2)
此时,你可以选择重新调整(reshape)页面的阅读顺序。Unstructured 框架已经内置了排序算法,但我发现在处理 double-column 的情况时,排序结果并不令我满意。
因此,有必要设计一种算法处理这种情况。最简单的方法是首先按照左上顶点的横坐标进行排序,如果横坐标相同,则按纵坐标进行排序。其伪代码如下:
layout.sort(key=lambda z:(z.bbox.x1,z.bbox.y1,z.bbox.x2,z.bbox.y2))
不过,我们发现,即使是同一列中的图块,其横坐标也可能存在变化。如图 9 所示,紫线指向的 block 横坐标 bbox.x1 实际上更靠左。进行排序时,它的位置会在绿线指向的 block 之前,这显然违反了文档的阅读顺序。
图 9:同一列的横坐标可能会有变化。Image by author。
在这种情况下,一种具备可行性的算法如下:
- 首先,对所有左上顶点x坐标x1进行排序,得到x1_min
- 然后,对所有右下顶点x坐标x2进行排序,得到x2_max
- 接下来,确定页面中心线的 x 坐标为:
x1_min=min([el.bbox.x1 for el in layout])
x2_max=max([el.bbox.x2 for el in layout])
mid_line_x_coordinate=(x2_max+x1_min)/2
之后,如果 bbox.x1
分类完成后,根据它们的 y 坐标对每列内的每个 block 进行排序。最后,将右侧列连接到左侧列的右侧。
left_column= []
right_column= []
forelinlayout:
ifel.bbox.x1
值得一提的是,这一算法改进也能兼容单栏 PDF 的解析。
挑战 3:如何提取多级标题
提取标题(包括多级标题)的目的是增强 LLM 所提供回复内容的准确性。
例如,如果用户想了解图 9 中第 2.1 节的大意,只需准确提取出第 2.1 节的标题,并将其与相关内容一起作为上下文发送给 LLM,最终所得到的回复内容的准确性就会大大提高。
该算法仍然依赖于图 9 所示的布局块(layout blocks)。我们可以提取 type=’Section-header’的 block,并计算高度差值(bbox.y2 - bbox.y1)。高度差值(height difference)最大的 block 对应一级标题,其次是二级标题,然后是三级标题。
2.3 基于多模态大模型解析PDF中的复杂结构
在多模态模型得到快速发展和广泛应用之后,也可以利用多模态模型来解析表格。有几种选择[12]:
- 检索相关图像(PDF 页面)并将它们发送到 GPT4-V ,以响应用户向系统提交的问题或需求。
- 将每个 PDF 页面视为一张图像,让 GPT4-V 对每个页面进行图像推理。通过图像推理构建 Text Vector Store index(译者注:应当是对文本向量进行索引和检索的数据结构或存储空间)。使用Image Reasoning Vectore Store(译者注:应当为用于存储图像推理向量的数据库或仓库)查询答案。
- 使用 Table Transformer 从检索到的图像中裁剪表格信息,然后将这些裁剪后的图像发送到 GPT4-V 以响应用户向系统提交的问题或需求。
- 对裁剪后的表格图像使用 OCR 技术进行识别,然后将数据发送到 GPT4 / GPT-3.5 ,以回答用户向系统提交的问题。
经过测试,确定第三种方法最为有效。
此外,我们还可以使用多模态模型从图像中提取或总结关键信息(因为 PDF 文件可轻松转换为图像),如图 10所示。
图 10:从图像中提取或总结关键信息。来源:GPT-4 with Vision: Complete Guide and Evaluation[13]
03 Conclusion
一般几乎所有的非结构化文档都具有高度的灵活性,需要各种各样的解析技术。然而,业界目前还没有就使用哪种方法达成共识。
在这种情况下,建议选择最适合项目需求的方法,根据不同类型的 PDF 文件,采取特定的处理方法。例如,论文、书籍和财务报表等非结构性文档可能会根据其特点进行独特的布局设计。
尽管如此,如果条件允许,仍建议选择基于深度学习或多模态的方法。这些方法可以有效地将文档分割成定义明确、完整的信息单元,从而最大限度地保留文档的原意和结构。
Thanks for reading!
————
Florian June
An artificial intelligence researcher, mainly write articles about Large Language Models, data structures and algorithms, and NLP.
END
参考资料
[1]https://github.com/py-pdf/pypdf
[2]https://github.com/langchain-ai/langchain/blob/v0.1.1/libs/langchain/langchain/document_loaders/pdf.py
[3]https://github.com/run-llama/llama_index/blob/v0.9.32/llama_index/readers/file/docs_reader.py
[4]https://arxiv.org/pdf/1706.03762.pdf
[5]http://unstructured-io.github.io/unstructured/
[6]https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/document_loaders/pdf.py
[7]http://github.com/Layout-Parser/layout-parser
[8]https://layout-parser.github.io/platform/
[9]https://arxiv.org/pdf/2210.05391.pdf
[10]https://github.com/Unstructured-IO/unstructured
[11]https://arxiv.org/pdf/1810.04805.pdf
[12]https://docs.llamaindex.ai/en/stable/examples/multi_modal/multi_modal_pdf_tables.html
[13]https://blog.roboflow.com/gpt-4-vision/
本文经原作者授权,由 Baihai IDP 编译。如需转载译文,请联系获取授权。
原文链接:
https://pub.towardsai.net/advanced-rag-02-unveiling-pdf-parsing-b84ae866344e
著作权归作者所有
热门内容
全站热门评论
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
.comments-box__list-news .osc-avatar{border-radius: 5px}.comments-box__list-news .comment-item:hover{background: rgba(201,201,201,0.1)}
开源软件推进联盟
社区规范
粤ICP备12009483号
.codeBlock:hover .oscCode{display: block !important;} .codeBlock{z-index: 2;position: fixed;right: 20px;bottom: 57px; overflow: hidden; margin-bottom: 4px;padding: 8px 0 6px;width: 40px;height: auto;box-sizing: content-box;cursor: pointer;border: 1px solid #ddd;background: #f5f5f5;text-align: center;transition: background 0.4s ease;}
@media only screen and (max-width: 767px){ .codeBlock{display: none;}}
/*
html{
-webkit-filter: grayscale(100%);
-moz-filter: grayscale(100%);
-ms-filter: grayscale(100%);
-o-filter: grayscale(100%);
filter:progid:DXImageTransform.Microsoft.BasicImage(grayscale=1);
_filter:none;
}
*/
if(window.location.href.indexOf("www.oschina.net/group")!=-1 && window.location.href.indexOf("/admin/")!=-1){
document.querySelector("#mainScreen > div > div.group-admin-container > div.admin-body-box.box-card > div > div.menu-box > div > div:nth-child(4)").remove()
}
(function(){
var bp = document.createElement('script');
var curProtocol = window.location.protocol.split(':')[0];
if (curProtocol === 'https'){
bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';
}
else{
bp.src = 'http://push.zhanzhang.baidu.com/push.js';
}
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(bp, s);
})();
var _hmt = _hmt || [];
_hmt.push(['_requirePlugin', 'UrlChangeTracker', {
shouldTrackUrlChange: function (newPath, oldPath) {
return newPath && oldPath;
}}
]);
(function() {
var hm = document.createElement("script");
hm.src = "https://hm.baidu.com/hm.js?a411c4d1664dd70048ee98afe7b28f0b";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
{
"@context": "https://ziyuan.baidu.com/contexts/cambrian.jsonld",
"@id": "https://my.oschina.net/IDP/blog/11051004",
"appid": "1653861004982757",
"title":"Advanced RAG 02:揭开 PDF 文档解析的神秘面纱 - IDP的个人空间",
"images": ["https://oscimg.oschina.net/oscnet/up-b7cb31bb23a25f0b29e5fc488c50328d9cc.png"],
"description":"编者按: 自 2023 年以来,RAG 已成为基于 LLM 的人工智能系统中应用最为广泛的架构之一。由于诸多产品的关键功能(如:领域智能问答、知识库构建等)严重依赖RAG,优化其性能、提高检索效率和准确性迫在眉睫,成...",
"pubDate": "2024-04-07T10:17:11+08:00",
"upDate":"2024-04-07T10:17:11+08:00",
"lrDate":""
}
<!--
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-TK89C9ZD80');
-->
window.goatcounter = {
path: function(p) { return location.host + p }
}
(function(){
var el = document.createElement("script");
el.src = "https://lf1-cdn-tos.bytegoofy.com/goofy/ttzz/push.js?2f2c965c87382dadf25633a3738875e5ccd132720338e03bf7e464e2ec709b9dfd9a9dcb5ced4d7780eb6f3bbd089073c2a6d54440560d63862bbf4ec01bba3a";
el.id = "ttzz";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(el, s);
})(window)
谷歌 Rust 团队工作效率是 C++ 团队的两倍
某开源公司实习生上班时间向其他开源项目提交 PR,CEO 发现后要求关闭
用 Vue 全家桶纯手工搓了一个开源版「抖音」,高仿度接近 100%
知名开源前端框架「威优易」,你们学吗?
近万台龙芯 3A5000 电脑走进中小学课堂
阿里云:以后公司 20% 代码由通义灵码编写
OpenHarmony 4.1 Release
OpenAI全网疯传的53页PDF文档:计划2027年前开发出通用人工智能
Bun 1.1 版本震撼发布,Windows 支持来了
mac 苹果芯片运行 Asahi,最强 Linux,终极 ARM64 Linux 工作站
服务器托管,北京服务器托管,服务器租用 http://www.fwqtg.net
点击引领话题
发布并加入讨论