langchain中RecursiveUrlLoader使用
- 开源代码
- 2025-09-12 06:57:01

RecursiveUrlLoader 第一个例子 from langchain_community.document_loaders import RecursiveUrlLoader loader = RecursiveUrlLoader( " docs.python.org/3.9/", ) # 同步加载 docs = loader.load() # 查看第一个文档的元数据 print(docs[0].metadata) d:\soft\anaconda\envs\chat_chain\Lib\site-packages\langchain_community\document_loaders\recursive_url_loader.py:43: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document. Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor. If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor: from bs4 import XMLParsedAsHTMLWarning import warnings warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning) soup = BeautifulSoup(raw_html, "html.parser") {'source': ' docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.21 Documentation', 'language': None} print(docs[0].page_content) <!DOCTYPE html> <html xmlns="http:// .w3.org/1999/xhtml"> <head> <meta charset="utf-8" /><title>3.9.21 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0"> <link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" /> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script> <script src="_static/jquery.js"></script> <script src="_static/underscore.js"></script> <script src="_static/doctools.js"></script> <script src="_static/language_data.js"></script> <script src="_static/sidebar.js"></script> <link rel="search" type="application/opensearchdescription+xml" title="Search within Python 3.9.21 documentation" href="_static/opensearch.xml"/> <link rel="author" title="About these documents" href="about.html" /> <link rel="index" title="Index" href="genindex.html" /> <link rel="search" title="Search" href="search.html" /> <link rel="copyright" title="Copyright" href="copyright.html" /> <link rel="canonical" href=" docs.python.org/3/index.html" /> <style> @media only screen { table.full-width-table { width: 100%; } } </style> <link rel="shortcut icon" type="image/png" href="_static/py.svg" /> <script type="text/javascript" src="_static/copybutton.js"></script> <script type="text/javascript" src="_static/menu.js"></script> </head> <body> <div class="mobile-nav"> <input type="checkbox" id="menuToggler" class="toggler__input" aria-controls="navigation" aria-pressed="false" aria-expanded="false" role="button" aria-label="Menu" /> <label for="menuToggler" class="toggler__label"> <span></span> </label> <nav class="nav-content" role="navigation"> <a href=" .python.org/" class="nav-logo"> <img src="_static/py.svg" alt="Logo"/> </a> <div class="version_switcher_placeholder"></div> <form role="search" class="search" action="search.html" method="get"> <svg xmlns="http:// .w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" class="search-icon"> <path fill-rule="nonzero" d="M15.5 14h-.79l-.28-.27a6.5 6.5 0 001.48-5.34c-.47-2.78-2.79-5-5.59-5.34a6.505 6.505 0 00-7.27 7.27c.34 2.8 2.56 5.12 5.34 5.59a6.5 6.5 0 005.34-1.48l.27.28v.79l4.25 4.25c.41.41 1.08.41 1.49 0 .41-.41.41-1.08 0-1.49L15.5 14zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z" fill="#444"></path> </svg> <input type="text" name="q" aria-label="Quick search"/> <input type="submit" value="Go"/> </form> </nav> <div class="menu-wrapper"> <nav class="menu" role="navigation" aria-label="main navigation"> <div class="language_switcher_placeholder"></div> <h3>Download</h3> <p><a href="download.html">Download these documents</a></p> <h3>Docs by version</h3> <ul> <li><a href=" docs.python.org/3.14/">Python 3.14 (in development)</a></li> <li><a href=" docs.python.org/3.13/">Python 3.13 (stable)</a></li> <li><a href=" docs.python.org/3.12/">Python 3.12 (stable)</a></li> <li><a href=" docs.python.org/3.11/">Python 3.11 (security-fixes)</a></li> <li><a href=" docs.python.org/3.10/">Python 3.10 (security-fixes)</a></li> <li><a href=" docs.python.org/3.9/">Python 3.9 (security-fixes)</a></li> <li><a href=" docs.python.org/3.8/">Python 3.8 (EOL)</a></li> <li><a href=" docs.python.org/3.7/">Python 3.7 (EOL)</a></li> <li><a href=" docs.python.org/3.6/">Python 3.6 (EOL)</a></li> <li><a href=" docs.python.org/3.5/">Python 3.5 (EOL)</a></li> <li><a href=" docs.python.org/3.4/">Python 3.4 (EOL)</a></li> <li><a href=" docs.python.org/3.3/">Python 3.3 (EOL)</a></li> <li><a href=" docs.python.org/3.2/">Python 3.2 (EOL)</a></li> <li><a href=" docs.python.org/3.1/">Python 3.1 (EOL)</a></li> <li><a href=" docs.python.org/3.0/">Python 3.0 (EOL)</a></li> <li><a href=" docs.python.org/2.7/">Python 2.7 (EOL)</a></li> <li><a href=" docs.python.org/2.6/">Python 2.6 (EOL)</a></li> <li><a href=" .python.org/doc/versions/">All versions</a></li> </ul> <h3>Other resources</h3> <ul> <li><a href=" peps.python.org">PEP Index</a></li> <li><a href=" wiki.python.org/moin/BeginnersGuide">Beginner's Guide</a></li> <li><a href=" wiki.python.org/moin/PythonBooks">Book List</a></li> <li><a href=" .python.org/doc/av/">Audio/Visual Talks</a></li> <li><a href=" devguide.python.org/">Python Developer’s Guide</a></li> </ul> </nav> </div> </div> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" accesskey="I">index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li><img src="_static/py.svg" alt="python logo" style="vertical-align: middle; margin-top: -1px"/></li> <li><a href=" .python.org/">Python</a> »</li> <li class="switchers"> <div class="language_switcher_placeholder"></div> <div class="version_switcher_placeholder"></div> </li> <li> </li> <li id="cpython-language-and-version"> <a href="#">3.9.21 Documentation</a> » </li> <li class="right"> <div class="inline-search" role="search"> <form class="inline-search" action="search.html" method="get"> <input placeholder="Quick search" aria-label="Quick search" type="text" name="q" /> <input type="submit" value="Go" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> | </li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body" role="main"> <h1>Python 3.9.21 documentation</h1> <p> Welcome! This is the official documentation for Python 3.9.21. </p> <p><strong>Parts of the documentation:</strong></p> <table class="contentstable" align="center"><tr> <td width="50%"> <p class="biglink"><a class="biglink" href="whatsnew/3.9.html">What's new in Python 3.9?</a><br/> <span class="linkdescr"> or <a href="whatsnew/index.html">all "What's new" documents</a> since 2.0</span></p> <p class="biglink"><a class="biglink" href="tutorial/index.html">Tutorial</a><br/> <span class="linkdescr">start here</span></p> <p class="biglink"><a class="biglink" href="library/index.html">Library Reference</a><br/> <span class="linkdescr">keep this under your pillow</span></p> <p class="biglink"><a class="biglink" href="reference/index.html">Language Reference</a><br/> <span class="linkdescr">describes syntax and language elements</span></p> <p class="biglink"><a class="biglink" href="using/index.html">Python Setup and Usage</a><br/> <span class="linkdescr">how to use Python on different platforms</span></p> <p class="biglink"><a class="biglink" href="howto/index.html">Python HOWTOs</a><br/> <span class="linkdescr">in-depth documents on specific topics</span></p> </td><td width="50%"> <p class="biglink"><a class="biglink" href="installing/index.html">Installing Python Modules</a><br/> <span class="linkdescr">installing from the Python Package Index & other sources</span></p> <p class="biglink"><a class="biglink" href="distributing/index.html">Distributing Python Modules</a><br/> <span class="linkdescr">publishing modules for installation by others</span></p> <p class="biglink"><a class="biglink" href="extending/index.html">Extending and Embedding</a><br/> <span class="linkdescr">tutorial for C/C++ programmers</span></p> <p class="biglink"><a class="biglink" href="c-api/index.html">Python/C API</a><br/> <span class="linkdescr">reference for C/C++ programmers</span></p> <p class="biglink"><a class="biglink" href="faq/index.html">FAQs</a><br/> <span class="linkdescr">frequently asked questions (with answers!)</span></p> </td></tr> </table> <p><strong>Indices and tables:</strong></p> <table class="contentstable" align="center"><tr> <td width="50%"> <p class="biglink"><a class="biglink" href="py-modindex.html">Global Module Index</a><br/> <span class="linkdescr">quick access to all modules</span></p> <p class="biglink"><a class="biglink" href="genindex.html">General Index</a><br/> <span class="linkdescr">all functions, classes, terms</span></p> <p class="biglink"><a class="biglink" href="glossary.html">Glossary</a><br/> <span class="linkdescr">the most important terms explained</span></p> </td><td width="50%"> <p class="biglink"><a class="biglink" href="search.html">Search page</a><br/> <span class="linkdescr">search this documentation</span></p> <p class="biglink"><a class="biglink" href="contents.html">Complete Table of Contents</a><br/> <span class="linkdescr">lists all sections and subsections</span></p> </td></tr> </table> <p><strong>Meta information:</strong></p> <table class="contentstable" align="center"><tr> <td width="50%"> <p class="biglink"><a class="biglink" href="bugs.html">Reporting bugs</a></p> <p class="biglink"><a class="biglink" href=" devguide.python.org/docquality/#helping-with-documentation">Contributing to Docs</a></p> <p class="biglink"><a class="biglink" href="about.html">About the documentation</a></p> </td><td width="50%"> <p class="biglink"><a class="biglink" href="license.html">History and License of Python</a></p> <p class="biglink"><a class="biglink" href="copyright.html">Copyright</a></p> </td></tr> </table> </div> </div> </div> <div class="sphinxsidebar" role="navigation" aria-label="main navigation"> <div class="sphinxsidebarwrapper"> <h3>Download</h3> <p><a href="download.html">Download these documents</a></p> <h3>Docs by version</h3> <ul> <li><a href=" docs.python.org/3.14/">Python 3.14 (in development)</a></li> <li><a href=" docs.python.org/3.13/">Python 3.13 (stable)</a></li> <li><a href=" docs.python.org/3.12/">Python 3.12 (stable)</a></li> <li><a href=" docs.python.org/3.11/">Python 3.11 (security-fixes)</a></li> <li><a href=" docs.python.org/3.10/">Python 3.10 (security-fixes)</a></li> <li><a href=" docs.python.org/3.9/">Python 3.9 (security-fixes)</a></li> <li><a href=" docs.python.org/3.8/">Python 3.8 (EOL)</a></li> <li><a href=" docs.python.org/3.7/">Python 3.7 (EOL)</a></li> <li><a href=" docs.python.org/3.6/">Python 3.6 (EOL)</a></li> <li><a href=" docs.python.org/3.5/">Python 3.5 (EOL)</a></li> <li><a href=" docs.python.org/3.4/">Python 3.4 (EOL)</a></li> <li><a href=" docs.python.org/3.3/">Python 3.3 (EOL)</a></li> <li><a href=" docs.python.org/3.2/">Python 3.2 (EOL)</a></li> <li><a href=" docs.python.org/3.1/">Python 3.1 (EOL)</a></li> <li><a href=" docs.python.org/3.0/">Python 3.0 (EOL)</a></li> <li><a href=" docs.python.org/2.7/">Python 2.7 (EOL)</a></li> <li><a href=" docs.python.org/2.6/">Python 2.6 (EOL)</a></li> <li><a href=" .python.org/doc/versions/">All versions</a></li> </ul> <h3>Other resources</h3> <ul> <li><a href=" peps.python.org">PEP Index</a></li> <li><a href=" wiki.python.org/moin/BeginnersGuide">Beginner's Guide</a></li> <li><a href=" wiki.python.org/moin/PythonBooks">Book List</a></li> <li><a href=" .python.org/doc/av/">Audio/Visual Talks</a></li> <li><a href=" devguide.python.org/">Python Developer’s Guide</a></li> </ul> </div> </div> <div class="clearer"></div> </div> <div class="related" role="navigation" aria-label="related navigation"> <h3>Navigation</h3> <ul> <li class="right" style="margin-right: 10px"> <a href="genindex.html" title="General Index" >index</a></li> <li class="right" > <a href="py-modindex.html" title="Python Module Index" >modules</a> |</li> <li><img src="_static/py.svg" alt="python logo" style="vertical-align: middle; margin-top: -1px"/></li> <li><a href=" .python.org/">Python</a> »</li> <li class="switchers"> <div class="language_switcher_placeholder"></div> <div class="version_switcher_placeholder"></div> </li> <li> </li> <li id="cpython-language-and-version"> <a href="#">3.9.21 Documentation</a> » </li> <li class="right"> <div class="inline-search" role="search"> <form class="inline-search" action="search.html" method="get"> <input placeholder="Quick search" aria-label="Quick search" type="text" name="q" /> <input type="submit" value="Go" /> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </div> | </li> </ul> </div> <div class="footer"> © <a href="copyright.html">Copyright</a> 2001-2024, Python Software Foundation. <br /> This page is licensed under the Python Software Foundation License Version 2. <br /> Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License. <br /> See <a href="/license.html">History and License</a> for more information.<br /> <br /> The Python Software Foundation is a non-profit corporation. <a href=" .python.org/psf/donations/">Please donate.</a> <br /> <br /> Last updated on Dec 08, 2024. <a href="/bugs.html">Found a bug</a>? <br /> Created using <a href=" .sphinx-doc.org/">Sphinx</a> 2.4.4. </div> <script type="text/javascript" src="_static/switchers.js"></script> </body> </html> print(len(docs)) 24 RecursiveUrlLoader 中自定义提取器 import re from bs4 import BeautifulSoup def bs4_extractor(html: str) -> str: soup = BeautifulSoup(html, "html.parser") return re.sub(r"\n\n+", "\n\n", soup.text).strip() loader = RecursiveUrlLoader( " docs.python.org/3.9/", extractor=bs4_extractor ) # 同步加载 docs = loader.load() print(len(docs)) # 查看第一个文档的元数据 print(docs[0].metadata) C:\Windows\Temp\ipykernel_12952\1217732938.py:5: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document. Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor. If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor: from bs4 import XMLParsedAsHTMLWarning import warnings warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning) soup = BeautifulSoup(html, "html.parser") 24 {'source': ' docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.21 Documentation', 'language': None} print(docs[0].page_content) 3.9.21 Documentation Download Download these documents Docs by version Python 3.14 (in development) Python 3.13 (stable) Python 3.12 (stable) Python 3.11 (security-fixes) Python 3.10 (security-fixes) Python 3.9 (security-fixes) Python 3.8 (EOL) Python 3.7 (EOL) Python 3.6 (EOL) Python 3.5 (EOL) Python 3.4 (EOL) Python 3.3 (EOL) Python 3.2 (EOL) Python 3.1 (EOL) Python 3.0 (EOL) Python 2.7 (EOL) Python 2.6 (EOL) All versions Other resources PEP Index Beginner's Guide Book List Audio/Visual Talks Python Developer’s Guide Navigation index modules | Python » 3.9.21 Documentation » | Python 3.9.21 documentation Welcome! This is the official documentation for Python 3.9.21. Parts of the documentation: What's new in Python 3.9? or all "What's new" documents since 2.0 Tutorial start here Library Reference keep this under your pillow Language Reference describes syntax and language elements Python Setup and Usage how to use Python on different platforms Python HOWTOs in-depth documents on specific topics Installing Python Modules installing from the Python Package Index & other sources Distributing Python Modules publishing modules for installation by others Extending and Embedding tutorial for C/C++ programmers Python/C API reference for C/C++ programmers FAQs frequently asked questions (with answers!) Indices and tables: Global Module Index quick access to all modules General Index all functions, classes, terms Glossary the most important terms explained Search page search this documentation Complete Table of Contents lists all sections and subsections Meta information: Reporting bugs Contributing to Docs About the documentation History and License of Python Copyright Download Download these documents Docs by version Python 3.14 (in development) Python 3.13 (stable) Python 3.12 (stable) Python 3.11 (security-fixes) Python 3.10 (security-fixes) Python 3.9 (security-fixes) Python 3.8 (EOL) Python 3.7 (EOL) Python 3.6 (EOL) Python 3.5 (EOL) Python 3.4 (EOL) Python 3.3 (EOL) Python 3.2 (EOL) Python 3.1 (EOL) Python 3.0 (EOL) Python 2.7 (EOL) Python 2.6 (EOL) All versions Other resources PEP Index Beginner's Guide Book List Audio/Visual Talks Python Developer’s Guide Navigation index modules | Python » 3.9.21 Documentation » | © Copyright 2001-2024, Python Software Foundation. This page is licensed under the Python Software Foundation License Version 2. Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License. See History and License for more information. The Python Software Foundation is a non-profit corporation. Please donate. Last updated on Dec 08, 2024. Found a bug? Created using Sphinx 2.4.4.
langchain中RecursiveUrlLoader使用由讯客互联开源代码栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“langchain中RecursiveUrlLoader使用”
上一篇
maven高级-05.私服