feat: add epub support for knowledge base document upload #7594
feat: add epub support for knowledge base document upload #7594Aster-amellus wants to merge 6 commits intoAstrBotDevs:masterfrom
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
file_read_utils._parse_local_supported_document, the combined.epub/_looks_like_zip_containerbranch effectively makes the later.docx-only branch dead code for real DOCX files; consider restructuring the conditions (e.g., check.epubby suffix first, then DOCX for ZIP containers) to avoid redundant paths and make the control flow clearer. - In
EpubParser.parse, the spine is iterated without considering the EPUBlinear="no"flag, so non-linear navigation or auxiliary documents may be included; consider checking the spine item’s attributes (e.g., via the tuple’s second element) and skipping entries marked as non-linear.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `file_read_utils._parse_local_supported_document`, the combined `.epub`/`_looks_like_zip_container` branch effectively makes the later `.docx`-only branch dead code for real DOCX files; consider restructuring the conditions (e.g., check `.epub` by suffix first, then DOCX for ZIP containers) to avoid redundant paths and make the control flow clearer.
- In `EpubParser.parse`, the spine is iterated without considering the EPUB `linear="no"` flag, so non-linear navigation or auxiliary documents may be included; consider checking the spine item’s attributes (e.g., via the tuple’s second element) and skipping entries marked as non-linear.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/parsers/epub_parser.py" line_range="19-22" />
<code_context>
+ return "\n".join(line for line in lines if line)
+
+
+def _extract_text_from_html(body_content: bytes | str) -> str:
+ from bs4 import BeautifulSoup
+
+ soup = BeautifulSoup(body_content, "html.parser")
+ for tag_name in _DROP_TAGS:
+ for tag in soup.find_all(tag_name):
</code_context>
<issue_to_address>
**issue:** Handle missing BeautifulSoup dependency consistently with EbookLib errors.
Because `BeautifulSoup` is imported inside `_extract_text_from_html`, a missing `beautifulsoup4` raises a raw `ImportError` instead of the clearer `RuntimeError` you use for `EbookLib`. To align error handling, either import `BeautifulSoup` at module level so dependency issues fail fast, or wrap the local import in try/except and raise a descriptive `RuntimeError` instead.
</issue_to_address>
### Comment 2
<location path="astrbot/core/knowledge_base/parsers/util.py" line_range="9" />
<code_context>
from .markitdown_parser import MarkitdownParser
return MarkitdownParser()
+ if ext == ".epub":
+ from .epub_parser import EpubParser
+
</code_context>
<issue_to_address>
**suggestion:** Normalize file extensions before selecting the EPUB parser.
`select_parser` compares `ext` to lowercase literals (e.g. `".epub"`), so upper- or mixed-case extensions (".EPUB", ".Epub") won’t match. Consider normalizing once at the top of the function (e.g. `ext = ext.lower()`) so all extension checks behave consistently regardless of case.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| from .markitdown_parser import MarkitdownParser | ||
|
|
||
| return MarkitdownParser() | ||
| if ext == ".epub": |
There was a problem hiding this comment.
suggestion: Normalize file extensions before selecting the EPUB parser.
select_parser compares ext to lowercase literals (e.g. ".epub"), so upper- or mixed-case extensions (".EPUB", ".Epub") won’t match. Consider normalizing once at the top of the function (e.g. ext = ext.lower()) so all extension checks behave consistently regardless of case.
There was a problem hiding this comment.
Code Review
This pull request introduces EPUB file support for the knowledge base, featuring a new EpubParser and updated file detection logic. The frontend was updated with localized strings and UI icons, and beautifulsoup4 and EbookLib were added as dependencies. Review feedback suggests refining the HTML tag filtering to preserve local navigation, switching to the lxml parser for better performance, and offloading the parsing logic to a separate thread to avoid blocking the asyncio event loop.
|
|
||
| from astrbot.core.knowledge_base.parsers.base import BaseParser, ParseResult | ||
|
|
||
| _DROP_TAGS = ("script", "style", "nav") |
There was a problem hiding this comment.
The PR description states that navigation sections should be filtered but not all <nav> content should be dropped. However, "nav" is included in _DROP_TAGS here, which will decompose all such tags regardless of their context. Since the main navigation document is already skipped in the parse loop (lines 54-55) based on manifest properties, you should remove "nav" from this list to preserve legitimate navigation elements (like local chapter TOCs) within the content documents.
| _DROP_TAGS = ("script", "style", "nav") | |
| _DROP_TAGS = ("script", "style") |
| def _extract_text_from_html(body_content: bytes | str) -> str: | ||
| from bs4 import BeautifulSoup | ||
|
|
||
| soup = BeautifulSoup(body_content, "html.parser") |
There was a problem hiding this comment.
Using the "lxml" parser with BeautifulSoup is recommended here as it is significantly faster and more robust for XHTML content (which EPUBs use) compared to the built-in "html.parser". Since lxml is a required dependency of EbookLib, it is guaranteed to be available in the environment. Additionally, consider moving the BeautifulSoup import to the top of the file to avoid repeated import overhead during parsing.
| soup = BeautifulSoup(body_content, "html.parser") | |
| soup = BeautifulSoup(body_content, "lxml") |
| async def parse(self, file_content: bytes, file_name: str) -> ParseResult: | ||
| try: | ||
| import ebooklib | ||
| from ebooklib import epub | ||
| except ImportError as exc: | ||
| raise RuntimeError( | ||
| "EPUB support requires the EbookLib package to be installed." | ||
| ) from exc |
There was a problem hiding this comment.
The parse method performs CPU-intensive tasks (EPUB decompression and HTML parsing) synchronously. In an asyncio environment, this can block the event loop and degrade the responsiveness of the application, especially when processing large books. Consider offloading the heavy lifting to a separate thread using asyncio.to_thread. Note that while synchronous blocks in the event loop are safe from race conditions due to their atomic execution, moving this logic to a thread requires ensuring that no shared state is modified without proper synchronization. Furthermore, since ebooklib is now a mandatory dependency in pyproject.toml, these dynamic imports should be moved to the top of the file for better performance and clarity.
References
- In a single-threaded asyncio event loop, synchronous functions (code blocks without 'await') are executed atomically and will not be interrupted by other coroutines. Therefore, they are safe from race conditions when modifying shared state within that block.
There was a problem hiding this comment.
Pull request overview
Adds .epub support end-to-end for Knowledge Base document uploads (backend parsing + dashboard upload/UI), including new parsing logic, file-type detection, dependency updates, and tests.
Changes:
- Introduces
EpubParserand wires it into parser selection and local file reading (magic/ZIP-based detection). - Updates dashboard upload components + i18n strings to allow selecting/uploading
.epuband to display EPUB icons/colors. - Adds test coverage for EPUB parsing and file-read tool EPUB detection.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
astrbot/core/knowledge_base/parsers/epub_parser.py |
New EPUB-to-text parser using EbookLib + BeautifulSoup extraction. |
astrbot/core/knowledge_base/parsers/util.py |
Registers .epub in select_parser(). |
astrbot/core/knowledge_base/parsers/__init__.py |
Exports EpubParser. |
astrbot/core/computer/file_read_utils.py |
Adds EPUB magic detection and local EPUB parsing path. |
dashboard/src/views/knowledge-base/components/DocumentsTab.vue |
Accepts .epub, resets file input after selection, updates icon/color handling, and tweaks upload request. |
dashboard/src/views/knowledge-base/DocumentDetail.vue |
Adds EPUB icon/color mapping. |
dashboard/src/views/alkaid/KnowledgeBase.vue |
Resets file input after selection, adds EPUB icon mapping, and tweaks upload request. |
dashboard/src/i18n/locales/zh-CN/features/knowledge-base/detail.json |
Updates supported-format copy to include .epub. |
dashboard/src/i18n/locales/en-US/features/knowledge-base/detail.json |
Updates supported-format copy to include .epub. |
dashboard/src/i18n/locales/ru-RU/features/knowledge-base/detail.json |
Updates supported-format copy to include .epub. |
dashboard/src/i18n/locales/zh-CN/features/alkaid/knowledge-base.json |
Updates upload subtitle to mention EPUB. |
dashboard/src/i18n/locales/en-US/features/alkaid/knowledge-base.json |
Updates upload subtitle to mention EPUB. |
requirements.txt |
Adds beautifulsoup4 + EbookLib. |
pyproject.toml |
Adds beautifulsoup4 + EbookLib to dependencies. |
tests/test_epub_parser.py |
New tests for parser selection + EPUB text extraction behavior. |
tests/test_computer_fs_tools.py |
Adds EPUB bytes fixture + tests for EPUB magic detection and read tool integration. |
| def _is_epub_bytes(file_bytes: bytes) -> bool: | ||
| try: | ||
| with zipfile.ZipFile(io.BytesIO(file_bytes)) as archive: | ||
| names = set(archive.namelist()) | ||
| mimetype = archive.read("mimetype").decode("utf-8").strip() | ||
| except (KeyError, OSError, UnicodeDecodeError, zipfile.BadZipFile): | ||
| return False | ||
|
|
||
| return mimetype == "application/epub+zip" and "META-INF/container.xml" in names |
There was a problem hiding this comment.
_is_epub_bytes() calls archive.read('mimetype'), which will decompress and load the entire entry into memory. Since this function can run on arbitrary ZIPs (via magic detection), a malicious archive could use an oversized/zip-bomb mimetype entry to cause excessive memory use. Prefer archive.open('mimetype') and read a small bounded amount (e.g., first ~64 bytes) before decoding/stripping.
| @pytest.mark.asyncio | ||
| async def test_epub_parser_reads_spine_order_as_text(): | ||
| pytest.importorskip("bs4") | ||
| pytest.importorskip("ebooklib") | ||
|
|
There was a problem hiding this comment.
These tests use pytest.importorskip() for bs4 and ebooklib, but both packages were added to the project's core dependencies in this PR. Keeping the skip means CI could silently skip the EPUB parser assertions if dependency resolution regresses. Consider importing directly (or failing fast with a clear message) so missing required deps fail the test run.
| @pytest.mark.asyncio | ||
| async def test_epub_parser_preserves_generic_container_text(): | ||
| pytest.importorskip("bs4") | ||
| pytest.importorskip("ebooklib") | ||
|
|
There was a problem hiding this comment.
Same as above: pytest.importorskip() here will hide failures if required EPUB dependencies are missing/mispackaged. Since the PR adds these as core deps, consider removing the skip so the test suite fails loudly when the environment is incorrect.
| _DROP_TAGS = ("script", "style", "nav") | ||
|
|
||
|
|
||
| def _normalize_multiline_text(text: str) -> str: | ||
| lines = [re.sub(r"[ \t]+", " ", line).strip() for line in text.splitlines()] | ||
| return "\n".join(line for line in lines if line) | ||
|
|
||
|
|
||
| def _extract_text_from_html(body_content: bytes | str) -> str: | ||
| from bs4 import BeautifulSoup | ||
|
|
||
| soup = BeautifulSoup(body_content, "html.parser") | ||
| for tag_name in _DROP_TAGS: | ||
| for tag in soup.find_all(tag_name): | ||
| tag.decompose() | ||
|
|
||
| root = soup.body or soup | ||
| return _normalize_multiline_text(root.get_text("\n", strip=True)) |
There was a problem hiding this comment.
The parser currently decomposes all <nav> elements (_DROP_TAGS includes "nav"). This contradicts the PR description that says only EPUB navigation sections (TOC/landmarks/page-list) should be filtered, and it can also drop meaningful in-chapter navigation content. Consider keeping <nav> by default and only removing <nav> nodes that are explicitly navigation (e.g., epub:type in {toc, landmarks, page-list}) or updating the PR description to match the implemented behavior.
| for spine_entry in book.spine: | ||
| item_id = self._resolve_spine_item_id(spine_entry) | ||
| if not item_id: | ||
| continue | ||
|
|
||
| item = book.get_item_with_id(item_id) | ||
| if item is None or item.get_type() != ebooklib.ITEM_DOCUMENT: | ||
| continue | ||
| if "nav" in getattr(item, "properties", []): | ||
| continue | ||
|
|
||
| chapter_text = _extract_text_from_html(item.get_body_content()) | ||
| if chapter_text: | ||
| text_parts.append(chapter_text) | ||
|
|
||
| return ParseResult(text="\n\n".join(text_parts).strip(), media=[]) | ||
|
|
||
| @staticmethod | ||
| def _resolve_spine_item_id(spine_entry: Any) -> str | None: | ||
| if isinstance(spine_entry, tuple) and spine_entry: | ||
| return str(spine_entry[0]) | ||
| if isinstance(spine_entry, str): |
There was a problem hiding this comment.
book.spine entries from EbookLib are commonly tuples like (idref, linear); _resolve_spine_item_id currently discards the linear flag and the main loop never checks it. That means non-linear spine items will still be included, and there is no fallback when the spine is empty/incomplete—both of which are claimed in the PR description. Consider inspecting the second tuple element (or itemref attributes) to skip linear == 'no', and add a fallback to iterate over document items when the spine doesn't yield any content (or update the PR description).
f936c79 to
ea0aaf8
Compare
ee6c054 to
7f9bcbc
Compare
a280edb to
de8b43e
Compare
Motivation
想上传一些小说,发现格式不支持,遂增加一个 epub 支持
Modifications / 改动点
EpubParserand registered it in the parser selection flowastrbot/core/computer/file_read_utils.py, including ZIP container handling and local EPUB text extractionastrbot/core/knowledge_base/parsers/epub_parser.py<nav>content.epubfilesrequirements.txtandpyproject.tomlCore files changed:
astrbot/core/knowledge_base/parsers/epub_parser.pyastrbot/core/knowledge_base/parsers/util.pyastrbot/core/knowledge_base/parsers/__init__.pyastrbot/core/computer/file_read_utils.pydashboard/src/views/alkaid/KnowledgeBase.vuedashboard/src/views/knowledge-base/DocumentDetail.vuedashboard/src/views/knowledge-base/components/DocumentsTab.vuedashboard/src/i18n/locales/...requirements.txtpyproject.tomltests/test_epub_parser.pytests/test_computer_fs_tools.pyThis is NOT a breaking change. / 这不是一个破坏性变更。
Screenshots or Test Results / 运行截图或测试结果
运行截图:



Checklist / 检查清单
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
appropriate locations in requirements.txt and pyproject.toml.
/ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txt 和 pyproject.toml 文件相应位置。
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Add EPUB document parsing support to the knowledge base and file reading pipeline, expose EPUB as a supported upload format in the dashboard, and cover the new behavior with tests.
New Features:
Enhancements:
Build:
Tests: