feat: add epub support for knowledge base document upload by Aster-amellus · Pull Request #7594 · AstrBotDevs/AstrBot

Aster-amellus · 2026-04-16T08:01:06Z

Motivation

想上传一些小说，发现格式不支持，遂增加一个 epub 支持

Modifications / 改动点

Added EPUB support to the knowledge base document parsing pipeline
Added a new EpubParser and registered it in the parser selection flow
Added EPUB detection to astrbot/core/computer/file_read_utils.py, including ZIP container handling and local EPUB text extraction
Improved EPUB HTML extraction logic in astrbot/core/knowledge_base/parsers/epub_parser.py
Preserved meaningful document structure during extraction with better block-level text normalization
Filtered only EPUB navigation sections such as TOC, landmarks, and page-list instead of dropping all <nav> content
Respected non-linear spine entries during EPUB parsing and added a fallback path for incomplete spine metadata
Updated dashboard upload components to accept .epub files
Updated dashboard file icons, file colors, and supported-format copy to include EPUB
Reset file input values after selection so the same file can be selected again more reliably
Added EPUB-related dependencies to requirements.txt and pyproject.toml
Added EPUB parser and file reader test coverage

Core files changed:

astrbot/core/knowledge_base/parsers/epub_parser.py
astrbot/core/knowledge_base/parsers/util.py
astrbot/core/knowledge_base/parsers/__init__.py
astrbot/core/computer/file_read_utils.py
dashboard/src/views/alkaid/KnowledgeBase.vue
dashboard/src/views/knowledge-base/DocumentDetail.vue
dashboard/src/views/knowledge-base/components/DocumentsTab.vue
dashboard/src/i18n/locales/...
requirements.txt
pyproject.toml
tests/test_epub_parser.py
tests/test_computer_fs_tools.py
This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

运行截图：

Checklist / 检查清单

😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能，已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试，并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the
appropriate locations in requirements.txt and pyproject.toml.
/ 我确保没有引入新依赖库，或者引入了新依赖库的同时将其添加到 requirements.txt 和 pyproject.toml 文件相应位置。
😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。

Summary by Sourcery

Add EPUB document parsing support to the knowledge base and file reading pipeline, expose EPUB as a supported upload format in the dashboard, and cover the new behavior with tests.

New Features:

Introduce an EPUB parser and register it in the knowledge base parser selection so EPUB files can be ingested as text documents.
Enable file read tools to detect and parse local EPUB files from ZIP containers and treat them as supported documents.
Allow users to upload .epub files from the dashboard, including appropriate icons, colors, and i18n strings for EPUB documents.

Enhancements:

Normalize and filter extracted EPUB HTML content to better preserve meaningful structure while dropping non-content elements.
Simplify dashboard upload requests by relying on axios to set multipart form-data headers automatically.

Build:

Add EbookLib and BeautifulSoup as dependencies to support EPUB parsing in both requirements.txt and pyproject.toml.

Tests:

Add unit tests for the EPUB parser, including parser selection and text extraction behavior.
Extend file system tool tests to cover EPUB detection and reading via the EPUB parser.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In file_read_utils._parse_local_supported_document, the combined .epub/_looks_like_zip_container branch effectively makes the later .docx-only branch dead code for real DOCX files; consider restructuring the conditions (e.g., check .epub by suffix first, then DOCX for ZIP containers) to avoid redundant paths and make the control flow clearer.
In EpubParser.parse, the spine is iterated without considering the EPUB linear="no" flag, so non-linear navigation or auxiliary documents may be included; consider checking the spine item’s attributes (e.g., via the tuple’s second element) and skipping entries marked as non-linear.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `file_read_utils._parse_local_supported_document`, the combined `.epub`/`_looks_like_zip_container` branch effectively makes the later `.docx`-only branch dead code for real DOCX files; consider restructuring the conditions (e.g., check `.epub` by suffix first, then DOCX for ZIP containers) to avoid redundant paths and make the control flow clearer.
- In `EpubParser.parse`, the spine is iterated without considering the EPUB `linear="no"` flag, so non-linear navigation or auxiliary documents may be included; consider checking the spine item’s attributes (e.g., via the tuple’s second element) and skipping entries marked as non-linear.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/parsers/epub_parser.py" line_range="19-22" />
<code_context>
+    return "\n".join(line for line in lines if line)
+
+
+def _extract_text_from_html(body_content: bytes | str) -> str:
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(body_content, "html.parser")
+    for tag_name in _DROP_TAGS:
+        for tag in soup.find_all(tag_name):
</code_context>
<issue_to_address>
**issue:** Handle missing BeautifulSoup dependency consistently with EbookLib errors.

Because `BeautifulSoup` is imported inside `_extract_text_from_html`, a missing `beautifulsoup4` raises a raw `ImportError` instead of the clearer `RuntimeError` you use for `EbookLib`. To align error handling, either import `BeautifulSoup` at module level so dependency issues fail fast, or wrap the local import in try/except and raise a descriptive `RuntimeError` instead.
</issue_to_address>

### Comment 2
<location path="astrbot/core/knowledge_base/parsers/util.py" line_range="9" />
<code_context>
         from .markitdown_parser import MarkitdownParser

         return MarkitdownParser()
+    if ext == ".epub":
+        from .epub_parser import EpubParser
+
</code_context>
<issue_to_address>
**suggestion:** Normalize file extensions before selecting the EPUB parser.

`select_parser` compares `ext` to lowercase literals (e.g. `".epub"`), so upper- or mixed-case extensions (".EPUB", ".Epub") won’t match. Consider normalizing once at the top of the function (e.g. `ext = ext.lower()`) so all extension checks behave consistently regardless of case.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-16T08:03:30Z

        from .markitdown_parser import MarkitdownParser

        return MarkitdownParser()
+    if ext == ".epub":


suggestion: Normalize file extensions before selecting the EPUB parser.

select_parser compares ext to lowercase literals (e.g. ".epub"), so upper- or mixed-case extensions (".EPUB", ".Epub") won’t match. Consider normalizing once at the top of the function (e.g. ext = ext.lower()) so all extension checks behave consistently regardless of case.

gemini-code-assist

Code Review

This pull request introduces EPUB file support for the knowledge base, featuring a new EpubParser and updated file detection logic. The frontend was updated with localized strings and UI icons, and beautifulsoup4 and EbookLib were added as dependencies. Review feedback suggests refining the HTML tag filtering to preserve local navigation, switching to the lxml parser for better performance, and offloading the parsing logic to a separate thread to avoid blocking the asyncio event loop.

gemini-code-assist · 2026-04-16T08:04:01Z

+
+from astrbot.core.knowledge_base.parsers.base import BaseParser, ParseResult
+
+_DROP_TAGS = ("script", "style", "nav")


The PR description states that navigation sections should be filtered but not all <nav> content should be dropped. However, "nav" is included in _DROP_TAGS here, which will decompose all such tags regardless of their context. Since the main navigation document is already skipped in the parse loop (lines 54-55) based on manifest properties, you should remove "nav" from this list to preserve legitimate navigation elements (like local chapter TOCs) within the content documents.

Suggested change

_DROP_TAGS = ("script", "style", "nav")

_DROP_TAGS = ("script", "style")

gemini-code-assist · 2026-04-16T08:04:01Z

+def _extract_text_from_html(body_content: bytes | str) -> str:
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(body_content, "html.parser")


Using the "lxml" parser with BeautifulSoup is recommended here as it is significantly faster and more robust for XHTML content (which EPUBs use) compared to the built-in "html.parser". Since lxml is a required dependency of EbookLib, it is guaranteed to be available in the environment. Additionally, consider moving the BeautifulSoup import to the top of the file to avoid repeated import overhead during parsing.

Suggested change

soup = BeautifulSoup(body_content, "html.parser")

soup = BeautifulSoup(body_content, "lxml")

gemini-code-assist · 2026-04-16T08:04:01Z

+    async def parse(self, file_content: bytes, file_name: str) -> ParseResult:
+        try:
+            import ebooklib
+            from ebooklib import epub
+        except ImportError as exc:
+            raise RuntimeError(
+                "EPUB support requires the EbookLib package to be installed."
+            ) from exc


The parse method performs CPU-intensive tasks (EPUB decompression and HTML parsing) synchronously. In an asyncio environment, this can block the event loop and degrade the responsiveness of the application, especially when processing large books. Consider offloading the heavy lifting to a separate thread using asyncio.to_thread. Note that while synchronous blocks in the event loop are safe from race conditions due to their atomic execution, moving this logic to a thread requires ensuring that no shared state is modified without proper synchronization. Furthermore, since ebooklib is now a mandatory dependency in pyproject.toml, these dynamic imports should be moved to the top of the file for better performance and clarity.

References

In a single-threaded asyncio event loop, synchronous functions (code blocks without 'await') are executed atomically and will not be interrupted by other coroutines. Therefore, they are safe from race conditions when modifying shared state within that block.

Copilot

Pull request overview

Adds .epub support end-to-end for Knowledge Base document uploads (backend parsing + dashboard upload/UI), including new parsing logic, file-type detection, dependency updates, and tests.

Changes:

Introduces EpubParser and wires it into parser selection and local file reading (magic/ZIP-based detection).
Updates dashboard upload components + i18n strings to allow selecting/uploading .epub and to display EPUB icons/colors.
Adds test coverage for EPUB parsing and file-read tool EPUB detection.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`astrbot/core/knowledge_base/parsers/epub_parser.py`	New EPUB-to-text parser using EbookLib + BeautifulSoup extraction.
`astrbot/core/knowledge_base/parsers/util.py`	Registers `.epub` in `select_parser()`.
`astrbot/core/knowledge_base/parsers/__init__.py`	Exports `EpubParser`.
`astrbot/core/computer/file_read_utils.py`	Adds EPUB magic detection and local EPUB parsing path.
`dashboard/src/views/knowledge-base/components/DocumentsTab.vue`	Accepts `.epub`, resets file input after selection, updates icon/color handling, and tweaks upload request.
`dashboard/src/views/knowledge-base/DocumentDetail.vue`	Adds EPUB icon/color mapping.
`dashboard/src/views/alkaid/KnowledgeBase.vue`	Resets file input after selection, adds EPUB icon mapping, and tweaks upload request.
`dashboard/src/i18n/locales/zh-CN/features/knowledge-base/detail.json`	Updates supported-format copy to include `.epub`.
`dashboard/src/i18n/locales/en-US/features/knowledge-base/detail.json`	Updates supported-format copy to include `.epub`.
`dashboard/src/i18n/locales/ru-RU/features/knowledge-base/detail.json`	Updates supported-format copy to include `.epub`.
`dashboard/src/i18n/locales/zh-CN/features/alkaid/knowledge-base.json`	Updates upload subtitle to mention EPUB.
`dashboard/src/i18n/locales/en-US/features/alkaid/knowledge-base.json`	Updates upload subtitle to mention EPUB.
`requirements.txt`	Adds `beautifulsoup4` + `EbookLib`.
`pyproject.toml`	Adds `beautifulsoup4` + `EbookLib` to dependencies.
`tests/test_epub_parser.py`	New tests for parser selection + EPUB text extraction behavior.
`tests/test_computer_fs_tools.py`	Adds EPUB bytes fixture + tests for EPUB magic detection and read tool integration.

Copilot · 2026-04-16T08:07:38Z

+def _is_epub_bytes(file_bytes: bytes) -> bool:
+    try:
+        with zipfile.ZipFile(io.BytesIO(file_bytes)) as archive:
+            names = set(archive.namelist())
+            mimetype = archive.read("mimetype").decode("utf-8").strip()
+    except (KeyError, OSError, UnicodeDecodeError, zipfile.BadZipFile):
+        return False
+
+    return mimetype == "application/epub+zip" and "META-INF/container.xml" in names


_is_epub_bytes() calls archive.read('mimetype'), which will decompress and load the entire entry into memory. Since this function can run on arbitrary ZIPs (via magic detection), a malicious archive could use an oversized/zip-bomb mimetype entry to cause excessive memory use. Prefer archive.open('mimetype') and read a small bounded amount (e.g., first ~64 bytes) before decoding/stripping.

Copilot · 2026-04-16T08:07:39Z

+@pytest.mark.asyncio
+async def test_epub_parser_reads_spine_order_as_text():
+    pytest.importorskip("bs4")
+    pytest.importorskip("ebooklib")
+


These tests use pytest.importorskip() for bs4 and ebooklib, but both packages were added to the project's core dependencies in this PR. Keeping the skip means CI could silently skip the EPUB parser assertions if dependency resolution regresses. Consider importing directly (or failing fast with a clear message) so missing required deps fail the test run.

Copilot · 2026-04-16T08:07:39Z

+@pytest.mark.asyncio
+async def test_epub_parser_preserves_generic_container_text():
+    pytest.importorskip("bs4")
+    pytest.importorskip("ebooklib")
+


Same as above: pytest.importorskip() here will hide failures if required EPUB dependencies are missing/mispackaged. Since the PR adds these as core deps, consider removing the skip so the test suite fails loudly when the environment is incorrect.

Copilot · 2026-04-16T08:07:39Z

+_DROP_TAGS = ("script", "style", "nav")
+
+
+def _normalize_multiline_text(text: str) -> str:
+    lines = [re.sub(r"[ \t]+", " ", line).strip() for line in text.splitlines()]
+    return "\n".join(line for line in lines if line)
+
+
+def _extract_text_from_html(body_content: bytes | str) -> str:
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(body_content, "html.parser")
+    for tag_name in _DROP_TAGS:
+        for tag in soup.find_all(tag_name):
+            tag.decompose()
+
+    root = soup.body or soup
+    return _normalize_multiline_text(root.get_text("\n", strip=True))


The parser currently decomposes all <nav> elements (_DROP_TAGS includes "nav"). This contradicts the PR description that says only EPUB navigation sections (TOC/landmarks/page-list) should be filtered, and it can also drop meaningful in-chapter navigation content. Consider keeping <nav> by default and only removing <nav> nodes that are explicitly navigation (e.g., epub:type in {toc, landmarks, page-list}) or updating the PR description to match the implemented behavior.

Copilot · 2026-04-16T08:07:39Z

+        for spine_entry in book.spine:
+            item_id = self._resolve_spine_item_id(spine_entry)
+            if not item_id:
+                continue
+
+            item = book.get_item_with_id(item_id)
+            if item is None or item.get_type() != ebooklib.ITEM_DOCUMENT:
+                continue
+            if "nav" in getattr(item, "properties", []):
+                continue
+
+            chapter_text = _extract_text_from_html(item.get_body_content())
+            if chapter_text:
+                text_parts.append(chapter_text)
+
+        return ParseResult(text="\n\n".join(text_parts).strip(), media=[])
+
+    @staticmethod
+    def _resolve_spine_item_id(spine_entry: Any) -> str | None:
+        if isinstance(spine_entry, tuple) and spine_entry:
+            return str(spine_entry[0])
+        if isinstance(spine_entry, str):


book.spine entries from EbookLib are commonly tuples like (idref, linear); _resolve_spine_item_id currently discards the linear flag and the main loop never checks it. That means non-linear spine items will still be included, and there is no fallback when the spine is empty/incomplete—both of which are claimed in the PR description. Consider inspecting the second tuple element (or itemref attributes) to skip linear == 'no', and add a fallback to iterate over document items when the spine doesn't yield any content (or update the PR description).

…base

Copilot AI review requested due to automatic review settings April 16, 2026 08:01

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 16, 2026

Copilot started reviewing on behalf of Aster-amellus April 16, 2026 08:01 View session

dosubot bot added area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base labels Apr 16, 2026

sourcery-ai bot reviewed Apr 16, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Aster-amellus force-pushed the feat/epub-parser branch from f936c79 to ea0aaf8 Compare April 16, 2026 08:51

Soulter reviewed Apr 17, 2026

View reviewed changes

Comment thread astrbot/core/knowledge_base/parsers/epub_parser.py

Aster-amellus force-pushed the feat/epub-parser branch from ee6c054 to 7f9bcbc Compare April 18, 2026 11:06

Aster-amellus added 6 commits April 18, 2026 20:10

feat: add EPUB parsing support for knowledge base and file reader

9145929

feat: update supported file formats for document upload in knowledge …

3638fb5

…base

feat: enhance EPUB parser to support spine order and generic containers

ac95266

makeitdown parse epub

5808303

update parser

4bd34a4

fix

de8b43e

Aster-amellus force-pushed the feat/epub-parser branch from a280edb to de8b43e Compare April 18, 2026 12:11


		from astrbot.core.knowledge_base.parsers.base import BaseParser, ParseResult

		_DROP_TAGS = ("script", "style", "nav")

	_DROP_TAGS = ("script", "style", "nav")
	_DROP_TAGS = ("script", "style")

	soup = BeautifulSoup(body_content, "html.parser")
	soup = BeautifulSoup(body_content, "lxml")

Uh oh!

Conversation

Aster-amellus commented Apr 16, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications / 改动点

Screenshots or Test Results / 运行截图或测试结果

Checklist / 检查清单

Summary by Sourcery

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aster-amellus commented Apr 16, 2026 •

edited by sourcery-ai bot

Loading