Skip to content

feat: add epub support for knowledge base document upload #7594

Open
Aster-amellus wants to merge 6 commits intoAstrBotDevs:masterfrom
Aster-amellus:feat/epub-parser
Open

feat: add epub support for knowledge base document upload #7594
Aster-amellus wants to merge 6 commits intoAstrBotDevs:masterfrom
Aster-amellus:feat/epub-parser

Conversation

@Aster-amellus
Copy link
Copy Markdown
Contributor

@Aster-amellus Aster-amellus commented Apr 16, 2026

Motivation

想上传一些小说,发现格式不支持,遂增加一个 epub 支持

Modifications / 改动点

  • Added EPUB support to the knowledge base document parsing pipeline
  • Added a new EpubParser and registered it in the parser selection flow
  • Added EPUB detection to astrbot/core/computer/file_read_utils.py, including ZIP container handling and local EPUB text extraction
  • Improved EPUB HTML extraction logic in astrbot/core/knowledge_base/parsers/epub_parser.py
  • Preserved meaningful document structure during extraction with better block-level text normalization
  • Filtered only EPUB navigation sections such as TOC, landmarks, and page-list instead of dropping all <nav> content
  • Respected non-linear spine entries during EPUB parsing and added a fallback path for incomplete spine metadata
  • Updated dashboard upload components to accept .epub files
  • Updated dashboard file icons, file colors, and supported-format copy to include EPUB
  • Reset file input values after selection so the same file can be selected again more reliably
  • Added EPUB-related dependencies to requirements.txt and pyproject.toml
  • Added EPUB parser and file reader test coverage

Core files changed:

  • astrbot/core/knowledge_base/parsers/epub_parser.py

  • astrbot/core/knowledge_base/parsers/util.py

  • astrbot/core/knowledge_base/parsers/__init__.py

  • astrbot/core/computer/file_read_utils.py

  • dashboard/src/views/alkaid/KnowledgeBase.vue

  • dashboard/src/views/knowledge-base/DocumentDetail.vue

  • dashboard/src/views/knowledge-base/components/DocumentsTab.vue

  • dashboard/src/i18n/locales/...

  • requirements.txt

  • pyproject.toml

  • tests/test_epub_parser.py

  • tests/test_computer_fs_tools.py

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

运行截图:
image
image
image

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the
    appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txt 和 pyproject.toml 文件相应位置。
  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Add EPUB document parsing support to the knowledge base and file reading pipeline, expose EPUB as a supported upload format in the dashboard, and cover the new behavior with tests.

New Features:

  • Introduce an EPUB parser and register it in the knowledge base parser selection so EPUB files can be ingested as text documents.
  • Enable file read tools to detect and parse local EPUB files from ZIP containers and treat them as supported documents.
  • Allow users to upload .epub files from the dashboard, including appropriate icons, colors, and i18n strings for EPUB documents.

Enhancements:

  • Normalize and filter extracted EPUB HTML content to better preserve meaningful structure while dropping non-content elements.
  • Simplify dashboard upload requests by relying on axios to set multipart form-data headers automatically.

Build:

  • Add EbookLib and BeautifulSoup as dependencies to support EPUB parsing in both requirements.txt and pyproject.toml.

Tests:

  • Add unit tests for the EPUB parser, including parser selection and text extraction behavior.
  • Extend file system tool tests to cover EPUB detection and reading via the EPUB parser.

Copilot AI review requested due to automatic review settings April 16, 2026 08:01
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 16, 2026
@dosubot dosubot bot added area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base labels Apr 16, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In file_read_utils._parse_local_supported_document, the combined .epub/_looks_like_zip_container branch effectively makes the later .docx-only branch dead code for real DOCX files; consider restructuring the conditions (e.g., check .epub by suffix first, then DOCX for ZIP containers) to avoid redundant paths and make the control flow clearer.
  • In EpubParser.parse, the spine is iterated without considering the EPUB linear="no" flag, so non-linear navigation or auxiliary documents may be included; consider checking the spine item’s attributes (e.g., via the tuple’s second element) and skipping entries marked as non-linear.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `file_read_utils._parse_local_supported_document`, the combined `.epub`/`_looks_like_zip_container` branch effectively makes the later `.docx`-only branch dead code for real DOCX files; consider restructuring the conditions (e.g., check `.epub` by suffix first, then DOCX for ZIP containers) to avoid redundant paths and make the control flow clearer.
- In `EpubParser.parse`, the spine is iterated without considering the EPUB `linear="no"` flag, so non-linear navigation or auxiliary documents may be included; consider checking the spine item’s attributes (e.g., via the tuple’s second element) and skipping entries marked as non-linear.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/parsers/epub_parser.py" line_range="19-22" />
<code_context>
+    return "\n".join(line for line in lines if line)
+
+
+def _extract_text_from_html(body_content: bytes | str) -> str:
+    from bs4 import BeautifulSoup
+
+    soup = BeautifulSoup(body_content, "html.parser")
+    for tag_name in _DROP_TAGS:
+        for tag in soup.find_all(tag_name):
</code_context>
<issue_to_address>
**issue:** Handle missing BeautifulSoup dependency consistently with EbookLib errors.

Because `BeautifulSoup` is imported inside `_extract_text_from_html`, a missing `beautifulsoup4` raises a raw `ImportError` instead of the clearer `RuntimeError` you use for `EbookLib`. To align error handling, either import `BeautifulSoup` at module level so dependency issues fail fast, or wrap the local import in try/except and raise a descriptive `RuntimeError` instead.
</issue_to_address>

### Comment 2
<location path="astrbot/core/knowledge_base/parsers/util.py" line_range="9" />
<code_context>
         from .markitdown_parser import MarkitdownParser

         return MarkitdownParser()
+    if ext == ".epub":
+        from .epub_parser import EpubParser
+
</code_context>
<issue_to_address>
**suggestion:** Normalize file extensions before selecting the EPUB parser.

`select_parser` compares `ext` to lowercase literals (e.g. `".epub"`), so upper- or mixed-case extensions (".EPUB", ".Epub") won’t match. Consider normalizing once at the top of the function (e.g. `ext = ext.lower()`) so all extension checks behave consistently regardless of case.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/knowledge_base/parsers/epub_parser.py Outdated
from .markitdown_parser import MarkitdownParser

return MarkitdownParser()
if ext == ".epub":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Normalize file extensions before selecting the EPUB parser.

select_parser compares ext to lowercase literals (e.g. ".epub"), so upper- or mixed-case extensions (".EPUB", ".Epub") won’t match. Consider normalizing once at the top of the function (e.g. ext = ext.lower()) so all extension checks behave consistently regardless of case.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces EPUB file support for the knowledge base, featuring a new EpubParser and updated file detection logic. The frontend was updated with localized strings and UI icons, and beautifulsoup4 and EbookLib were added as dependencies. Review feedback suggests refining the HTML tag filtering to preserve local navigation, switching to the lxml parser for better performance, and offloading the parsing logic to a separate thread to avoid blocking the asyncio event loop.


from astrbot.core.knowledge_base.parsers.base import BaseParser, ParseResult

_DROP_TAGS = ("script", "style", "nav")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PR description states that navigation sections should be filtered but not all <nav> content should be dropped. However, "nav" is included in _DROP_TAGS here, which will decompose all such tags regardless of their context. Since the main navigation document is already skipped in the parse loop (lines 54-55) based on manifest properties, you should remove "nav" from this list to preserve legitimate navigation elements (like local chapter TOCs) within the content documents.

Suggested change
_DROP_TAGS = ("script", "style", "nav")
_DROP_TAGS = ("script", "style")

def _extract_text_from_html(body_content: bytes | str) -> str:
from bs4 import BeautifulSoup

soup = BeautifulSoup(body_content, "html.parser")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the "lxml" parser with BeautifulSoup is recommended here as it is significantly faster and more robust for XHTML content (which EPUBs use) compared to the built-in "html.parser". Since lxml is a required dependency of EbookLib, it is guaranteed to be available in the environment. Additionally, consider moving the BeautifulSoup import to the top of the file to avoid repeated import overhead during parsing.

Suggested change
soup = BeautifulSoup(body_content, "html.parser")
soup = BeautifulSoup(body_content, "lxml")

Comment on lines +34 to +41
async def parse(self, file_content: bytes, file_name: str) -> ParseResult:
try:
import ebooklib
from ebooklib import epub
except ImportError as exc:
raise RuntimeError(
"EPUB support requires the EbookLib package to be installed."
) from exc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parse method performs CPU-intensive tasks (EPUB decompression and HTML parsing) synchronously. In an asyncio environment, this can block the event loop and degrade the responsiveness of the application, especially when processing large books. Consider offloading the heavy lifting to a separate thread using asyncio.to_thread. Note that while synchronous blocks in the event loop are safe from race conditions due to their atomic execution, moving this logic to a thread requires ensuring that no shared state is modified without proper synchronization. Furthermore, since ebooklib is now a mandatory dependency in pyproject.toml, these dynamic imports should be moved to the top of the file for better performance and clarity.

References
  1. In a single-threaded asyncio event loop, synchronous functions (code blocks without 'await') are executed atomically and will not be interrupted by other coroutines. Therefore, they are safe from race conditions when modifying shared state within that block.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds .epub support end-to-end for Knowledge Base document uploads (backend parsing + dashboard upload/UI), including new parsing logic, file-type detection, dependency updates, and tests.

Changes:

  • Introduces EpubParser and wires it into parser selection and local file reading (magic/ZIP-based detection).
  • Updates dashboard upload components + i18n strings to allow selecting/uploading .epub and to display EPUB icons/colors.
  • Adds test coverage for EPUB parsing and file-read tool EPUB detection.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
astrbot/core/knowledge_base/parsers/epub_parser.py New EPUB-to-text parser using EbookLib + BeautifulSoup extraction.
astrbot/core/knowledge_base/parsers/util.py Registers .epub in select_parser().
astrbot/core/knowledge_base/parsers/__init__.py Exports EpubParser.
astrbot/core/computer/file_read_utils.py Adds EPUB magic detection and local EPUB parsing path.
dashboard/src/views/knowledge-base/components/DocumentsTab.vue Accepts .epub, resets file input after selection, updates icon/color handling, and tweaks upload request.
dashboard/src/views/knowledge-base/DocumentDetail.vue Adds EPUB icon/color mapping.
dashboard/src/views/alkaid/KnowledgeBase.vue Resets file input after selection, adds EPUB icon mapping, and tweaks upload request.
dashboard/src/i18n/locales/zh-CN/features/knowledge-base/detail.json Updates supported-format copy to include .epub.
dashboard/src/i18n/locales/en-US/features/knowledge-base/detail.json Updates supported-format copy to include .epub.
dashboard/src/i18n/locales/ru-RU/features/knowledge-base/detail.json Updates supported-format copy to include .epub.
dashboard/src/i18n/locales/zh-CN/features/alkaid/knowledge-base.json Updates upload subtitle to mention EPUB.
dashboard/src/i18n/locales/en-US/features/alkaid/knowledge-base.json Updates upload subtitle to mention EPUB.
requirements.txt Adds beautifulsoup4 + EbookLib.
pyproject.toml Adds beautifulsoup4 + EbookLib to dependencies.
tests/test_epub_parser.py New tests for parser selection + EPUB text extraction behavior.
tests/test_computer_fs_tools.py Adds EPUB bytes fixture + tests for EPUB magic detection and read tool integration.

Comment on lines +374 to +382
def _is_epub_bytes(file_bytes: bytes) -> bool:
try:
with zipfile.ZipFile(io.BytesIO(file_bytes)) as archive:
names = set(archive.namelist())
mimetype = archive.read("mimetype").decode("utf-8").strip()
except (KeyError, OSError, UnicodeDecodeError, zipfile.BadZipFile):
return False

return mimetype == "application/epub+zip" and "META-INF/container.xml" in names
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_epub_bytes() calls archive.read('mimetype'), which will decompress and load the entire entry into memory. Since this function can run on arbitrary ZIPs (via magic detection), a malicious archive could use an oversized/zip-bomb mimetype entry to cause excessive memory use. Prefer archive.open('mimetype') and read a small bounded amount (e.g., first ~64 bytes) before decoding/stripping.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_epub_parser.py Outdated
Comment on lines +175 to +179
@pytest.mark.asyncio
async def test_epub_parser_reads_spine_order_as_text():
pytest.importorskip("bs4")
pytest.importorskip("ebooklib")

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests use pytest.importorskip() for bs4 and ebooklib, but both packages were added to the project's core dependencies in this PR. Keeping the skip means CI could silently skip the EPUB parser assertions if dependency resolution regresses. Consider importing directly (or failing fast with a clear message) so missing required deps fail the test run.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_epub_parser.py Outdated
Comment on lines +190 to +194
@pytest.mark.asyncio
async def test_epub_parser_preserves_generic_container_text():
pytest.importorskip("bs4")
pytest.importorskip("ebooklib")

Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: pytest.importorskip() here will hide failures if required EPUB dependencies are missing/mispackaged. Since the PR adds these as core deps, consider removing the skip so the test suite fails loudly when the environment is incorrect.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +28
_DROP_TAGS = ("script", "style", "nav")


def _normalize_multiline_text(text: str) -> str:
lines = [re.sub(r"[ \t]+", " ", line).strip() for line in text.splitlines()]
return "\n".join(line for line in lines if line)


def _extract_text_from_html(body_content: bytes | str) -> str:
from bs4 import BeautifulSoup

soup = BeautifulSoup(body_content, "html.parser")
for tag_name in _DROP_TAGS:
for tag in soup.find_all(tag_name):
tag.decompose()

root = soup.body or soup
return _normalize_multiline_text(root.get_text("\n", strip=True))
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser currently decomposes all <nav> elements (_DROP_TAGS includes "nav"). This contradicts the PR description that says only EPUB navigation sections (TOC/landmarks/page-list) should be filtered, and it can also drop meaningful in-chapter navigation content. Consider keeping <nav> by default and only removing <nav> nodes that are explicitly navigation (e.g., epub:type in {toc, landmarks, page-list}) or updating the PR description to match the implemented behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +67
for spine_entry in book.spine:
item_id = self._resolve_spine_item_id(spine_entry)
if not item_id:
continue

item = book.get_item_with_id(item_id)
if item is None or item.get_type() != ebooklib.ITEM_DOCUMENT:
continue
if "nav" in getattr(item, "properties", []):
continue

chapter_text = _extract_text_from_html(item.get_body_content())
if chapter_text:
text_parts.append(chapter_text)

return ParseResult(text="\n\n".join(text_parts).strip(), media=[])

@staticmethod
def _resolve_spine_item_id(spine_entry: Any) -> str | None:
if isinstance(spine_entry, tuple) and spine_entry:
return str(spine_entry[0])
if isinstance(spine_entry, str):
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

book.spine entries from EbookLib are commonly tuples like (idref, linear); _resolve_spine_item_id currently discards the linear flag and the main loop never checks it. That means non-linear spine items will still be included, and there is no fallback when the spine is empty/incomplete—both of which are claimed in the PR description. Consider inspecting the second tuple element (or itemref attributes) to skip linear == 'no', and add a fallback to iterate over document items when the spine doesn't yield any content (or update the PR description).

Copilot uses AI. Check for mistakes.
Comment thread astrbot/core/knowledge_base/parsers/epub_parser.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants