Skip to content

Conversation

@afourney
Copy link
Member

@afourney afourney commented May 21, 2025

This pull request introduces changes to improve XML handling in the markitdown package by replacing the use of the standard xml.dom.minidom module with the more secure defusedxml.minidom. The changes also update type annotations to use Document and Element directly from xml.dom.minidom. These updates enhance security and maintain consistency in type annotations.

Security Improvements:

  • Added defusedxml as a dependency in pyproject.toml to replace the standard xml.dom.minidom for safer XML parsing. (packages/markitdown/pyproject.toml, packages/markitdown/pyproject.tomlR32)
  • Replaced imports of xml.dom.minidom with defusedxml.minidom in _epub_converter.py and _rss_converter.py for improved security. (packages/markitdown/src/markitdown/converters/_epub_converter.py, [1]; packages/markitdown/src/markitdown/converters/_rss_converter.py, [2]

Code Consistency:

  • Updated type annotations in _epub_converter.py to use Document from xml.dom.minidom instead of minidom.Document. (packages/markitdown/src/markitdown/converters/_epub_converter.py, packages/markitdown/src/markitdown/converters/_epub_converter.pyL131-R140)
  • Updated type annotations in _rss_converter.py to use Document and Element directly, ensuring consistent usage across methods. (packages/markitdown/src/markitdown/converters/_rss_converter.py, [1] [2] [3]

@afourney afourney requested a review from gagb May 21, 2025 16:43
@gagb gagb merged commit bbcf876 into main May 21, 2025
3 checks passed
@gagb gagb deleted the defusedxml branch May 21, 2025 16:47
azhao25 pushed a commit to azhao25/markitdown that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants