Skip to content

Comments

Add gaussian/fchk format support#896

Merged
wanghan-iapcm merged 16 commits intodeepmodeling:develfrom
Romarin87:fchk
Sep 17, 2025
Merged

Add gaussian/fchk format support#896
wanghan-iapcm merged 16 commits intodeepmodeling:develfrom
Romarin87:fchk

Conversation

@Romarin87
Copy link
Contributor

@Romarin87 Romarin87 commented Sep 16, 2025

Added reading of fchk format files and support for saving hessian.npy

Summary by CodeRabbit

  • New Features

    • Added import support for Gaussian formatted checkpoint (.fchk) with automatic unit conversions for coordinates, energies, forces, and Hessians; Hessian data type is now supported and available; new Gaussian FCHK format handler added; non-periodic systems handled by default.
  • Tests

    • Added FCHK/LOG fixtures and unit tests validating presence, shapes, and cross-file consistency of energies, forces, coordinates, atom info, and Hessians.

@coderabbitai
Copy link

coderabbitai bot commented Sep 16, 2025

📝 Walkthrough

Walkthrough

Adds a Gaussian FCHK parser that reads coordinates, energy, gradients (as forces), and lower‑triangular Hessian blocks, reconstructs and converts full symmetric Hessians, adds a Hessian unit conversion, registers a gaussian/fchk format and Hessian data type, and adds FCHK/log fixtures and tests.

Changes

Cohort / File(s) Summary of changes
Gaussian FCHK parser
dpdata/gaussian/fchk.py
New module: unit conversion globals (length_convert, energy_convert, force_convert, hessian_convert), create_full_hessian(hessian_raw, natoms) to rebuild a symmetric 3N×3N Hessian from lower‑triangular data, and to_system_data(file_name, has_forces=True, has_hessian=True) to parse ELEMENTS, coordinates (Bohr→Å), total energy (Hartree→eV), gradients→forces (negated, converted to eV/Å), optional Hessian reconstruction and conversion (hartree/bohr^2→eV/Å^2), validate presence of required fields, and assemble the returned system dict (coords, atom info, optional energies/forces/hessian, nopbc, cell/orig).
Plugin integration & Hessian data type
dpdata/plugins/gaussian.py
Adds register_hessian_data(data) and a GaussianFChkFormat registered as @Format.register("gaussian/fchk") with from_labeled_system(file_name, has_forces=True, has_hessian=True, **kwargs) which calls the FCHK parser, registers a Hessian DataType (shape (Axis.NFRAMES, Axis.NATOMS, 3, Axis.NATOMS, 3)) when present, and returns parsed data with an AssertionError fallback.
Hessian unit conversion
dpdata/unit.py
Adds HessianConversion(Conversion) to parse composite units like energy_unit/length_unit^2, split energy and length parts, compute conversion via EnergyConversion and LengthConversion, and provide a conversion factor for Hessian units.
Gaussian fixtures
tests/gaussian/*
tests/gaussian/waterfreq.gaussianfchk, tests/gaussian/waterfreq.gaussianlog
New test fixtures: a Gaussian formatted checkpoint (FCHK) and the corresponding Gaussian log for a water frequency calculation.
FCHK tests
tests/test_gaussian_fchk.py
New unittest module with TestGaussianLoadFchk (validates parsed atom info, coords, energies, forces, Hessian, cells, orig, nopbc, frame counts) and TestGaussianFchkVsLog (compares FCHK vs LOG for energies, forces, coordinates, atom info, and system properties).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as Caller
  participant F as GaussianFChkFormat
  participant P as dpdata.gaussian.fchk
  participant R as register_hessian_data
  participant S as LabeledSystem

  U->>F: from_labeled_system(file, has_forces, has_hessian)
  F->>P: to_system_data(file, has_forces, has_hessian)
  P-->>F: data {atoms, coords, energies?, forces?, hessian?}
  alt data includes hessian
    F->>R: register_hessian_data(data)
    R->>S: register DataType(name="hessian", shape=(NFRAMES,NATOMS,3,NATOMS,3))
  end
  F-->>U: data
Loading
sequenceDiagram
  autonumber
  participant P as fchk.to_system_data
  participant Uc as UnitConversion
  participant H as create_full_hessian

  P->>Uc: convert coords (bohr → Å), energy (hartree → eV), gradients→forces (→ eV/Å)
  opt Hessian present
    P->>H: create_full_hessian(lower-triangular, natoms)
    H-->>P: symmetric full Hessian (3N×3N)
    P->>Uc: convert Hessian (hartree/bohr^2 → eV/Å^2)
  end
  P-->>P: assemble and return system data dict
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • wanghan-iapcm

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Add gaussian/fchk format support" accurately and concisely describes the main change—adding Gaussian formatted checkpoint (fchk) parsing and Hessian support. The changeset adds dpdata/gaussian/fchk.py, unit conversion helpers, a Gaussian FCHK format registration, and tests, so the title is directly related to the primary work. It is clear and specific enough for a reviewer scanning history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 92.96875% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.06%. Comparing base (b25e0e7) to head (fd12f38).
⚠️ Report is 21 commits behind head on devel.

Files with missing lines Patch % Lines
dpdata/gaussian/fchk.py 93.00% 7 Missing ⚠️
dpdata/plugins/gaussian.py 87.50% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel     #896      +/-   ##
==========================================
+ Coverage   85.94%   86.06%   +0.11%     
==========================================
  Files          82       83       +1     
  Lines        7758     7886     +128     
==========================================
+ Hits         6668     6787     +119     
- Misses       1090     1099       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (13)
tests/gaussian/waterfreq.gaussianfchk (1)

1-1293: Consider storing large fixture via Git LFS (or compressing) to keep repo lean.

1.3K-line numeric fixture can bloat the repo history. Using Git LFS (or gzipping and auto-unzipping in tests) will reduce clone size.

dpdata/unit.py (1)

155-176: Validate “^2” suffix and fail fast on malformed Hessian units.

Currently we assume unit strings end with “^2” (e.g., bohr^2). Add a guard to avoid silent misparsing like “bohr2”.

Apply this diff:

 class HessianConversion(Conversion):
     def __init__(self, unitA, unitB):
@@
-        eunitA, lunitA = self._split_unit(unitA)
-        eunitB, lunitB = self._split_unit(unitB)
+        eunitA, lunitA = self._split_unit(unitA)
+        eunitB, lunitB = self._split_unit(unitB)
         econv = EnergyConversion(eunitA, eunitB).value()
         lconv = LengthConversion(lunitA, lunitB).value()
         self.set_value(econv / lconv**2)
@@
-    def _split_unit(self, unit):
-        eunit = unit.split("/")[0]
-        lunit = unit.split("/")[1][:-2]
+    def _split_unit(self, unit):
+        eunit, lpart = unit.split("/")
+        if not lpart.endswith("^2"):
+            raise RuntimeError(f"Invalid Hessian unit (missing '^2'): {unit}")
+        lunit = lpart[:-2]
         return eunit, lunit
dpdata/plugins/gaussian.py (1)

44-57: Tidy control flow and silence ruff TRY300.

Use else: after try/except to keep side-effects away from the exception path.

Apply this diff:

 @Format.register("gaussian/fchk")
 class GaussianFChkFormat(Format):
     def from_labeled_system(
         self, file_name: FileType, md=False, has_forces=True, has_hessian=True, **kwargs
     ):
-        try:
-            data = dpdata.gaussian.fchk.to_system_data(
-                file_name, md=md, has_forces=has_forces, has_hessian=has_hessian
-            )
-            register_hessian_data(data)
-            return data
-        except AssertionError:
-            return {"energies": [], "forces": [], "hessian": [], "nopbc": True}
+        try:
+            data = dpdata.gaussian.fchk.to_system_data(
+                file_name, md=md, has_forces=has_forces, has_hessian=has_hessian
+            )
+        except AssertionError:
+            return {"energies": [], "forces": [], "hessian": [], "nopbc": True}
+        else:
+            register_hessian_data(data)
+            return data
dpdata/gaussian/fchk.py (3)

74-93: md is unused; silence ruff or document.

If kept for API parity, underscore it locally to satisfy linters.

Apply this diff:

-def to_system_data(file_name: FileType, md=False, has_forces=True, has_hessian=True):
+def to_system_data(file_name: FileType, md=False, has_forces=True, has_hessian=True):
@@
-    data = {}
+    _ = md  # API parity; not used for fchk
+    data = {}

102-155: Rename throwaway loop variable to avoid E741 and improve clarity.

Replace l with line2 (or similar) in the chunk-reading loops.

Apply this diff (pattern repeated in all chunks):

-                while len(atom_numbers) < n:
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    atom_numbers += [int(x) for x in l.split()]
+                while len(atom_numbers) < n:
+                    line2 = next(fp)
+                    if isinstance(line2, bytes):
+                        line2 = line2.decode(errors="ignore")
+                    atom_numbers += [int(x) for x in line2.split()]

Repeat the same replacement for coordinate, gradient, and Hessian reads.


171-173: Cells length should match nframes.

cells is currently shape (1, 3, 3). Prefer per-frame shape to satisfy data-type checks consistently.

Apply this diff:

-    data["cells"] = np.array([np.eye(3) * 100])
+    nframes = len(coords_t)
+    data["cells"] = np.repeat(np.eye(3)[None, ...] * 100.0, nframes, axis=0)
tests/test_gaussian_fchk.py (7)

5-9: Add Path import and TEST_DIR helper for fixture paths.

-import numpy as np
-from context import dpdata
+import numpy as np
+from pathlib import Path
+try:
+    from context import dpdata  # tests/context.py sets sys.path for local import
+except ImportError:
+    import dpdata  # fallback to installed dpdata
+
+TEST_DIR = Path(__file__).resolve().parent

16-18: Generalize hard-coded shapes to NATOMS/NFRAMES.

Avoid baking 3 atoms into shapes; derive from metadata for fixture reusability.

         self.nframes = 1
+        self.natoms = sum(self.atom_numbs)
         self.atom_types = [1, 0, 0]
@@
-        self.assertEqual(self.system.data["forces"].shape, (1, 3, 3))
+        self.assertEqual(self.system.data["forces"].shape, (self.nframes, self.natoms, 3))
@@
-        self.assertEqual(self.system.data["hessian"].shape, (1, 9, 9))
+        self.assertEqual(
+            self.system.data["hessian"].shape,
+            (self.nframes, 3 * self.natoms, 3 * self.natoms),
+        )
@@
-        self.assertEqual(self.system.data["coords"].shape, (1, 3, 3))
+        self.assertEqual(self.system.data["coords"].shape, (self.nframes, self.natoms, 3))
@@
-        self.assertEqual(self.system.data["cells"].shape, (1, 3, 3))
+        self.assertEqual(self.system.data["cells"].shape, (self.nframes, 3, 3))

Also applies to: 47-51, 53-56, 59-63, 65-68


31-35: Prefer vector assert for atom_types.

Cleaner and shows full diff on failure.

-        for ii in range(len(self.atom_types)):
-            self.assertEqual(self.system.data["atom_types"][ii], self.atom_types[ii])
+        np.testing.assert_array_equal(self.system.data["atom_types"], np.array(self.atom_types))

90-95: Remove unused conversion imports/variables.

They’re not used and may trigger linter warnings.

-        from dpdata.unit import EnergyConversion, ForceConversion, LengthConversion
-
-        self.energy_convert = EnergyConversion("hartree", "eV").value()
-        self.force_convert = ForceConversion("hartree/bohr", "eV/angstrom").value()
-        self.length_convert = LengthConversion("bohr", "angstrom").value()

170-177: Make key set comparison less brittle.

Future plugins may add keys; ensure log keys ⊆ fchk keys and only allow extra “hessian” on fchk.

-        # Check that both have the same data keys
-        fchk_keys = set(self.system_fchk.data.keys())
-        log_keys = set(self.system_log.data.keys())
-
-        # fchk has hessian, log doesn't, so we exclude it from comparison
-        fchk_keys.discard("hessian")
-
-        self.assertEqual(fchk_keys, log_keys)
+        # Check that all keys from log are present in fchk; allow extra 'hessian' in fchk
+        fchk_keys = set(self.system_fchk.data.keys())
+        log_keys = set(self.system_log.data.keys())
+        self.assertTrue(log_keys.issubset(fchk_keys))
+        self.assertLessEqual(fchk_keys - log_keys, {"hessian"})

52-56: Optional: Assert Hessian symmetry.

Adds a quick sanity check that reconstructed Hessian is symmetric.

    def test_hessian_is_symmetric(self):
        H = self.system.data["hessian"][0]
        np.testing.assert_allclose(H, H.T, rtol=0.0, atol=1e-8)

52-56: Optional: Cover saving hessian.npy.

PR mentions saving Hessian; add an I/O roundtrip check.

    def test_saving_includes_hessian_npy(self):
        import os, tempfile
        with tempfile.TemporaryDirectory() as tmpdir:
            out = os.path.join(tmpdir, "set.000")
            # Ensure this uses the LabeledSystem that has hessian (from fchk)
            self.system.to("deepmd/npy", out)
            self.assertTrue(os.path.isfile(os.path.join(out, "data", "hessian.npy")))

If the to() signature differs in this repo, I can adjust once you confirm the expected API.

Also applies to: 80-88

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b25e0e7 and 254773d.

📒 Files selected for processing (6)
  • dpdata/gaussian/fchk.py (1 hunks)
  • dpdata/plugins/gaussian.py (3 hunks)
  • dpdata/unit.py (1 hunks)
  • tests/gaussian/waterfreq.gaussianfchk (1 hunks)
  • tests/gaussian/waterfreq.gaussianlog (1 hunks)
  • tests/test_gaussian_fchk.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/gaussian/fchk.py
  • dpdata/unit.py
  • dpdata/plugins/gaussian.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/gaussian/fchk.py
  • dpdata/plugins/gaussian.py
🧬 Code graph analysis (3)
tests/test_gaussian_fchk.py (2)
dpdata/system.py (1)
  • LabeledSystem (1159-1357)
dpdata/unit.py (4)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • LengthConversion (120-131)
  • value (93-94)
dpdata/gaussian/fchk.py (2)
dpdata/utils.py (1)
  • open_file (137-170)
dpdata/unit.py (5)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • HessianConversion (155-180)
  • LengthConversion (120-131)
  • value (93-94)
dpdata/plugins/gaussian.py (3)
dpdata/data_type.py (3)
  • Axis (15-21)
  • DataType (35-131)
  • register_data_type (138-149)
dpdata/system.py (4)
  • System (55-1111)
  • register_data_type (1094-1111)
  • LabeledSystem (1159-1357)
  • has_forces (1269-1270)
dpdata/gaussian/fchk.py (1)
  • to_system_data (74-180)
🪛 Ruff (0.12.2)
dpdata/gaussian/fchk.py

57-60: Avoid specifying long messages outside the exception class

(TRY003)


74-74: Unused function argument: md

(ARG001)


112-112: Ambiguous variable name: l

(E741)


114-114: Ambiguous variable name: l

(E741)


120-120: Ambiguous variable name: l

(E741)


122-122: Ambiguous variable name: l

(E741)


133-133: Ambiguous variable name: l

(E741)


135-135: Ambiguous variable name: l

(E741)


144-144: Ambiguous variable name: l

(E741)


146-146: Ambiguous variable name: l

(E741)

dpdata/plugins/gaussian.py

47-47: Unused method argument: kwargs

(ARG002)


54-54: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (6)
tests/gaussian/waterfreq.gaussianfchk (1)

376-386: Hessian block size matches 3N(3N+1)/2 expectation (N=3 ⇒ 45). LGTM.

tests/gaussian/waterfreq.gaussianlog (1)

1-739: Fixture looks consistent with fchk; includes gradients and force constants. LGTM.

dpdata/gaussian/fchk.py (3)

20-23: Unit factors set correctly for bohr/angstrom, hartree/eV, forces, and Hessians. LGTM.


26-71: Hessian reconstruction logic is sound and validates size. LGTM.


174-180: Conditional packing of energies/forces/Hessian is correct. LGTM.

tests/test_gaussian_fchk.py (1)

6-6: Context import OK — tests/context.py present and adjusts sys.path

tests/context.py inserts the repo root into sys.path (line 6) and imports dpdata (line 7), so from context import dpdata should work in CI — no change required.

Comment on lines 22 to 33
def register_hessian_data(data):
if "hessian" in data:
dt = DataType(
"hessian",
np.ndarray,
(Axis.NFRAMES, -1, -1),
required=False,
deepmd_name="hessian",
)
dpdata.System.register_data_type(dt)
dpdata.LabeledSystem.register_data_type(dt)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

NameError: missing dpdata import; registration will crash.

register_hessian_data calls dpdata.System.register_data_type but dpdata isn’t imported here.

Apply this diff:

@@
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING
@@
+import dpdata  # required for System/LabeledSystem registration
@@
 def register_hessian_data(data):
     if "hessian" in data:
-        dt = DataType(
+        dt = DataType(
             "hessian",
             np.ndarray,
             (Axis.NFRAMES, -1, -1),
             required=False,
             deepmd_name="hessian",
         )
-        dpdata.System.register_data_type(dt)
-        dpdata.LabeledSystem.register_data_type(dt)
+        # Avoid duplicate registrations if called multiple times
+        if not any(x.name == "hessian" for x in dpdata.System.DTYPES):
+            dpdata.System.register_data_type(dt)
+        if not any(x.name == "hessian" for x in dpdata.LabeledSystem.DTYPES):
+            dpdata.LabeledSystem.register_data_type(dt)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def register_hessian_data(data):
if "hessian" in data:
dt = DataType(
"hessian",
np.ndarray,
(Axis.NFRAMES, -1, -1),
required=False,
deepmd_name="hessian",
)
dpdata.System.register_data_type(dt)
dpdata.LabeledSystem.register_data_type(dt)
from typing import TYPE_CHECKING
import dpdata # required for System/LabeledSystem registration
def register_hessian_data(data):
if "hessian" in data:
dt = DataType(
"hessian",
np.ndarray,
(Axis.NFRAMES, -1, -1),
required=False,
deepmd_name="hessian",
)
# Avoid duplicate registrations if called multiple times
if not any(x.name == "hessian" for x in dpdata.System.DTYPES):
dpdata.System.register_data_type(dt)
if not any(x.name == "hessian" for x in dpdata.LabeledSystem.DTYPES):
dpdata.LabeledSystem.register_data_type(dt)
🤖 Prompt for AI Agents
In dpdata/plugins/gaussian.py around lines 22 to 33, the function
register_hessian_data calls dpdata.System.register_data_type but dpdata is not
imported causing a NameError; add an import for dpdata at the top of the file
(e.g., add "import dpdata" alongside the other imports) so dpdata.System and
dpdata.LabeledSystem resolve correctly when register_hessian_data runs.

Comment on lines +11 to +13
self.system = dpdata.LabeledSystem(
"gaussian/waterfreq.gaussianfchk", fmt="gaussian/fchk"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Make fixture paths robust to the working directory.

Use paths relative to this test file to avoid CI failures when pytest runs from repo root.

-        self.system = dpdata.LabeledSystem(
-            "gaussian/waterfreq.gaussianfchk", fmt="gaussian/fchk"
-        )
+        self.system = dpdata.LabeledSystem(
+            str(TEST_DIR / "gaussian" / "waterfreq.gaussianfchk"), fmt="gaussian/fchk"
+        )
@@
-        self.system_fchk = dpdata.LabeledSystem(
-            "gaussian/waterfreq.gaussianfchk", fmt="gaussian/fchk"
-        )
-        self.system_log = dpdata.LabeledSystem(
-            "gaussian/waterfreq.gaussianlog", fmt="gaussian/log"
-        )
+        self.system_fchk = dpdata.LabeledSystem(
+            str(TEST_DIR / "gaussian" / "waterfreq.gaussianfchk"), fmt="gaussian/fchk"
+        )
+        self.system_log = dpdata.LabeledSystem(
+            str(TEST_DIR / "gaussian" / "waterfreq.gaussianlog"), fmt="gaussian/log"
+        )

Also applies to: 82-87

🤖 Prompt for AI Agents
In tests/test_gaussian_fchk.py around lines 11-13 (and similarly at lines
82-87), the fixture file paths are hardcoded and can break when pytest is
invoked from the repository root; change them to build paths relative to this
test file using the test file's directory (e.g. use
Path(__file__).parent.joinpath("gaussian","waterfreq.gaussianfchk") and convert
to string where needed), add the necessary pathlib import at the top if missing,
and replace the other occurrences at 82-87 the same way so all fixture
references use a path computed from __file__.

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 16, 2025

CodSpeed WallTime Performance Report

Merging #896 will not alter performance

Comparing Romarin87:fchk (fd12f38) with devel (b25e0e7)

Summary

✅ 2 untouched

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
dpdata/plugins/gaussian.py (1)

10-13: Explicitly import dpdata to make registration unambiguous.

Relying on import dpdata.gaussian.fchk to bind dpdata is indirect. Add a top‑level import dpdata to avoid surprises and to satisfy tools.

 import numpy as np
+import dpdata
 import dpdata.gaussian.fchk
 import dpdata.gaussian.gjf
 import dpdata.gaussian.log
 from dpdata.data_type import Axis, DataType
🧹 Nitpick comments (1)
dpdata/plugins/gaussian.py (1)

22-32: Guard against duplicate dtype registration (avoid warnings and churn).

register_hessian_data may be called multiple times; current code will trigger duplicate dtype warnings each call.

 def register_hessian_data(data):
     if "hessian" in data:
         dt = DataType(
             "hessian",
             np.ndarray,
             (Axis.NFRAMES, Axis.NATOMS, 3, Axis.NATOMS, 3),
             required=False,
             deepmd_name="hessian",
         )
-        dpdata.LabeledSystem.register_data_type(dt)
+        # Register once
+        if not any(x.name == "hessian" for x in dpdata.LabeledSystem.DTYPES):
+            dpdata.LabeledSystem.register_data_type(dt)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 254773d and 01d39ee.

📒 Files selected for processing (1)
  • dpdata/plugins/gaussian.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/plugins/gaussian.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/plugins/gaussian.py
🧬 Code graph analysis (1)
dpdata/plugins/gaussian.py (3)
dpdata/data_type.py (3)
  • Axis (15-21)
  • DataType (35-131)
  • register_data_type (138-149)
dpdata/system.py (3)
  • LabeledSystem (1159-1357)
  • register_data_type (1094-1111)
  • has_forces (1269-1270)
dpdata/gaussian/fchk.py (1)
  • to_system_data (74-180)
🪛 Ruff (0.12.2)
dpdata/plugins/gaussian.py

46-46: Unused method argument: kwargs

(ARG002)


53-53: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (2)
dpdata/plugins/gaussian.py (2)

8-8: Import looks good.

np is required for DataType and Hessian reshape in this module.


43-56: Fix Hessian shape mismatch vs registered DataType; clean up ruff findings

File: dpdata/plugins/gaussian.py Lines: 43-56

  • Normalize hessian from (nframes, 3N, 3N) -> (nframes, natoms, 3, natoms, 3) to match the registered dtype.
  • Rename **kwargs -> **_kwargs (ARG002) and move the return into the else branch to satisfy TRY300.
 @Format.register("gaussian/fchk")
 class GaussianFChkFormat(Format):
     def from_labeled_system(
-        self, file_name: FileType, md=False, has_forces=True, has_hessian=True, **kwargs
+        self, file_name: FileType, md=False, has_forces=True, has_hessian=True, **_kwargs
     ):
-        try:
-            data = dpdata.gaussian.fchk.to_system_data(
-                file_name, md=md, has_forces=has_forces, has_hessian=has_hessian
-            )
-            register_hessian_data(data)
-            return data
-        except AssertionError:
-            return {"energies": [], "forces": [], "hessian": [], "nopbc": True}
+        try:
+            data = dpdata.gaussian.fchk.to_system_data(
+                file_name, md=md, has_forces=has_forces, has_hessian=has_hessian
+            )
+        except AssertionError:
+            return {"energies": [], "forces": [], "hessian": [], "nopbc": True}
+        else:
+            # Normalize Hessian to (nframes, natoms, 3, natoms, 3) to match DataType.
+            if has_hessian and "hessian" in data and isinstance(data["hessian"], np.ndarray):
+                if data["hessian"].ndim == 3:  # (nframes, 3N, 3N)
+                    nframes, m, n = data["hessian"].shape
+                    natoms = int(data["coords"].shape[1])
+                    if m == n == 3 * natoms:
+                        data["hessian"] = data["hessian"].reshape(nframes, natoms, 3, natoms, 3)
+            register_hessian_data(data)
+            return data

I could not verify the upstream reader in this checkout — please confirm that dpdata.gaussian.fchk.to_system_data returns (nframes, 3N, 3N) (or run the grep below) before merging:

rg -n -C2 "data\['hessian'\]|reshape\(\s*\(3\s*\*.*?,\s*3\s*\*" dpdata/gaussian/fchk.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dpdata/gaussian/fchk.py (1)

1-179: Fix ruff violations in dpdata/gaussian/fchk.py

  • TRY003 (dpdata/gaussian/fchk.py:55–60): shorten the exception message or move message formatting into an exception class; e.g. replace with a concise message like raise ValueError(f"Input length {hessian_block.size} != expected {expected_length}").
  • ARG001 (dpdata/gaussian/fchk.py:74): unused parameter md — rename to _md (preserve API) or remove if safe.
  • E741 (dpdata/gaussian/fchk.py:112–116, 118–124, 131–137, 142–148): rename ambiguous variable l to line or raw_line in all loops.
    Run ruff check/format after these edits.
♻️ Duplicate comments (2)
tests/test_gaussian_fchk.py (2)

11-13: Make fixture path robust to CWD.

Build paths relative to this test file to avoid CI flakiness. (Prev comment already raised this.)

-        self.system = dpdata.LabeledSystem(
-            "gaussian/waterfreq.gaussianfchk", fmt="gaussian/fchk"
-        )
+        self.system = dpdata.LabeledSystem(
+            str(TEST_DIR / "gaussian" / "waterfreq.gaussianfchk"), fmt="gaussian/fchk"
+        )

Additional context to add near the imports:

+from pathlib import Path
+TEST_DIR = Path(__file__).parent

82-87: Apply the same robust path handling here.

Use TEST_DIR for both fchk and log.

-        self.system_fchk = dpdata.LabeledSystem(
-            "gaussian/waterfreq.gaussianfchk", fmt="gaussian/fchk"
-        )
-        self.system_log = dpdata.LabeledSystem(
-            "gaussian/waterfreq.gaussianlog", fmt="gaussian/log"
-        )
+        self.system_fchk = dpdata.LabeledSystem(
+            str(TEST_DIR / "gaussian" / "waterfreq.gaussianfchk"), fmt="gaussian/fchk"
+        )
+        self.system_log = dpdata.LabeledSystem(
+            str(TEST_DIR / "gaussian" / "waterfreq.gaussianlog"), fmt="gaussian/log"
+        )
🧹 Nitpick comments (7)
dpdata/gaussian/fchk.py (5)

57-60: Shorten exception message to satisfy Ruff TRY003.

Use a concise error string.

-        raise ValueError(
-            f"Input data length ({hessian_block.size}) does not match expected length ({expected_length}) "
-            f"for {natoms} atoms (matrix dimension {dim}x{dim})."
-        )
+        raise ValueError(
+            f"Invalid Hessian length: got {hessian_block.size}, expected {expected_length} (natoms={natoms}, dim={dim})."
+        )

74-94: Silence unused md while keeping signature compatibility.

Keep md for the plugin interface but avoid ARG001.

-def to_system_data(file_name: FileType, md=False, has_forces=True, has_hessian=True):
+def to_system_data(
+    file_name: FileType,
+    md: bool = False,
+    has_forces: bool = True,
+    has_hessian: bool = True,
+):
@@
-    data = {}
+    # fchk is effectively single-frame; keep `md` for API compatibility.
+    del md  # unused
+    data = {}

112-116: Avoid ambiguous single-letter l variable (Ruff E741).

Rename to chunk for readability and lint compliance.

-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    atom_numbers += [int(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    atom_numbers += [int(x) for x in chunk.split()]
@@
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    coords_raw += [float(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    coords_raw += [float(x) for x in chunk.split()]
@@
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    forces_raw += [float(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    forces_raw += [float(x) for x in chunk.split()]
@@
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    hessian_raw += [float(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    hessian_raw += [float(x) for x in chunk.split()]

Also applies to: 120-124, 133-137, 144-147


126-128: Make scalar energy parsing robust to fchk variants.

Handle cases where the value isn’t on the same line (e.g., header with trailing tokens).

-            elif "Total Energy" in line:
-                energy = float(line.split()[-1]) * energy_convert
+            elif "Total Energy" in line:
+                tokens = line.split()
+                try:
+                    energy = float(tokens[-1]) * energy_convert
+                except ValueError:
+                    nxt = next(fp)
+                    if isinstance(nxt, bytes):
+                        nxt = nxt.decode(errors="ignore")
+                    energy = float(nxt.split()[0]) * energy_convert
                 energy_t.append(energy)

154-160: Assertions tie behavior to flags; consider clearer error messages.

The asserts are fine, but raising with context (which field is missing) improves UX; optional.

-    if has_forces:
-        assert forces_t, "cannot find forces"
-    if has_hessian:
-        assert hessian_t, "cannot find hessian"
+    if has_forces and not forces_t:
+        raise ValueError("cannot find forces in fchk (disable with has_forces=False)")
+    if has_hessian and not hessian_t:
+        raise ValueError("cannot find hessian in fchk (disable with has_hessian=False)")
tests/test_gaussian_fchk.py (2)

89-95: Remove unused conversion imports/variables.

They’re not used; trim to keep tests minimal.

-        # Get conversion factors from dpdata
-        from dpdata.unit import EnergyConversion, ForceConversion, LengthConversion
-
-        self.energy_convert = EnergyConversion("hartree", "eV").value()
-        self.force_convert = ForceConversion("hartree/bohr", "eV/angstrom").value()
-        self.length_convert = LengthConversion("bohr", "angstrom").value()
+        # (no-op) conversions not needed for this comparison

31-35: Tighter assertion for atom_types.

Use array equality instead of a loop.

-        for ii in range(len(self.atom_types)):
-            self.assertEqual(self.system.data["atom_types"][ii], self.atom_types[ii])
+        np.testing.assert_array_equal(self.system.data["atom_types"], self.atom_types)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01d39ee and 76fb799.

📒 Files selected for processing (2)
  • dpdata/gaussian/fchk.py (1 hunks)
  • tests/test_gaussian_fchk.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/gaussian/fchk.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/gaussian/fchk.py
🧬 Code graph analysis (2)
dpdata/gaussian/fchk.py (2)
dpdata/utils.py (1)
  • open_file (137-170)
dpdata/unit.py (5)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • HessianConversion (155-180)
  • LengthConversion (120-131)
  • value (93-94)
tests/test_gaussian_fchk.py (2)
dpdata/system.py (1)
  • LabeledSystem (1159-1357)
dpdata/unit.py (4)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • LengthConversion (120-131)
  • value (93-94)
🪛 Ruff (0.12.2)
dpdata/gaussian/fchk.py

57-60: Avoid specifying long messages outside the exception class

(TRY003)


74-74: Unused function argument: md

(ARG001)


112-112: Ambiguous variable name: l

(E741)


114-114: Ambiguous variable name: l

(E741)


120-120: Ambiguous variable name: l

(E741)


122-122: Ambiguous variable name: l

(E741)


133-133: Ambiguous variable name: l

(E741)


135-135: Ambiguous variable name: l

(E741)


144-144: Ambiguous variable name: l

(E741)


146-146: Ambiguous variable name: l

(E741)

🔇 Additional comments (3)
dpdata/gaussian/fchk.py (3)

26-71: Hessian reconstruction logic is sound.

Size check, lower-tri fill, and symmetric mirroring are correct and numerically stable. Nice.


140-153: Nice: lower-tri Hessian to full tensor with unit conversion.

The reshape to (natoms,3,natoms,3) matches the registered shape.


20-23: Unit conversion constants defined once at import — good.

This avoids recomputation and keeps the parser fast and consistent.

@Romarin87 Romarin87 marked this pull request as draft September 16, 2025 11:22
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
dpdata/gaussian/fchk.py (8)

26-71: Create-full-Hessian logic is sound; tighten the error message to satisfy Ruff TRY003.
Shorten the long exception text.

Apply this diff:

-    if hessian_block.size != expected_length:
-        raise ValueError(
-            f"Input data length ({hessian_block.size}) does not match expected length ({expected_length}) "
-            f"for {natoms} atoms (matrix dimension {dim}x{dim})."
-        )
+    if hessian_block.size != expected_length:
+        raise ValueError(
+            f"Invalid Hessian length: got {hessian_block.size}, expected {expected_length} "
+            f"(natoms={natoms}, dim={dim})."
+        )

74-93: md is unused; make intent explicit to silence Ruff ARG001.
Either consume it or mark intentionally unused.

Apply this minimal change after initializing data:

-    data = {}
+    data = {}
+    _ = md  # fchk is effectively single-frame; md flag not used

Alternatively, raise if someone tries md=True.


111-116: Rename ambiguous loop variable and accept Fortran D exponents (Ruff E741, robustness).

Apply this diff:

-                while len(atom_numbers) < n:
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    atom_numbers += [int(x) for x in l.split()]
+                while len(atom_numbers) < n:
+                    next_line = next(fp)
+                    if isinstance(next_line, bytes):
+                        next_line = next_line.decode(errors="ignore")
+                    tokens = next_line.split()
+                    atom_numbers += [int(x) for x in tokens]

-                while len(coords_raw) < n:
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    coords_raw += [float(x) for x in l.split()]
+                while len(coords_raw) < n:
+                    next_line = next(fp)
+                    if isinstance(next_line, bytes):
+                        next_line = next_line.decode(errors="ignore")
+                    coords_raw += [float(x.replace("D", "E")) for x in next_line.split()]

-                while len(forces_raw) < n:
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    forces_raw += [float(x) for x in l.split()]
+                while len(forces_raw) < n:
+                    next_line = next(fp)
+                    if isinstance(next_line, bytes):
+                        next_line = next_line.decode(errors="ignore")
+                    forces_raw += [float(x.replace("D", "E")) for x in next_line.split()]

-                while len(hessian_raw) < n:
-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    hessian_raw += [float(x) for x in l.split()]
+                while len(hessian_raw) < n:
+                    next_line = next(fp)
+                    if isinstance(next_line, bytes):
+                        next_line = next_line.decode(errors="ignore")
+                    hessian_raw += [float(x.replace("D", "E")) for x in next_line.split()]

Also applies to: 120-124, 133-137, 144-147


140-153: Add sanity checks for natoms and Hessian header length before reconstructing.
Prevents silent mis-shapes when files are malformed.

Apply this diff:

             elif "Cartesian Force Constants" in line and "R" in line:
                 n = int(line.split()[-1])
                 hessian_raw = []
                 while len(hessian_raw) < n:
                     next_line = next(fp)
                     if isinstance(next_line, bytes):
                         next_line = next_line.decode(errors="ignore")
                     hessian_raw += [float(x.replace("D", "E")) for x in next_line.split()]
-                hessian_full = (
-                    create_full_hessian(hessian_raw, natoms) * hessian_convert
-                )
+                if natoms <= 0:
+                    raise ValueError("natoms must be parsed before Hessian block.")
+                expected = (3 * natoms) * (3 * natoms + 1) // 2
+                if n != expected:
+                    raise ValueError(f"Invalid Hessian header length: N={n}, expected {expected} for natoms={natoms}.")
+                hessian_full = create_full_hessian(hessian_raw, natoms) * hessian_convert
                 # store as (natoms, 3, natoms, 3) to align with registered shape
                 hessian_t.append(hessian_full.reshape(natoms, 3, natoms, 3))

154-160: Avoid assert for runtime validation; raise explicit exceptions (asserts can be stripped).

Apply this diff:

-    assert coords_t, "cannot find coords"
-    assert energy_t, "cannot find energy"
-    if has_forces:
-        assert forces_t, "cannot find forces"
-    if has_hessian:
-        assert hessian_t, "cannot find hessian"
+    if not coords_t:
+        raise ValueError("cannot find coords")
+    if not energy_t:
+        raise ValueError("cannot find energy")
+    if has_forces and not forces_t:
+        raise ValueError("cannot find forces")
+    if has_hessian and not hessian_t:
+        raise ValueError("cannot find hessian")

161-168: Validate atom_numbers vs natoms to catch inconsistent files.

Apply this diff:

-    atom_symbols = [ELEMENTS[z - 1] for z in atom_numbers]
+    if natoms != len(atom_numbers):
+        raise ValueError(f"atomic numbers length {len(atom_numbers)} != natoms {natoms}")
+    atom_symbols = [ELEMENTS[z - 1] for z in atom_numbers]

169-171: Cells should match number of frames; broadcast instead of a single cell.
Prevents shape mismatches if fchk ever carries multiple snapshots.

Apply this diff:

-    data["orig"] = np.zeros(3)
-    data["cells"] = np.array([np.eye(3) * 100])
+    nframes = len(coords_t)
+    data["orig"] = np.zeros(3)
+    data["cells"] = np.repeat((np.eye(3) * 100)[None, :, :], nframes, axis=0)
     data["nopbc"] = True

20-23: Optional: PEP8 constant naming for module-level conversions.
If you care about style consistency, rename to UPPER_CASE and update usages.

Proposed diff (partial):

-length_convert = LengthConversion("bohr", "angstrom").value()
-energy_convert = EnergyConversion("hartree", "eV").value()
-force_convert = ForceConversion("hartree/bohr", "eV/angstrom").value()
-hessian_convert = HessianConversion("hartree/bohr^2", "eV/angstrom^2").value()
+LENGTH_CONVERT = LengthConversion("bohr", "angstrom").value()
+ENERGY_CONVERT = EnergyConversion("hartree", "eV").value()
+FORCE_CONVERT = ForceConversion("hartree/bohr", "eV/angstrom").value()
+HESSIAN_CONVERT = HessianConversion("hartree/bohr^2", "eV/angstrom^2").value()

-                coords = np.array(coords_raw).reshape(-1, 3) * length_convert
+                coords = np.array(coords_raw).reshape(-1, 3) * LENGTH_CONVERT
-                energy = float(line.split()[-1]) * energy_convert
+                energy = float(line.split()[-1]) * ENERGY_CONVERT
-                forces = -np.array(forces_raw).reshape(-1, 3) * force_convert
+                forces = -np.array(forces_raw).reshape(-1, 3) * FORCE_CONVERT
-                hessian_full = create_full_hessian(hessian_raw, natoms) * hessian_convert
+                hessian_full = create_full_hessian(hessian_raw, natoms) * HESSIAN_CONVERT

Also applies to: 124-125, 127-129, 138-139, 149-153

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76fb799 and acbdfbb.

📒 Files selected for processing (1)
  • dpdata/gaussian/fchk.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/gaussian/fchk.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/gaussian/fchk.py
🧬 Code graph analysis (1)
dpdata/gaussian/fchk.py (2)
dpdata/utils.py (1)
  • open_file (137-170)
dpdata/unit.py (5)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • HessianConversion (155-180)
  • LengthConversion (120-131)
  • value (93-94)
🪛 Ruff (0.12.2)
dpdata/gaussian/fchk.py

57-60: Avoid specifying long messages outside the exception class

(TRY003)


74-74: Unused function argument: md

(ARG001)


112-112: Ambiguous variable name: l

(E741)


114-114: Ambiguous variable name: l

(E741)


120-120: Ambiguous variable name: l

(E741)


122-122: Ambiguous variable name: l

(E741)


133-133: Ambiguous variable name: l

(E741)


135-135: Ambiguous variable name: l

(E741)


144-144: Ambiguous variable name: l

(E741)


146-146: Ambiguous variable name: l

(E741)

🔇 Additional comments (2)
dpdata/gaussian/fchk.py (2)

20-23: Unit conversions look correct (Bohr→Å, Hartree→eV, Hartree/Bohr→eV/Å, Hartree/Bohr²→eV/Ų).


124-139: Approve — conversions & ndarray atom_types are fine

End-to-end conversions/shapes verified; atom_types is used as np.ndarray across readers and System.DTYPES, and System.dump uses monty.serialization.dumpfn — monty's MontyEncoder handles numpy arrays, so keep atom_types as an ndarray. Convert to .tolist() only if you must produce plain JSON without monty's encoder. (docs.deepmodeling.org)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
dpdata/gaussian/fchk.py (5)

26-72: Hessian reconstruction is correct; tighten the error message to satisfy ruff TRY003.

The lower-triangular length check and symmetry fill are correct. To appease TRY003, shorten the exception text.

Apply this diff:

-    if hessian_block.size != expected_length:
-        raise ValueError(
-            f"Input data length ({hessian_block.size}) does not match expected length ({expected_length}) "
-            f"for {natoms} atoms (matrix dimension {dim}x{dim})."
-        )
+    if hessian_block.size != expected_length:
+        raise ValueError(
+            f"lower-tri length mismatch: got {hessian_block.size}, expected {expected_length} (natoms={natoms}, dim={dim})"
+        )

74-88: md is unused (ruff ARG001).

Keep the param for API compatibility but mark it used.

Apply this diff:

 def to_system_data(file_name: FileType, md=False, has_forces=True, has_hessian=True):
@@
-    data = {}
+    data = {}
+    _ = md  # reserved for future multi-frame support

Also applies to: 94-101


112-116: Rename ambiguous variable l (ruff E741).

Use clearer names in the read loops.

Apply this diff:

-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    atom_numbers += [int(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    atom_numbers += [int(x) for x in chunk.split()]

-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    coords_raw += [float(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    coords_raw += [float(x) for x in chunk.split()]

-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    forces_raw += [float(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    forces_raw += [float(x) for x in chunk.split()]

-                    l = next(fp)
-                    if isinstance(l, bytes):
-                        l = l.decode(errors="ignore")
-                    hessian_raw += [float(x) for x in l.split()]
+                    chunk = next(fp)
+                    if isinstance(chunk, bytes):
+                        chunk = chunk.decode(errors="ignore")
+                    hessian_raw += [float(x) for x in chunk.split()]

Also applies to: 120-124, 133-137, 144-147


154-160: Prefer explicit exceptions over assert for required data.

Asserts can be stripped with -O. Raise ValueError to guarantee checks in production.

Apply this diff:

-    assert coords_t, "cannot find coords"
-    assert energy_t, "cannot find energy"
-    if has_forces:
-        assert forces_t, "cannot find forces"
-    if has_hessian:
-        assert hessian_t, "cannot find hessian"
+    if not coords_t:
+        raise ValueError("cannot find coords")
+    if not energy_t:
+        raise ValueError("cannot find energy")
+    if has_forces and not forces_t:
+        raise ValueError("cannot find forces")
+    if has_hessian and not hessian_t:
+        raise ValueError("cannot find hessian")

126-139: Minor robustness: tolerate scalar formatting variance in FCHK lines.

line.split()[-1] usually works; if Gaussian changes spacing, a regex for trailing number would be safer. Optional.

Example:

-            elif "Total Energy" in line:
-                energy = float(line.split()[-1]) * energy_convert
+            elif "Total Energy" in line:
+                import re
+                m = re.search(r'([-+]?\d+(\.\d+)?([Ee][-+]?\d+)?)\s*$', line)
+                if m:
+                    energy = float(m.group(1)) * energy_convert
+                else:
+                    raise ValueError("failed to parse Total Energy")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acbdfbb and f5553d3.

📒 Files selected for processing (1)
  • dpdata/gaussian/fchk.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/gaussian/fchk.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/gaussian/fchk.py
🧬 Code graph analysis (1)
dpdata/gaussian/fchk.py (2)
dpdata/utils.py (1)
  • open_file (137-170)
dpdata/unit.py (5)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • HessianConversion (155-180)
  • LengthConversion (120-131)
  • value (93-94)
🪛 Ruff (0.12.2)
dpdata/gaussian/fchk.py

57-60: Avoid specifying long messages outside the exception class

(TRY003)


74-74: Unused function argument: md

(ARG001)


112-112: Ambiguous variable name: l

(E741)


114-114: Ambiguous variable name: l

(E741)


120-120: Ambiguous variable name: l

(E741)


122-122: Ambiguous variable name: l

(E741)


133-133: Ambiguous variable name: l

(E741)


135-135: Ambiguous variable name: l

(E741)


144-144: Ambiguous variable name: l

(E741)


146-146: Ambiguous variable name: l

(E741)

🔇 Additional comments (4)
dpdata/gaussian/fchk.py (4)

20-23: Unit conversions look correct and consistent.

The chosen unit pairs match Gaussian FCHK conventions and dpdata’s converters. No changes needed.


12-18: Module placement under dpdata/gaussian/ follows repo guidelines.

Conforms to “new format under dpdata//”. Formatting also looks ruff‑friendly after the above small fixes.


161-168: No change required — ELEMENTS is 0-based.

dpdata/periodic_table.py defines ELEMENTS = ["H", "He", …] (H at index 0), so ELEMENTS[z - 1] correctly maps atomic number z to its symbol.


140-153: Approve: Hessian shape and registration match

Hessian stored as (natoms, 3, natoms, 3) per frame; plugin registers (Axis.NFRAMES, Axis.NATOMS, 3, Axis.NATOMS, 3) and conversion is applied once — no changes required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
dpdata/gaussian/fchk.py (2)

57-59: Consider extracting the error message to a constant.

The static analysis tool suggests avoiding long messages directly in the exception. While this is a minor style preference, you could extract the message format to improve maintainability.

+# Error message template for Hessian validation
+HESSIAN_SIZE_ERROR_MSG = "Input length {actual} != expected {expected}"
+
 def create_full_hessian(hessian_raw: list | np.ndarray, natoms: int) -> np.ndarray:
     # ... existing code ...
     if hessian_block.size != expected_length:
         raise ValueError(
-            f"Input length {hessian_block.size} != expected {expected_length}"
+            HESSIAN_SIZE_ERROR_MSG.format(
+                actual=hessian_block.size, expected=expected_length
+            )
         )

167-167: Centralize or document the hardcoded default cell size (100 Å)

100 Å is hardcoded in dpdata/gaussian/fchk.py:167, dpdata/gaussian/gjf.py:326, and dpdata/plugins/xyz.py:44 — either make this a shared configurable constant or derive it from molecular dimensions (bounding box + padding); if keeping 100 Å, add a short comment explaining the rationale.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5553d3 and a888f0f.

📒 Files selected for processing (1)
  • dpdata/gaussian/fchk.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/gaussian/fchk.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/gaussian/fchk.py
🧬 Code graph analysis (1)
dpdata/gaussian/fchk.py (3)
dpdata/utils.py (1)
  • open_file (137-170)
dpdata/unit.py (5)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • HessianConversion (155-180)
  • LengthConversion (120-131)
  • value (93-94)
dpdata/system.py (1)
  • has_forces (1269-1270)
🪛 Ruff (0.12.2)
dpdata/gaussian/fchk.py

57-59: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (10)
dpdata/gaussian/fchk.py (10)

1-24: LGTM! Clean imports and unit conversion setup.

The imports are well-organized with proper type checking guards, and the unit conversion constants are correctly initialized using the dpdata unit conversion framework.


26-71: Well-implemented Hessian reconstruction with thorough validation.

The function correctly reconstructs a symmetric Hessian matrix from lower triangular elements with proper input validation and clear documentation. The mathematical approach using M_full = M_lower + M_lower.T - diag(M_lower) is correct and efficient.


73-90: Function signature and documentation look good.

The parameters and return type are well-documented. The function signature correctly removes the unused md parameter mentioned in the AI summary.


99-112: Robust file parsing with proper encoding handling.

The byte-to-string decoding with error handling is appropriate for parsing potentially problematic file formats. The atomic number parsing logic correctly handles multi-line data blocks.


113-122: Correct coordinate parsing and unit conversion.

The coordinate parsing correctly handles multi-line blocks and applies the proper unit conversion from Bohr to Angstrom. The reshaping to (-1, 3) format is appropriate for Cartesian coordinates.


123-125: Proper energy extraction and conversion.

Energy parsing and conversion from Hartree to eV is correctly implemented.


126-136: Correct gradient-to-forces conversion.

The conversion from Cartesian gradients to forces by negation (F = -∇E) is physically correct and well-documented. The unit conversion and reshaping are appropriate.


137-149: Excellent Hessian processing implementation.

The Hessian parsing, reconstruction via create_full_hessian, unit conversion, and reshaping to (natoms, 3, natoms, 3) format aligns perfectly with the registered shape mentioned in the AI summary for the Gaussian plugin.


150-156: Appropriate data validation with conditional checks.

The assertions ensure essential data is present, and the conditional validation for forces and Hessian based on the function parameters is well-implemented.


157-175: Well-structured data assembly and return.

The data dictionary construction follows dpdata conventions with proper handling of atom information, coordinates, boundary conditions, and optional fields (energies, forces, Hessian). The use of np.unique for atom type processing is efficient.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
dpdata/gaussian/fchk.py (3)

165-168: Provide one cell per frame (not a single 1x cell).

If multiple frames are ever present, match cells’ length to coords’ length.

-    data["coords"] = np.array(coords_t).reshape(-1, natoms, 3)
-    data["orig"] = np.zeros(3)
-    data["cells"] = np.array([np.eye(3) * 100])
+    data["coords"] = np.array(coords_t).reshape(-1, natoms, 3)
+    nframes = data["coords"].shape[0]
+    data["orig"] = np.zeros(3)
+    data["cells"] = np.repeat((np.eye(3) * 100)[None, :, :], nframes, axis=0)

57-59: Shorten exception message to satisfy ruff TRY003 hint.

Keep the message concise; detailed counts can be added to a custom exception if needed.

-        raise ValueError(
-            f"Input length {hessian_block.size} != expected {expected_length}"
-        )
+        raise ValueError("Invalid lower-triangular Hessian length")

151-156: Prefer explicit exceptions over asserts for library-facing validation.

Asserts can be disabled with -O. Raise ValueError for reliability.

-    assert coords_t, "cannot find coords"
-    assert energy_t, "cannot find energy"
-    if has_forces:
-        assert forces_t, "cannot find forces"
-    if has_hessian:
-        assert hessian_t, "cannot find hessian"
+    if not coords_t:
+        raise ValueError("coords not found in fchk")
+    if not energy_t:
+        raise ValueError("energy not found in fchk")
+    if has_forces and not forces_t:
+        raise ValueError("forces not found in fchk")
+    if has_hessian and not hessian_t:
+        raise ValueError("hessian not found in fchk")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5553d3 and a888f0f.

📒 Files selected for processing (1)
  • dpdata/gaussian/fchk.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/gaussian/fchk.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/gaussian/fchk.py
🧬 Code graph analysis (1)
dpdata/gaussian/fchk.py (2)
dpdata/utils.py (1)
  • open_file (137-170)
dpdata/unit.py (5)
  • EnergyConversion (106-117)
  • ForceConversion (134-152)
  • HessianConversion (155-180)
  • LengthConversion (120-131)
  • value (93-94)
🪛 Ruff (0.12.2)
dpdata/gaussian/fchk.py

57-59: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
dpdata/gaussian/fchk.py (2)

61-69: Hessian reconstruction logic is correct and efficient.

Lower-triangular fill and symmetric mirroring are implemented properly.


158-158: Keep - 1 — ELEMENTS is 0-indexed. dpdata/periodic_table.py defines ELEMENTS starting with "H" at index 0 and Element.from_Z uses ELEMENTS[Z - 1], so atom_symbols = [ELEMENTS[z - 1] for z in atom_numbers] is correct.

Comment on lines +73 to +85
def to_system_data(file_name: FileType, has_forces=True, has_hessian=True):
"""Read Gaussian fchk file.

Parameters
----------
file_name : str
file name
has_forces : bool, default True
whether to read force
Note: Cartesian Gradient in fchk file is converted to forces by taking negative sign
has_hessian : bool, default True
whether to read hessian

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Relax defaults for has_forces/has_hessian to avoid unexpected assertion failures.

Most fchk files won’t include gradients/Hessian. Defaulting to True is too strict for general use.

-def to_system_data(file_name: FileType, has_forces=True, has_hessian=True):
+def to_system_data(file_name: FileType, has_forces=False, has_hessian=False):
@@
-    has_forces : bool, default True
+    has_forces : bool, default False
@@
-    has_hessian : bool, default True
+    has_hessian : bool, default False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def to_system_data(file_name: FileType, has_forces=True, has_hessian=True):
"""Read Gaussian fchk file.
Parameters
----------
file_name : str
file name
has_forces : bool, default True
whether to read force
Note: Cartesian Gradient in fchk file is converted to forces by taking negative sign
has_hessian : bool, default True
whether to read hessian
def to_system_data(file_name: FileType, has_forces=False, has_hessian=False):
"""Read Gaussian fchk file.
Parameters
----------
file_name : str
file name
has_forces : bool, default False
whether to read force
Note: Cartesian Gradient in fchk file is converted to forces by taking negative sign
has_hessian : bool, default False
whether to read hessian
🤖 Prompt for AI Agents
In dpdata/gaussian/fchk.py around lines 73 to 85, the function to_system_data
currently defaults has_forces and has_hessian to True which causes unexpected
assertion failures for fchk files lacking gradients/Hessians; change the
parameter defaults to has_forces=False and has_hessian=False, update the
docstring to reflect the new defaults, and replace any unconditional assertions
that assume presence of gradient/Hessian with conditional checks that only
validate or convert those sections when the data actually exists (e.g., check
keys/flags in the parsed fchk content before accessing or asserting).

Comment on lines +123 to +126
elif "Total Energy" in line:
energy = float(line.split()[-1]) * energy_convert
energy_t.append(energy)
elif "Cartesian Gradient" in line:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix fragile “Total Energy” parsing to handle fchk scalar on next line (N= 1).

Many fchk files put scalar values on the next line when the header includes “N= 1”. Current code can read “1” instead of the energy. Parse robustly as below.

-            elif "Total Energy" in line:
-                energy = float(line.split()[-1]) * energy_convert
-                energy_t.append(energy)
+            elif "Total Energy" in line:
+                # fchk may have scalar on the same line or next line when "N= 1" appears
+                if "N=" in line:
+                    next_line = next(fp)
+                    if isinstance(next_line, bytes):
+                        next_line = next_line.decode(errors="ignore")
+                    energy_val = float(next_line.split()[0])
+                else:
+                    energy_val = float(line.split()[-1])
+                energy_t.append(energy_val * energy_convert)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif "Total Energy" in line:
energy = float(line.split()[-1]) * energy_convert
energy_t.append(energy)
elif "Cartesian Gradient" in line:
elif "Total Energy" in line:
# fchk may have scalar on the same line or next line when "N= 1" appears
if "N=" in line:
next_line = next(fp)
if isinstance(next_line, bytes):
next_line = next_line.decode(errors="ignore")
energy_val = float(next_line.split()[0])
else:
energy_val = float(line.split()[-1])
energy_t.append(energy_val * energy_convert)
elif "Cartesian Gradient" in line:

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
dpdata/plugins/gaussian.py (1)

8-13: Imports look good; past NameError is now resolved.

import dpdata.gaussian.fchk binds the top-level dpdata module, so later dpdata.LabeledSystem references won’t raise. No further action needed.

🧹 Nitpick comments (2)
dpdata/plugins/gaussian.py (2)

22-33: Make Hessian dtype registration idempotent.

Repeated calls will trigger duplicate-registration warnings. Guard the registration.

Apply:

 def register_hessian_data(data):
     if "hessian" in data:
         dt = DataType(
             "hessian",
             np.ndarray,
             (Axis.NFRAMES, Axis.NATOMS, 3, Axis.NATOMS, 3),
             required=False,
             deepmd_name="hessian",
         )
-        dpdata.LabeledSystem.register_data_type(dt)
+        # avoid duplicate registrations
+        if not any(x.name == "hessian" for x in dpdata.LabeledSystem.DTYPES):
+            dpdata.LabeledSystem.register_data_type(dt)

Optional: explicitly import dpdata at the top to decouple from submodule imports.


43-56: Handle fallback + lints; register dtype in both paths.

  • ruff ARG002: kwargs is unused.
  • TRY300: return in try can move to else.
  • Ensure register_hessian_data runs even on fallback; respect has_forces/has_hessian.

Apply:

 @Format.register("gaussian/fchk")
 class GaussianFChkFormat(Format):
     def from_labeled_system(
         self, file_name: FileType, has_forces=True, has_hessian=True, **kwargs
     ):
-        try:
-            data = dpdata.gaussian.fchk.to_system_data(
-                file_name, has_forces=has_forces, has_hessian=has_hessian
-            )
-            register_hessian_data(data)
-            return data
-        except AssertionError:
-            return {"energies": [], "forces": [], "hessian": [], "nopbc": True}
+        del kwargs  # noqa: ARG002
+        try:
+            data = dpdata.gaussian.fchk.to_system_data(
+                file_name, has_forces=has_forces, has_hessian=has_hessian
+            )
+        except AssertionError:
+            data = {"energies": [], "nopbc": True}
+            if has_forces:
+                data["forces"] = []
+            if has_hessian:
+                data["hessian"] = []
+        else:
+            pass
+        register_hessian_data(data)
+        return data
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a888f0f and f04ecca.

📒 Files selected for processing (1)
  • dpdata/plugins/gaussian.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
dpdata/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use ruff to lint and format all Python code under dpdata/ before committing

Files:

  • dpdata/plugins/gaussian.py
dpdata/*/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

When adding a new format, implement it as a new submodule under dpdata//

Files:

  • dpdata/plugins/gaussian.py
🧬 Code graph analysis (1)
dpdata/plugins/gaussian.py (3)
dpdata/data_type.py (3)
  • Axis (15-21)
  • DataType (35-131)
  • register_data_type (138-149)
dpdata/system.py (3)
  • LabeledSystem (1159-1357)
  • register_data_type (1094-1111)
  • has_forces (1269-1270)
dpdata/gaussian/fchk.py (1)
  • to_system_data (73-175)
🪛 Ruff (0.12.2)
dpdata/plugins/gaussian.py

46-46: Unused method argument: kwargs

(ARG002)


53-53: Consider moving this statement to an else block

(TRY300)

@Romarin87 Romarin87 marked this pull request as ready for review September 16, 2025 16:19
@wanghan-iapcm wanghan-iapcm requested a review from njzjz September 17, 2025 01:00
@wanghan-iapcm wanghan-iapcm merged commit a89a7a4 into deepmodeling:devel Sep 17, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants