Add docstring linting (D-class rules) to ruff configuration#247
Add docstring linting (D-class rules) to ruff configuration#247Nidhicodes wants to merge 1 commit intoopenml:mainfrom
Conversation
WalkthroughThis pull request updates the project's linting configuration and documentation standards. The Ruff linter configuration removes global ignore rules for docstring (D) and timezone-aware datetime (DTZ) checks, replacing them with targeted per-file ignores across source and test modules. The PR systematically adds missing module-level docstrings to multiple files in the core, database, routers, and schemas packages. Additionally, docstring formatting and punctuation are improved for consistency throughout the codebase. Datetime handling is updated to use UTC-aware timestamps (datetime.UTC) in database and schema modules, and a new DatasetSchema enum is introduced to expose schema type options (DCAT_AP, OPENML, MLDCAT_AP). Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The per-file
D100–D103ignores inpyproject.tomlare very repetitive; consider collapsing them using directory or glob patterns (e.g.,src/database/*.py,src/routers/openml/*.py) to make the configuration shorter and easier to maintain. - Timezone usage is a bit mixed (
datetime.datetime.now(datetime.UTC)vsdatetime.now(UTC)); consider standardizing on one style and import pattern across the codebase for consistency and readability. - Some updated docstrings, such as the one on
Agentinmldcat_ap.py, now have a summary line that reads as a fragment followed by a new paragraph; consider rephrasing so the first line is a single complete summary sentence per PEP 257.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The per-file `D100`–`D103` ignores in `pyproject.toml` are very repetitive; consider collapsing them using directory or glob patterns (e.g., `src/database/*.py`, `src/routers/openml/*.py`) to make the configuration shorter and easier to maintain.
- Timezone usage is a bit mixed (`datetime.datetime.now(datetime.UTC)` vs `datetime.now(UTC)`); consider standardizing on one style and import pattern across the codebase for consistency and readability.
- Some updated docstrings, such as the one on `Agent` in `mldcat_ap.py`, now have a summary line that reads as a fragment followed by a new paragraph; consider rephrasing so the first line is a single complete summary sentence per PEP 257.
## Individual Comments
### Comment 1
<location> `src/schemas/datasets/mldcat_ap.py:69-74` </location>
<code_context>
class Agent(JsonLDObject):
- """Any entity carrying out actions with respect to the (Core) entities Catalogue,
- Datasets, Data Services and Distributions. If the Agent is an organisation,
- the use of the Organization Ontology is recommended.
+ """Any entity carrying out actions with respect to the (Core) entities.
+
+ Catalogue, Datasets, Data Services and Distributions. If the Agent is an
+ organisation, the use of the Organization Ontology is recommended.
"""
</code_context>
<issue_to_address>
**nitpick (typo):** Agent docstring reads as two separate sentences where the second is a fragment.
Consider reflowing to something like: `Any entity carrying out actions with respect to the (Core) entities Catalogue, Datasets, Data Services and Distributions.` to keep the description grammatically consistent while still incorporating the additional detail.
```suggestion
class Agent(JsonLDObject):
+ """Any entity carrying out actions with respect to the (Core) entities Catalogue,
+ Datasets, Data Services and Distributions. If the Agent is an organisation, the
+ use of the Organization Ontology is recommended.
+ """
```
</issue_to_address>
### Comment 2
<location> `src/core/conversions.py:15-19` </location>
<code_context>
def nested_str_to_num(obj: Any) -> Any:
- """Recursively tries to convert all strings in the object to numbers.
- For dictionaries, only the values will be converted."""
+ """Recursively try to convert all strings in the object to numbers.
+
+ For dictionaries, only the values will be converted.
+ """
if isinstance(obj, str):
</code_context>
<issue_to_address>
**suggestion:** Docstring for nested_str_to_num under-describes behavior for non-dict containers.
Please also document how non-dict containers are handled (e.g., that lists/tuples/sets are fully traversed and all string elements are converted where possible) so the helper’s behavior is clear at call sites.
```suggestion
def nested_str_to_num(obj: Any) -> Any:
"""Recursively try to convert all strings in the object to numbers.
For dictionaries, only the values will be converted.
For non-dict containers (e.g., lists, tuples, sets), all elements are
recursively traversed and any string elements are converted to numbers
where possible, preserving the original container type and structure.
"""
```
</issue_to_address>
### Comment 3
<location> `tests/routers/openml/study_test.py:497-498` </location>
<code_context>
- "%Y-%m-%dT%H:%M:%S",
- )
- assert creation_date.date() == datetime.now().date()
+ creation_date = datetime.fromisoformat(new_study.pop("creation_date"))
+ assert creation_date.date() == datetime.now(UTC).date()
assert new_study == expected
</code_context>
<issue_to_address>
**suggestion (testing):** Add an explicit assertion that `creation_date` is timezone-aware and in UTC to prove the DTZ-related change.
Since this change is about enforcing timezone-aware datetimes, this regression test should also assert that the returned value is explicitly timezone-aware and in UTC, not just that the date matches. For example:
```python
creation_date = datetime.fromisoformat(new_study.pop("creation_date"))
assert creation_date.tzinfo is UTC
assert creation_date.date() == datetime.now(UTC).date()
```
This ensures the API contract (UTC-aware datetime) is verified, rather than only the calendar date.
Suggested implementation:
```python
from datetime import datetime, UTC
import httpx
}
new_study = study.json()
```
```python
new_study = study.json()
creation_date = datetime.fromisoformat(new_study.pop("creation_date"))
assert creation_date.tzinfo is UTC
assert creation_date.date() == datetime.now(UTC).date()
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| class Agent(JsonLDObject): | ||
| """Any entity carrying out actions with respect to the (Core) entities Catalogue, | ||
| Datasets, Data Services and Distributions. If the Agent is an organisation, | ||
| the use of the Organization Ontology is recommended. | ||
| """Any entity carrying out actions with respect to the (Core) entities. | ||
|
|
||
| Catalogue, Datasets, Data Services and Distributions. If the Agent is an | ||
| organisation, the use of the Organization Ontology is recommended. | ||
| """ |
There was a problem hiding this comment.
nitpick (typo): Agent docstring reads as two separate sentences where the second is a fragment.
Consider reflowing to something like: Any entity carrying out actions with respect to the (Core) entities Catalogue, Datasets, Data Services and Distributions. to keep the description grammatically consistent while still incorporating the additional detail.
| class Agent(JsonLDObject): | |
| """Any entity carrying out actions with respect to the (Core) entities Catalogue, | |
| Datasets, Data Services and Distributions. If the Agent is an organisation, | |
| the use of the Organization Ontology is recommended. | |
| """Any entity carrying out actions with respect to the (Core) entities. | |
| Catalogue, Datasets, Data Services and Distributions. If the Agent is an | |
| organisation, the use of the Organization Ontology is recommended. | |
| """ | |
| class Agent(JsonLDObject): | |
| + """Any entity carrying out actions with respect to the (Core) entities Catalogue, | |
| + Datasets, Data Services and Distributions. If the Agent is an organisation, the | |
| + use of the Organization Ontology is recommended. | |
| + """ |
| def nested_str_to_num(obj: Any) -> Any: | ||
| """Recursively tries to convert all strings in the object to numbers. | ||
| For dictionaries, only the values will be converted.""" | ||
| """Recursively try to convert all strings in the object to numbers. | ||
|
|
||
| For dictionaries, only the values will be converted. | ||
| """ |
There was a problem hiding this comment.
suggestion: Docstring for nested_str_to_num under-describes behavior for non-dict containers.
Please also document how non-dict containers are handled (e.g., that lists/tuples/sets are fully traversed and all string elements are converted where possible) so the helper’s behavior is clear at call sites.
| def nested_str_to_num(obj: Any) -> Any: | |
| """Recursively tries to convert all strings in the object to numbers. | |
| For dictionaries, only the values will be converted.""" | |
| """Recursively try to convert all strings in the object to numbers. | |
| For dictionaries, only the values will be converted. | |
| """ | |
| def nested_str_to_num(obj: Any) -> Any: | |
| """Recursively try to convert all strings in the object to numbers. | |
| For dictionaries, only the values will be converted. | |
| For non-dict containers (e.g., lists, tuples, sets), all elements are | |
| recursively traversed and any string elements are converted to numbers | |
| where possible, preserving the original container type and structure. | |
| """ |
| creation_date = datetime.fromisoformat(new_study.pop("creation_date")) | ||
| assert creation_date.date() == datetime.now(UTC).date() |
There was a problem hiding this comment.
suggestion (testing): Add an explicit assertion that creation_date is timezone-aware and in UTC to prove the DTZ-related change.
Since this change is about enforcing timezone-aware datetimes, this regression test should also assert that the returned value is explicitly timezone-aware and in UTC, not just that the date matches. For example:
creation_date = datetime.fromisoformat(new_study.pop("creation_date"))
assert creation_date.tzinfo is UTC
assert creation_date.date() == datetime.now(UTC).date()This ensures the API contract (UTC-aware datetime) is verified, rather than only the calendar date.
Suggested implementation:
from datetime import datetime, UTC
import httpx
}
new_study = study.json() new_study = study.json()
creation_date = datetime.fromisoformat(new_study.pop("creation_date"))
assert creation_date.tzinfo is UTC
assert creation_date.date() == datetime.now(UTC).date()There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
54-56:⚠️ Potential issue | 🟡 MinorRemove stale comment.
The comment says D and DTZ rules are "heavily violated — fix later" but this PR is the fix. Remove or update it to avoid misleading future maintainers.
🔧 Proposed fix
[tool.ruff.lint] -# The D (doc) and DTZ (datetime zone) lint classes current heavily violated - fix later select = ["ALL"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 54 - 56, The inline comment above the [tool.ruff.lint] configuration in pyproject.toml referencing "The D (doc) and DTZ (datetime zone) lint classes current heavily violated - fix later" is now stale; remove or replace that comment so it no longer misleads maintainers. Edit the pyproject.toml, locate the [tool.ruff.lint] block and delete the stale comment line (or update it to reflect that D/DTZ issues were addressed), leaving the select = ["ALL"] setting intact.
🧹 Nitpick comments (4)
src/schemas/datasets/__init__.py (1)
6-9: Consider adding a docstring to the newDatasetSchemaenum.
DatasetSchemais new public API but is silently grandfathered from D101 by the file-level per-file-ignore. Since the PR's stated intent is to enforce docstrings on new code, a short docstring here would be consistent with that goal.♻️ Suggested addition
class DatasetSchema(StrEnum): + """Schema format options for dataset representations.""" + DCAT_AP = "dcat-ap" OPENML = "openml" MLDCAT_AP = "mldcat-ap"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/datasets/__init__.py` around lines 6 - 9, Add a short docstring to the public enum class DatasetSchema explaining what the enum represents and how its members (DCAT_AP, OPENML, MLDCAT_AP) are used (e.g., canonical identifiers for supported dataset metadata/schema formats); update the class DatasetSchema so the docstring sits directly under the class declaration to satisfy D101 and document the public API.src/database/datasets.py (1)
165-165: LGTM — UTC-aware timestamp.The naive
datetime.datetime.now()is correctly replaced withdatetime.datetime.now(datetime.UTC).For consistency,
src/database/studies.pyusesfrom datetime import UTC, datetime/datetime.now(UTC)while this file usesdatetime.datetime.now(datetime.UTC). Consider aligning the import style across the codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/database/datasets.py` at line 165, The timestamp usage is UTC-aware but imports are inconsistent; update src/database/datasets.py to match the style used in src/database/studies.py by importing UTC and datetime directly (e.g., "from datetime import UTC, datetime") and replace uses of "datetime.datetime.now(datetime.UTC)" with "datetime.now(UTC)" so the module consistently uses the same datetime symbols and call pattern (locate the current usage in the dataset creation code where "date" is set).src/schemas/datasets/mldcat_ap.py (2)
70-74:Agentdocstring reformatting produces a dangling fragment.The break after "entities." leaves "Catalogue, Datasets, Data Services and Distributions." as an incomplete sentence starting the second paragraph; the semantic link (originally a colon) is lost.
📝 Suggested fix
- """Any entity carrying out actions with respect to the (Core) entities. - - Catalogue, Datasets, Data Services and Distributions. If the Agent is an - organisation, the use of the Organization Ontology is recommended. - """ + """Any entity carrying out actions with respect to the (Core) entities. + + Applies to Catalogues, Datasets, Data Services, and Distributions. + If the Agent is an organisation, the use of the Organization Ontology is recommended. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/datasets/mldcat_ap.py` around lines 70 - 74, The Agent class/module docstring currently breaks a sentence causing a dangling fragment ("Catalogue, Datasets, Data Services and Distributions.") — update the Agent docstring in src/schemas/datasets/mldcat_ap.py (the triple-quoted string associated with Agent) to restore the original semantic link: either merge the two lines into one sentence with the colon (e.g., "...entities: Catalogue, Datasets, Data Services and Distributions.") or rephrase the second paragraph to be a complete sentence that follows the first, ensuring punctuation and sentence boundaries are correct.
70-74:Agentdocstring split creates a dangling sentence fragment.The reformatting replaced the original colon-linked list with a hard paragraph break, leaving "Catalogue, Datasets, Data Services and Distributions." as an orphaned opening clause with no grammatical connection to the first sentence.
📝 Suggested fix
- """Any entity carrying out actions with respect to the (Core) entities. - - Catalogue, Datasets, Data Services and Distributions. If the Agent is an - organisation, the use of the Organization Ontology is recommended. - """ + """Any entity carrying out actions with respect to the (Core) entities. + + Applies to Catalogues, Datasets, Data Services, and Distributions. + If the Agent is an organisation, the use of the Organization Ontology is recommended. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/datasets/mldcat_ap.py` around lines 70 - 74, The Agent class docstring in src/schemas/datasets/mldcat_ap.py was split leaving a fragment ("Catalogue, Datasets, Data Services and Distributions.") — update the docstring for Agent to be a single coherent sentence or a properly punctuated list; e.g., merge the fragment into the first sentence after the colon or convert the fragment into a continuation clause or bullet list so the meaning is preserved and no dangling sentence remains (look for the Agent docstring string literal to edit).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 68: Update the glob used to suppress rules for test files: replace the
non-recursive pattern "tests/*" with the recursive pattern "tests/**" in the
pyproject TOML setting that maps test globs to ["S101","COM812","D"] so nested
test files (e.g., tests/routers/... and tests/database/...) are properly matched
and the D rule is suppressed for all test subdirectories.
In `@src/schemas/datasets/mldcat_ap.py`:
- Line 22: The docstring on JsonLDQualifiedLiteral is incorrect — it was copied
from JsonLDObject; update the docstring for the JsonLDQualifiedLiteral class to
describe that it is a concrete JSON-LD literal model with fields type_ and value
(e.g., "Concrete JSON-LD literal with fields `type_` and `value`"), and ensure
JsonLDObject retains the "Base class for all JSON-LD objects." description so
the two classes have accurate, distinct docstrings.
- Line 22: The docstring on class JsonLDQualifiedLiteral is incorrect (it was
copied from JsonLDObject); update the JsonLDQualifiedLiteral docstring to
accurately describe that it is a concrete value model with two fields (type_ and
value) rather than a base class—mention the purpose/usage of the model and the
fields so API docs are not misleading, and ensure JsonLDObject remains
documented as the abstract base class for JSON-LD objects.
In `@tests/routers/openml/study_test.py`:
- Around line 497-498: The test reveals a timezone-awareness gap: creation_date
is stored as a naive DATETIME and returned naive via SQLAlchemy, then parsed
with datetime.fromisoformat in the test; fix by preserving UTC context either at
storage or on read—update the SQLAlchemy column definition to use
DateTime(timezone=True) (so retrieval yields aware datetimes) or, if changing
schema is infeasible, ensure the code path that produces/serializes
creation_date (the model/serializer that fills new_study and the DB write path)
normalizes naive datetimes to UTC on read/write (e.g., attach UTC tzinfo before
serialization) so datetime.fromisoformat/new_study contain timezone-aware values
consistently.
---
Outside diff comments:
In `@pyproject.toml`:
- Around line 54-56: The inline comment above the [tool.ruff.lint] configuration
in pyproject.toml referencing "The D (doc) and DTZ (datetime zone) lint classes
current heavily violated - fix later" is now stale; remove or replace that
comment so it no longer misleads maintainers. Edit the pyproject.toml, locate
the [tool.ruff.lint] block and delete the stale comment line (or update it to
reflect that D/DTZ issues were addressed), leaving the select = ["ALL"] setting
intact.
---
Nitpick comments:
In `@src/database/datasets.py`:
- Line 165: The timestamp usage is UTC-aware but imports are inconsistent;
update src/database/datasets.py to match the style used in
src/database/studies.py by importing UTC and datetime directly (e.g., "from
datetime import UTC, datetime") and replace uses of
"datetime.datetime.now(datetime.UTC)" with "datetime.now(UTC)" so the module
consistently uses the same datetime symbols and call pattern (locate the current
usage in the dataset creation code where "date" is set).
In `@src/schemas/datasets/__init__.py`:
- Around line 6-9: Add a short docstring to the public enum class DatasetSchema
explaining what the enum represents and how its members (DCAT_AP, OPENML,
MLDCAT_AP) are used (e.g., canonical identifiers for supported dataset
metadata/schema formats); update the class DatasetSchema so the docstring sits
directly under the class declaration to satisfy D101 and document the public
API.
In `@src/schemas/datasets/mldcat_ap.py`:
- Around line 70-74: The Agent class/module docstring currently breaks a
sentence causing a dangling fragment ("Catalogue, Datasets, Data Services and
Distributions.") — update the Agent docstring in
src/schemas/datasets/mldcat_ap.py (the triple-quoted string associated with
Agent) to restore the original semantic link: either merge the two lines into
one sentence with the colon (e.g., "...entities: Catalogue, Datasets, Data
Services and Distributions.") or rephrase the second paragraph to be a complete
sentence that follows the first, ensuring punctuation and sentence boundaries
are correct.
- Around line 70-74: The Agent class docstring in
src/schemas/datasets/mldcat_ap.py was split leaving a fragment ("Catalogue,
Datasets, Data Services and Distributions.") — update the docstring for Agent to
be a single coherent sentence or a properly punctuated list; e.g., merge the
fragment into the first sentence after the colon or convert the fragment into a
continuation clause or bullet list so the meaning is preserved and no dangling
sentence remains (look for the Agent docstring string literal to edit).
| [tool.ruff.lint.per-file-ignores] | ||
| "tests/*" = [ "S101", "COM812", "D"] | ||
| "src/core/conversions.py" = ["ANN401"] | ||
| "tests/*" = ["S101", "COM812", "D"] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd --type f --extension py . tests/Repository: openml/server-api
Length of output: 991
Change glob pattern to recursively match nested test files.
The pattern "tests/*" only matches files directly in the tests/ directory. However, the repository contains nested test files at multiple levels (e.g., tests/routers/openml/study_test.py, tests/database/flows_test.py, tests/routers/openml/migration/datasets_migration_test.py). The D (docstring) rules will incorrectly fire on these nested files instead of being suppressed. Use "tests/**" for recursive coverage.
🔧 Proposed fix
-"tests/*" = ["S101", "COM812", "D"]
+"tests/**" = ["S101", "COM812", "D"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "tests/*" = ["S101", "COM812", "D"] | |
| "tests/**" = ["S101", "COM812", "D"] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 68, Update the glob used to suppress rules for test
files: replace the non-recursive pattern "tests/*" with the recursive pattern
"tests/**" in the pyproject TOML setting that maps test globs to
["S101","COM812","D"] so nested test files (e.g., tests/routers/... and
tests/database/...) are properly matched and the D rule is suppressed for all
test subdirectories.
|
|
||
| class JsonLDQualifiedLiteral(BaseModel): | ||
| """Base class for all JSON-LD objects""" | ||
| """Base class for all JSON-LD objects.""" |
There was a problem hiding this comment.
JsonLDQualifiedLiteral has the wrong docstring — copied from JsonLDObject.
JsonLDQualifiedLiteral is a concrete value model (two fields: type_ and value) and is not a base class. The docstring "Base class for all JSON-LD objects." belongs to JsonLDObject (the actual ABC at line 34), and appears to have been copy-pasted here.
📝 Proposed fix
- """Base class for all JSON-LD objects."""
+ """A JSON-LD qualified literal pairing an explicit type with a string value."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Base class for all JSON-LD objects.""" | |
| """A JSON-LD qualified literal pairing an explicit type with a string value.""" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/schemas/datasets/mldcat_ap.py` at line 22, The docstring on
JsonLDQualifiedLiteral is incorrect — it was copied from JsonLDObject; update
the docstring for the JsonLDQualifiedLiteral class to describe that it is a
concrete JSON-LD literal model with fields type_ and value (e.g., "Concrete
JSON-LD literal with fields `type_` and `value`"), and ensure JsonLDObject
retains the "Base class for all JSON-LD objects." description so the two classes
have accurate, distinct docstrings.
Incorrect docstring copy-pasted from JsonLDObject.
JsonLDQualifiedLiteral is a concrete value model (two fields: type_ and value); it is not a base class. The docstring "Base class for all JSON-LD objects." was lifted verbatim from JsonLDObject (line 34), which is the actual ABC. This will mislead any reader of the API docs.
📝 Proposed fix
- """Base class for all JSON-LD objects."""
+ """A JSON-LD qualified literal with an explicit type and value."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Base class for all JSON-LD objects.""" | |
| """A JSON-LD qualified literal with an explicit type and value.""" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/schemas/datasets/mldcat_ap.py` at line 22, The docstring on class
JsonLDQualifiedLiteral is incorrect (it was copied from JsonLDObject); update
the JsonLDQualifiedLiteral docstring to accurately describe that it is a
concrete value model with two fields (type_ and value) rather than a base
class—mention the purpose/usage of the model and the fields so API docs are not
misleading, and ensure JsonLDObject remains documented as the abstract base
class for JSON-LD objects.
| creation_date = datetime.fromisoformat(new_study.pop("creation_date")) | ||
| assert creation_date.date() == datetime.now(UTC).date() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tests/routers/openml/study_test.py | sed -n '490,510p'Repository: openml/server-api
Length of output: 795
🏁 Script executed:
# Find Pydantic models related to Study
rg -t py "class.*Study" --max-count=20Repository: openml/server-api
Length of output: 252
🏁 Script executed:
# Look for any custom datetime serializers or Pydantic config
rg -t py "json_encoders|ConfigDict|model_config|datetime" tests/routers/openml/study_test.py -A 2 -B 2Repository: openml/server-api
Length of output: 326
🌐 Web query:
Pydantic v2 default datetime serialization with timezone UTC
💡 Result:
In Pydantic v2, datetime values are serialized to JSON as strings, and the default temporal JSON format is ISO 8601 (ser_json_temporal='iso8601'). It does not automatically convert datetimes to UTC; whatever tzinfo is on the value is what gets serialized (as an offset like +02:30, +00:00, etc.). [1][2]
What you get by default (UTC-aware)
If your value is already UTC-aware (tzinfo=timezone.utc), the default JSON output will be ISO-8601 with an offset (typically +00:00). [2]
Enforce “UTC (and optionally Z)” on output
If you want to always output UTC, convert to UTC in a serializer; if you specifically want a trailing Z, format it yourself:
from datetime import datetime, timezone
from pydantic import BaseModel, field_serializer
class M(BaseModel):
dt: datetime
`@field_serializer`('dt')
def ser_dt(self, v: datetime, _info):
v = v.astimezone(timezone.utc)
return v.strftime('%Y-%m-%dT%H:%M:%S.%fZ').rstrip('0').replace('.Z', 'Z')Pydantic v2 supports this via @field_serializer (and other serializer hooks). [3]
(Example pattern for Z formatting in v2 is also commonly done with PlainSerializer/Annotated.) [4]
Sources:
[1] Pydantic config: ser_json_temporal default and options. (docs.pydantic.dev)
[2] Pydantic standard library types: datetime JSON serialization and example with timezone offset. (docs.pydantic.dev)
[3] Pydantic serialization docs: custom serializers (@field_serializer, etc.). (docs.pydantic.dev)
[4] Example approach using v2 Annotated + PlainSerializer to emit Z. (stackoverflow.com)
Citations:
- 1: https://docs.pydantic.dev/dev/api/config/
- 2: https://docs.pydantic.dev/latest/api/standard_library_types/
- 3: https://docs.pydantic.dev/2.0/usage/serialization/
- 4: https://stackoverflow.com/questions/66548586/how-to-change-date-format-in-pydantic?utm_source=openai
🏁 Script executed:
cat -n src/schemas/study.py | head -100Repository: openml/server-api
Length of output: 3708
🏁 Script executed:
# Find the creation_date field in Study model
rg -t py "creation_date" src/schemas/study.py -B 5 -A 2Repository: openml/server-api
Length of output: 255
🏁 Script executed:
# Check where creation_date is set in the codebase
rg -t py "creation_date" --max-count=30 -B 3 -A 3Repository: openml/server-api
Length of output: 3908
🏁 Script executed:
# Look for the creation of Study objects, particularly in routers or models
rg -t py "Study\(" src/ -B 2 -A 5 | head -100Repository: openml/server-api
Length of output: 1375
🏁 Script executed:
# Check if there are any custom serializers or model config in Study
cat -n src/schemas/study.pyRepository: openml/server-api
Length of output: 3708
🏁 Script executed:
# Check if there are any custom serializers in the Study model or BaseModel config
rg -t py "field_serializer|ConfigDict|model_config" src/schemas/study.pyRepository: openml/server-api
Length of output: 43
🏁 Script executed:
# Check the full router code to see how Study is instantiated
cat -n src/routers/openml/study.py | grep -A 20 "return Study("Repository: openml/server-api
Length of output: 928
🏁 Script executed:
# Verify the imports and check if UTC is being used consistently
cat -n src/database/studies.py | head -50Repository: openml/server-api
Length of output: 1782
🏁 Script executed:
# Check if there are any custom field serializers in any base model or parent
rg -t py "field_serializer" src/schemas/Repository: openml/server-api
Length of output: 313
🏁 Script executed:
# Look at how the creation_date data flows - check the test more closely
cat -n tests/routers/openml/study_test.py | sed -n '470,510p'Repository: openml/server-api
Length of output: 1516
🏁 Script executed:
# Check if there's a BaseModel with custom datetime serialization
cat -n src/schemas/core.pyRepository: openml/server-api
Length of output: 203
🏁 Script executed:
# Search for any Pydantic config that might affect serialization
rg -t py "class BaseModel|ConfigDict|model_config|ser_json_temporal" src/schemas/Repository: openml/server-api
Length of output: 846
🏁 Script executed:
# Let me check the full test to understand the test data better
cat -n tests/routers/openml/study_test.py | sed -n '450,480p'Repository: openml/server-api
Length of output: 1096
🏁 Script executed:
# Check if there's a custom datetime serializer in the fixtures or anywhere
rg -t py "json_encoders|serialize.*datetime" tests/Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Check how the datetime is actually being serialized - look at actual test execution context
cat -n tests/routers/openml/study_test.py | sed -n '1,50p'Repository: openml/server-api
Length of output: 1516
🏁 Script executed:
# Check if Study model uses any parent class or has config we missed
rg -t py "class Study" src/schemas/ -A 20Repository: openml/server-api
Length of output: 1454
🏁 Script executed:
# Check what the actual datetime format looks like in the test fixture database
rg -t sql "2019-02-25" tests/ src/Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Look for any datetime handling in the conftest or test setup
find tests -name "conftest.py" -o -name "fixtures*" | head -5Repository: openml/server-api
Length of output: 78
🏁 Script executed:
# Check if there are database snapshots or fixtures with actual creation_date values
find tests -type f -name "*.py" -path "*fixture*" -o -name "conftest.py" | xargs grep -l "creation_date" 2>/dev/nullRepository: openml/server-api
Length of output: 45
🏁 Script executed:
# Let's check the test database setup to understand the actual stored values
cat -n tests/conftest.py | head -100Repository: openml/server-api
Length of output: 3562
🏁 Script executed:
# Check how the Study model is being converted to JSON in the test
# Look for any custom JSON encoding or serialization
rg -t py "\.json()" tests/routers/openml/study_test.py -B 3 -A 3Repository: openml/server-api
Length of output: 1507
🏁 Script executed:
# Check if TestClient.json() method behaves differently
# Let's look at actual test database to see what creation_date values exist
cat -n tests/resources/sql/ 2>/dev/null | head -20 || find tests/resources -name "*.sql" -o -name "*.json" | head -10Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Check if there's database initialization with actual datetime values
find tests/resources -type f | head -20Repository: openml/server-api
Length of output: 102
🏁 Script executed:
# Look for any database migration or initialization scripts
rg -t sql "creation_date" tests/ src/database/Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Let me check the actual database schema for the study table
rg -t sql "CREATE TABLE.*study" src/database/Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Check how the datetime is retrieved from the database
cat -n src/database/studies.py | sed -n '1,100p'Repository: openml/server-api
Length of output: 3834
🏁 Script executed:
# Let's trace the actual data: check what the test fixture shows
cat -n tests/routers/openml/study_test.py | sed -n '10,80p'Repository: openml/server-api
Length of output: 1914
🏁 Script executed:
# Check if there's a custom datetime deserializer or handler in the routers
cat -n src/routers/openml/study.py | head -50Repository: openml/server-api
Length of output: 2186
🏁 Script executed:
# Check the actual type of creation_date coming from the database
# Look at the database schema definition
find src -name "*.py" -type f | xargs grep -l "CREATE TABLE\|alembic\|migration" | head -5Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Check if SQLAlchemy has any type conversion for datetime
rg -t py "DateTime|DATETIME" src/database/ -B 2 -A 2Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Let's look at what the test database actually contains
cat -n tests/resources/sql/ 2>/dev/null || find tests -name "*.sql" -type f | head -10Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Check the database initialization/fixture setup
rg -t py "INSERT INTO study" tests/ src/database/ -B 3 -A 3Repository: openml/server-api
Length of output: 447
🏁 Script executed:
# Critical check: let's see the actual test expectation and compare
# If the test expects "2019-02-25T17:15:01" without offset, but Pydantic defaults to adding it,
# then the first test (line 457) would fail. Let's check if that test actually passes
cat -n tests/routers/openml/study_test.py | sed -n '12,60p'Repository: openml/server-api
Length of output: 1438
🏁 Script executed:
# Check if there's any datetime handling or type conversion happening
# Look at how SQLAlchemy is configured or if there's custom type handling
rg -t py "python_type|bind_processor|result_processor" src/Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Let me check the actual database column definition for creation_date
# Look for schema or migration files
find src -name "*.toml" -o -name "*.yaml" -o -name "*.yml" | head -10Repository: openml/server-api
Length of output: 76
🏁 Script executed:
# Check if there's a type stub or database schema definition
rg -t py "datetime|tzinfo" src/database/studies.py -B 1 -A 1 | head -30Repository: openml/server-api
Length of output: 298
🏁 Script executed:
# Critical: Check what the database actually stores and retrieves
# Look at how SQLAlchemy would interpret the datetime from the database
cat -n src/database/setup.py 2>/dev/null || find src -name "setup.py" -o -name "database.py" | head -5Repository: openml/server-api
Length of output: 1145
🏁 Script executed:
# Check the database connection setup
rg -t py "create_engine\|Engine" src/database/ -B 2 -A 5Repository: openml/server-api
Length of output: 43
🏁 Script executed:
# Let me verify: when creation_date is stored as datetime.now(UTC) in the database,
# what type does SQLAlchemy retrieve it as? Check the conftest to see actual test
cat -n tests/routers/openml/study_test.py | sed -n '495,500p'Repository: openml/server-api
Length of output: 301
🏁 Script executed:
# The key question: does the test database fixture have timezone info?
# Let's check what's in the database by looking for where fixtures are populated
rg -t py "datetime\.now\(UTC\)|timezone" tests/routers/openml/study_test.pyRepository: openml/server-api
Length of output: 120
🏁 Script executed:
# Key question: Is the creation_date coming from DB naive or aware?
# When SQLAlchemy retrieves datetime from MySQL DATETIME column, it's naive
# Let's check if there's any tzinfo conversion happening
rg -t py "tzinfo\|replace\(tzinfo\|astimezone" src/routers/openml/study.py src/database/studies.pyRepository: openml/server-api
Length of output: 43
🏁 Script executed:
# Let's check: when the test calls study.json(), what does Pydantic actually produce?
# We need to understand if the database value is naive or aware when passed to Study
cat -n src/routers/openml/study.py | sed -n '130,155p'Repository: openml/server-api
Length of output: 1164
🏁 Script executed:
# Check: Is there any custom serializer or configuration on the Study model?
rg -t py "model_config\|ConfigDict\|json_encoders" src/schemas/study.pyRepository: openml/server-api
Length of output: 43
🏁 Script executed:
# The critical piece: let's see if datetime.now(UTC) actually produces an aware datetime
# and check if Pydantic would serialize it differently
python3 << 'EOF'
from datetime import UTC, datetime
from pydantic import BaseModel
class TestModel(BaseModel):
dt: datetime
# Test with aware datetime
aware_dt = datetime.now(UTC)
m_aware = TestModel(dt=aware_dt)
print("Aware datetime JSON:", m_aware.model_dump_json())
# Test with naive datetime
naive_dt = datetime.now()
m_naive = TestModel(dt=naive_dt)
print("Naive datetime JSON:", m_naive.model_dump_json())
EOFRepository: openml/server-api
Length of output: 175
The test is safe from flakiness but has a subtle semantic issue with timezone awareness.
Pydantic v2 does serialize timezone-aware datetimes with a timezone indicator (e.g., Z for UTC). However, the actual issue here is upstream: creation_date is stored in the database as a naive DATETIME column. When SQLAlchemy retrieves it, it returns a naive Python datetime object, so Pydantic serializes it without any timezone offset (e.g., "2019-02-25T17:15:01").
The test call creation_date.date() == datetime.now(UTC).date() is safe because .date() strips timezone information from both sides, resulting in a comparison of two naive date objects. This won't be flaky even at midnight UTC.
However, the underlying code has a semantic gap: datetime.now(UTC) (an aware datetime) is written to a naive database column, losing its UTC context. While the test works correctly by comparing dates, consider whether creation_date should preserve timezone information throughout the stack—either by using a timezone-aware database column or by explicitly managing UTC context.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/routers/openml/study_test.py` around lines 497 - 498, The test reveals
a timezone-awareness gap: creation_date is stored as a naive DATETIME and
returned naive via SQLAlchemy, then parsed with datetime.fromisoformat in the
test; fix by preserving UTC context either at storage or on read—update the
SQLAlchemy column definition to use DateTime(timezone=True) (so retrieval yields
aware datetimes) or, if changing schema is infeasible, ensure the code path that
produces/serializes creation_date (the model/serializer that fills new_study and
the DB write path) normalizes naive datetimes to UTC on read/write (e.g., attach
UTC tzinfo before serialization) so datetime.fromisoformat/new_study contain
timezone-aware values consistently.
This PR addresses the issue of missing docstring linting in the OpenML server API #212 .
Changes:
DTZruleset inruffconfiguration.__init__.pyfiles to satisfy D104.Druleset inruffconfiguration.per-file-ignoresfor D100, D101, D102, and D103 for those specific files. This ensures all new files and code must include docstrings, while legacy code can be updated gradually.Verification:
ruff check .and they pass.UTCconstant was used for datetime awareness.