Skip to content

Add POST /datasets/untag endpoint#251

Closed
ishansharma08 wants to merge 3 commits intoopenml:mainfrom
ishansharma08:feature/data-untag-endpoint
Closed

Add POST /datasets/untag endpoint#251
ishansharma08 wants to merge 3 commits intoopenml:mainfrom
ishansharma08:feature/data-untag-endpoint

Conversation

@ishansharma08
Copy link

Implements the POST /datasets/untag endpoint as part of the legacy API migration.

This endpoint removes a tag from a dataset and returns the dataset ID on success. It returns 422 Unprocessable Entity when the tag does not exist for the dataset (replacing the legacy 412 behavior), and 401 when unauthenticated.

The implementation follows the structure of the existing /datasets/tag endpoint for consistency.

Tested locally using Docker Compose with full environment:

Ran docker compose --profile python up --build

Verified database and API container startup

Tested endpoint via Swagger UI at http://localhost:8001/docs

Confirmed correct behavior for both success and failure cases

Please let me know if any adjustments are needed.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 21, 2026

Walkthrough

Adds a new untag_dataset endpoint at path "/untag" that requires authentication, fetches existing tags for a dataset, validates the specified tag exists (case-insensitive), calls database.datasets.untag to remove the tag, and returns the dataset id. Introduces two module-level helpers, tag(data_id, tag, user_id, connection) and untag(data_id, tag, user_id, connection). The existing tag_dataset endpoint remains but now contains an additional return block. Net changes: +60/-0 lines.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add POST /datasets/untag endpoint' accurately and concisely describes the main change - adding a new untag endpoint to the datasets router.
Description check ✅ Passed The description clearly explains the implementation of the /datasets/untag endpoint, its behavior, consistency with existing code, and testing methodology.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The untag_dataset endpoint is defined twice with identical implementations; remove the duplicate definition to avoid FastAPI route conflicts and maintenance issues.
  • The new tag/untag database functions and from sqlalchemy import text import appear to be added at the bottom of the router module rather than in the database layer (e.g., database.datasets); consider moving them to the appropriate data-access module to keep concerns separated.
  • In the new untag DB function, the user_id parameter is accepted but unused in the SQL; either use it (e.g., for auditing/ownership) or remove it to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `untag_dataset` endpoint is defined twice with identical implementations; remove the duplicate definition to avoid FastAPI route conflicts and maintenance issues.
- The new `tag`/`untag` database functions and `from sqlalchemy import text` import appear to be added at the bottom of the router module rather than in the database layer (e.g., `database.datasets`); consider moving them to the appropriate data-access module to keep concerns separated.
- In the new `untag` DB function, the `user_id` parameter is accepted but unused in the SQL; either use it (e.g., for auditing/ownership) or remove it to avoid confusion.

## Individual Comments

### Comment 1
<location> `src/routers/openml/datasets.py:509-518` </location>
<code_context>
+from sqlalchemy import text
+
+
+def tag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
+    connection.execute(
+        text(
+            """
+            INSERT INTO dataset_tag (id, tag)
+            VALUES (:data_id, :tag)
+            """
+        ),
+        {"data_id": data_id, "tag": tag},
+    )
+
+
+def untag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
+    connection.execute(
+        text(
</code_context>

<issue_to_address>
**issue:** `user_id` parameter is unused in the new `tag`/`untag` helpers

`user_id` is never used in these helpers. Either incorporate it for auditing/authorization at this layer, or remove it from the function signatures to avoid a misleading, unnecessary parameter.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 509 to 518
def tag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
connection.execute(
text(
"""
INSERT INTO dataset_tag (id, tag)
VALUES (:data_id, :tag)
"""
),
{"data_id": data_id, "tag": tag},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: user_id parameter is unused in the new tag/untag helpers

user_id is never used in these helpers. Either incorporate it for auditing/authorization at this layer, or remove it from the function signatures to avoid a misleading, unnecessary parameter.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 83-113: The file contains a duplicate FastAPI route definition for
POST /datasets/untag: two identical functions named untag_dataset decorated with
router.post(path="/untag"), which registers conflicting callables; remove the
duplicate definition (the second untag_dataset function and its decorator) so
only a single router.post(path="/untag") / untag_dataset implementation remains,
or if you intended different behavior, rename the function and route to be
unique; ensure database.datasets.untag and the dependencies (fetch_user,
expdb_connection) remain referenced by the single kept function.

@ishansharma08 ishansharma08 force-pushed the feature/data-untag-endpoint branch from b68f6a6 to cd2a796 Compare February 21, 2026 18:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 49-51: Remove the unreachable dead-code return that returns
{"data_untag": {"id": str(data_id)}}; locate the function in
src/routers/openml/datasets.py that already returns earlier (the untag
response/handler) and delete the redundant return statement/block containing
"data_untag" so only the intended earlier return remains; ensure no other code
depends on that duplicated response shape.
- Around line 63-73: The code checks tag membership before verifying
authentication, leaking tag existence to unauthenticated callers; move the
authentication guard (the user is None check that raises
HTTPStatus.UNAUTHORIZED) to run before any DB/tag checks (i.e., before calling
get_tags_for and before using the tags list), so the route handler returns 401
for unauthenticated requests and only proceeds to the tag existence logic (the
tag.casefold() check against tags) after user is validated.
- Around line 483-504: Remove the dead module-level functions tag and untag from
this file: they are unused, have an unused user_id parameter, and the tag
implementation omits the required uploader column; delete the entire function
definitions named tag(data_id: int, tag: str, user_id: int, connection:
Connection) and untag(data_id: int, tag: str, user_id: int, connection:
Connection) so callers use the correct database.datasets.tag and
database.datasets.untag implementations.
- Around line 75-80: The call to database.datasets.untag(...) is failing because
the untag function is missing; add an untag(data_id, tag, user_id=None,
connection=None) function to src/database/datasets.py that mirrors the existing
tag(...) signature and behavior but removes the association instead of creating
it: locate the database.datasets module and implement untag to perform the
inverse DB operation used by tag (delete the tag row for the given data_id and
tag using the provided connection), handle user_id/context the same way as tag,
return the same success/failure shape as tag (or raise the same errors), and
ensure any cached/get_tags_for(...) behavior remains consistent after removal.

Comment on lines +49 to +51
return {
"data_untag": {"id": str(data_id)},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the unreachable return block — dead code artifact.

Lines 49-51 are unreachable; the function already returns on lines 46-48. This is a copy-paste artifact from drafting the untag response shape and should be removed.

🗑️ Proposed fix
     database.datasets.tag(data_id, tag, user_id=user.user_id, connection=expdb_db)
     return {
         "data_tag": {"id": str(data_id), "tag": [*tags, tag]},
     }
-    return {
-        "data_untag": {"id": str(data_id)},
-    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 49 - 51, Remove the unreachable
dead-code return that returns {"data_untag": {"id": str(data_id)}}; locate the
function in src/routers/openml/datasets.py that already returns earlier (the
untag response/handler) and delete the redundant return statement/block
containing "data_untag" so only the intended earlier return remains; ensure no
other code depends on that duplicated response shape.

Comment on lines +63 to +73
if tag.casefold() not in [t.casefold() for t in tags]:
raise HTTPException(
status_code=HTTPStatus.UNPROCESSABLE_ENTITY,
detail={"message": "Tag does not exist for this dataset"},
)

if user is None:
raise HTTPException(
status_code=HTTPStatus.UNAUTHORIZED,
detail="Authentication required",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move the authentication guard before the tag-existence check.

Currently an unauthenticated caller receives 422 when the tag is absent and 401 when it exists, leaking tag membership to anyone without credentials. Authentication should be validated first — before the get_tags_for DB call — to both prevent information disclosure and avoid the unnecessary round-trip on every unauthenticated request.

🔒️ Proposed fix
 def untag_dataset(
     data_id: Annotated[int, Body()],
     tag: Annotated[str, SystemString64],
     user: Annotated[User | None, Depends(fetch_user)] = None,
     expdb_db: Annotated[Connection, Depends(expdb_connection)] = None,
 ) -> dict[str, dict[str, Any]]:
+    if user is None:
+        raise HTTPException(
+            status_code=HTTPStatus.UNAUTHORIZED,
+            detail="Authentication required",
+        )
+
     tags = database.datasets.get_tags_for(data_id, expdb_db)

     if tag.casefold() not in [t.casefold() for t in tags]:
         raise HTTPException(
             status_code=HTTPStatus.UNPROCESSABLE_ENTITY,
             detail={"message": "Tag does not exist for this dataset"},
         )

-    if user is None:
-        raise HTTPException(
-            status_code=HTTPStatus.UNAUTHORIZED,
-            detail="Authentication required",
-        )
-
     database.datasets.untag(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if tag.casefold() not in [t.casefold() for t in tags]:
raise HTTPException(
status_code=HTTPStatus.UNPROCESSABLE_ENTITY,
detail={"message": "Tag does not exist for this dataset"},
)
if user is None:
raise HTTPException(
status_code=HTTPStatus.UNAUTHORIZED,
detail="Authentication required",
)
if user is None:
raise HTTPException(
status_code=HTTPStatus.UNAUTHORIZED,
detail="Authentication required",
)
tags = database.datasets.get_tags_for(data_id, expdb_db)
if tag.casefold() not in [t.casefold() for t in tags]:
raise HTTPException(
status_code=HTTPStatus.UNPROCESSABLE_ENTITY,
detail={"message": "Tag does not exist for this dataset"},
)
database.datasets.untag(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 63 - 73, The code checks tag
membership before verifying authentication, leaking tag existence to
unauthenticated callers; move the authentication guard (the user is None check
that raises HTTPStatus.UNAUTHORIZED) to run before any DB/tag checks (i.e.,
before calling get_tags_for and before using the tags list), so the route
handler returns 401 for unauthenticated requests and only proceeds to the tag
existence logic (the tag.casefold() check against tags) after user is validated.

Comment on lines +75 to +80
database.datasets.untag(
data_id,
tag,
user_id=user.user_id,
connection=expdb_db,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for untag method definition in the codebase
rg -n "def untag" --type py

Repository: openml/server-api

Length of output: 229


🏁 Script executed:

#!/bin/bash
# Search for untag specifically in the database module
rg -n "def untag" src/database/ --type py

# Also check the imports in the router file to see how database.datasets is referenced
head -50 src/routers/openml/datasets.py | grep -E "^import|^from"

Repository: openml/server-api

Length of output: 819


🏁 Script executed:

#!/bin/bash
# List all function definitions in src/database/datasets.py
rg -n "^def " src/database/datasets.py --type py

Repository: openml/server-api

Length of output: 818


Add missing untag method to src/database/datasets.py.

The code at line 75 calls database.datasets.untag(...), but this method does not exist in src/database/datasets.py. The module currently defines: get, get_file, get_tags_for, tag, get_description, get_status, get_latest_processing_update, get_features, get_feature_values, update_status, and remove_deactivated_status — but not untag. This will raise AttributeError at runtime when the endpoint executes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 75 - 80, The call to
database.datasets.untag(...) is failing because the untag function is missing;
add an untag(data_id, tag, user_id=None, connection=None) function to
src/database/datasets.py that mirrors the existing tag(...) signature and
behavior but removes the association instead of creating it: locate the
database.datasets module and implement untag to perform the inverse DB operation
used by tag (delete the tag row for the given data_id and tag using the provided
connection), handle user_id/context the same way as tag, return the same
success/failure shape as tag (or raise the same errors), and ensure any
cached/get_tags_for(...) behavior remains consistent after removal.

Comment on lines +483 to +504
def tag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
connection.execute(
text(
"""
INSERT INTO dataset_tag (id, tag)
VALUES (:data_id, :tag)
""",
),
{"data_id": data_id, "tag": tag},
)


def untag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
connection.execute(
text(
"""
DELETE FROM dataset_tag
WHERE id = :data_id AND tag = :tag
""",
),
{"data_id": data_id, "tag": tag},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove the dead module-level tag and untag helpers — they are never called and contain defects.

Neither function is invoked anywhere in the router file:

  • tag_dataset calls database.datasets.tag (line 45)
  • untag_dataset calls database.datasets.untag (line 75)

Both are unreachable and confirmed by static analysis (user_id is unused in each — ARG001). Beyond being dead code, the module-level tag omits the uploader column from its INSERT, which differs from database.datasets.tag and would likely violate a NOT NULL constraint if ever called.

🗑️ Proposed fix — remove both functions entirely
-
-
-def tag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
-    connection.execute(
-        text(
-            """
-            INSERT INTO dataset_tag (id, tag)
-            VALUES (:data_id, :tag)
-            """,
-        ),
-        {"data_id": data_id, "tag": tag},
-    )
-
-
-def untag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
-    connection.execute(
-        text(
-            """
-            DELETE FROM dataset_tag
-            WHERE id = :data_id AND tag = :tag
-            """,
-        ),
-        {"data_id": data_id, "tag": tag},
-    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def tag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
connection.execute(
text(
"""
INSERT INTO dataset_tag (id, tag)
VALUES (:data_id, :tag)
""",
),
{"data_id": data_id, "tag": tag},
)
def untag(data_id: int, tag: str, user_id: int, connection: Connection) -> None:
connection.execute(
text(
"""
DELETE FROM dataset_tag
WHERE id = :data_id AND tag = :tag
""",
),
{"data_id": data_id, "tag": tag},
)
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 483-483: Unused function argument: user_id

(ARG001)


[warning] 495-495: Unused function argument: user_id

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 483 - 504, Remove the dead
module-level functions tag and untag from this file: they are unused, have an
unused user_id parameter, and the tag implementation omits the required uploader
column; delete the entire function definitions named tag(data_id: int, tag: str,
user_id: int, connection: Connection) and untag(data_id: int, tag: str, user_id:
int, connection: Connection) so callers use the correct database.datasets.tag
and database.datasets.untag implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant