Skip to content

[Port to dtq-dev] Port ItemMetadataQAChecker curation task from v5 to v7#1237

Open
kosarko wants to merge 2 commits intodataquest-dev:dtq-devfrom
ufal:backport-1312-to-dtq-dev
Open

[Port to dtq-dev] Port ItemMetadataQAChecker curation task from v5 to v7#1237
kosarko wants to merge 2 commits intodataquest-dev:dtq-devfrom
ufal:backport-1312-to-dtq-dev

Conversation

@kosarko
Copy link

@kosarko kosarko commented Jan 26, 2026

Port of ufal#1312 to dtq-dev.

* Add ItemMetadataQAChecker curation task with tests

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: kosarko <1842385+kosarko@users.noreply.github.com>
(cherry picked from commit d5517be)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request ports the ItemMetadataQAChecker curation task from DSpace v5 to v7 for the dtq-dev branch. The curation task performs quality assurance checks on item metadata, validating various properties such as dc.type values, language codes, duplicate metadata, and other metadata consistency requirements specific to CLARIN/LINDAT repositories.

Changes:

  • Adds a new curation task ItemMetadataQAChecker that validates item metadata against CLARIN-specific quality requirements
  • Registers the new task in both production and test configuration files with the task name "metadataqa"
  • Includes comprehensive integration tests covering various validation scenarios (invalid types, missing metadata, incorrect language mappings, duplicate fields)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
dspace/config/modules/curate.cfg Registers the ItemMetadataQAChecker curation task as "metadataqa"
dspace-api/src/test/data/dspaceFolder/config/modules/curate.cfg Registers the task in test configuration
dspace-api/src/main/java/org/dspace/ctask/general/ItemMetadataQAChecker.java Main implementation of the metadata QA checker with validation methods for types, languages, relations, and other metadata fields
dspace-api/src/test/java/org/dspace/curate/ItemMetadataQACheckerIT.java Integration tests covering various validation scenarios and edge cases

Comment on lines +120 to +126
.withTitle("Item With Two Available Dates")
.withMetadata("dc", "type", null, "corpus")
.withMetadata("dc", "date", "available", "2020-01-01")
.build();

itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date",
"available", "en_US", "2021-01-01");
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: this block uses 8 spaces for the builder pattern indentation, while the rest of the test file uses 4 spaces. This should be adjusted to match the consistent 4-space indentation used throughout the rest of the file.

Suggested change
.withTitle("Item With Two Available Dates")
.withMetadata("dc", "type", null, "corpus")
.withMetadata("dc", "date", "available", "2020-01-01")
.build();
itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date",
"available", "en_US", "2021-01-01");
.withTitle("Item With Two Available Dates")
.withMetadata("dc", "type", null, "corpus")
.withMetadata("dc", "date", "available", "2020-01-01")
.build();
itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date",
"available", "en_US", "2021-01-01");

Copilot uses AI. Check for mistakes.
.withMetadata("dc", "date", "available", "2020-01-01")
.build();

itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date",
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma in method call. Should be "itemWithTwoAvailableDatesAndLang, "dc"" to follow Java code style conventions.

Suggested change
itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date",
itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang, "dc", "date",

Copilot uses AI. Check for mistakes.
.build();

itemService.addMetadata(context, itemWithTwoAvailableDatesAndLang,"dc", "date",
"available", "en_US", "2021-01-01");
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: this line uses 8 spaces for indentation while the rest of the file uses 4 spaces. Should be adjusted to match the consistent 4-space indentation pattern.

Suggested change
"available", "en_US", "2021-01-01");
"available", "en_US", "2021-01-01");

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +261
// Validate that each language name corresponds to its ISO code
for (int i = 0; i < dcsLanguageIso.size(); i++) {
String expectedLangName = IsoLangCodes.getLangForCode(dcsLanguageIso.get(i).getValue());
String actualLangName = languageNames.get(i).getValue();
if (!expectedLangName.equals(actualLangName)) {
throw new CurateException(
String.format("local.language.name [%s] does not match expected name [%s] for ISO code [%s]",
actualLangName, expectedLangName, dcsLanguageIso.get(i).getValue()),
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic assumes that dc.language.iso and local.language.name metadata values are returned in matching order by index, but this assumption is incorrect. According to DSpace metadata documentation, when retrieving metadata fields, values are sorted by MetadataFieldId first and then by place value. Since dc.language.iso and local.language.name are different fields, there's no guarantee that their ordering will match when accessed by index.

The current implementation will produce false failures when metadata values for these fields have different place orderings. For example, if dc.language.iso values are added with places [0,1,2] and local.language.name values are added with places [2,1,0], the validation will incorrectly report mismatches even if the values actually correspond correctly.

The validation should either: (1) explicitly compare metadata values by their place attribute to ensure proper pairing, or (2) use a different approach that doesn't rely on list index ordering assumptions, such as building a set of expected language names and verifying all actual language names are in that set.

Suggested change
// Validate that each language name corresponds to its ISO code
for (int i = 0; i < dcsLanguageIso.size(); i++) {
String expectedLangName = IsoLangCodes.getLangForCode(dcsLanguageIso.get(i).getValue());
String actualLangName = languageNames.get(i).getValue();
if (!expectedLangName.equals(actualLangName)) {
throw new CurateException(
String.format("local.language.name [%s] does not match expected name [%s] for ISO code [%s]",
actualLangName, expectedLangName, dcsLanguageIso.get(i).getValue()),
// Build maps of expected and actual language names keyed by place
Map<Integer, String> expectedLangNamesByPlace = new HashMap<>();
Map<Integer, String> isoCodesByPlace = new HashMap<>();
for (MetadataValue langCodeDC : dcsLanguageIso) {
Integer place = langCodeDC.getPlace();
String isoCode = langCodeDC.getValue();
String expectedLangName = IsoLangCodes.getLangForCode(isoCode);
expectedLangNamesByPlace.put(place, expectedLangName);
isoCodesByPlace.put(place, isoCode);
}
Map<Integer, String> actualLangNamesByPlace = new HashMap<>();
for (MetadataValue languageName : languageNames) {
Integer place = languageName.getPlace();
String actualLangName = languageName.getValue();
actualLangNamesByPlace.put(place, actualLangName);
}
// Ensure that the sets of places match between ISO codes and language names
Set<Integer> expectedPlaces = expectedLangNamesByPlace.keySet();
Set<Integer> actualPlaces = actualLangNamesByPlace.keySet();
if (!expectedPlaces.equals(actualPlaces)) {
throw new CurateException(
String.format("local.language.name places %s do not match dc.language.iso places %s",
actualPlaces, expectedPlaces),
Curator.CURATE_FAIL);
}
// Validate that each language name corresponds to its ISO code for each place
for (Integer place : expectedPlaces) {
String expectedLangName = expectedLangNamesByPlace.get(place);
String actualLangName = actualLangNamesByPlace.get(place);
if (!expectedLangName.equals(actualLangName)) {
String isoCode = isoCodesByPlace.get(place);
throw new CurateException(
String.format(
"local.language.name [%s] at place [%d] does not match expected name [%s] for ISO code [%s]",
actualLangName, place, expectedLangName, isoCode),

Copilot uses AI. Check for mistakes.
curator.curate(context, itemWithIncorrectLanguageName.getHandle());
int status = curator.getStatus(TASK_NAME);
String result = curator.getResult(TASK_NAME);
System.out.println("Test result: " + result);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The System.out.println debug statement should be removed or replaced with a proper logging statement. While System.out.println is used in some other test files in the codebase, it's generally not a best practice for test output. Consider using a logger instead, or remove this line entirely as the assertion message already provides diagnostic information when the test fails.

Suggested change
System.out.println("Test result: " + result);

Copilot uses AI. Check for mistakes.
@kosarko
Copy link
Author

kosarko commented Feb 19, 2026

@milanmajchrak updated, but haven't tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants