Skip to content

Conversation

@tcnichol
Copy link
Contributor

With this pull request, if a file is updated then all extractors that were run on this file will be ran again.

Right now we are not saving information on the parameters used if an extractor has parameters, so for this pull request the resubmissions won't include parameters. Adding those will be dependent on other issues currently in progress.

In order for this to work, a listener/extractor needs to be in the db. Here is a sample of one (wordcount) that can be used. Add to the 'listeners' db table.

{ "_id": { "$oid": "63b5cd4aeb1180d52266214e" }, "author": "Rob Kooper <[email protected]>", "name": "ncsa.wordcount", "version": "2.0", "description": "WordCount extractor. Counts the number of characters, words and lines in the text file that was uploaded.", "creator": null, "created": { "$date": { "$numberLong": "1672858954451" } }, "modified": { "$date": { "$numberLong": "1672858954451" } }, "properties": { "author": "Rob Kooper <[email protected]>", "process": { "file": [ "text/*", "application/json" ] }, "maturity": "Development", "name": "ncsa.wordcount", "contributors": [], "contexts": [ { "lines": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#lines", "words": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#words", "characters": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#characters" } ], "repository": [ { "id": { "$oid": "63b5cd4aeb1180d52266214d" }, "repository_type": "git", "repository_url": "" } ], "external_services": [], "libraries": [], "bibtex": [], "default_labels": [], "categories": [], "parameters": { "schema": { "X_MIN_START": { "type": "integer", "title": "X_MIN_START" }, "X_MIN_END": { "type": "integer", "title": "X_MIN_END" }, "Y_MIN_START": { "type": "integer", "title": "Y_MIN_START" }, "Y_MIN_END": { "type": "integer", "title": "Y_MIN_END" }, "ZONE": { "type": "string", "title": "ZONE" } } }, "version": "2.0" } }

@tcnichol tcnichol requested a review from ddey2 January 14, 2023 21:03
@tcnichol tcnichol linked an issue Jan 14, 2023 that may be closed by this pull request
@tcnichol
Copy link
Contributor Author

In this pull request, the method which updates a file calls a helper method that runs all the extractors. I added a route that does this as well independently, but went with the helper method as the solution since it requires no change on the front end. If that would be a better solution, I can change how this works.

Copy link
Member

@lmarini lmarini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to code. Thank you.

@tcnichol
Copy link
Contributor Author

comments added. In this pull request the extractors are re-run using a helper method. There is also a route for re-running extractors added; it seemed like it might be useful.

Once the Job is implemented, I will be able to use parameters of former extractor runs.

}
}
update_record(es, "file", doc, updated_file.id)
await _resubmit_file_extractors(file_id, credentials, db, rabbitmq_client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthy to add a return message to the _resubmit... function so you can get some status back? If resubmit failed there will be a way to return that info to the client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the method return a list. If any extractor succeeds, it goes into a succeeds list and the ones that fail go into a fail list.

helper method returns a count of extractors failed/succeeded on resubmit
…-a-file-is-updated

# Conflicts:
#	backend/app/routers/files.py
Copy link
Contributor

@max-zilla max-zilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small comments, most of it looks good otherwise.

rabbitmq_client: Rabbitmq Client
"""
if (file := await db["files"].find_one({"_id": ObjectId(file_id)})) is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spot where you call this method, you already have the FileOut you could pass in directly and not have to query it again here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed that.

"""
if (file := await db["files"].find_one({"_id": ObjectId(file_id)})) is not None:
file_out = FileOut.from_mongo(file)
query = {"resource.resource_id": ObjectId(file_id)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will match all metadata regardless of resource.version, so any previous file version metadata will also be deleted. is this the intent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that MetadataOut doesn't seem to have the version. I will add that and then check to make sure this is the latest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that, just need to test to make sure it works.

adding parameters to update_file method
passing in FileOut object to _resubmit method
…a is equal to file version of file before updating
@tcnichol
Copy link
Contributor Author

tcnichol commented Feb 3, 2023

I merged 'main' into this branch and am noticing that it no longer seems to work correctly.

Will try to fix tomorrow before the clowder meeting.

does not work
…-a-file-is-updated

# Conflicts:
#	backend/app/routers/files.py
metadata attaches to right file version
@max-zilla
Copy link
Contributor

@tcnichol please resolve merge conflict

@max-zilla max-zilla requested a review from lmarini February 6, 2023 15:32
@max-zilla max-zilla merged commit a4b6790 into main Feb 6, 2023
@max-zilla max-zilla deleted the 276-automatically-trigger-extractor-whenever-a-file-is-updated branch February 6, 2023 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extractor metadata attaches to version 1 of file regardless of file version Automatically trigger extractor whenever a file is updated

5 participants