-
Notifications
You must be signed in to change notification settings - Fork 6
276 automatically trigger extractor whenever a file is updated #279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
276 automatically trigger extractor whenever a file is updated #279
Conversation
|
In this pull request, the method which updates a file calls a helper method that runs all the extractors. I added a route that does this as well independently, but went with the helper method as the solution since it requires no change on the front end. If that would be a better solution, I can change how this works. |
lmarini
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments to code. Thank you.
unused method removed
|
comments added. In this pull request the extractors are re-run using a helper method. There is also a route for re-running extractors added; it seemed like it might be useful. Once the Job is implemented, I will be able to use parameters of former extractor runs. |
backend/app/routers/files.py
Outdated
| } | ||
| } | ||
| update_record(es, "file", doc, updated_file.id) | ||
| await _resubmit_file_extractors(file_id, credentials, db, rabbitmq_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthy to add a return message to the _resubmit... function so you can get some status back? If resubmit failed there will be a way to return that info to the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the method return a list. If any extractor succeeds, it goes into a succeeds list and the ones that fail go into a fail list.
helper method returns a count of extractors failed/succeeded on resubmit
…-a-file-is-updated # Conflicts: # backend/app/routers/files.py
max-zilla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small comments, most of it looks good otherwise.
backend/app/routers/files.py
Outdated
| rabbitmq_client: Rabbitmq Client | ||
| """ | ||
| if (file := await db["files"].find_one({"_id": ObjectId(file_id)})) is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spot where you call this method, you already have the FileOut you could pass in directly and not have to query it again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed that.
backend/app/routers/files.py
Outdated
| """ | ||
| if (file := await db["files"].find_one({"_id": ObjectId(file_id)})) is not None: | ||
| file_out = FileOut.from_mongo(file) | ||
| query = {"resource.resource_id": ObjectId(file_id)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will match all metadata regardless of resource.version, so any previous file version metadata will also be deleted. is this the intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that MetadataOut doesn't seem to have the version. I will add that and then check to make sure this is the latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that, just need to test to make sure it works.
adding parameters to update_file method passing in FileOut object to _resubmit method
…a is equal to file version of file before updating
…-a-file-is-updated
does not work
|
I merged 'main' into this branch and am noticing that it no longer seems to work correctly. Will try to fix tomorrow before the clowder meeting. |
does not work
…-a-file-is-updated # Conflicts: # backend/app/routers/files.py
metadata attaches to right file version
|
@tcnichol please resolve merge conflict |
With this pull request, if a file is updated then all extractors that were run on this file will be ran again.
Right now we are not saving information on the parameters used if an extractor has parameters, so for this pull request the resubmissions won't include parameters. Adding those will be dependent on other issues currently in progress.
In order for this to work, a listener/extractor needs to be in the db. Here is a sample of one (wordcount) that can be used. Add to the 'listeners' db table.
{ "_id": { "$oid": "63b5cd4aeb1180d52266214e" }, "author": "Rob Kooper <[email protected]>", "name": "ncsa.wordcount", "version": "2.0", "description": "WordCount extractor. Counts the number of characters, words and lines in the text file that was uploaded.", "creator": null, "created": { "$date": { "$numberLong": "1672858954451" } }, "modified": { "$date": { "$numberLong": "1672858954451" } }, "properties": { "author": "Rob Kooper <[email protected]>", "process": { "file": [ "text/*", "application/json" ] }, "maturity": "Development", "name": "ncsa.wordcount", "contributors": [], "contexts": [ { "lines": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#lines", "words": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#words", "characters": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#characters" } ], "repository": [ { "id": { "$oid": "63b5cd4aeb1180d52266214d" }, "repository_type": "git", "repository_url": "" } ], "external_services": [], "libraries": [], "bibtex": [], "default_labels": [], "categories": [], "parameters": { "schema": { "X_MIN_START": { "type": "integer", "title": "X_MIN_START" }, "X_MIN_END": { "type": "integer", "title": "X_MIN_END" }, "Y_MIN_START": { "type": "integer", "title": "Y_MIN_START" }, "Y_MIN_END": { "type": "integer", "title": "Y_MIN_END" }, "ZONE": { "type": "string", "title": "ZONE" } } }, "version": "2.0" } }