276 automatically trigger extractor whenever a file is updated #279

tcnichol · 2023-01-14T21:03:35Z

With this pull request, if a file is updated then all extractors that were run on this file will be ran again.

Right now we are not saving information on the parameters used if an extractor has parameters, so for this pull request the resubmissions won't include parameters. Adding those will be dependent on other issues currently in progress.

In order for this to work, a listener/extractor needs to be in the db. Here is a sample of one (wordcount) that can be used. Add to the 'listeners' db table.

{ "_id": { "$oid": "63b5cd4aeb1180d52266214e" }, "author": "Rob Kooper <[email protected]>", "name": "ncsa.wordcount", "version": "2.0", "description": "WordCount extractor. Counts the number of characters, words and lines in the text file that was uploaded.", "creator": null, "created": { "$date": { "$numberLong": "1672858954451" } }, "modified": { "$date": { "$numberLong": "1672858954451" } }, "properties": { "author": "Rob Kooper <[email protected]>", "process": { "file": [ "text/*", "application/json" ] }, "maturity": "Development", "name": "ncsa.wordcount", "contributors": [], "contexts": [ { "lines": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#lines", "words": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#words", "characters": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#characters" } ], "repository": [ { "id": { "$oid": "63b5cd4aeb1180d52266214d" }, "repository_type": "git", "repository_url": "" } ], "external_services": [], "libraries": [], "bibtex": [], "default_labels": [], "categories": [], "parameters": { "schema": { "X_MIN_START": { "type": "integer", "title": "X_MIN_START" }, "X_MIN_END": { "type": "integer", "title": "X_MIN_END" }, "Y_MIN_START": { "type": "integer", "title": "Y_MIN_START" }, "Y_MIN_END": { "type": "integer", "title": "Y_MIN_END" }, "ZONE": { "type": "string", "title": "ZONE" } } }, "version": "2.0" } }

… run on a file

…g a file

tcnichol · 2023-01-14T21:04:56Z

In this pull request, the method which updates a file calls a helper method that runs all the extractors. I added a route that does this as well independently, but went with the helper method as the solution since it requires no change on the front end. If that would be a better solution, I can change how this works.

lmarini

Please add comments to code. Thank you.

unused method removed

tcnichol · 2023-01-19T15:51:24Z

comments added. In this pull request the extractors are re-run using a helper method. There is also a route for re-running extractors added; it seemed like it might be useful.

Once the Job is implemented, I will be able to use parameters of former extractor runs.

backend/app/routers/files.py

longshuicy · 2023-01-20T15:41:52Z

backend/app/routers/files.py

            }
        }
        update_record(es, "file", doc, updated_file.id)
+        await _resubmit_file_extractors(file_id, credentials, db, rabbitmq_client)


It might be worthy to add a return message to the _resubmit... function so you can get some status back? If resubmit failed there will be a way to return that info to the client?

I made the method return a list. If any extractor succeeds, it goes into a succeeds list and the ones that fail go into a fail list.

helper method returns a count of extractors failed/succeeded on resubmit

…-a-file-is-updated # Conflicts: # backend/app/routers/files.py

max-zilla

A couple small comments, most of it looks good otherwise.

max-zilla · 2023-02-02T19:36:47Z

backend/app/routers/files.py

+        rabbitmq_client: Rabbitmq Client
+
+    """
+    if (file := await db["files"].find_one({"_id": ObjectId(file_id)})) is not None:


In the spot where you call this method, you already have the FileOut you could pass in directly and not have to query it again here.

fixed that.

max-zilla · 2023-02-02T19:37:57Z

backend/app/routers/files.py

+    """
+    if (file := await db["files"].find_one({"_id": ObjectId(file_id)})) is not None:
+        file_out = FileOut.from_mongo(file)
+        query = {"resource.resource_id": ObjectId(file_id)}


this will match all metadata regardless of resource.version, so any previous file version metadata will also be deleted. is this the intent?

I notice that MetadataOut doesn't seem to have the version. I will add that and then check to make sure this is the latest.

Added that, just need to test to make sure it works.

adding parameters to update_file method passing in FileOut object to _resubmit method

…a is equal to file version of file before updating

…-a-file-is-updated

does not work

tcnichol · 2023-02-03T02:33:57Z

I merged 'main' into this branch and am noticing that it no longer seems to work correctly.

Will try to fix tomorrow before the clowder meeting.

does not work

…-a-file-is-updated # Conflicts: # backend/app/routers/files.py

metadata attaches to right file version

max-zilla · 2023-02-03T20:48:43Z

@tcnichol please resolve merge conflict

…-a-file-is-updated

tcnichol added 4 commits January 13, 2023 16:41

adding helper method get extraction events. gets what extractors were…

eb0b7a0

… run on a file

helper method for get extractor events and new method for resubmittin…

77d3c77

…g a file

helper method called in update file

7cea1d7

adding 'await' or else won't run

c17f0ed

tcnichol requested a review from ddey2 January 14, 2023 21:03

tcnichol requested review from lmarini and max-zilla as code owners January 14, 2023 21:03

tcnichol linked an issue Jan 14, 2023 that may be closed by this pull request

Automatically trigger extractor whenever a file is updated #276

Closed

pipenv run black app - formatting

15973df

lmarini requested changes Jan 17, 2023

View reviewed changes

tcnichol added 2 commits January 19, 2023 09:16

adding comments for method

cf39adb

comments to new route

720880e

unused method removed

longshuicy reviewed Jan 20, 2023

View reviewed changes

tcnichol added 4 commits January 23, 2023 15:42

re-used helper method in new route

ec1efbc

helper method returns a count of extractors failed/succeeded on resubmit

change - return list of success and fail and not count

3e85185

Merge branch 'main' into 276-automatically-trigger-extractor-whenever…

4108500

…-a-file-is-updated # Conflicts: # backend/app/routers/files.py

formatting

8b18392

max-zilla reviewed Feb 2, 2023

View reviewed changes

tcnichol added 4 commits February 2, 2023 18:10

missing imports

d6a6eda

adding parameters to update_file method passing in FileOut object to _resubmit method

adding file_version to metadata out, checking file version of metadat…

c341ca3

…a is equal to file version of file before updating

Merge branch 'main' into 276-automatically-trigger-extractor-whenever…

f540259

…-a-file-is-updated

comparing version numbers

aa27e13

does not work

tcnichol added 3 commits February 2, 2023 20:40

need to fix

51c8af0

does not work

Merge branch 'main' into 276-automatically-trigger-extractor-whenever…

c5a26e0

…-a-file-is-updated # Conflicts: # backend/app/routers/files.py

does not rerun extractors

d620c00

metadata attaches to right file version

tcnichol linked an issue Feb 3, 2023 that may be closed by this pull request

extractor metadata attaches to version 1 of file regardless of file version #309

Closed

formatting

bd95ff2

tcnichol added 5 commits February 3, 2023 16:36

changing order of imports

d2279e9

Merge branch 'main' into 276-automatically-trigger-extractor-whenever…

d9c2656

…-a-file-is-updated

removing unused import

84f2a59

using new method

5cf85c5

formatting

85443f1

max-zilla requested a review from lmarini February 6, 2023 15:32

max-zilla approved these changes Feb 6, 2023

View reviewed changes

max-zilla merged commit a4b6790 into main Feb 6, 2023

max-zilla deleted the 276-automatically-trigger-extractor-whenever-a-file-is-updated branch February 6, 2023 15:35

276 automatically trigger extractor whenever a file is updated #279

276 automatically trigger extractor whenever a file is updated #279

Uh oh!

Conversation

tcnichol commented Jan 14, 2023

Uh oh!

tcnichol commented Jan 14, 2023

Uh oh!

lmarini left a comment

Choose a reason for hiding this comment

Uh oh!

tcnichol commented Jan 19, 2023

Uh oh!

Uh oh!

longshuicy Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

tcnichol Jan 23, 2023

Choose a reason for hiding this comment

Uh oh!

max-zilla left a comment

Choose a reason for hiding this comment

Uh oh!

max-zilla Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

tcnichol Feb 3, 2023

Choose a reason for hiding this comment

Uh oh!

max-zilla Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

tcnichol Feb 3, 2023

Choose a reason for hiding this comment

Uh oh!

tcnichol Feb 3, 2023

Choose a reason for hiding this comment

Uh oh!

tcnichol commented Feb 3, 2023

Uh oh!

max-zilla commented Feb 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants