Skip to content

Conversation

@tcnichol
Copy link
Contributor

@tcnichol tcnichol commented Aug 7, 2022

I'm going to go ahead and create this as a draft to get feedback.

Extractors are registered using the extractor-info.json. Files and datasets can be submitted to a running extractor. The message sent is as close to what was sent before so that minimal changes are needed for pyclowder2.

Right now extractors don't run since they rely on api endpoints that aren't in clowder2.0. Users also don't have extractor keys. I will create other issues and possibly link them here as well.

@tcnichol tcnichol requested review from lmarini and max-zilla August 7, 2022 18:00
This was linked to issues Aug 7, 2022
@tcnichol tcnichol linked an issue Aug 9, 2022 that may be closed by this pull request
external_services: List[str]
libraries: List[str] = []
bibtex: List[str]
maturity: str = "Development"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does maturity mean?

Copy link
Contributor Author

@tcnichol tcnichol Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in the Extractor Model before I started working.

I'm not sure what its used for and there isn't a field in the extractor_info.json, but since it was in I left it in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to identify extractors that are still under development versus those that are ready for production.

We had a discussion about this and decided to keep fields in and review all of them together.

@@ -0,0 +1,92 @@
import asyncio
import json
from aio_pika import ExchangeType, connect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you haven't done so you might want to add aio_pika to the pipfile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


async def main() -> None:
# Perform connection
connection = await connect("amqp://guest:guest@localhost/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this amqp://guest:guest@localhost/ to config.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


async def main() -> None:
# Perform connection
connection = await connect("amqp://guest:guest@localhost/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any need to add credentials? I saw our old code does that

credentials = pika.PlainCredentials("guest", "guest")
parameters = pika.ConnectionParameters("localhost", credentials=credentials)
connection = pika.BlockingConnection(parameters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is necessary with this library. The URL has the name and password, and this seems to work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the parameters now in both the (renamed) sync and async heartbeat listeners

@@ -0,0 +1,59 @@
import pika
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I might've missed 2 weeks of discussion. Is this meant for clowder 2?
Any reason we are using aio_pika vs pika in different places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I can change. Part of the issue I had was I was trying to run this as a background process, and I think running in using just pika caused issues.

If it's better, I can try to use pika again, especially since this will be running as a separate script outside of main.

body['host'] = 'http://127.0.0.1:8000'
body['secretKey'] = 'secretKey'
body['resource_type'] = 'dataset'
body['flags'] = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about if we should spell out the secretKey and host here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took that out and used token, to match what happens with files.
So far I haven't actually tested dataset extraction, so this may change again later.

some extractors use a.b.c etc to denote version
version parse replaced the float comparison
better logging - log if new extractor registered or updated
@tcnichol
Copy link
Contributor Author

Made a few changes.

Extractor versions are now strings instead of floats, since many extractor versions use 1.0.0. Still compared.

There is also a sync and async listener. Both work, but I would consider the sync one the default and adequate for what we are doing. The async one I wrote because I was trying to get the heartbeat extractor to run as a background thread in main, but if it's run as a separate service it should not matter.

I have not added docker deployment for this. That will be a later issue and pull request.

@tcnichol tcnichol marked this pull request as ready for review August 23, 2022 22:42
@tcnichol
Copy link
Contributor Author

Also, for this to work use branch
50-clowder20-submit-file-to-extractor
branch with pyclowder, and test using the wordcount.

@tcnichol
Copy link
Contributor Author

Copy link
Contributor

@max-zilla max-zilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly good for initial implementation, just a few changes.

author: str
contributors: List[str] = []
contexts: List[dict] = []
repository: Union[Repository, None] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elsewhere we use Optional[Repository]

external_services: List[str]
libraries: List[str] = []
bibtex: List[str]
maturity: str = "Development"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss in meeting, we can probably remove some of these old fields.

{"_id": existing_extractor["_id"]}
)
extractor_out = ExtractorOut.from_mongo(found)
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's either remove these print() statements or use logger

user=Depends(keycloak_auth.get_current_user),
db: MongoClient = Depends(dependencies.get_db),
):
result = extractor_in.dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

@max-zilla max-zilla self-requested a review September 26, 2022 20:08
@max-zilla max-zilla merged commit 1cb545e into main Sep 26, 2022
@max-zilla max-zilla deleted the register-extractor-submit-file branch September 26, 2022 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

submit file and dataset to extractor manually register extractor implement extractor heartbeat

5 participants