Conversation
6ddb360 to
81fcd3f
Compare
johannaSommer
left a comment
There was a problem hiding this comment.
That's really amazing work you did Begüm! Let's iterate one more time, I have a few questions, but I don't think it will take long then we finally have this refactooooor 🚀
There was a problem hiding this comment.
can you elaborate a bit to what extent users can still use the previous metrics / naming? I am a bit worried that we are changing too much and will break existing setups without prior deprecation warnings
There was a problem hiding this comment.
okay after looking further, i definitely see that you already took this into account. It's a lot which makes it difficult to understand, can you elaborate / summarize briefly again for me here what will still be possible / what can we simply not support anymore?
There was a problem hiding this comment.
Thanks for the great question! I summarized the metric refactor in the table below. In the first column, you can see the old metric class and string identifier. These are now replaced by new internal parent classes (second column), which are primarily used for inference-time grouping and shared computation. The third and fourth columns show the new atomic metric classes and their string identifiers, which are what users are expected to use going forward.
If someone tries to instantiate an old metric class, they'll now get an instance of the corresponding new internal parent class instead. This won’t break anything! the new internal parents still expose the .compute() method and return a dictionary just like the originals.
When a task is initialized using old class names or string identifiers, we also still support this, but issue a deprecation warning. Internally, the Evaluation Agent maps these to their respective new atomic metrics and groups them in the new internal parent classes only for inference efficiency.
| Old Metric Class & Identifier | New Internal Parent Class | New Metric Classes | New String Identifiers |
|---|---|---|---|
GPUMemoryMetric "gpu_memory" |
GPUMemoryStats |
DiskMemoryMetric |
"disk_memory" |
InferenceMemoryMetric |
"inference_memory" |
||
TrainingMemoryMetric |
"training_memory" |
||
ElapsedTimeMetric "elapsed_time" |
InferenceTimeStats |
TotalTimeMetric |
"total_time" |
LatencyMetric |
"latency" |
||
ThroughputMetric |
"throughput" |
||
ModelArchitectureMetric "model_architecture" |
ModelArchitectureStats |
TotalParamsMetric |
"total_params" |
TotalMacsMetric |
"total_macs" |
||
EnergyMetric "energy" |
EnvironmentalImpactStats |
CO2ConsumedMetric |
"co2_emissions" |
EnergyConsumedMetric |
"energy_consumed" |
There was a problem hiding this comment.
why was i even asking, of course you took care of everything perfectly duh
There was a problem hiding this comment.
I definitely see what you are trying to do here with the structure, but wouldnt this lead to double computations if I am requesting both throughput and latency for example?
There was a problem hiding this comment.
That was also my nightmare scenario!! I tried to handle this by sharing inference for new new metrics that internally share this computation logic. If you checkout https://github.com/PrunaAI/pruna/blob/81fcd3f4059accd3896f561696a0f4a12d6405ea/src/pruna/evaluation/evaluation_agent.py#L215
I tried to share inference for stateless metics whenever possible 🥹
So if you're using the EvaluationAgent you're doing all your computations efficiently
There was a problem hiding this comment.
For code understanding, it would be good to add this discussion as a comment (that the execution of the metrics via the agent will avoid redundancy)
There was a problem hiding this comment.
Added to the Evaluation Agent under the computation of stateless metrics
There was a problem hiding this comment.
I think it might be good/better to also add it to the metric, since one stumbles upon this when reading this file.
There was a problem hiding this comment.
makes sense, thanks Begüm!
There was a problem hiding this comment.
out of curiosity, do we expect more functions to be added here?
There was a problem hiding this comment.
I am overall wondering if this metric result might be very similar to the individual metric types that you introduced. What is the conceptual difference between a metric result and a metric? I am asking because i think attributes like higher-is-better and benchmark-metric could also be used here.
To be clear, I am not disagreeing with the structure, I am just wondering if this needs a seperate data class?
There was a problem hiding this comment.
That’s a really good point, and honestly, I wasn’t 100% sure myself at first whether this warranted a separate data class. I talked a bit with Gabriel about it, especially since these results will eventually be logged, and we landed on dataclass as a middle ground between using a plain dictionary and bringing in something heavier like pydantic.
The main reason I kept MetricResult as a separate class is that even with its simplicity now (since the metrics themselves became more atomic, the MetricResult evolved into something simpler and simpler as this refactoring progressed), it’s still useful to have a structured place for validation. The small assertions I added here already helped catch a couple of subtle bugs during development, and I imagine we’ll want to add more checks or logging-related logic in the future.
That said, you're right, it does overlap conceptually with some of the attributes in the metrics themselves (like higher_is_better), we can rethink the separation if you have a suggestion!!
There was a problem hiding this comment.
I personally think that a separate class is a good solution here. I actually could see to additionally add the higher_is_better is better here (it adds a bit redundancy but would make it much simply in downstream tasks to link the result to the property of the metric). As an alternative to just duplicating some attributes, one could add a pointer to the corresponding class the metric result is based on (and then the user can get the higher is better via the class).
One should potentially also discuss whether the pointer in general is nicer (in contrast to the string you are using currently in the metric result to identify what metric is used).
There was a problem hiding this comment.
I thought about it again and I agree with you both now, let's keep them as seperate classes... In the future we might want to have different normalizations or scalings of the same metric, where this class would then come in handy for e.g. storing units (latency in ms or s) or normalizing some score to 0-100
guennemann
left a comment
There was a problem hiding this comment.
Thanks for the nice refactoring. You can find detailed comments inline.
There was a problem hiding this comment.
I personally think that a separate class is a good solution here. I actually could see to additionally add the higher_is_better is better here (it adds a bit redundancy but would make it much simply in downstream tasks to link the result to the property of the metric). As an alternative to just duplicating some attributes, one could add a pointer to the corresponding class the metric result is based on (and then the user can get the higher is better via the class).
One should potentially also discuss whether the pointer in general is nicer (in contrast to the string you are using currently in the metric result to identify what metric is used).
There was a problem hiding this comment.
For code understanding, it would be good to add this discussion as a comment (that the execution of the metrics via the agent will avoid redundancy)
There was a problem hiding this comment.
I am now wondering why we return a dict here and not also a list of metric results for consistency. The downside would be that the children will not be able to access easily/efficiently "its own" result (since we loose the hashMap). So probably we keep it in the current form.
There was a problem hiding this comment.
Yes exactly! Also for the people who are using the metric classes directly and not via the Agent I think it's nice that they still get their results in a dictionary form. We can deprecate it in the future but I didn't want to change how we return the results in the "legacy" class too much.
c675bb7 to
b8c5ebc
Compare
dbe85a9 to
e41c2a0
Compare
There was a problem hiding this comment.
I just wanted to show an example of what a MetricResult looks like, didn't want to give the impression that users need to import it to use it when metrics return it :)
There was a problem hiding this comment.
should we add something like MS_NUM_ITER?
There was a problem hiding this comment.
I’m only using this unit in this specific metric, so I wasn’t sure if it still makes sense to define a global constant for it. I went through all the metrics, and for the ones where I reuse the units across multiple places, I did create global strings. What do you think; would you prefer we keep all units as globals for consistency, or only when they’re reused?
There was a problem hiding this comment.
When I deprecate, I normally add a specific version where it is going to be removed, + I add a test that uses this deprecated class but will be removed during that version. @johannaSommer I also noticed you recently deprecated something, so perhaps it would be good to start doing this to make it easier to maintain.
There was a problem hiding this comment.
thanks a lot for this feedback, yes, let's definitely do this! I will also take this into account
There was a problem hiding this comment.
Sounds great! Added the version we stop supporting in the warning message + added tests for the deprecated classes in the tests.
e41c2a0 to
b9564e7
Compare
3d01c3b to
b60092f
Compare
b60092f to
a0ec779
Compare
Description
General evaluation refactoring for better user experience:
higher_is_better,metric_units,singleorpairwise, and cannot be called with non supported call types.Type of Change
How Has This Been Tested?
Checklist
Additional Notes