refactor: metric attributes by begumcig · Pull Request #81 · PrunaAI/pruna

begumcig · 2025-04-28T14:21:24Z

Description

General evaluation refactoring for better user experience:

Atomic metrics: Each metric returns a single float value as a result.
MetricResult: results are returned with MetricResult objects, which carry the metric name, metric metadata and the result value
Added various class attributes to metrics for clarity like higher_is_better, metric_units,
Refactored call types for metrics, now any metric can be called with just single or pairwise, and cannot be called with non supported call types.
Metric names in the metric class and metric results match
Adapted the EvaluationAgent to this paradigm so that even when the metrics are atomic, inference computations are shared when possible

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

johannaSommer

That's really amazing work you did Begüm! Let's iterate one more time, I have a few questions, but I don't think it will take long then we finally have this refactooooor 🚀

johannaSommer · 2025-05-06T10:36:51Z

can you elaborate a bit to what extent users can still use the previous metrics / naming? I am a bit worried that we are changing too much and will break existing setups without prior deprecation warnings

okay after looking further, i definitely see that you already took this into account. It's a lot which makes it difficult to understand, can you elaborate / summarize briefly again for me here what will still be possible / what can we simply not support anymore?

Thanks for the great question! I summarized the metric refactor in the table below. In the first column, you can see the old metric class and string identifier. These are now replaced by new internal parent classes (second column), which are primarily used for inference-time grouping and shared computation. The third and fourth columns show the new atomic metric classes and their string identifiers, which are what users are expected to use going forward.

If someone tries to instantiate an old metric class, they'll now get an instance of the corresponding new internal parent class instead. This won’t break anything! the new internal parents still expose the .compute() method and return a dictionary just like the originals.

When a task is initialized using old class names or string identifiers, we also still support this, but issue a deprecation warning. Internally, the Evaluation Agent maps these to their respective new atomic metrics and groups them in the new internal parent classes only for inference efficiency.

Old Metric Class & Identifier New Internal Parent Class New Metric Classes New String Identifiers

GPUMemoryMetric "gpu_memory" GPUMemoryStats DiskMemoryMetric "disk_memory"

InferenceMemoryMetric "inference_memory"

TrainingMemoryMetric "training_memory"

ElapsedTimeMetric "elapsed_time" InferenceTimeStats TotalTimeMetric "total_time"

LatencyMetric "latency"

ThroughputMetric "throughput"

ModelArchitectureMetric "model_architecture" ModelArchitectureStats TotalParamsMetric "total_params"

TotalMacsMetric "total_macs"

EnergyMetric "energy" EnvironmentalImpactStats CO2ConsumedMetric "co2_emissions"

EnergyConsumedMetric "energy_consumed"

why was i even asking, of course you took care of everything perfectly duh

johannaSommer · 2025-05-06T10:42:21Z

I definitely see what you are trying to do here with the structure, but wouldnt this lead to double computations if I am requesting both throughput and latency for example?

That was also my nightmare scenario!! I tried to handle this by sharing inference for new new metrics that internally share this computation logic. If you checkout https://github.com/PrunaAI/pruna/blob/81fcd3f4059accd3896f561696a0f4a12d6405ea/src/pruna/evaluation/evaluation_agent.py#L215

I tried to share inference for stateless metics whenever possible 🥹
So if you're using the EvaluationAgent you're doing all your computations efficiently

For code understanding, it would be good to add this discussion as a comment (that the execution of the metrics via the agent will avoid redundancy)

Added to the Evaluation Agent under the computation of stateless metrics

I think it might be good/better to also add it to the metric, since one stumbles upon this when reading this file.

makes sense, thanks Begüm!

johannaSommer · 2025-05-06T10:46:57Z

out of curiosity, do we expect more functions to be added here?

I am overall wondering if this metric result might be very similar to the individual metric types that you introduced. What is the conceptual difference between a metric result and a metric? I am asking because i think attributes like higher-is-better and benchmark-metric could also be used here.
To be clear, I am not disagreeing with the structure, I am just wondering if this needs a seperate data class?

That’s a really good point, and honestly, I wasn’t 100% sure myself at first whether this warranted a separate data class. I talked a bit with Gabriel about it, especially since these results will eventually be logged, and we landed on dataclass as a middle ground between using a plain dictionary and bringing in something heavier like pydantic.

The main reason I kept MetricResult as a separate class is that even with its simplicity now (since the metrics themselves became more atomic, the MetricResult evolved into something simpler and simpler as this refactoring progressed), it’s still useful to have a structured place for validation. The small assertions I added here already helped catch a couple of subtle bugs during development, and I imagine we’ll want to add more checks or logging-related logic in the future.

That said, you're right, it does overlap conceptually with some of the attributes in the metrics themselves (like higher_is_better), we can rethink the separation if you have a suggestion!!

I personally think that a separate class is a good solution here. I actually could see to additionally add the higher_is_better is better here (it adds a bit redundancy but would make it much simply in downstream tasks to link the result to the property of the metric). As an alternative to just duplicating some attributes, one could add a pointer to the corresponding class the metric result is based on (and then the user can get the higher is better via the class).

One should potentially also discuss whether the pointer in general is nicer (in contrast to the string you are using currently in the metric result to identify what metric is used).

I thought about it again and I agree with you both now, let's keep them as seperate classes... In the future we might want to have different normalizations or scalings of the same metric, where this class would then come in handy for e.g. storing units (latency in ms or s) or normalizing some score to 0-100

guennemann

Thanks for the nice refactoring. You can find detailed comments inline.

guennemann · 2025-05-08T09:45:36Z

I personally think that a separate class is a good solution here. I actually could see to additionally add the higher_is_better is better here (it adds a bit redundancy but would make it much simply in downstream tasks to link the result to the property of the metric). As an alternative to just duplicating some attributes, one could add a pointer to the corresponding class the metric result is based on (and then the user can get the higher is better via the class).

One should potentially also discuss whether the pointer in general is nicer (in contrast to the string you are using currently in the metric result to identify what metric is used).

guennemann · 2025-05-08T10:06:33Z

For code understanding, it would be good to add this discussion as a comment (that the execution of the metrics via the agent will avoid redundancy)

guennemann · 2025-05-08T10:10:02Z

I am now wondering why we return a dict here and not also a list of metric results for consistency. The downside would be that the children will not be able to access easily/efficiently "its own" result (since we loose the hashMap). So probably we keep it in the current form.

Yes exactly! Also for the people who are using the metric classes directly and not via the Agent I think it's nice that they still get their results in a dictionary form. We can deprecate it in the future but I didn't want to change how we return the results in the "legacy" class too much.

davidberenstein1957 · 2025-05-14T18:15:19Z

why hidden code?

I just wanted to show an example of what a MetricResult looks like, didn't want to give the impression that users need to import it to use it when metrics return it :)

davidberenstein1957 · 2025-05-14T18:50:47Z

should we add something like MS_NUM_ITER?

I’m only using this unit in this specific metric, so I wasn’t sure if it still makes sense to define a global constant for it. I went through all the metrics, and for the ones where I reuse the units across multiple places, I did create global strings. What do you think; would you prefer we keep all units as globals for consistency, or only when they’re reused?

davidberenstein1957 · 2025-05-14T18:51:03Z

same usage here

davidberenstein1957 · 2025-05-14T18:53:18Z

When I deprecate, I normally add a specific version where it is going to be removed, + I add a test that uses this deprecated class but will be removed during that version. @johannaSommer I also noticed you recently deprecated something, so perhaps it would be good to start doing this to make it easier to maintain.

thanks a lot for this feedback, yes, let's definitely do this! I will also take this into account

Sounds great! Added the version we stop supporting in the warning message + added tests for the deprecated classes in the tests.

johannaSommer

you're a star 🤩

…s object

guennemann

Very nice work! See comments for some very minor points.

begumcig changed the title ~~refactor: metric-attributes~~ refactor: metric attributes Apr 28, 2025

begumcig force-pushed the refactor/metric-attributes branch 13 times, most recently from 6ddb360 to 81fcd3f Compare May 2, 2025 15:07

begumcig requested review from guennemann and johannaSommer May 2, 2025 15:15

begumcig marked this pull request as ready for review May 2, 2025 15:19

begumcig requested a review from davidberenstein1957 May 5, 2025 09:53

johannaSommer requested changes May 6, 2025

View reviewed changes

begumcig requested a review from johannaSommer May 7, 2025 15:00

guennemann reviewed May 8, 2025

View reviewed changes

begumcig force-pushed the refactor/metric-attributes branch 4 times, most recently from c675bb7 to b8c5ebc Compare May 14, 2025 14:44

begumcig requested a review from guennemann May 14, 2025 14:50

begumcig force-pushed the refactor/metric-attributes branch 3 times, most recently from dbe85a9 to e41c2a0 Compare May 14, 2025 16:33

davidberenstein1957 reviewed May 14, 2025

View reviewed changes

johannaSommer approved these changes May 15, 2025

View reviewed changes

begumcig force-pushed the refactor/metric-attributes branch from e41c2a0 to b9564e7 Compare May 15, 2025 14:23

begumcig requested a review from davidberenstein1957 May 15, 2025 16:29

begumcig force-pushed the refactor/metric-attributes branch 3 times, most recently from 3d01c3b to b60092f Compare May 16, 2025 09:35

begumcig added 5 commits May 19, 2025 15:15

refactor: split composite metrics into granular ones, with new result…

057415b

…s object

refactor: add utils for opt agent

0aa8412

docs: update docs and refactor call types

dc17814

refactor: benchmark metrics

4849001

refactor: docstrings and time metric

9d9de44

guennemann approved these changes May 20, 2025

View reviewed changes

Comment thread src/pruna/evaluation/metrics/utils.py Outdated

Comment thread src/pruna/evaluation/task.py Outdated

Comment thread src/pruna/evaluation/task.py Outdated

refactor: separate base and stateful metrics

a0ec779

begumcig force-pushed the refactor/metric-attributes branch from b60092f to a0ec779 Compare May 22, 2025 16:02

begumcig merged commit d79f4b7 into main May 22, 2025
6 checks passed

Old Metric Class & Identifier	New Internal Parent Class	New Metric Classes	New String Identifiers
`GPUMemoryMetric` `"gpu_memory"`	`GPUMemoryStats`	`DiskMemoryMetric`	`"disk_memory"`
		`InferenceMemoryMetric`	`"inference_memory"`
		`TrainingMemoryMetric`	`"training_memory"`
`ElapsedTimeMetric` `"elapsed_time"`	`InferenceTimeStats`	`TotalTimeMetric`	`"total_time"`
		`LatencyMetric`	`"latency"`
		`ThroughputMetric`	`"throughput"`
`ModelArchitectureMetric` `"model_architecture"`	`ModelArchitectureStats`	`TotalParamsMetric`	`"total_params"`
		`TotalMacsMetric`	`"total_macs"`
`EnergyMetric` `"energy"`	`EnvironmentalImpactStats`	`CO2ConsumedMetric`	`"co2_emissions"`
		`EnergyConsumedMetric`	`"energy_consumed"`

Conversation

begumcig commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

johannaSommer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guennemann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig commented Apr 28, 2025 •

edited

Loading

begumcig May 6, 2025 •

edited

Loading

begumcig May 14, 2025 •

edited

Loading

guennemann left a comment •

edited

Loading