Skip to content

refactor: metric attributes#81

Merged
begumcig merged 6 commits intomainfrom
refactor/metric-attributes
May 22, 2025
Merged

refactor: metric attributes#81
begumcig merged 6 commits intomainfrom
refactor/metric-attributes

Conversation

@begumcig
Copy link
Copy Markdown
Member

@begumcig begumcig commented Apr 28, 2025

Description

General evaluation refactoring for better user experience:

  • Atomic metrics: Each metric returns a single float value as a result.
  • MetricResult: results are returned with MetricResult objects, which carry the metric name, metric metadata and the result value
  • Added various class attributes to metrics for clarity like higher_is_better, metric_units,
  • Refactored call types for metrics, now any metric can be called with just single or pairwise, and cannot be called with non supported call types.
  • Metric names in the metric class and metric results match
  • Adapted the EvaluationAgent to this paradigm so that even when the metrics are atomic, inference computations are shared when possible

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

@begumcig begumcig changed the title refactor: metric-attributes refactor: metric attributes Apr 28, 2025
@begumcig begumcig force-pushed the refactor/metric-attributes branch 13 times, most recently from 6ddb360 to 81fcd3f Compare May 2, 2025 15:07
@begumcig begumcig marked this pull request as ready for review May 2, 2025 15:19
Copy link
Copy Markdown
Member

@johannaSommer johannaSommer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really amazing work you did Begüm! Let's iterate one more time, I have a few questions, but I don't think it will take long then we finally have this refactooooor 🚀

Comment thread docs/user_manual/evaluation.rst Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a bit to what extent users can still use the previous metrics / naming? I am a bit worried that we are changing too much and will break existing setups without prior deprecation warnings

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay after looking further, i definitely see that you already took this into account. It's a lot which makes it difficult to understand, can you elaborate / summarize briefly again for me here what will still be possible / what can we simply not support anymore?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great question! I summarized the metric refactor in the table below. In the first column, you can see the old metric class and string identifier. These are now replaced by new internal parent classes (second column), which are primarily used for inference-time grouping and shared computation. The third and fourth columns show the new atomic metric classes and their string identifiers, which are what users are expected to use going forward.

If someone tries to instantiate an old metric class, they'll now get an instance of the corresponding new internal parent class instead. This won’t break anything! the new internal parents still expose the .compute() method and return a dictionary just like the originals.

When a task is initialized using old class names or string identifiers, we also still support this, but issue a deprecation warning. Internally, the Evaluation Agent maps these to their respective new atomic metrics and groups them in the new internal parent classes only for inference efficiency.

Old Metric Class & Identifier New Internal Parent Class New Metric Classes New String Identifiers
GPUMemoryMetric "gpu_memory" GPUMemoryStats DiskMemoryMetric "disk_memory"
InferenceMemoryMetric "inference_memory"
TrainingMemoryMetric "training_memory"
ElapsedTimeMetric "elapsed_time" InferenceTimeStats TotalTimeMetric "total_time"
LatencyMetric "latency"
ThroughputMetric "throughput"
ModelArchitectureMetric "model_architecture" ModelArchitectureStats TotalParamsMetric "total_params"
TotalMacsMetric "total_macs"
EnergyMetric "energy" EnvironmentalImpactStats CO2ConsumedMetric "co2_emissions"
EnergyConsumedMetric "energy_consumed"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was i even asking, of course you took care of everything perfectly duh

Comment thread src/pruna/evaluation/metrics/metric_cmmd.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely see what you are trying to do here with the structure, but wouldnt this lead to double computations if I am requesting both throughput and latency for example?

Copy link
Copy Markdown
Member Author

@begumcig begumcig May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was also my nightmare scenario!! I tried to handle this by sharing inference for new new metrics that internally share this computation logic. If you checkout https://github.com/PrunaAI/pruna/blob/81fcd3f4059accd3896f561696a0f4a12d6405ea/src/pruna/evaluation/evaluation_agent.py#L215

I tried to share inference for stateless metics whenever possible 🥹
So if you're using the EvaluationAgent you're doing all your computations efficiently

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For code understanding, it would be good to add this discussion as a comment (that the execution of the metrics via the agent will avoid redundancy)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the Evaluation Agent under the computation of stateless metrics

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be good/better to also add it to the metric, since one stumbles upon this when reading this file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks Begüm!

Comment thread src/pruna/evaluation/metrics/metric_energy.py Outdated
Comment thread src/pruna/evaluation/metrics/result.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, do we expect more functions to be added here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am overall wondering if this metric result might be very similar to the individual metric types that you introduced. What is the conceptual difference between a metric result and a metric? I am asking because i think attributes like higher-is-better and benchmark-metric could also be used here.
To be clear, I am not disagreeing with the structure, I am just wondering if this needs a seperate data class?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a really good point, and honestly, I wasn’t 100% sure myself at first whether this warranted a separate data class. I talked a bit with Gabriel about it, especially since these results will eventually be logged, and we landed on dataclass as a middle ground between using a plain dictionary and bringing in something heavier like pydantic.

The main reason I kept MetricResult as a separate class is that even with its simplicity now (since the metrics themselves became more atomic, the MetricResult evolved into something simpler and simpler as this refactoring progressed), it’s still useful to have a structured place for validation. The small assertions I added here already helped catch a couple of subtle bugs during development, and I imagine we’ll want to add more checks or logging-related logic in the future.

That said, you're right, it does overlap conceptually with some of the attributes in the metrics themselves (like higher_is_better), we can rethink the separation if you have a suggestion!!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that a separate class is a good solution here. I actually could see to additionally add the higher_is_better is better here (it adds a bit redundancy but would make it much simply in downstream tasks to link the result to the property of the metric). As an alternative to just duplicating some attributes, one could add a pointer to the corresponding class the metric result is based on (and then the user can get the higher is better via the class).

One should potentially also discuss whether the pointer in general is nicer (in contrast to the string you are using currently in the metric result to identify what metric is used).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it again and I agree with you both now, let's keep them as seperate classes... In the future we might want to have different normalizations or scalings of the same metric, where this class would then come in handy for e.g. storing units (latency in ms or s) or normalizing some score to 0-100

@begumcig begumcig requested a review from johannaSommer May 7, 2025 15:00
Copy link
Copy Markdown

@guennemann guennemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice refactoring. You can find detailed comments inline.

Comment thread src/pruna/evaluation/metrics/result.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that a separate class is a good solution here. I actually could see to additionally add the higher_is_better is better here (it adds a bit redundancy but would make it much simply in downstream tasks to link the result to the property of the metric). As an alternative to just duplicating some attributes, one could add a pointer to the corresponding class the metric result is based on (and then the user can get the higher is better via the class).

One should potentially also discuss whether the pointer in general is nicer (in contrast to the string you are using currently in the metric result to identify what metric is used).

Comment thread src/pruna/evaluation/evaluation_agent.py Outdated
Comment thread src/pruna/evaluation/metrics/utils.py Outdated
Comment thread src/pruna/evaluation/evaluation_agent.py Outdated
Comment thread src/pruna/evaluation/metrics/metric_base.py Outdated
Comment thread src/pruna/evaluation/metrics/metric_elapsed_time.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For code understanding, it would be good to add this discussion as a comment (that the execution of the metrics via the agent will avoid redundancy)

Comment thread src/pruna/evaluation/metrics/metric_energy.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering why we return a dict here and not also a list of metric results for consistency. The downside would be that the children will not be able to access easily/efficiently "its own" result (since we loose the hashMap). So probably we keep it in the current form.

Copy link
Copy Markdown
Member Author

@begumcig begumcig May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly! Also for the people who are using the metric classes directly and not via the Agent I think it's nice that they still get their results in a dictionary form. We can deprecate it in the future but I didn't want to change how we return the results in the "legacy" class too much.

Comment thread src/pruna/evaluation/task.py Outdated
@begumcig begumcig force-pushed the refactor/metric-attributes branch 4 times, most recently from c675bb7 to b8c5ebc Compare May 14, 2025 14:44
@begumcig begumcig requested a review from guennemann May 14, 2025 14:50
@begumcig begumcig force-pushed the refactor/metric-attributes branch 3 times, most recently from dbe85a9 to e41c2a0 Compare May 14, 2025 16:33
Comment thread docs/user_manual/evaluate.rst Outdated
Comment thread docs/user_manual/evaluate.rst Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hidden code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to show an example of what a MetricResult looks like, didn't want to give the impression that users need to import it to use it when metrics return it :)

Comment thread docs/user_manual/evaluate.rst Outdated
Comment thread docs/user_manual/evaluate.rst Outdated
Comment thread docs/user_manual/evaluate.rst Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add something like MS_NUM_ITER?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m only using this unit in this specific metric, so I wasn’t sure if it still makes sense to define a global constant for it. I went through all the metrics, and for the ones where I reuse the units across multiple places, I did create global strings. What do you think; would you prefer we keep all units as globals for consistency, or only when they’re reused?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same usage here

Comment on lines 313 to 329
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I deprecate, I normally add a specific version where it is going to be removed, + I add a test that uses this deprecated class but will be removed during that version. @johannaSommer I also noticed you recently deprecated something, so perhaps it would be good to start doing this to make it easier to maintain.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for this feedback, yes, let's definitely do this! I will also take this into account

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! Added the version we stop supporting in the warning message + added tests for the deprecated classes in the tests.

Comment thread src/pruna/evaluation/metrics/metric_memory.py Outdated
Comment thread src/pruna/evaluation/metrics/metric_torch.py Outdated
Copy link
Copy Markdown
Member

@johannaSommer johannaSommer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're a star 🤩

@begumcig begumcig force-pushed the refactor/metric-attributes branch from e41c2a0 to b9564e7 Compare May 15, 2025 14:23
@begumcig begumcig force-pushed the refactor/metric-attributes branch 3 times, most recently from 3d01c3b to b60092f Compare May 16, 2025 09:35
Copy link
Copy Markdown

@guennemann guennemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! See comments for some very minor points.

Comment thread src/pruna/evaluation/metrics/utils.py Outdated
Comment thread src/pruna/evaluation/task.py Outdated
Comment thread src/pruna/evaluation/task.py Outdated
@begumcig begumcig force-pushed the refactor/metric-attributes branch from b60092f to a0ec779 Compare May 22, 2025 16:02
@begumcig begumcig merged commit d79f4b7 into main May 22, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants