Skip to content

Enable scheduler and higher starting LR to avoid plateau#53

Draft
michaelmckinsey1 wants to merge 4 commits intoLBANN:mainfrom
michaelmckinsey1:new-scheduler-and-lr
Draft

Enable scheduler and higher starting LR to avoid plateau#53
michaelmckinsey1 wants to merge 4 commits intoLBANN:mainfrom
michaelmckinsey1:new-scheduler-and-lr

Conversation

@michaelmckinsey1
Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Apr 17, 2026

Use ExponentialLR scheduler to start at a higher LR to avoid getting stuck at val_dice_score=0.5. Decreases number of epochs required on the order of 2x. main branch currently uses a default config with a constant LR.

  • starting_learning_rate, with gamma and min_learning_rate added
  • Exclude background class 0 from dice calculation, so val_dice_score is not boosted by background class.

main vs PR#53:

1node:

  • scale6 - 329 vs 106 epochs (3.1x)
  • scale7 - 343 epochs vs 208 epochs (1.6x)
  • scale8 - 406 epochs vs 296 epochs (1.4x)

2nodes:

  • scale6 - 616 vs 240 epochs (2.6x)

@michaelmckinsey1 michaelmckinsey1 self-assigned this Apr 17, 2026
dice_score_probs = compute_sharded_dice(
mask_pred_probs, mask_true_onehot, spatial_mesh
)
dice_loss_curr = 1.0 - dice_score_probs.mean()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was background + foreground -> now foreground

@michaelmckinsey1 michaelmckinsey1 changed the title Enable scheduler and higher starting LR to avoid plateau at 0.5 val_d… Enable scheduler and higher starting LR to avoid plateau Apr 17, 2026
self.scale_reference_starting_learning_rate
* self.scale_learning_rate_factor
** (self.problem_scale - self.scale_reference)
)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed instability at higher problem scales that is fixed by lowering the LR

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed this is correlated with turning on AMP

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no AMP
image
vs AMP
image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should just omit this logic and avoid using AMP

@michaelmckinsey1 michaelmckinsey1 marked this pull request as draft April 17, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant