Logic to detect hardware GPU count and aggregate GPU memory size in MiB by makungaj1 · Pull Request #4389 · aws/sagemaker-python-sdk

makungaj1 · 2024-01-23T23:50:48Z

Issue #, if available:

Description of changes:
Add logic to detect hardware GPU count and aggregate GPU memory size in MiB

Testing done:
Tested changes by running unit tests and integration tests locally.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

[x ] I have read the CONTRIBUTING doc
[x ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
[ x] I used the commit message format described in CONTRIBUTING
[ x] I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
[ x] I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

[x ] I have added tests that prove my fix is effective or that my feature works (if appropriate)
[ x] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
[ x] I have checked that my tests are not configured for a specific region or account (if appropriate)
[ x] I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…in MiB

src/sagemaker/serve/utils/hardware_detector.py

samruds · 2024-01-24T16:28:13Z

src/sagemaker/serve/utils/hardware_detector.py

+
+def _get_gpu_info(instance_type: str, session: Session) -> int:
+    """Get GPU info for the provided instance"""
+    global instance


What's the role of this instance variable?

This was left by error. Will remove this in the next PR update.

src/sagemaker/serve/utils/hardware_detector.py

samruds · 2024-01-24T16:41:07Z

tests/unit/sagemaker/serve/utils/test_hardware_detector.py

+REGION = "us-west-2"
+VALID_INSTANCE_TYPE = "ml.p5.48xlarge"
+INVALID_INSTANCE_TYPE = "fl.c5.57xxlarge"
+DESCRIBE_INSTANCE_TYPE_RESULT = {


Doing this might make the tests brittle since responses change with time/release, can we use moto or look into using the ec2 stub...

So can we use the mock ec2 service for interaction?

We'll probably need to install a third party lib for EC2 stub. We can keep the tests simple like

sagemaker-python-sdk/tests/unit/test_collection.py

Line 32 in 08bcc2a

CREATE_COLLECTION_RESPONSE = {

I'll filter down the return value to only relevant fields.

Is there a guidance around adding 3rd party dependencies to the SDK? Is it okay to add or do we need the Python SDK team to provide approval?

jiapinw · 2024-01-24T18:24:31Z

src/sagemaker/serve/utils/hardware_detector.py



-def _get_gpu_info(instance_type: str, session: Session) -> int:
+def _get_gpu_info(instance_type: str, session: Session) -> tuple:


Type hint should be more explicit, ie. Tuple[int, int]

Also, if we assume orders in Tuple, please make sure it's well documented.

Great, will make it clear as below doc string
""" Get GPU info for the provided instance @return: Tuple containing: [0]number of GPUs available and [1]aggregate memory size in MiB """

src/sagemaker/serve/utils/hardware_detector.py

jiapinw · 2024-01-24T18:28:32Z

src/sagemaker/serve/utils/hardware_detector.py

-
-    ec2_instance = ".".join(split_instance)
+    if instance_info is not None:
+        gpus_info = instance_info.get("GpuInfo")


Do we only care about GpuInfo, iirc, inf2 and trn instances store this info under InferenceAcceleratorInfo

Scope of this milestone is GPU, inferentia support is out of scope due to some blockers.

jiapinw · 2024-01-24T18:30:16Z

src/sagemaker/serve/utils/hardware_detector.py

+        gpus_info = instance_info.get("GpuInfo")
+        if gpus_info is not None:
+            instance_gpu_info = (
+                gpus_info.get("Gpus")[0].get("Count"),


gpus_info.get("Gpus")[0].get("Count") and could be none gpus_info.get("TotalGpuMemoryInMiB") in some cases (I'm actually not entirely sure, could be helpful if you can run this against a comprehensive list), do we need to handle them

it should be less of an issue tho

Should not cause issue, as they are expected values.

src/sagemaker/serve/utils/hardware_detector.py

jiapinw · 2024-01-24T18:35:48Z

src/sagemaker/serve/utils/hardware_detector.py

+    "ml.p3dn.24xlarge": {"Count": 8, "TotalGpuMemoryInMiB": 262144},
+    "ml.p2.xlarge": {"Count": 1, "TotalGpuMemoryInMiB": 12288},
+    "ml.p2.8xlarge": {"Count": 8, "TotalGpuMemoryInMiB": 98304},
+    "ml.p2.16xlarge": {"Count": 16, "TotalGpuMemoryInMiB": 196608},


out of curiosity, what's the source of truth for this? I didn't see gpu mem info here https://aws.amazon.com/ec2/instance-types/p2/

To get the mem, please run command like
aws ec2 describe-instance-types --instance-types g5.48xlarge

gotcha, then if this can be retrieved from describe-instance-types, do we still need to store them as constants? Or i misunderstand the functionality of this fallback info. Imo, this is what we rely on when certain instance types are not retrievable from ec2.

See here https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/serve/utils/local_hardware.py#L71

We can still store them and for non-retrievable instances from ec2 api, the gpu mem is here https://aws.amazon.com/sagemaker/pricing/

samruds

Overall LGTM , two minor comments

src/sagemaker/serve/utils/hardware_detector.py

knikure

/bot run all

mohanasudhan · 2024-01-24T23:42:55Z

src/sagemaker/serve/utils/hardware_detector.py

+    "ml.g5.16xlarge": {"Count": 1, "TotalGpuMemoryInMiB": 24576},
+    "ml.g5.12xlarge": {"Count": 4, "TotalGpuMemoryInMiB": 98304},
+    "ml.g5.24xlarge": {"Count": 4, "TotalGpuMemoryInMiB": 98304},
+    "ml.g5.48xlarge": {"Count": 8, "TotalGpuMemoryInMiB": 196608},


Can you detail the process in place so when a new accelerator hardware gets released in SageMaker, how we automate the logic, so the end to end release of an instance is taken care in PySDK as well?

I understand it is a fall back logic in case of access denied. But, I am curious on how we plan to keep it fresh.

You are right about the static list being out of date, the plan long term is to have a table (DynamoDB or otherwise) that has a lambda/wf to fetch additional fallback types in an instance family or at the minimum the ability to append new types.

Proposed process we are going for:

Customer provides new accelerator hardware , we query Ec2 API to get details.

If unable to query API, fallback to this static list.

Try HuggingFace Hub with additional parameters

Fail and propagate

mohanasudhan · 2024-01-24T23:45:37Z

tests/unit/sagemaker/serve/utils/test_hardware_detector.py

+
+
+@patch("sagemaker.session.Session")
+def test_get_gpu_info_success(session):


Thanks for adding unit test. Can you also include integration test?

Integration test is a separate task. https://issues.amazon.com/issues/SMIE-540

mufaddal-rohawala · 2024-01-25T00:39:11Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: 9521f87
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-25T00:41:56Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: 9521f87
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-25T01:21:16Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: 9521f87
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-25T01:37:23Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 9521f87
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-25T01:41:17Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 9521f87
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

codecov-commenter · 2024-01-25T08:11:11Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (fc11ace) 86.96% compared to head (8c75b1e) 86.85%.

Files	Patch %	Lines
src/sagemaker/instance_types_gpu_info.py	84.61%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4389      +/-   ##
==========================================
- Coverage   86.96%   86.85%   -0.11%     
==========================================
  Files        1197      382     -815     
  Lines      106962    35307   -71655     
==========================================
- Hits        93019    30666   -62353     
+ Misses      13943     4641    -9302

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mufaddal-rohawala · 2024-01-27T04:03:08Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-27T04:05:38Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-27T04:08:39Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-27T05:10:09Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-27T05:12:11Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: 625ba7c
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

knikure

/bot run all

mufaddal-rohawala · 2024-01-29T22:52:25Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-29T22:56:03Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: 625ba7c
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-29T22:56:08Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-29T23:55:26Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 625ba7c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-30T00:16:55Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: 625ba7c
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-30T03:04:27Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-local-mode-tests
Commit ID: 8c75b1e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-30T03:07:31Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 8c75b1e
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-30T03:08:28Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-notebook-tests
Commit ID: 8c75b1e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-30T04:06:07Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-unit-tests
Commit ID: 8c75b1e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mufaddal-rohawala · 2024-01-30T04:20:32Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-slow-tests
Commit ID: 8c75b1e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

knikure

/bot run pr

mufaddal-rohawala · 2024-01-30T17:43:09Z

AWS CodeBuild CI Report

CodeBuild project: sagemaker-python-sdk-pr
Commit ID: 8c75b1e
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

…e in MiB (aws#4389) * Add logic to detect hardware GPU count and aggregate GPU memory size in MiB * Fix all formatting * Addressed PR review comments * Addressed PR Review messages * Addressed PR Review Messages * Addressed PR Review comments * Addressed PR Review Comments * Add integration tests * Add config * Fix integration tests * Include Instance Types GPU infor Config files * Addressed PR review comments * Fix unit tests * Fix unit test: 'Mock' object is not subscriptable --------- Co-authored-by: Jonathan Makunga <makung@amazon.com>

…ngestion. (#4413) * change: update image_uri_configs 12-13-2023 12:23:06 PST * change: update image_uri_configs 12-13-2023 14:04:54 PST * prepare release v2.200.1 * update development version to v2.200.2.dev0 * fix: Move func and args serialization of function step to step level (#4312) * fix: Add write permission to job output dirs for remote and step decorator running on non-root job user (#4325) * feat: Added update for model package (#4309) Co-authored-by: Keshav Chandak <chakesh@amazon.com> * documentation: fix ModelBuilder sample notebook links (#4319) * feat: Use specific images for SMP v2 jobs (#4333) * Add check for smp lib * update syntax * Remove unused images * Update repo name and regions * Update account number * Update framework name and check for None distribution * Add unit tests for smp v2 uri * Check enabled * Remove logging * Add cuda version in uri * Update cu121 * Update syntax * Fix black check * Fix black --------- Co-authored-by: huilgolr <yoda@ip-10-0-12-252.us-west-2.compute.internal> * Fix: Updated js mb compression logic - ModelBuilder (#4294) Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * documentation: SMP v2 doc updates (#1423) (#4336) * doc update for estimator distribution art * add note to the SMP doc and minor fixes * remove subnodes * rm all v1 content as documenting everything in aws docs * fix build errors * fix white spaces * rm smdistributed from TF estimator distribution * rm white spaces * add notes to TF estimator distribution * fix links * incorporate feedback * update example values * fix version numbers in the notes Co-authored-by: Miyoung <cmiyoung@amazon.com> * prepare release v2.201.0 * update development version to v2.201.1.dev0 * Fix: Add additional model builder telemetry (#4334) * move telemetry code to public * add additional test --------- Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * feature: support remote debug for sagemaker training job (#4315) * feature: support remote debug for sagemaker training job * change: Replace update_remote_config with 2 helper methods for enable and disable respectively * change: add new argument enable_remote_debug to skip set of test_jumpstart_estimator_kwargs_match_parent_class * chore: add jumpstart support for remote debug --------- Co-authored-by: Xinyu Xie <xixinyu@amazon.com> Co-authored-by: Evan Kravitz <evakravi@amazon.com> * Update tblib constraint (#4317) * Fix: Fix job_objective type (#4303) * change: update image_uri_configs 12-21-2023 08:32:41 PST * prepare release v2.202.0 * update development version to v2.202.1.dev0 * Using logging instead of prints (#4133) * documentation: update issue template. (#4337) * change: update model path in local mode (#4296) * Update model path in local mode * Add test * change: update image_uri_configs 12-22-2023 06:17:35 PST * prepare release v2.202.1 * update development version to v2.202.2.dev0 * change: create role if needed in `get_execution_role` (#4323) * Create role if needed in get_execution_role * Add tests * Change: More pythonic tags (#4327) * Change: More pythonic tags * Fix broken tags * More tags formatting and add a test * Fix tests * Raise Exception for debug (#4344) Co-authored-by: Ruilian Gao <ruiliann@amazon.com> * Change: Allow extra_args to be passed to uploader (#4338) * Change: Allow extra_args to be passed to uploader * Fix tests * Black * Fix test * Change: Drop py2 tag from the wheel as we don't support Python 2 (#4343) * Disable failed test in IR (#4345) * Disable failed test in IR * Fix format --------- Co-authored-by: Ruilian Gao <ruiliann@amazon.com> * change: update image_uri_configs 12-25-2023 06:17:33 PST * feat: Supporting tbac in load_run (#4039) * feature: support local mode in SageMaker Studio (#1300) (#4347) * feature: support local mode in SageMaker Studio * chore: fix typo * chore: fix formatting * chore: revert changes for docker compose logs * chore: black-format * change: Use predtermined dns-allow-listed-hostname for Studio Local Support * add support for CodeEditor and JupyterLabs --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> * prepare release v2.203.0 * update development version to v2.203.1.dev0 * change: update image_uri_configs 12-29-2023 06:17:34 PST * query hf api for model md (#4346) Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * fix: skip failing integs (#4348) Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> * change: TGI 1.3.3 (#4335) * prepare release v2.203.1 * update development version to v2.203.2.dev0 * feat: parallelize notebook search utils, add new operators (#4342) * feat: parallelize notebook search utils * chore: raise exception in notebook utils if thread has error * chore: improve variable name * fix: not passing region to get jumpstart bucket * chore: add sagemaker session to notebook utils * chore: address PR comments * feat: add support for includes, begins with, ends with * fix: pylint * feat: private util for model eula key * fix: unit tests, use verify_model_region_and_return_specs in notebook utils * Revert "feat: private util for model eula key" This reverts commit e2daefc. * chore: add search keywords to header * fix: change ConditionNot incorrect property Expression to Condition (#4351) * fix: Huggingface glue failing tests (#4367) * fix: Huggingface glue failing tests * fix: Sphinx doc build failure * fix: Huggingface glue failing tests * fix: failing sphinx tests * fix: failing sphinx tests * fix: failing black check * fix: sphinx doc errors * fix: sphinx doc errors * sphinx * black-format * sphinx * sphinx * sphinx --------- Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> Co-authored-by: Erick Benitez-Ramos <benieric@amazon.com> * fix: Add PyTorch 2.1.0 SM Training DLC to UNSUPPORTED_DLC_IMAGE_FOR_SM_PARALLELISM list (#4356) * add 2.1 unsupported smddp * formatting * feat: Support custom repack model settings (#4328) * change: update sphinx version (#4377) * change: update sphinx version * Update sphinx * change: Updates for DJL 0.26.0 release (#4366) * change: TGI NeuronX (#4375) * TGI NeuronX * Update * Update * fix: add warning message for job-prefixed pipeline steps when no job name is provided (#4371) Co-authored-by: svia3 <svia@amazon.com> * change: JumpStart - TLV region launch (#4379) * feat: add throughput management support for feature group (#4359) * feat: add throughput management support for feature group * documentation: add doc for feature group throughput config --------- Co-authored-by: Nilesh PS <psnilesh@amazon.com> * change: Enable galactus integ tests (#4376) * feat: Enable galactus integ tests * fix flake8 * fix doc8 * trying to see if it works with slow tests * small fixes in import error * fix missing import * try to remove some dependencies from requirement to see if pr test can be fixed * fix flake8 * Enable more tests * Add rerun annotation and further remove dependencies * comment out 2 integ tests * Remove local mode test for now * fix flake8 * prepare release v2.204.0 * update development version to v2.204.1.dev0 * fix: Add validation for empty ParameterString value in start local pipeline (#4354) * feat: Support selective pipeline execution for function step (#4372) * change: update image_uri_configs 01-24-2024 06:17:33 PST * fix: update get_execution_role_arn from metadata file if present (#4388) * fix: Support using PipelineDefinitionConfig in local mode (#4352) * fix: remove fastapi and uvicorn dependencies (#4365) They are not used in the codebase. Closes #4361 #4295 * prepare release v2.205.0 * update development version to v2.205.1.dev0 * change: TGI NeuronX 0.0.17 (#4390) * fix: Support PipelineVariable for ModelQualityCheckConfig attributes (#4353) * feat: Logic to detect hardware GPU count and aggregate GPU memory size in MiB (#4389) * Add logic to detect hardware GPU count and aggregate GPU memory size in MiB * Fix all formatting * Addressed PR review comments * Addressed PR Review messages * Addressed PR Review Messages * Addressed PR Review comments * Addressed PR Review Comments * Add integration tests * Add config * Fix integration tests * Include Instance Types GPU infor Config files * Addressed PR review comments * Fix unit tests * Fix unit test: 'Mock' object is not subscriptable --------- Co-authored-by: Jonathan Makunga <makung@amazon.com> * fix: fixed create monitoring schedule failing after validation error (#4385) Co-authored-by: Keshav Chandak <chakesh@amazon.com> * Add collection type support for Feaure Group Ingestion. Add TargetStores support for PutRecord and Ingestion. * Remove merge conflicts. * Update the feature definition type * Black formatting * Fix Flake8 formatting * Fix Pylint * Fix Formatting. --------- Co-authored-by: sagemaker-bot <sagemaker-bot@amazon.com> Co-authored-by: ci <ci> Co-authored-by: qidewenwhen <32910701+qidewenwhen@users.noreply.github.com> Co-authored-by: Keshav Chandak <keshav.chandak1995@gmail.com> Co-authored-by: Keshav Chandak <chakesh@amazon.com> Co-authored-by: stacicho <stacicho@amazon.com> Co-authored-by: Teng-xu <67929972+Teng-xu@users.noreply.github.com> Co-authored-by: huilgolr <yoda@ip-10-0-12-252.us-west-2.compute.internal> Co-authored-by: Gary Wang <38331932+gwang111@users.noreply.github.com> Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> Co-authored-by: akrishna1995 <38850354+akrishna1995@users.noreply.github.com> Co-authored-by: Miyoung <cmiyoung@amazon.com> Co-authored-by: Xinyu Xie <xiexinyucrab@126.com> Co-authored-by: Xinyu Xie <xixinyu@amazon.com> Co-authored-by: Evan Kravitz <evakravi@amazon.com> Co-authored-by: martinRenou <martin.renou@gmail.com> Co-authored-by: Duc Trung Le <leductrungxf@gmail.com> Co-authored-by: ruiliann666 <141953824+ruiliann666@users.noreply.github.com> Co-authored-by: Ruilian Gao <ruiliann@amazon.com> Co-authored-by: ananth102 <ananthbashyam1@gmail.com> Co-authored-by: Mufaddal Rohawala <89424143+mufaddal-rohawala@users.noreply.github.com> Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> Co-authored-by: amzn-choeric <105388439+amzn-choeric@users.noreply.github.com> Co-authored-by: evakravi <69981223+evakravi@users.noreply.github.com> Co-authored-by: Erick Benitez-Ramos <benieric@amazon.com> Co-authored-by: Sirut Buasai <73297481+sirutBuasai@users.noreply.github.com> Co-authored-by: Sindhu Somasundaram <56774226+sindhuvahinis@users.noreply.github.com> Co-authored-by: Stephen Via <51342648+svia3@users.noreply.github.com> Co-authored-by: svia3 <svia@amazon.com> Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com> Co-authored-by: Nilesh PS <nps17thatsme@gmail.com> Co-authored-by: Nilesh PS <psnilesh@amazon.com> Co-authored-by: jiapinw <95885824+jiapinw@users.noreply.github.com> Co-authored-by: Jay Goyani <135654128+jgoyani1@users.noreply.github.com> Co-authored-by: Justin <justinm088@hotmail.com> Co-authored-by: Jonathan Makunga <54963715+makungaj1@users.noreply.github.com> Co-authored-by: Jonathan Makunga <makung@amazon.com>

…ngestion. (aws#4413) * change: update image_uri_configs 12-13-2023 12:23:06 PST * change: update image_uri_configs 12-13-2023 14:04:54 PST * prepare release v2.200.1 * update development version to v2.200.2.dev0 * fix: Move func and args serialization of function step to step level (aws#4312) * fix: Add write permission to job output dirs for remote and step decorator running on non-root job user (aws#4325) * feat: Added update for model package (aws#4309) Co-authored-by: Keshav Chandak <chakesh@amazon.com> * documentation: fix ModelBuilder sample notebook links (aws#4319) * feat: Use specific images for SMP v2 jobs (aws#4333) * Add check for smp lib * update syntax * Remove unused images * Update repo name and regions * Update account number * Update framework name and check for None distribution * Add unit tests for smp v2 uri * Check enabled * Remove logging * Add cuda version in uri * Update cu121 * Update syntax * Fix black check * Fix black --------- Co-authored-by: huilgolr <yoda@ip-10-0-12-252.us-west-2.compute.internal> * Fix: Updated js mb compression logic - ModelBuilder (aws#4294) Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * documentation: SMP v2 doc updates (aws#1423) (aws#4336) * doc update for estimator distribution art * add note to the SMP doc and minor fixes * remove subnodes * rm all v1 content as documenting everything in aws docs * fix build errors * fix white spaces * rm smdistributed from TF estimator distribution * rm white spaces * add notes to TF estimator distribution * fix links * incorporate feedback * update example values * fix version numbers in the notes Co-authored-by: Miyoung <cmiyoung@amazon.com> * prepare release v2.201.0 * update development version to v2.201.1.dev0 * Fix: Add additional model builder telemetry (aws#4334) * move telemetry code to public * add additional test --------- Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * feature: support remote debug for sagemaker training job (aws#4315) * feature: support remote debug for sagemaker training job * change: Replace update_remote_config with 2 helper methods for enable and disable respectively * change: add new argument enable_remote_debug to skip set of test_jumpstart_estimator_kwargs_match_parent_class * chore: add jumpstart support for remote debug --------- Co-authored-by: Xinyu Xie <xixinyu@amazon.com> Co-authored-by: Evan Kravitz <evakravi@amazon.com> * Update tblib constraint (aws#4317) * Fix: Fix job_objective type (aws#4303) * change: update image_uri_configs 12-21-2023 08:32:41 PST * prepare release v2.202.0 * update development version to v2.202.1.dev0 * Using logging instead of prints (aws#4133) * documentation: update issue template. (aws#4337) * change: update model path in local mode (aws#4296) * Update model path in local mode * Add test * change: update image_uri_configs 12-22-2023 06:17:35 PST * prepare release v2.202.1 * update development version to v2.202.2.dev0 * change: create role if needed in `get_execution_role` (aws#4323) * Create role if needed in get_execution_role * Add tests * Change: More pythonic tags (aws#4327) * Change: More pythonic tags * Fix broken tags * More tags formatting and add a test * Fix tests * Raise Exception for debug (aws#4344) Co-authored-by: Ruilian Gao <ruiliann@amazon.com> * Change: Allow extra_args to be passed to uploader (aws#4338) * Change: Allow extra_args to be passed to uploader * Fix tests * Black * Fix test * Change: Drop py2 tag from the wheel as we don't support Python 2 (aws#4343) * Disable failed test in IR (aws#4345) * Disable failed test in IR * Fix format --------- Co-authored-by: Ruilian Gao <ruiliann@amazon.com> * change: update image_uri_configs 12-25-2023 06:17:33 PST * feat: Supporting tbac in load_run (aws#4039) * feature: support local mode in SageMaker Studio (aws#1300) (aws#4347) * feature: support local mode in SageMaker Studio * chore: fix typo * chore: fix formatting * chore: revert changes for docker compose logs * chore: black-format * change: Use predtermined dns-allow-listed-hostname for Studio Local Support * add support for CodeEditor and JupyterLabs --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> * prepare release v2.203.0 * update development version to v2.203.1.dev0 * change: update image_uri_configs 12-29-2023 06:17:34 PST * query hf api for model md (aws#4346) Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * fix: skip failing integs (aws#4348) Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> * change: TGI 1.3.3 (aws#4335) * prepare release v2.203.1 * update development version to v2.203.2.dev0 * feat: parallelize notebook search utils, add new operators (aws#4342) * feat: parallelize notebook search utils * chore: raise exception in notebook utils if thread has error * chore: improve variable name * fix: not passing region to get jumpstart bucket * chore: add sagemaker session to notebook utils * chore: address PR comments * feat: add support for includes, begins with, ends with * fix: pylint * feat: private util for model eula key * fix: unit tests, use verify_model_region_and_return_specs in notebook utils * Revert "feat: private util for model eula key" This reverts commit e2daefc. * chore: add search keywords to header * fix: change ConditionNot incorrect property Expression to Condition (aws#4351) * fix: Huggingface glue failing tests (aws#4367) * fix: Huggingface glue failing tests * fix: Sphinx doc build failure * fix: Huggingface glue failing tests * fix: failing sphinx tests * fix: failing sphinx tests * fix: failing black check * fix: sphinx doc errors * fix: sphinx doc errors * sphinx * black-format * sphinx * sphinx * sphinx --------- Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> Co-authored-by: Erick Benitez-Ramos <benieric@amazon.com> * fix: Add PyTorch 2.1.0 SM Training DLC to UNSUPPORTED_DLC_IMAGE_FOR_SM_PARALLELISM list (aws#4356) * add 2.1 unsupported smddp * formatting * feat: Support custom repack model settings (aws#4328) * change: update sphinx version (aws#4377) * change: update sphinx version * Update sphinx * change: Updates for DJL 0.26.0 release (aws#4366) * change: TGI NeuronX (aws#4375) * TGI NeuronX * Update * Update * fix: add warning message for job-prefixed pipeline steps when no job name is provided (aws#4371) Co-authored-by: svia3 <svia@amazon.com> * change: JumpStart - TLV region launch (aws#4379) * feat: add throughput management support for feature group (aws#4359) * feat: add throughput management support for feature group * documentation: add doc for feature group throughput config --------- Co-authored-by: Nilesh PS <psnilesh@amazon.com> * change: Enable galactus integ tests (aws#4376) * feat: Enable galactus integ tests * fix flake8 * fix doc8 * trying to see if it works with slow tests * small fixes in import error * fix missing import * try to remove some dependencies from requirement to see if pr test can be fixed * fix flake8 * Enable more tests * Add rerun annotation and further remove dependencies * comment out 2 integ tests * Remove local mode test for now * fix flake8 * prepare release v2.204.0 * update development version to v2.204.1.dev0 * fix: Add validation for empty ParameterString value in start local pipeline (aws#4354) * feat: Support selective pipeline execution for function step (aws#4372) * change: update image_uri_configs 01-24-2024 06:17:33 PST * fix: update get_execution_role_arn from metadata file if present (aws#4388) * fix: Support using PipelineDefinitionConfig in local mode (aws#4352) * fix: remove fastapi and uvicorn dependencies (aws#4365) They are not used in the codebase. Closes aws#4361 aws#4295 * prepare release v2.205.0 * update development version to v2.205.1.dev0 * change: TGI NeuronX 0.0.17 (aws#4390) * fix: Support PipelineVariable for ModelQualityCheckConfig attributes (aws#4353) * feat: Logic to detect hardware GPU count and aggregate GPU memory size in MiB (aws#4389) * Add logic to detect hardware GPU count and aggregate GPU memory size in MiB * Fix all formatting * Addressed PR review comments * Addressed PR Review messages * Addressed PR Review Messages * Addressed PR Review comments * Addressed PR Review Comments * Add integration tests * Add config * Fix integration tests * Include Instance Types GPU infor Config files * Addressed PR review comments * Fix unit tests * Fix unit test: 'Mock' object is not subscriptable --------- Co-authored-by: Jonathan Makunga <makung@amazon.com> * fix: fixed create monitoring schedule failing after validation error (aws#4385) Co-authored-by: Keshav Chandak <chakesh@amazon.com> * Add collection type support for Feaure Group Ingestion. Add TargetStores support for PutRecord and Ingestion. * Remove merge conflicts. * Update the feature definition type * Black formatting * Fix Flake8 formatting * Fix Pylint * Fix Formatting. --------- Co-authored-by: sagemaker-bot <sagemaker-bot@amazon.com> Co-authored-by: ci <ci> Co-authored-by: qidewenwhen <32910701+qidewenwhen@users.noreply.github.com> Co-authored-by: Keshav Chandak <keshav.chandak1995@gmail.com> Co-authored-by: Keshav Chandak <chakesh@amazon.com> Co-authored-by: stacicho <stacicho@amazon.com> Co-authored-by: Teng-xu <67929972+Teng-xu@users.noreply.github.com> Co-authored-by: huilgolr <yoda@ip-10-0-12-252.us-west-2.compute.internal> Co-authored-by: Gary Wang <38331932+gwang111@users.noreply.github.com> Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> Co-authored-by: akrishna1995 <38850354+akrishna1995@users.noreply.github.com> Co-authored-by: Miyoung <cmiyoung@amazon.com> Co-authored-by: Xinyu Xie <xiexinyucrab@126.com> Co-authored-by: Xinyu Xie <xixinyu@amazon.com> Co-authored-by: Evan Kravitz <evakravi@amazon.com> Co-authored-by: martinRenou <martin.renou@gmail.com> Co-authored-by: Duc Trung Le <leductrungxf@gmail.com> Co-authored-by: ruiliann666 <141953824+ruiliann666@users.noreply.github.com> Co-authored-by: Ruilian Gao <ruiliann@amazon.com> Co-authored-by: ananth102 <ananthbashyam1@gmail.com> Co-authored-by: Mufaddal Rohawala <89424143+mufaddal-rohawala@users.noreply.github.com> Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> Co-authored-by: amzn-choeric <105388439+amzn-choeric@users.noreply.github.com> Co-authored-by: evakravi <69981223+evakravi@users.noreply.github.com> Co-authored-by: Erick Benitez-Ramos <benieric@amazon.com> Co-authored-by: Sirut Buasai <73297481+sirutBuasai@users.noreply.github.com> Co-authored-by: Sindhu Somasundaram <56774226+sindhuvahinis@users.noreply.github.com> Co-authored-by: Stephen Via <51342648+svia3@users.noreply.github.com> Co-authored-by: svia3 <svia@amazon.com> Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com> Co-authored-by: Nilesh PS <nps17thatsme@gmail.com> Co-authored-by: Nilesh PS <psnilesh@amazon.com> Co-authored-by: jiapinw <95885824+jiapinw@users.noreply.github.com> Co-authored-by: Jay Goyani <135654128+jgoyani1@users.noreply.github.com> Co-authored-by: Justin <justinm088@hotmail.com> Co-authored-by: Jonathan Makunga <54963715+makungaj1@users.noreply.github.com> Co-authored-by: Jonathan Makunga <makung@amazon.com>

…e in MiB (aws#4389) * Add logic to detect hardware GPU count and aggregate GPU memory size in MiB * Fix all formatting * Addressed PR review comments * Addressed PR Review messages * Addressed PR Review Messages * Addressed PR Review comments * Addressed PR Review Comments * Add integration tests * Add config * Fix integration tests * Include Instance Types GPU infor Config files * Addressed PR review comments * Fix unit tests * Fix unit test: 'Mock' object is not subscriptable --------- Co-authored-by: Jonathan Makunga <makung@amazon.com>

…ngestion. (aws#4413) * change: update image_uri_configs 12-13-2023 12:23:06 PST * change: update image_uri_configs 12-13-2023 14:04:54 PST * prepare release v2.200.1 * update development version to v2.200.2.dev0 * fix: Move func and args serialization of function step to step level (aws#4312) * fix: Add write permission to job output dirs for remote and step decorator running on non-root job user (aws#4325) * feat: Added update for model package (aws#4309) Co-authored-by: Keshav Chandak <chakesh@amazon.com> * documentation: fix ModelBuilder sample notebook links (aws#4319) * feat: Use specific images for SMP v2 jobs (aws#4333) * Add check for smp lib * update syntax * Remove unused images * Update repo name and regions * Update account number * Update framework name and check for None distribution * Add unit tests for smp v2 uri * Check enabled * Remove logging * Add cuda version in uri * Update cu121 * Update syntax * Fix black check * Fix black --------- Co-authored-by: huilgolr <yoda@ip-10-0-12-252.us-west-2.compute.internal> * Fix: Updated js mb compression logic - ModelBuilder (aws#4294) Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * documentation: SMP v2 doc updates (aws#1423) (aws#4336) * doc update for estimator distribution art * add note to the SMP doc and minor fixes * remove subnodes * rm all v1 content as documenting everything in aws docs * fix build errors * fix white spaces * rm smdistributed from TF estimator distribution * rm white spaces * add notes to TF estimator distribution * fix links * incorporate feedback * update example values * fix version numbers in the notes Co-authored-by: Miyoung <cmiyoung@amazon.com> * prepare release v2.201.0 * update development version to v2.201.1.dev0 * Fix: Add additional model builder telemetry (aws#4334) * move telemetry code to public * add additional test --------- Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * feature: support remote debug for sagemaker training job (aws#4315) * feature: support remote debug for sagemaker training job * change: Replace update_remote_config with 2 helper methods for enable and disable respectively * change: add new argument enable_remote_debug to skip set of test_jumpstart_estimator_kwargs_match_parent_class * chore: add jumpstart support for remote debug --------- Co-authored-by: Xinyu Xie <xixinyu@amazon.com> Co-authored-by: Evan Kravitz <evakravi@amazon.com> * Update tblib constraint (aws#4317) * Fix: Fix job_objective type (aws#4303) * change: update image_uri_configs 12-21-2023 08:32:41 PST * prepare release v2.202.0 * update development version to v2.202.1.dev0 * Using logging instead of prints (aws#4133) * documentation: update issue template. (aws#4337) * change: update model path in local mode (aws#4296) * Update model path in local mode * Add test * change: update image_uri_configs 12-22-2023 06:17:35 PST * prepare release v2.202.1 * update development version to v2.202.2.dev0 * change: create role if needed in `get_execution_role` (aws#4323) * Create role if needed in get_execution_role * Add tests * Change: More pythonic tags (aws#4327) * Change: More pythonic tags * Fix broken tags * More tags formatting and add a test * Fix tests * Raise Exception for debug (aws#4344) Co-authored-by: Ruilian Gao <ruiliann@amazon.com> * Change: Allow extra_args to be passed to uploader (aws#4338) * Change: Allow extra_args to be passed to uploader * Fix tests * Black * Fix test * Change: Drop py2 tag from the wheel as we don't support Python 2 (aws#4343) * Disable failed test in IR (aws#4345) * Disable failed test in IR * Fix format --------- Co-authored-by: Ruilian Gao <ruiliann@amazon.com> * change: update image_uri_configs 12-25-2023 06:17:33 PST * feat: Supporting tbac in load_run (aws#4039) * feature: support local mode in SageMaker Studio (aws#1300) (aws#4347) * feature: support local mode in SageMaker Studio * chore: fix typo * chore: fix formatting * chore: revert changes for docker compose logs * chore: black-format * change: Use predtermined dns-allow-listed-hostname for Studio Local Support * add support for CodeEditor and JupyterLabs --------- Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> * prepare release v2.203.0 * update development version to v2.203.1.dev0 * change: update image_uri_configs 12-29-2023 06:17:34 PST * query hf api for model md (aws#4346) Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> * fix: skip failing integs (aws#4348) Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> * change: TGI 1.3.3 (aws#4335) * prepare release v2.203.1 * update development version to v2.203.2.dev0 * feat: parallelize notebook search utils, add new operators (aws#4342) * feat: parallelize notebook search utils * chore: raise exception in notebook utils if thread has error * chore: improve variable name * fix: not passing region to get jumpstart bucket * chore: add sagemaker session to notebook utils * chore: address PR comments * feat: add support for includes, begins with, ends with * fix: pylint * feat: private util for model eula key * fix: unit tests, use verify_model_region_and_return_specs in notebook utils * Revert "feat: private util for model eula key" This reverts commit e2daefc. * chore: add search keywords to header * fix: change ConditionNot incorrect property Expression to Condition (aws#4351) * fix: Huggingface glue failing tests (aws#4367) * fix: Huggingface glue failing tests * fix: Sphinx doc build failure * fix: Huggingface glue failing tests * fix: failing sphinx tests * fix: failing sphinx tests * fix: failing black check * fix: sphinx doc errors * fix: sphinx doc errors * sphinx * black-format * sphinx * sphinx * sphinx --------- Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> Co-authored-by: Erick Benitez-Ramos <benieric@amazon.com> * fix: Add PyTorch 2.1.0 SM Training DLC to UNSUPPORTED_DLC_IMAGE_FOR_SM_PARALLELISM list (aws#4356) * add 2.1 unsupported smddp * formatting * feat: Support custom repack model settings (aws#4328) * change: update sphinx version (aws#4377) * change: update sphinx version * Update sphinx * change: Updates for DJL 0.26.0 release (aws#4366) * change: TGI NeuronX (aws#4375) * TGI NeuronX * Update * Update * fix: add warning message for job-prefixed pipeline steps when no job name is provided (aws#4371) Co-authored-by: svia3 <svia@amazon.com> * change: JumpStart - TLV region launch (aws#4379) * feat: add throughput management support for feature group (aws#4359) * feat: add throughput management support for feature group * documentation: add doc for feature group throughput config --------- Co-authored-by: Nilesh PS <psnilesh@amazon.com> * change: Enable galactus integ tests (aws#4376) * feat: Enable galactus integ tests * fix flake8 * fix doc8 * trying to see if it works with slow tests * small fixes in import error * fix missing import * try to remove some dependencies from requirement to see if pr test can be fixed * fix flake8 * Enable more tests * Add rerun annotation and further remove dependencies * comment out 2 integ tests * Remove local mode test for now * fix flake8 * prepare release v2.204.0 * update development version to v2.204.1.dev0 * fix: Add validation for empty ParameterString value in start local pipeline (aws#4354) * feat: Support selective pipeline execution for function step (aws#4372) * change: update image_uri_configs 01-24-2024 06:17:33 PST * fix: update get_execution_role_arn from metadata file if present (aws#4388) * fix: Support using PipelineDefinitionConfig in local mode (aws#4352) * fix: remove fastapi and uvicorn dependencies (aws#4365) They are not used in the codebase. Closes aws#4361 aws#4295 * prepare release v2.205.0 * update development version to v2.205.1.dev0 * change: TGI NeuronX 0.0.17 (aws#4390) * fix: Support PipelineVariable for ModelQualityCheckConfig attributes (aws#4353) * feat: Logic to detect hardware GPU count and aggregate GPU memory size in MiB (aws#4389) * Add logic to detect hardware GPU count and aggregate GPU memory size in MiB * Fix all formatting * Addressed PR review comments * Addressed PR Review messages * Addressed PR Review Messages * Addressed PR Review comments * Addressed PR Review Comments * Add integration tests * Add config * Fix integration tests * Include Instance Types GPU infor Config files * Addressed PR review comments * Fix unit tests * Fix unit test: 'Mock' object is not subscriptable --------- Co-authored-by: Jonathan Makunga <makung@amazon.com> * fix: fixed create monitoring schedule failing after validation error (aws#4385) Co-authored-by: Keshav Chandak <chakesh@amazon.com> * Add collection type support for Feaure Group Ingestion. Add TargetStores support for PutRecord and Ingestion. * Remove merge conflicts. * Update the feature definition type * Black formatting * Fix Flake8 formatting * Fix Pylint * Fix Formatting. --------- Co-authored-by: sagemaker-bot <sagemaker-bot@amazon.com> Co-authored-by: ci <ci> Co-authored-by: qidewenwhen <32910701+qidewenwhen@users.noreply.github.com> Co-authored-by: Keshav Chandak <keshav.chandak1995@gmail.com> Co-authored-by: Keshav Chandak <chakesh@amazon.com> Co-authored-by: stacicho <stacicho@amazon.com> Co-authored-by: Teng-xu <67929972+Teng-xu@users.noreply.github.com> Co-authored-by: huilgolr <yoda@ip-10-0-12-252.us-west-2.compute.internal> Co-authored-by: Gary Wang <38331932+gwang111@users.noreply.github.com> Co-authored-by: EC2 Default User <ec2-user@ip-172-16-54-104.us-west-2.compute.internal> Co-authored-by: akrishna1995 <38850354+akrishna1995@users.noreply.github.com> Co-authored-by: Miyoung <cmiyoung@amazon.com> Co-authored-by: Xinyu Xie <xiexinyucrab@126.com> Co-authored-by: Xinyu Xie <xixinyu@amazon.com> Co-authored-by: Evan Kravitz <evakravi@amazon.com> Co-authored-by: martinRenou <martin.renou@gmail.com> Co-authored-by: Duc Trung Le <leductrungxf@gmail.com> Co-authored-by: ruiliann666 <141953824+ruiliann666@users.noreply.github.com> Co-authored-by: Ruilian Gao <ruiliann@amazon.com> Co-authored-by: ananth102 <ananthbashyam1@gmail.com> Co-authored-by: Mufaddal Rohawala <89424143+mufaddal-rohawala@users.noreply.github.com> Co-authored-by: Erick Benitez-Ramos <141277478+benieric@users.noreply.github.com> Co-authored-by: Mufaddal Rohawala <mufi@amazon.com> Co-authored-by: amzn-choeric <105388439+amzn-choeric@users.noreply.github.com> Co-authored-by: evakravi <69981223+evakravi@users.noreply.github.com> Co-authored-by: Erick Benitez-Ramos <benieric@amazon.com> Co-authored-by: Sirut Buasai <73297481+sirutBuasai@users.noreply.github.com> Co-authored-by: Sindhu Somasundaram <56774226+sindhuvahinis@users.noreply.github.com> Co-authored-by: Stephen Via <51342648+svia3@users.noreply.github.com> Co-authored-by: svia3 <svia@amazon.com> Co-authored-by: Haixin Wang <98612668+haixiw@users.noreply.github.com> Co-authored-by: Nilesh PS <nps17thatsme@gmail.com> Co-authored-by: Nilesh PS <psnilesh@amazon.com> Co-authored-by: jiapinw <95885824+jiapinw@users.noreply.github.com> Co-authored-by: Jay Goyani <135654128+jgoyani1@users.noreply.github.com> Co-authored-by: Justin <justinm088@hotmail.com> Co-authored-by: Jonathan Makunga <54963715+makungaj1@users.noreply.github.com> Co-authored-by: Jonathan Makunga <makung@amazon.com>

Jonathan Makunga added 2 commits January 23, 2024 15:08

Add logic to detect hardware GPU count and aggregate GPU memory size …

ecbe66f

…in MiB

Fix all formatting

27a620a

makungaj1 requested a review from a team as a code owner January 23, 2024 23:50

makungaj1 requested review from jgoyani1 and removed request for a team January 23, 2024 23:50

Merge branch 'master' into master

e5d7c16

gwang111 reviewed Jan 24, 2024

View reviewed changes

src/sagemaker/serve/utils/hardware_detector.py Outdated Show resolved Hide resolved

jiapinw suggested changes Jan 24, 2024

View reviewed changes

samruds reviewed Jan 24, 2024

View reviewed changes

Jonathan Makunga and others added 2 commits January 24, 2024 10:06

Addressed PR review comments

cf49ca8

Merge branch 'master' into master

3b63301

jiapinw suggested changes Jan 24, 2024

View reviewed changes

Addressed PR Review messages

477d6c2

makungaj1 force-pushed the master branch from dc26af4 to 477d6c2 Compare January 24, 2024 21:38

Merge branch 'master' into master

6adfa69

samruds reviewed Jan 24, 2024

View reviewed changes

src/sagemaker/serve/utils/hardware_detector.py Outdated Show resolved Hide resolved

src/sagemaker/serve/utils/hardware_detector.py Outdated Show resolved Hide resolved

Addressed PR Review Messages

27abb4c

samruds approved these changes Jan 24, 2024

View reviewed changes

Jonathan Makunga added 2 commits January 24, 2024 15:18

Addressed PR Review comments

51c8649

Addressed PR Review Comments

9521f87

jiapinw approved these changes Jan 24, 2024

View reviewed changes

knikure reviewed Jan 24, 2024

View reviewed changes

mohanasudhan reviewed Jan 24, 2024

View reviewed changes

Add integration tests

7592fac

samruds approved these changes Jan 29, 2024

View reviewed changes

knikure reviewed Jan 29, 2024

View reviewed changes

Fix unit test: 'Mock' object is not subscriptable

8c75b1e

knikure self-assigned this Jan 30, 2024

knikure reviewed Jan 30, 2024

View reviewed changes

samruds approved these changes Jan 30, 2024

View reviewed changes

mohanasudhan approved these changes Jan 30, 2024

View reviewed changes

knikure approved these changes Jan 30, 2024

View reviewed changes

knikure merged commit 427dec6 into aws:master Jan 30, 2024



		def _get_gpu_info(instance_type: str, session: Session) -> int:
		def _get_gpu_info(instance_type: str, session: Session) -> tuple:



		@patch("sagemaker.session.Session")
		def test_get_gpu_info_success(session):

Conversation

makungaj1 commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Checklist

General

Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samruds left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

knikure left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mufaddal-rohawala commented Jan 25, 2024

AWS CodeBuild CI Report

Uh oh!

mufaddal-rohawala commented Jan 25, 2024

AWS CodeBuild CI Report

Uh oh!

mufaddal-rohawala commented Jan 25, 2024

AWS CodeBuild CI Report

Uh oh!

mufaddal-rohawala commented Jan 25, 2024

AWS CodeBuild CI Report

Uh oh!

mufaddal-rohawala commented Jan 25, 2024

AWS CodeBuild CI Report

makungaj1 commented Jan 23, 2024 •

edited

Loading

codecov-commenter commented Jan 25, 2024 •

edited

Loading