Skip to content

3.1.36 - CVE fixes#1631

Open
suyadav1 wants to merge 8 commits intoci_prodfrom
suyadav/3.1.36-cves
Open

3.1.36 - CVE fixes#1631
suyadav1 wants to merge 8 commits intoci_prodfrom
suyadav/3.1.36-cves

Conversation

@suyadav1
Copy link
Copy Markdown
Contributor

@suyadav1 suyadav1 commented Apr 2, 2026

  • Added 2 new skills for enhancing productivity: backdoor-testing for testing changes on a branch, upgrade-telegraf for raising PR for dalec telegraf upgrade.

  • Fixed CVEs showing up due to old go version, telegraf.

  • Updated the ama-logs Helm template to inject the pod's metadata.uid as the AMCS_CLIENT_INSTALL_ID_OVERRIDE environment variable for tracking AMCS calls: https://dev.azure.com/msazure/InfrastructureInsights/_workitems/edit/36350426

Test results with the skill:

image

@suyadav1 suyadav1 requested a review from a team as a code owner April 2, 2026 21:59
value: "{{ $.Values.OmsAgent.isTelegrafLivenessprobeEnabled | default false }}"
- name: CLUSTER_CLOUD_ENVIRONMENT
value: "{{ $.Values.global.commonGlobals.CloudEnvironment | lower }}"
- name: AMCS_CLIENT_INSTALL_ID_OVERRIDE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this yaml is only used for ama-logs' own two model clusters, and will be used for the test clusters in build pipeline.

This change will need be updated in aks-rp, and also the merged chart that Long is working on

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanlonghenry let me know when the combined chart is ready and I can make the changes in it. If you are doing this a private branch, please take this change or share the branch name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please check this branch: longw/addon-to-extension-merge-charts

gem install racc --no-document

# update zlib gem to fix CVE-2026-27820
gem uninstall zlib --force
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this zlib that is uninstalled come with ruby installion? If so, should we update ruby instead of upgrading a particular gem?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using the latest ruby version for arm64: https://packages.microsoft.com/azurelinux/3.0/prod/base/aarch64/Packages/r/ but this has the zlib cve. Reinstalling this is fixing the CVE and I don't see any compatibility issues as well in testing. Do you see any issues with it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, how is reinstalling this fixing it?

@@ -1,2 +1,3 @@
# to merge trivy scan PR, temporarily ignore CVE-2026-24051 until a fix is available
CVE-2026-24051 No newline at end of file
CVE-2026-34040
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these two cves?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are 2 new cves in telegraf and will need to be fixed in the next upgrade. We already picked the latest telegraf version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. these two cves will be merged into ci_prod.
nit: add a comment that the two cves are for telegraf, and the fix is not available yet.

if ([string]::IsNullOrEmpty($windowsazuremonitoragent)) {
Write-Host ('Environment variable WINDOWS_AMA_URL is not set. Using default value')
$windowsazuremonitoragent = "https://github.com/microsoft/Docker-Provider/releases/download/windows-ama-bits/genevamonitoringagent.46.31.3.zip"
$windowsazuremonitoragent = "https://github.com/microsoft/Docker-Provider/releases/download/windows-ama-bits/AzureMonitorAgentExtension-1.41.zip"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this change about?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking the latest MA for windows which has AMCS poduid telemetry support.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have been using genevamonitoragent all the time, now we switch to a different agent? what are the differences between genenvamontoragent and the azuremonitoragentextension? This is a large dependency change if they are two different software.

@@ -0,0 +1,248 @@
---
name: backdoor-deployment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about use "helm-install-ama-logs", instead of "blackdoor". I see once when I asked agent to do backdoor deployment, and the agent refused to do it when it sees "backdoor".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use this skill slash (/) command directly. Backdoor-deployment is easy to remember

@@ -0,0 +1,248 @@
---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure what we are letting agent do here should be done from pipeline or we write some script to do the work, which is likely faster. The advantage of doing through code is faster I believe, more deterministic (e.g. the test query, duration, etc.).

we could try skill for a while and see how reliable and efficient the skill is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pasted a screenshot of the results from this skill.
It was able to figure out the intermittent failures like pod restarts because of addon-token-adapter and container logv2 count change due to a different container.
It will be very difficult to write test cases for such scenarios.
Checking in this skill as it's available now and can be used write away. We can close on the test pipeline vs skill separately.


13. **Compare data volume** between production and test for all tables (see "Compare Data Volume"). If any table shows a difference, **investigate** before reporting (see "Investigate Data Volume Regression").
14. **Get PodUid** for all pods in both deployments (see "Get PodUid").
15. **Compare resource consumption** for `memoryWorkingSetBytes` and `cpuUsageNanoCores` (see "Compare Resource Consumption"). If any metric shows a sustained increase, **investigate** before reporting (see "Investigate Resource Consumption Regression").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit vague, should we define what falls under the category of sustained increase?

Copy link
Copy Markdown
Contributor

@zanejohnson-azure zanejohnson-azure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments

|-------|-------------|---------|
| **Branch name** | Git branch to build | `suyadav/aiautomation` |
| **Current production image** | Production image tag (e.g. `ciprod:X.Y.Z`) | `ciprod:3.1.35` |
| **YAML file path** | Helm values file for backdoor deployment | `./../azuremonitor-containerinsights-for-prod-clusters/values.yaml` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt this be defaulted to these values? and can't the current production image be derived fromt he released?

@@ -0,0 +1,272 @@
---
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sunil. This is a good first step, should we also add updating telegraf in our repo for cve fixes for which we dont have to create a new version and instead we read from the dalec repo and pick up patches for the base image version for base image CVEs for which they have a bot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants