Conversation
charts/azuremonitor-containerinsights-for-prod-clusters/templates/ama-logs.yaml
Show resolved
Hide resolved
| value: "{{ $.Values.OmsAgent.isTelegrafLivenessprobeEnabled | default false }}" | ||
| - name: CLUSTER_CLOUD_ENVIRONMENT | ||
| value: "{{ $.Values.global.commonGlobals.CloudEnvironment | lower }}" | ||
| - name: AMCS_CLIENT_INSTALL_ID_OVERRIDE |
There was a problem hiding this comment.
this yaml is only used for ama-logs' own two model clusters, and will be used for the test clusters in build pipeline.
This change will need be updated in aks-rp, and also the merged chart that Long is working on
There was a problem hiding this comment.
@wanlonghenry let me know when the combined chart is ready and I can make the changes in it. If you are doing this a private branch, please take this change or share the branch name
There was a problem hiding this comment.
yes please check this branch: longw/addon-to-extension-merge-charts
| gem install racc --no-document | ||
|
|
||
| # update zlib gem to fix CVE-2026-27820 | ||
| gem uninstall zlib --force |
There was a problem hiding this comment.
Does this zlib that is uninstalled come with ruby installion? If so, should we update ruby instead of upgrading a particular gem?
There was a problem hiding this comment.
we are using the latest ruby version for arm64: https://packages.microsoft.com/azurelinux/3.0/prod/base/aarch64/Packages/r/ but this has the zlib cve. Reinstalling this is fixing the CVE and I don't see any compatibility issues as well in testing. Do you see any issues with it?
There was a problem hiding this comment.
just curious, how is reinstalling this fixing it?
| @@ -1,2 +1,3 @@ | |||
| # to merge trivy scan PR, temporarily ignore CVE-2026-24051 until a fix is available | |||
| CVE-2026-24051 No newline at end of file | |||
| CVE-2026-34040 | |||
There was a problem hiding this comment.
what are these two cves?
There was a problem hiding this comment.
these are 2 new cves in telegraf and will need to be fixed in the next upgrade. We already picked the latest telegraf version.
There was a problem hiding this comment.
got it. these two cves will be merged into ci_prod.
nit: add a comment that the two cves are for telegraf, and the fix is not available yet.
| if ([string]::IsNullOrEmpty($windowsazuremonitoragent)) { | ||
| Write-Host ('Environment variable WINDOWS_AMA_URL is not set. Using default value') | ||
| $windowsazuremonitoragent = "https://github.com/microsoft/Docker-Provider/releases/download/windows-ama-bits/genevamonitoringagent.46.31.3.zip" | ||
| $windowsazuremonitoragent = "https://github.com/microsoft/Docker-Provider/releases/download/windows-ama-bits/AzureMonitorAgentExtension-1.41.zip" |
There was a problem hiding this comment.
what is this change about?
There was a problem hiding this comment.
Picking the latest MA for windows which has AMCS poduid telemetry support.
There was a problem hiding this comment.
we have been using genevamonitoragent all the time, now we switch to a different agent? what are the differences between genenvamontoragent and the azuremonitoragentextension? This is a large dependency change if they are two different software.
| @@ -0,0 +1,248 @@ | |||
| --- | |||
| name: backdoor-deployment | |||
There was a problem hiding this comment.
nit: how about use "helm-install-ama-logs", instead of "blackdoor". I see once when I asked agent to do backdoor deployment, and the agent refused to do it when it sees "backdoor".
There was a problem hiding this comment.
we can use this skill slash (/) command directly. Backdoor-deployment is easy to remember
| @@ -0,0 +1,248 @@ | |||
| --- | |||
There was a problem hiding this comment.
i am not sure what we are letting agent do here should be done from pipeline or we write some script to do the work, which is likely faster. The advantage of doing through code is faster I believe, more deterministic (e.g. the test query, duration, etc.).
we could try skill for a while and see how reliable and efficient the skill is.
There was a problem hiding this comment.
I have pasted a screenshot of the results from this skill.
It was able to figure out the intermittent failures like pod restarts because of addon-token-adapter and container logv2 count change due to a different container.
It will be very difficult to write test cases for such scenarios.
Checking in this skill as it's available now and can be used write away. We can close on the test pipeline vs skill separately.
|
|
||
| 13. **Compare data volume** between production and test for all tables (see "Compare Data Volume"). If any table shows a difference, **investigate** before reporting (see "Investigate Data Volume Regression"). | ||
| 14. **Get PodUid** for all pods in both deployments (see "Get PodUid"). | ||
| 15. **Compare resource consumption** for `memoryWorkingSetBytes` and `cpuUsageNanoCores` (see "Compare Resource Consumption"). If any metric shows a sustained increase, **investigate** before reporting (see "Investigate Resource Consumption Regression"). |
There was a problem hiding this comment.
this is a bit vague, should we define what falls under the category of sustained increase?
| |-------|-------------|---------| | ||
| | **Branch name** | Git branch to build | `suyadav/aiautomation` | | ||
| | **Current production image** | Production image tag (e.g. `ciprod:X.Y.Z`) | `ciprod:3.1.35` | | ||
| | **YAML file path** | Helm values file for backdoor deployment | `./../azuremonitor-containerinsights-for-prod-clusters/values.yaml` | |
There was a problem hiding this comment.
shouldnt this be defaulted to these values? and can't the current production image be derived fromt he released?
| @@ -0,0 +1,272 @@ | |||
| --- | |||
There was a problem hiding this comment.
Thanks Sunil. This is a good first step, should we also add updating telegraf in our repo for cve fixes for which we dont have to create a new version and instead we read from the dalec repo and pick up patches for the base image version for base image CVEs for which they have a bot?
Added 2 new skills for enhancing productivity: backdoor-testing for testing changes on a branch, upgrade-telegraf for raising PR for dalec telegraf upgrade.
Fixed CVEs showing up due to old go version, telegraf.
Updated the
ama-logsHelm template to inject the pod'smetadata.uidas theAMCS_CLIENT_INSTALL_ID_OVERRIDEenvironment variable for tracking AMCS calls: https://dev.azure.com/msazure/InfrastructureInsights/_workitems/edit/36350426Test results with the skill: