Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI broken by perma-diff in Cloud Functions #1306

Closed
eeaton opened this issue Jul 30, 2024 · 9 comments · Fixed by #1328
Closed

CI broken by perma-diff in Cloud Functions #1306

eeaton opened this issue Jul 30, 2024 · 9 comments · Fixed by #1328
Labels
bug Something isn't working

Comments

@eeaton
Copy link
Collaborator

eeaton commented Jul 30, 2024

TL;DR

CI tests have 100% failure rate since July 13th.

Tests are consistently broken by a perma-diff in Cloud Funcitons, with an error like the following:

Step #7 - "converge-org": TestOrg 2024-07-29T17:51:04Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; 
Step #7 - "converge-org": Error: Provider produced inconsistent final plan
Step #7 - "converge-org": 
Step #7 - "converge-org": When expanding the plan for
Step #7 - "converge-org": module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function
Step #7 - "converge-org": to include new values learned so far during apply, provider
Step #7 - "converge-org": "registry.terraform.io/hashicorp/google" produced an invalid new value for
Step #7 - "converge-org": .service_config[0].environment_variables: was null, but now
Step #7 - "converge-org": cty.MapVal(map[string]cty.Value{"LOG_EXECUTION_ID":cty.StringVal("true"),
Step #7 - "converge-org": "ROLES":cty.StringVal("roles/owner,roles/editor,roles/resourcemanager.organizationAdmin,roles/compute.networkAdmin,roles/compute.orgFirewallPolicyAdmin"),
Step #7 - "converge-org": "SOURCE_ID":cty.StringVal("organizations/943740911108/sources/17068193223143848275")}).
Step #7 - "converge-org": 
Step #7 - "converge-org": This is a bug in the provider, which should be reported in the provider's own
Step #7 - "converge-org": issue tracker.}
Step #7 - "converge-org":     apply.go:34: 
Step #7 - "converge-org":         	Error Trace:	/builder/home/go/pkg/mod/github.com/gruntwork-io/[email protected]/modules/terraform/apply.go:34
Step #7 - "converge-org":         	            				/builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/[email protected]/pkg/tft/terraform.go:571
Step #7 - "converge-org":         	            				/workspace/test/integration/org/org_test.go:101
Step #7 - "converge-org":         	            				/builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/[email protected]/pkg/tft/terraform.go:630
Step #7 - "converge-org":         	            				/builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/[email protected]/pkg/tft/terraform.go:669
Step #7 - "converge-org":         	            				/builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/[email protected]/pkg/utils/stages.go:31
Step #7 - "converge-org":         	            				/builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/[email protected]/pkg/tft/terraform.go:669

Expected behavior

No response

Observed behavior

No response

Terraform Configuration

the automated Cloud Build suite for this repo

Terraform Version

the automated Cloud Build suite for this repo

Additional information

I'm having a hard time identifying what we need to update or modify to resolve this. It looks like the root issue was in the provider, here's what I've found so far:

There was a provider fix 27 days ago to fix a permadiff:

There is also a PR in terraform-google-cloud-function (which we reference) to use latest source has not been approved, from 27 days ago: GoogleCloudPlatform/terraform-google-cloud-functions#135

From the related dates, I expect that the PR 135 needs to be merged into the CFT module, then we can update with this repo with the CFT module. However, I'm not certain because I can't find where the provider fix is pulled into the CFT module.

@eeaton eeaton added the bug Something isn't working label Jul 30, 2024
@eeaton
Copy link
Collaborator Author

eeaton commented Jul 30, 2024

@apeabody @daniel-cit can you take a look at this and check if my understanding is correct? (And if I'm off-base, any idea how else we might fix this?)

@apeabody
Copy link
Contributor

Hi @eeaton!

Terraform automatically picks the latest provider versions allowed by the version constraints during the init stage. So any post 5.38.0 release CI runs should use it, unless there is a maximum version constraint present (typically we only use major version for these, such as the <6.0.0 below).

For example:

TestBootstrap 2024-07-29T17:13:11Z command.go:185: Initializing provider plugins...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/time versions matching ">= 0.5.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/google-beta versions matching ">= 3.43.0, >= 3.50.0, >= 3.67.0, >= 3.77.0, >= 4.11.0, >= 4.17.0, != 4.31.0, >= 4.64.0, >= 5.7.0, >= 5.22.0, < 6.0.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/random versions matching ">= 2.1.0, >= 2.2.0, >= 3.1.0, ~> 3.4"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/null versions matching ">= 2.1.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/external versions matching ">= 2.2.2"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Finding hashicorp/google versions matching ">= 3.33.0, >= 3.43.0, >= 3.50.0, >= 3.53.0, >= 3.67.0, >= 3.77.0, >= 3.83.0, >= 4.17.0, >= 4.25.0, >= 4.28.0, != 4.31.0, >= 4.64.0, >= 5.7.0, >= 5.22.0, < 6.0.0"...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installing hashicorp/null v3.2.2...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installed hashicorp/null v3.2.2 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installing hashicorp/external v2.3.3...
TestBootstrap 2024-07-29T17:13:11Z command.go:185: - Installed hashicorp/external v2.3.3 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:12Z command.go:185: - Installing hashicorp/google v5.38.0...
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installed hashicorp/google v5.38.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installing hashicorp/time v0.12.0...
TestBootstrap 2024-07-29T17:13:13Z command.go:185: - Installed hashicorp/time v0.12.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:14Z command.go:185: - Installing hashicorp/google-beta v5.38.0...
TestBootstrap 2024-07-29T17:13:15Z command.go:185: - Installed hashicorp/google-beta v5.38.0 (signed by HashiCorp)
TestBootstrap 2024-07-29T17:13:15Z command.go:185: - Installing hashicorp/random v3.6.2...
TestBootstrap 2024-07-29T17:13:16Z command.go:185: - Installed hashicorp/random v3.6.2 (signed by HashiCorp)

GoogleCloudPlatform/terraform-google-cloud-functions#135 only updates examples, so I doubt that it will make an impact.

If you have a recent CI test failure, we can determine the TPG version used for the failing stage.

@apeabody
Copy link
Contributor

Looks like this error occurs with TPG v5.39: #1293 (comment)

@daniel-cit
Copy link
Contributor

daniel-cit commented Jul 30, 2024

Lets try to remove the LOG_EXECUTION_ID that was added to see if it can pass the permadiff #1307

the version of the provider used in the build is one that includes fix that added in 27 days ago
https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/cloudfunctions2/resource_cloudfunctions2_function.go#L36C2-L48C14

it is in both 5.38 and 5,39

@daniel-cit
Copy link
Contributor

Lets try to remove the LOG_EXECUTION_ID that was added to see if it can pass the permadiff #1307

the version of the provider used in the build is one that includes fix that added in 27 days ago https://github.com/hashicorp/terraform-provider-google/blob/main/google/services/cloudfunctions2/resource_cloudfunctions2_function.go#L36C2-L48C14

it is in both 5.38 and 5,39

Failed even after removing the LOG_EXECUTION_ID

Error:      	Received unexpected error:
            	FatalError{Underlying: error while running command: exit status 1; 
            	Error: Provider produced inconsistent final plan
            	
            	When expanding the plan for
            	module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function
            	to include new values learned so far during apply, provider
            	"registry.terraform.io/hashicorp/google" produced an invalid new value for
            	.service_config[0].environment_variables: was null, but now
            	cty.MapVal(map[string]cty.Value{
                   "ROLES":cty.StringVal("roles/owner,roles/editor,roles/resourcemanager.organizationAdmin,roles/compute.networkAdmin,roles/compute.orgFirewallPolicyAdmin"),
            	   "SOURCE_ID":cty.StringVal("organizations/REDACTED/sources/2027208248522035993")
                }).
            	
            	This is a bug in the provider, which should be reported in the provider's own
            	issue tracker.}

I will take a look at the provider and try to see what is the issue with the DiffSuppressFunc

@daniel-cit
Copy link
Contributor

@eeaton @apeabody

Foundation code originally had two entries in the runtime env variables, ROLES and SOURCE_ID

  service_config = {
    service_account_email = google_service_account.cloudfunction.email
    runtime_env_variables = {
      ROLES     = join(",", var.roles_to_monitor)
      SOURCE_ID = google_scc_source.cai_monitoring.id
    }
  }

Using the code from the latest version of the provider with the permadiff fix we have

func environmentVariablesDiffSuppress(k, old, new string, d *schema.ResourceData) bool {
	if k == "service_config.0.environment_variables.LOG_EXECUTION_ID" && new == "" {
		return true
	}

	// Let diff be determined by environment_variables (above)
	if strings.HasPrefix(k, "service_config.0.environment_variables.%") {
		return true
	}

	// For other keys, don't suppress diff.
	return false
}

This part if k == "service_config.0.environment_variables.LOG_EXECUTION_ID" && new == "" is skipping comparison for the key LOG_EXECUTION_ID on the map

This part if strings.HasPrefix(k, "service_config.0.environment_variables.%") is ignoring diferences in the size of the old and new environment variables map

when we run Terraform plan we get

# module.cai_monitoring.module.cloud_function.google_cloudfunctions2_function.function will be created
  + resource "google_cloudfunctions2_function" "function" {
...

      + service_config {
          + all_traffic_on_latest_revision   = true
          + available_cpu                    = "1"
          + available_memory                 = "256M"
          + gcf_uri                          = (known after apply)
          + ingress_settings                 = "ALLOW_ALL"
          + max_instance_count               = 100
          + max_instance_request_concurrency = (known after apply)
          + min_instance_count               = 1
          + service                          = (known after apply)
          + service_account_email            = (known after apply)
          + timeout_seconds                  = 60
          + uri                              = (known after apply)
        }
    }

and the service_config map is missing the evironment_variablesmap

if we remove the fix we get:

  + resource "google_cloudfunctions2_function" "function" {
...

      + service_config {
          + all_traffic_on_latest_revision   = true
          + available_cpu                    = "1"
          + available_memory                 = "256M"
          + environment_variables            = (known after apply) <=============
          + gcf_uri                          = (known after apply)
          + ingress_settings                 = "ALLOW_ALL"
          + max_instance_count               = 100
          + max_instance_request_concurrency = (known after apply)
          + min_instance_count               = 1
          + service                          = (known after apply)
          + service_account_email            = (known after apply)
          + timeout_seconds                  = 60
          + uri                              = (known after apply)
        }
    }

And the environment_variables is back at the plan for the configuration. It has (known after apply) because the values of the map depend on resources that have not been created yet

Adding a log output in the code we can see the state of the variables

[DEBUG] provider.terraform-provider-google: 2024/07/30 23:56:19 [DEBUG] DiffSuppress  k:"service_config.0.environment_variables.%" old:"" new:""

service_config.0.environment_variables.% has the length of the Map and old and new values are empty because they do not exist yet.

somehow this code is saying that on the plan case there is no diff between not having a map and the new map.

changing the condition

from

if strings.HasPrefix(k, "service_config.0.environment_variables.%")

to

if strings.HasPrefix(k, "service_config.0.environment_variables.%") && new != ""

seems to fix the issue, and plan, apply, apply with changes, and destroy work as expected.

The same strategy of ignoring the size is used in other places to deal with labels that have key value pair added by the API when returning results.

I would like to pinpoint why this code snippet is failing in this case but seems to work in other cases.

@daniel-cit
Copy link
Contributor

there is an open issue in the provider for this hashicorp/terraform-provider-google#18747

@eeaton
Copy link
Collaborator Author

eeaton commented Aug 1, 2024

Thanks for the investigation and detailed writeup on the provider issue, Daniel.
I've also escalated this on the internal bug with your assessment.

As a quick fix to unblock this repo, I'm trying PR #1311 to pin the provider version used in the cai_monitoring module to the last good version

@daniel-cit
Copy link
Contributor

PR opened in magic-modules GoogleCloudPlatform/magic-modules#11333

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants