Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main review mods 2023 12 06 #55

Open
wants to merge 151 commits into
base: main-review-2023-11-17
Choose a base branch
from

Conversation

tlhumphrey2
Copy link

No description provided.

Ubuntu and others added 30 commits September 21, 2023 13:15
… folder structure by replacing the subfolders of modules, i.e. aks, logging, storage, and vnet, with the folders in the root by the same names, i.e. aks, logging, storage, and vnet. Also, placed the contents of the folder, 'hpcc' in the root and deleted the folder.
…d branch 'tlh-bug-fixes-and-personalize-easy-deploy-v0'
…nch: tlh-bug-fixes-and-personalize-easy-deploy-v0.

Also, in hpcc/hpcc.tf module 'hpcc' now sources Godji's opinionated hpcc instead my my local clone of it. As of this
date, 2023/09/22, Godji has not merged a change I made to my local opinionated. So, hpcc will not deploy until he merges
or you switch to sourcing a local clone of my opnionated hpcc.
…hing

branch:add-terraform-to-deploy-everything
…hing

branch:add-terraform-to-deploy-everything
…ariables

branch:aks-is-now-using-easy-deploy-variables
…ariables

branch:aks-is-now-using-easy-deploy-variables
… so lite.auto.tfvars is correctly copied to hpcc and/or aks directories.
…ariables

branch:aks-is-now-using-easy-deploy-variables. Changed scripts/deploy…
…ariables

branch:aks-is-now-using-easy-deploy-variables
…ariables

branch:aks-is-now-using-easy-deploy-variables
Copy link

@dcamper dcamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few new comments, but be aware that there are still older outstanding conversations as well.

aks/locals.tf Outdated
node_type_version = "v2" # v1, v2
node_size = var.aks_serv_node_size
single_group = false
min_capacity = 4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify this min_capacity value. Is it really needed? I am somewhat suspicious of it, if we have trimmed the HPCC cluster down at all then we may not need 4 nodes at minimum.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the nodepool called 'hpccpool'. This is the only nodepool used by the hpcc cluster when 'aks_4nodepools=false'. I set 'min_capacity=4' because the terraform was always using 4 nodepools before I added this new option. I assume you want to know if the hpcc cluster can use less than 4 nodes. I have verified that it can.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the previous comment.

variable "aks_azure_region" {
type = string
description = "REQUIRED. The Azure region abbreviation in which to create these resources.\nMust be one of [\"eastus\", \"eastus2\", \"centralus\"].\nExample entry: eastus2"
validation {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is noted elsewhere, but: We should remove the region constraints. What is there now, which was brought over from old code, pertains to our group and not to OSS. Users should be free to place their clusters anywhere in the world.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

}

variable "aks_enable_roxie" {
description = "REQUIRED. Enable ROXIE?\nThis will also expose port 8002 on the cluster.\nExample entry: false"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this port number correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Should have been 18002.

#------------------------------------------------------------------------------

# Enable ROXIE?
# This will also expose port 8002 on the cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 8002 correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed to 18002.

README.md Show resolved Hide resolved
@@ -1,6 +1,7 @@
module "hpcc" {
#source = "[email protected]:hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git"
source = "github.com/hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git"
#source = "/home/azureuser/tlhumphrey2/RBA-terraform-azurerm-hpcc"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented-out code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commented-out code I don't want to remove until we are fairly sure most everything works properly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In aks/locals.tf, I have changed to min_capacity of hpccpool and servpool.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this conversation re: commented-out code as unresolved until it is completed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have spoke with Godji and Milind. Godji said that he has the role of Owner and has never tried a role with fewer permissions than Owner. Milind says he has always used TFE to deploy the RBA hpcc terraform, but, believes if you use the terraform outside of TFE you would need elevated permissions.

@@ -30,11 +30,11 @@ module "metadata" {
project = local.metadata.project
}

resource "null_resource" "delete_ephemeral_storage_accounts" {
/*resource "null_resource" "delete_ephemeral_storage_accounts" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove commented-out code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commented-out code I don't want to delete until we are sure everything works.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving unresolved until the commented-out code is removed.

}

variable "aks_enable_roxie" {
description = "REQUIRED. Enable ROXIE?\nThis will also expose port 8002 on the cluster.\nExample entry: false"
description = "REQUIRED if you want roxie. Enable ROXIE?\nThis will also expose port 18002 on the cluster.\nExample entry: false"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if you want roxie" is incorrect, strictly speaking. If someone omitted the value entirely, because they don't want a Roxie, would the plan fail?

Also, note the capitalization of "Roxie"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. lite-variables.tf no longer says roxie is required.

}

variable "aks_enable_roxie" {
description = "REQUIRED. Enable ROXIE?\nThis will also expose port 8002 on the cluster.\nExample entry: false"
description = "REQUIRED if you want roxie. Enable ROXIE?\nThis will also expose port 18002 on the cluster.\nExample entry: false"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: ECL Watch now uses port 443 instead of 18010, correct? This same thing needs to happen with Roxie: It should use port 443 instead of 18002. 443 is the default https port.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed to port 443, because when I did, the terraform didn't work properly. So, I'm waiting until we have most issues solved before trying again.

aks/locals.tf Outdated
node_size = var.aks_node_sizes.roxie
single_group = false
min_capacity = 1
max_capacity = 3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim, please address the previous questions.

aks/locals.tf Outdated
node_type_version = "v2" # v1, v2
node_size = var.aks_serv_node_size
single_group = false
min_capacity = 4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify the previous comment.

@@ -0,0 +1,16 @@
#!/bin/bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the previous issue been resolved? Are any storage accounts created and then destroyed if the user only wants ephemeral storage?

lite-locals.tf Show resolved Hide resolved
lite-locals.tf Outdated
}
workerMemory = {
query = "3G"
thirdParty = "500M"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query value is really workerResources.memory - thirdParty, rounded down. Please change the hardcoded 3G to that computation.

The goal of this block is to divide workResources.memory into "memory for Thor" and "memory for 3rd party tools".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed workerMemory.query to the same value as workerResources.memory. But, I don't know what you want thirdParty to be?????

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your commit 81eca56 you changed the query value to exactly the same value as workerResources.memory, which is not what was requested. Your change runs the risk of over-allocation of RAM, leading to an Out Of Memory exception and a killed pod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit "ed598d8".

README.md Show resolved Hide resolved
…ols are set to this. Also, fixed workerMemory.query and workerMemory.thirdParty.
lite-locals.tf Outdated
query = format("%dG", local.thor_worker_ram)
thirdParty = "500M"
query = format(%dG", local.workerMemory_query)
thirdParty = format(%dG", local.workerMemory_thirdParty)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same as the original. The hardcoded value was "500M" but your computed value is "500G" -- definitely an error or OOM condition if a Thor job starts.

Ubuntu added 2 commits January 5, 2024 15:38
…capacity changed for hpccpool & servpool. In lite-locals.tf, Corrected workerMemory values.
@tlhumphrey2
Copy link
Author

tlhumphrey2 commented Jan 5, 2024 via email

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
@@ -1,6 +1,7 @@
module "hpcc" {
#source = "[email protected]:hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git"
source = "github.com/hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git"
#source = "/home/azureuser/tlhumphrey2/RBA-terraform-azurerm-hpcc"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this conversation re: commented-out code as unresolved until it is completed.

@@ -30,11 +30,11 @@ module "metadata" {
project = local.metadata.project
}

resource "null_resource" "delete_ephemeral_storage_accounts" {
/*resource "null_resource" "delete_ephemeral_storage_accounts" {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving unresolved until the commented-out code is removed.

@@ -0,0 +1,16 @@
#!/bin/bash
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please continue to work on this issue; leaving unresolved.

…nd 2) the need for 28 vCPUs before deploying HPCC.
Copy link

@dcamper dcamper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new comments, but there are still unresolved conversations. In addition, the issue of not being able to deploy into the eastus2 region is still unresolved.

@tlhumphrey2
Copy link
Author

tlhumphrey2 commented Jan 8, 2024 via email

Ubuntu added 3 commits January 10, 2024 18:40
…ged availability_zone from 1 to 2. Why? In eastus2, getting error message saying hpccpool could not be created because constrints too restrictive.
…y? 300 was not enough time for HPCC to fully deploy on my azure vm Standard_B2s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants