-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Main review mods 2023 12 06 #55
base: main-review-2023-11-17
Are you sure you want to change the base?
Changes from 6 commits
afe0783
5a51f03
35b1a43
f31f1c7
c11ef09
977e107
1b54ae7
eba6f63
58268fa
fbafd5d
32b918b
226139f
982e692
347ceef
1e0e18f
e5f5853
9f93f16
7fb2c89
cde9f4f
0f56039
1bf7dc9
e14e9eb
3de0cc1
dbe5cef
ff58b6d
2203f42
71434db
c54dad8
f077f15
1868452
1523b64
8e0e2f1
d2ee650
873a9f7
50f7050
108641a
19e114a
5e3a1ec
7065e89
ad678e4
aca0ced
1c346ee
e796355
3588eec
8ec2aab
498d908
81afffd
902ec17
682612e
2417b9a
d8aa06c
bf1a64d
41be7fa
02fe05b
5bc6502
a8a8170
882a390
345453e
986595a
746eb3e
fc879fe
6ffcf2d
9f522e2
409e1a5
ba53432
dc46a6f
cf68e81
d7ea956
31396de
d42c883
d17572e
b05a798
fcc5365
a76e524
c175fdb
1dce33d
b44fbcb
4d284cb
d89b49e
8747f16
822acd9
73f5982
339bf98
4d5526d
5d71764
3aab1f8
1518d47
2a6805b
a97cb81
6809ae2
a646d62
2e5e4f4
f8fa470
290d023
801e279
075d3aa
fb61631
c5d78de
99d7fcf
416b4e4
0274223
bfd5142
5617f98
6471329
ab72960
96bbe2c
4d88cc9
91266e9
f45720c
3405610
6e6b6c3
6ece380
eae050f
11554af
5c33a80
8dd9ebf
a00e254
9f48ece
f4847f5
01f09ed
b30f1a5
b83083c
185c760
b772483
3eb4e6e
76386ea
4f37c43
85b2bef
e6ff10c
38ef877
6e366a7
e74c3b6
44cedc4
97f7eef
4ba65e2
a8cef3e
d899526
2255093
a5e3dc5
abbb3bf
9cf1765
f58b8b7
9cdfc88
81eca56
ed598d8
05f46ff
09cad2b
4d1b78f
e4a8cc4
431fad0
3c7c641
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
module "aks" { | ||
depends_on = [random_string.string] | ||
source = "[email protected]:hpccsystems-solutions-lab/tlh-oss-terraform-azurerm-aks.git" | ||
#source = "[email protected]:hpccsystems-solutions-lab/tlh-oss-terraform-azurerm-aks.git" | ||
source = "https://github.com/hpccsystems-solutions-lab/tlh-oss-terraform-azurerm-aks.git" | ||
#source = "/home/azureuser/temp/OSS/terraform-azurerm-aks" | ||
|
||
providers = { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,8 +13,8 @@ locals { | |
node_type_version = "v2" | ||
node_size = var.aks_node_sizes.roxie | ||
single_group = false | ||
min_capacity = var.aks_capacity.roxie_min | ||
max_capacity = var.aks_capacity.roxie_max | ||
min_capacity = 1 | ||
max_capacity = 3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (2) Shouldn't this be determined dynamically, based on the size of the cluster the user wants? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TLH: Added new variable, 'aks_capacity' which defines the minimum and maximum number of nodes in each node pool. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This size of a Roxie pool is still hardwired. Is that correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not fixed, as of this writing:
tlh response: I don't have any roxie variables I can use to calculate the roxiepool max_capacity. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please do not edit anyone else's comments; reply instead. Your comment is a good reply to my original question ("This size of a Roxie pool is still hardwired. Is that correct?"). Is a maximum capacity of 3 appropriate, though? Assuming that Roxie's VM requirements are hardcoded from the helm chart, do you need more than 3 nodes in the pool? Elsewhere you correctly asserted that the maximum capacity costs no extra money, but it does impose a limit on the resources that can be allocated. If Roxie's helm requirements cause more than 3 nodes to be created, then everything will fail. This is worth double-checking. Related, if you can determine that you will never need 3 nodes, then why not reduce this 3 to whatever is really needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Understood There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tim, please address the previous questions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The max_capacity of all hpcc nodepools is var.aks_nodepools_max_capacity (see commit ed598d8). By default the value is 400. My reasoning is having a large max_capacity doesn't cost more. The cost increases only when you use nodes. For example, say your min_capacity=1 but during the execution of hpcc, it needs 10 more nodes. So, the capacity is increased to 11. So, you will be charged for 11 then. |
||
labels = { | ||
"lnrs.io/tier" = "standard" | ||
"workload" = "roxiepool" | ||
|
@@ -31,8 +31,8 @@ locals { | |
node_type_version = "v2" # v1, v2 | ||
node_size = var.aks_node_sizes.thor | ||
single_group = false | ||
min_capacity = var.aks_capacity.thor_min | ||
max_capacity = var.aks_capacity.thor_max | ||
min_capacity = 1 | ||
max_capacity = var.aks_thorpool_max_capacity | ||
labels = { | ||
"lnrs.io/tier" = "standard" | ||
"workload" = "thorpool" | ||
|
@@ -48,8 +48,8 @@ locals { | |
node_type_version = "v1" | ||
node_size = var.aks_node_sizes.serv | ||
single_group = false | ||
min_capacity = var.aks_capacity.serv_min | ||
max_capacity = var.aks_capacity.serv_max | ||
min_capacity = 1 | ||
max_capacity = 3 | ||
labels = { | ||
"lnrs.io/tier" = "standard" | ||
"workload" = "servpool" | ||
|
@@ -65,8 +65,8 @@ locals { | |
node_type_version = "v1" | ||
node_size = var.aks_node_sizes.spray | ||
single_group = false | ||
min_capacity = var.aks_capacity.spray_min | ||
max_capacity = var.aks_capacity.spray_max | ||
min_capacity = 0 | ||
max_capacity = 6 | ||
dcamper marked this conversation as resolved.
Show resolved
Hide resolved
|
||
labels = { | ||
"lnrs.io/tier" = "standard" | ||
"workload" = "spraypool" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
module "hpcc" { | ||
source = "[email protected]:hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git" | ||
#source = "[email protected]:hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git" | ||
source = "https://github.com/hpccsystems-solutions-lab/tlh-opinionated-terraform-azurerm-hpcc.git" | ||
|
||
environment = local.metadata.environment | ||
productname = local.metadata.product_name | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,63 @@ | ||
output "thor_max_jobs" { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this output (and the next several lines) for debugging purposes or did you intend to have them remain for the final product? If the latter, please update the documentation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These outputs are for debugging. Regarding you comments about the min_capacity and max_capacity in aks/locals.tf: all min_capacitys are 1. And, only the max_capacity for thorpool is variable because it is the only one that I can calculate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. I will keep this conversation as unresolved until the debugging statements are removed. |
||
value = var.thor_max_jobs | ||
} | ||
output "thor_num_workers" { | ||
value = var.thor_num_workers | ||
} | ||
output "thor_node_size" { | ||
value = var.aks_node_sizes.thor | ||
} | ||
output "thor_ns_spec" { | ||
value = local.ns_spec[var.aks_node_sizes.thor] | ||
} | ||
output "thor_worker_cpus" { | ||
value = var.thor_worker_cpus | ||
} | ||
output "thorWorkersPerNode" { | ||
value = "local.ns_spec[${var.aks_node_sizes.thor}].cpu / var.thor_worker_cpus = ${local.thorWorkersPerNode}" | ||
} | ||
output "thor_worker_ram" { | ||
value = "local.ns_spec[${var.aks_node_sizes.thor}].ram / local.thorWorkersPerNode = ${local.thor_worker_ram}" | ||
} | ||
output "nodesPer1Job" { | ||
value = "var.thor_num_workers / local.thorWorkersPerNode = ${local.nodesPer1Job}" | ||
} | ||
output "thorpool_max_capacity" { | ||
value = "local.nodesPer1Job * var.thor_max_jobs = ${local.thorpool_max_capacity}" | ||
} | ||
locals { | ||
helm_chart_timeout=600 | ||
ns_spec = { | ||
dcamper marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"large" = { | ||
cpu = 2 | ||
ram = 8 | ||
} | ||
"xlarge" = { | ||
cpu = 4 | ||
ram = 16 | ||
} | ||
"2xlarge" = { | ||
cpu = 8 | ||
ram = 32 | ||
} | ||
"4xlarge" = { | ||
cpu = 16 | ||
ram = 64 | ||
} | ||
} | ||
|
||
twpn = "${ local.ns_spec[var.aks_node_sizes.thor].cpu / var.thor_worker_cpus }" | ||
thorWorkersPerNode = ceil(local.twpn) == local.twpn? local.twpn : "local.thorWorkersPerNode, ${local.twpn}, is not an integer because local.ns_spec[${var.aks_node_sizes.thor}].cpu, ${local.ns_spec[var.aks_node_sizes.thor].cpu}, is not a multiple of var.thor_worker_cpus, ${var.thor_worker_cpus}." | ||
|
||
twr = "${local.ns_spec[var.aks_node_sizes.thor].ram / local.thorWorkersPerNode }" | ||
thor_worker_ram = ceil(local.twr) == local.twr? local.twr : "local.thor_worker_ram, ${local.twr}, is not an integer because local.ns_spec[${var.aks_node_sizes.thor}].ram, ${local.ns_spec[var.aks_node_sizes.thor].ram}, is not a multiple of local.thorWorkersPerNode, ${local.thorWorkersPerNode}." | ||
|
||
np1j = "${var.thor_num_workers / local.thorWorkersPerNode }" | ||
nodesPer1Job = ceil(local.np1j) == local.np1j? local.np1j : "local.nodesPer1Job, ${local.np1j}, is not an integer because var.thor_num_workers, ${var.thor_num_workers}, is not a multiple of local.thorWorkersPerNode, ${local.thorWorkersPerNode}." | ||
|
||
thorpool_max_capacity = ceil("${ local.nodesPer1Job * var.thor_max_jobs }") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not work in all cases (one Thor worker, one max job); nodesPer1Job becomes 0.5 and fails validation.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm responding to your comment, "It is not fixed. See line 94, below, as a simple example". I'm not sure what you're talking about. But, I believe it might be code that was commented-out (you want me to delete it) and I have). |
||
|
||
helm_chart_timeout=300 | ||
#hpcc_version = "8.6.20" | ||
|
||
owner = { | ||
name = var.admin_username | ||
|
@@ -8,17 +66,80 @@ locals { | |
|
||
owner_name_initials = lower(join("",[for x in split(" ",local.owner.name): substr(x,0,1)])) | ||
|
||
tags = var.extra_tags | ||
/*metadata = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please delete commented-out code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not fixed. See line 94, below, as a simple example. |
||
project = format("%shpccplatform", local.owner_name_initials) | ||
product_name = format("%shpccplatform", local.owner_name_initials) | ||
business_unit = "commercial" | ||
environment = "sandbox" | ||
market = "us" | ||
product_group = format("%shpcc", local.owner_name_initials) | ||
resource_group_type = "app" | ||
sre_team = format("%shpccplatform", local.owner_name_initials) | ||
subscription_type = "dev" | ||
additional_tags = { "justification" = "testing" } | ||
location = var.aks_azure_region # Acceptable values: eastus, centralus | ||
} | ||
|
||
tags = merge(local.metadata.additional_tags, var.extra_tags) | ||
*/ | ||
|
||
# # disable_naming_conventions - Disable naming conventions | ||
# # disable_naming_conventions = true | ||
disable_naming_conventions = false | ||
|
||
# # auto_launch_eclwatch - Automatically launch ECLWatch web interface. | ||
#auto_launch_eclwatch = true | ||
auto_launch_svc = { | ||
eclwatch = false | ||
} | ||
|
||
# azure_auth = { | ||
# # AAD_CLIENT_ID = "" | ||
# # AAD_CLIENT_SECRET = "" | ||
# # AAD_TENANT_ID = "" | ||
# # AAD_PRINCIPAL_ID = "" | ||
# SUBSCRIPTION_ID = "" | ||
# } | ||
|
||
# hpcc_container = { | ||
# version = "9.2.0" | ||
# image_name = "platform-core-ln" | ||
# image_root = "jfrog.com/glb-docker-virtual" | ||
# # custom_chart_version = "9.2.0-rc1" | ||
# # custom_image_version = "9.2.0-demo" | ||
# } | ||
|
||
# hpcc_container_registry_auth = { | ||
# username = "value" | ||
# password = "value" | ||
# } | ||
|
||
internal_domain = var.aks_dns_zone_name // Example: hpcczone.us-hpccsystems-dev.azure.lnrsg.io | ||
|
||
external = {} | ||
# external = { | ||
# blob_nfs = [{ | ||
# container_id = "" | ||
# container_name = "" | ||
# id = "" | ||
# resource_group_name = var.storage_account_resource_group_name | ||
# storage_account_id = "" | ||
# storage_account_name = var.storage_account_name | ||
# }] | ||
# # hpc_cache = [{ | ||
# # id = "" | ||
# # path = "" | ||
# # server = "" | ||
# }] | ||
# hpcc = [{ | ||
# name = "" | ||
# planes = list(object({ | ||
# local = "" | ||
# remote = "" | ||
# })) | ||
# service = "" | ||
# }] | ||
# } | ||
|
||
admin_services_storage_account_settings = { | ||
replication_type = "ZRS" #LRS only if using HPC Cache | ||
|
@@ -43,6 +164,12 @@ locals { | |
delete_protection = false | ||
} | ||
} | ||
# hpc_cache = { | ||
# enabled = false | ||
# size = "small" | ||
# cache_update_frequency = "3h" | ||
# storage_account_data_planes = null | ||
# } | ||
} | ||
external = null | ||
} | ||
|
@@ -63,6 +190,36 @@ locals { | |
replicas = 6 | ||
dcamper marked this conversation as resolved.
Show resolved
Hide resolved
|
||
nodeSelector = "spraypool" | ||
} | ||
|
||
# ldap = { | ||
# ldap_server = "" //Server IP | ||
# dali = { | ||
# hpcc_admin_password = "" | ||
# hpcc_admin_username = "" | ||
# ldap_admin_password = "" | ||
# ldap_admin_username = "" | ||
# adminGroupName = "HPCC-Admins" | ||
# filesBasedn = "ou=files,ou=eclHPCCSysUser,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# groupsBasedn = "OU=AADDC Users,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# resourcesBasedn = "ou=smc,ou=espservices,ou=eclHPCCSysUser,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# systemBasedn = "OU=AADDC Users,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# usersBasedn = "OU=AADDC Users,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# workunitsBasedn = "ou=workunits,ou=eclHPCCSysUser,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# } | ||
# esp = { | ||
# hpcc_admin_password = "" | ||
# hpcc_admin_username = "" | ||
# ldap_admin_password = "" | ||
# ldap_admin_username = "" | ||
# adminGroupName = "HPCC-Admins" | ||
# filesBasedn = "ou=files,ou=eclHPCCSysUser,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# groupsBasedn = "OU=AADDC Users,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# resourcesBasedn = "ou=smc,ou=espservices,ou=eclHPCCSysUser,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# systemBasedn = "OU=AADDC Users,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# usersBasedn = "OU=AADDC Users,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# workunitsBasedn = "ou=workunits,ou=eclHPCCSysUser,dc=z0lpf,dc=onmicrosoft,dc=com" | ||
# } | ||
# } | ||
|
||
roxie_internal_service = { | ||
name = "iroxie" | ||
|
@@ -100,6 +257,7 @@ locals { | |
disabled = (var.aks_enable_roxie == true)? false : true | ||
name = "roxie" | ||
nodeSelector = { workload = "roxiepool" } | ||
# tlh 20231109 numChannels = 2 | ||
numChannels = 1 | ||
prefix = "roxie" | ||
replicas = 2 | ||
|
@@ -249,7 +407,7 @@ locals { | |
type = "hthor" | ||
spillPlane = "spill" | ||
resources = { | ||
cpu = "2" | ||
cpu = "1" | ||
memory = "4G" | ||
} | ||
nodeSelector = { workload = "servpool" } | ||
|
@@ -323,6 +481,7 @@ locals { | |
throttle = 0 | ||
retryinterval = 6 | ||
keepResultFiles = false | ||
# egress = "engineEgress" | ||
} | ||
|
||
dfuwu-archiver = { | ||
|
@@ -336,6 +495,7 @@ locals { | |
cutoff = 14 | ||
at = "* * * * *" | ||
throttle = 0 | ||
# egress = "engineEgress" | ||
} | ||
|
||
dfurecovery-archiver = { | ||
|
@@ -344,6 +504,7 @@ locals { | |
limit = 20 | ||
cutoff = 4 | ||
at = "* * * * *" | ||
# egress = "engineEgress" | ||
} | ||
|
||
file-expiry = { | ||
|
@@ -353,6 +514,7 @@ locals { | |
persistExpiryDefault = 7 | ||
expiryDefault = 4 | ||
user = "sasha" | ||
# egress = "engineEgress" | ||
} | ||
} | ||
|
||
|
@@ -385,15 +547,16 @@ locals { | |
maxGraphs = 2 | ||
maxGraphStartupTime = 172800 | ||
numWorkersPerPod = 1 | ||
#nodeSelector = {} | ||
nodeSelector = { workload = "thorpool" } | ||
egress = "engineEgress" | ||
tolerations_value = "thorpool" | ||
managerResources = { | ||
cpu = 2 | ||
memory = "4G" | ||
cpu = 1 | ||
memory = "2G" | ||
} | ||
workerResources = { | ||
cpu = 4 | ||
cpu = 3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is 3 the best number here (I thought it was 2). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now, both cpu and memory for 'workerResources' are variable. |
||
memory = "4G" | ||
} | ||
workerMemory = { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be zero -- or the entire node pool not built at all -- if Roxie is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All min_capacitys are 1 now because when I set them to zero, they still showup on the portal as one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is appropriate for a node pool to be spun down to zero, we should set it that way, even if it winds up as a '1' in the portal afterwards. That could be a bug, or it could be an intentional behavior by an upstream module that needs to be corrected. We can at least do it correct in our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed