I’m currently working a lot with Azure Kubernetes Service – AKS again and since I’m trying to determine the best setup regarding nodes and their settings I started testing the “Cluster Autoscaler” a bit more and the rather new “Multiple Node Pools” as-well.
The cluster autoscaler is around for a bit now, so its basically working fine but it’s still in preview. Multiple node pools are pretty new and also in preview.
If you are not familiar with those two features:
https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler
https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools
So let me share a few issues I ran into and how I solved them…or not.
Cluster Autoscaler
Max Pod Size
The autoscaler itself doesn’t give me many issues right now, but the main reason I’m using it currently is that I don’t want to continuously scaling my nodes myself. The reason I have to scale a lot right now is more or less the 30 pod max size AKS comes with as a default.
You can not change the max pod size of 30 in the portal. So if you want to increase the size, you have to use Azure CLI to create your AKS. You could also use an ARM template for it too.
Setup
In order to use the autoscaler you have to add the aks-preview as an extension and also register the vm scale sets feature. You will need the VMSS for the multiple node pools too.
After that you always have to register the ContainerService provider. But you can see all of that in the two links I posted above.
Minimum number of nodes
Before you enable the autoscaler, think about the minimum number of nodes you want to have, because you can’t change the minimum number higher than its currently set. So if you start with 1 like I did, you can’t change it later to 2 or more. So it will scale down to 1 on low load even if you know that 2 would be a better fit due to load spikes. Good thing is, that’s only for the preview the case.
Also be aware that the scaling will take a few minutes, since it has to spin up a new VM and register it as a node. For me to get a new node it took up to 5 minutes.
Vnet Subnet Permission
Another issue I ran into was while it was trying to scale the first time, nothing happened. So I checked the autoscaler with:
kubectl -n kube-system describe configmap cluster-autoscaler-status
It gave this error about permissions missing to access the vnet subnet:
It does not have permission to perform action ‘Microsoft.Network/virtualNetworks/subnets/join/action’ on the linked scope…
So I went to the vnet I selected on creating the AKS and the subnet I choose for the nodes and added the service principle with the role “Network Contributor”
Similar to this issue: https://github.com/Azure/AKS/issues/357
After a few minutes or a few more, it actually started scaling and creating a new node.
This seems to be a new issue, since I didn’t have that problem a month ago. Maybe its related to the new node pools.
Things to remember
- Think about the minimum number of nodes
- Be aware of the time a new node needs
- Check service principle permissions on vnet subnet
- Autoscaler status with
kubectl -n kube-system describe configmap cluster-autoscaler-status
- Max pod size default is 30, need to change on creation via CLI or ARM template
If you are also running the new multiple node pools:
- You currently can not change the maximum node count due to an error telling you that changing settings on a managed cluster isn’t allowed, you should use the node pool. But there is not option for the max number. You can just manually scale the node pool.
Multiple Node Pools
With this new feature you might run into a few more issues and I would strongly advise to think about it twice if you really want to test it right now.
Main reason is because it’s rather tricky to disable a feature. There is no unregister cli command currently. And of right now, I haven’t figured out how to unregister it yet. https://github.com/Azure/azure-cli/issues/8941
So if you decide to use that feature, here are my issues so far with it:
- Enabling Azure Monitor on the cluster errors on the portal and with the CLI command: “Node pool update via managed cluster not allowed. Use per nodepool operations.“
- Increasing the max node size if you have the autoscaler enabled results in the same error as above
- Current node count on the portal isn’t updating correctly if it got scaled with the autoscaler. I was running 5 nodes but the portal was still showing 1.
That’s it so far, but I will update this post as I test it more and probably run into more issues.