Resource requirements

Core EM (plus HPO)¶

The following are the recommended resource requirements for deploying the core Comet product (Experiment Management + Hyper-Parameter Optimization)

Single Server Installation¶

16 vCPUs
32GB RAM (64GB Recommended)
1TB Root Disk space

The disk space allocation may be adjusted downward if you plan on storing experiment data on another partition or externally on S3.

If you're using a public cloud provider, the following instance types are recommended for use:

AWS Instance type: m6i.4xlarge
Azure Instance type: D16_v5
GCP instance type: n2-standard-16

Kubernetes Deployment¶

For a Kubernetes deployment, we recommend at least two dedicated nodes provisioned with:

16vCPUs each
- When using Bare Metal vCPUs are equivalent CPU threads. Thus a hyper-threaded 8 core CPU would be considered 16 vCPUs.
64GB Memory each

If you're using a public cloud provider, we recommend using one of the node types listed in the Single Server Installation section above.

Why Two Nodes?¶

Kubernetes operates on a principle of high availability, and as such, in order to get the most out of Kubernetes, it is recommended to have at least two replicas of each Comet component running at all times. This is to ensure that if one node goes down, the other can take over without any downtime. Although it is possible to run Comet on a single node, with only a single replica, if this is desired you likely should consider the Single Server Installation instead.

Why Dedicated Nodes?¶

Comet is a resource-intensive application, and as such, it is recommended to have dedicated nodes for Comet. This is to ensure that Comet has the resources it needs to run smoothly, without being impacted by other applications running on the same node (AKA: The Noisy Neighbor Problem).

It is possible to run Comet on a shared node, but in that case we strongly recommend ensuring that Comet has the equivalent resource reservations (requests) to the resources it would have with dedicated nodes. In that case however, you may still experience difficulties when rolling out new updates of comet or restarting the Comet deployments.

Opik¶

When enabled as a feature component of Comet, Opik Requires either a Kubernetes Cluster, or a Kubernetes-Style Single-Server installation, to be enabled.

Opik can share the same pool of resources as Comet EM & HPO. But increases the overall amount of resources needed.

Our median resource recommendation for Opik (alone) is:

4 vCPUs
8 GiB Memory

When using a single node, it is essential to increase the size of the node to accomodate Opik comfortably. When using a multi-node cluster, there is a greater degree of flexibility for how these requirments can be met.

Comet Compute Panels¶

Comet Compute Panels Requires either a Kubernetes Cluster, or a Kubernetes-Style Single-Server installation, to be enabled since it depends on the Kubernetes API.

From a resource perspective it is very light-weight when idle, or running a few simple panels concurrently. Thus no additional resources are required beyond those that are needed for EM, prior to enabling it. But significant usage, or resource intensive panels could very quickly consume or exceed these resources, and thus could require additional resources to be provisioned.

We strongly recommend isolating the workloads via either Node Isolation or (when using namespace quotas) via Namespace Isolation to avoid risking the stability of your overall Comet installation through running greedy panels.

Model Production Monitoring (MPM) ¶

MPM can only be deployed on a Kubernetes Cluster, as its architecture is not optimized for single node deployments.

Workloads¶

Our general recommendation is to have 3 MPM pods, with access to 16 vCPUs/Cores and 32GiB of Memory/RAM each. This should support an average utilization of about 4500 predictions per second.

Nodes¶

As usual, we recommend isolating your MPM pods from other workloads on your kubernetes cluster, or setting aggressive resource reservations/requests to ensure optimal performance.

Ideally these should also be a seperate node pool from the one used for the core Comet application/Experiment Management. But if this is not necessary if you are not using the EM product (or use it only minimally), or if you set aggressive resource reservations/requests for MPM.

We recommend a dedicated pool of 3 16vCPU/32GiB nodes (or whichever number matches your MPM replicaCount). See the Druid database section on Nodes for further details about node pools.

While it is possible to run the MPM Application workloads in the same node pool as the other Comet EM workloads without aggressive minimums. This is simply not the case wit the MPM Data Layers. Be sure to create dedicated node pools for both Druid and Airflow:

Druid Nodes¶

We need four VMs to run the Druid stack. The specifications for these VMs are as follows:

Number of Instances: 4
Specifications per Instance:
- vCPUs: 8
- Memory: 32 GiB
- Storage: 1000Gi

Airflow Nodes¶

We need three VMs to run the Airflow stack. The specifications for these VMs are as follows:

Number of Instances: 3
Specifications per Instance:
- vCPUs: 2
- Memory: 4 GiB

Storage Resources¶

PV Storage: 1400 GiB (for Druid PVCs)
Node Storage: 100 GiB per node
Total Storage: 2100 GiB

These specifications ensure that our Druid and Airflow stacks run efficiently with the required resources for processing and querying data.

Druid Workloads¶

In cases of excessively high (or low) utilization, it may be desired to adjust the resource requests/limits for the Druid pods. This can be done from the Helm values, in the Druid section with the other dependency configuration. This section also permits setting pod replica counts. Example:

# ...
druid:
  # ...
  broker:
    replicaCount: 2
    resources:
      requests:
        cpu: 3          # Adjust CPU request to 3 cores
        memory: 12Gi    # Adjust memory request to 12Gi
      limits:
        cpu: 4          # Optionally set a CPU limit
        memory: 14Gi    # Adjust memory limit to 14Gi
# ...

WARNING: When changing replicas or resource requests/limits, you'll need to adjust your node counts and/or sizes to accomodate. When setting aggressive resource reservations, you must either have spare nodes or much larger nodes if you wish to maintain availability when updating the pods. Otherwise you will not have enough capacity to run more than your configured pod count and will need to scale down and back up to replace pods.

Dec. 17, 2024