How StarOps hosts models

Learn about how StarOps hosts models

How does StarOps deploy and host models?

StarOps connects with your AWS account to manage and provision resources, deploy services, in your designated EKS cluster to support serving large language models in your VPC. StarOps agents assess your cloud environment and model deployment needs to plan and execute the deployment of your selected model, the KServe model serving framework, a vLLM inference service, and all required dependencies. Along with the above, StarOps agent will provision cost effective GPU instances for inference based on your model’s requirements.

Model serving dependencies StarOps will deploy and configure as required.

KServe- KServe provides performant serverless inference on Kubernetes for LLMs and traditional ML models.
KNative - KNative is required for serverless inference for KServe. Knative also enables autoscaling the inference service based on request volume and supports scale down to and from zero.
Istio - Istio ingress is the recommended ingress and service mesh layer for Kserve and KNative.
NVIDIA GPU operator - manages GPU workloads in your Kubernetes cluster. The operator will be installed when NVIDIA GPU instances are provisioned for inference.

Once StarOps has deployed and configured KServe, KNative and other required dependencies per environment for an initial deployment, your subsequent model deployments will be streamlined, deploying only the inference service and other model specific configurations.

Inference service

StarOps uses vLLM inference for serving large language models (LLMs). vLLM is designed to provide a high throughput and more cost efficient inference for LLMs. Using the PagedAttention algorithm vLLM is able to provide efficient key value attention memory. vLLM also supports batching and several quantization methods.

You can have more than multiple models (inference service) deployed at the same time.

Model hosting

StarOps supports several open LLMs. When you deploy a model with StarOps will store the selected model in S3 in a dedicated bucket on your VPC.

Supported models & GPU reqs

Models Page