optimization

4 guides tagged “optimization”.

Quantizing ONNX Models: When It's Actually Worth It

A practical guide to INT8 and FP16 quantization for ONNX models — how much you save, what you risk, and a real decision: should an unquantized production model be quantized?

May 23, 20264 min read

Choosing an ONNX Runtime Execution Provider: CPU, CUDA, TensorRT, CoreML

ONNX Runtime can dispatch the same model to very different hardware backends. A practical guide to execution providers — what each is for, how the fallback chain works, and how to choose.

May 21, 20264 min read

onnx inference optimization deployment

Batching Inference Requests: Throughput vs Latency

Processing requests one at a time wastes hardware; batching them trades a little latency for a lot of throughput. How dynamic batching works, when it helps, and when a single shared session is enough.

May 17, 20264 min read

inference optimization mlops production

Reducing Cold Starts in Containerized ML Services

When your service loads a model from object storage at boot, cold starts get slow. Why ML services start slowly, and the practical levers — image size, lazy loading, warm instances, and model size — to fix it.

May 14, 20264 min read

mlops production deployment optimization