Quantizing ONNX Models: When It's Actually Worth It
A practical guide to INT8 and FP16 quantization for ONNX models — how much you save, what you risk, and a real decision: should an unquantized production model be quantized?
4 guides tagged “optimization”.
A practical guide to INT8 and FP16 quantization for ONNX models — how much you save, what you risk, and a real decision: should an unquantized production model be quantized?
ONNX Runtime can dispatch the same model to very different hardware backends. A practical guide to execution providers — what each is for, how the fallback chain works, and how to choose.
Processing requests one at a time wastes hardware; batching them trades a little latency for a lot of throughput. How dynamic batching works, when it helps, and when a single shared session is enough.
When your service loads a model from object storage at boot, cold starts get slow. Why ML services start slowly, and the practical levers — image size, lazy loading, warm instances, and model size — to fix it.