KV Cache and Inference Scheduling: Energy Modeling for High-QPS Services

Wenwen Liu

doi:10.70393/6a69656173.333930

Authors

Wenwen Liu Bytedance

DOI:

https://doi.org/10.70393/6a69656173.333930

ARK:

https://n2t.net/ark:/40704/JIEAS.v4n1a05

Disciplines:

Computer Science

Subjects:

Artificial Intelligence

References:

11

Keywords:

KV Cache, Inference Scheduling, Energy Modeling, High-QPS Services, AI System Optimization, Energy Efficiency

Abstract

High-QPS (Queries Per Second) services, such as large language model (LLM) inference and real-time recommendation systems, are increasingly pervasive in AI-driven applications, but their energy consumption has become a critical challenge—accounting for up to 40% of data center operational costs. Existing optimization efforts primarily focus on latency reduction and throughput improvement, overlooking the intricate interplay between KV cache management (a core component of transformer-based model inference) and inference scheduling in energy efficiency. Traditional energy modeling methods (e.g., linear regression, hardware-centric power meters) fail to capture dynamic dependencies between cache behavior, scheduling policies, and workload volatility, leading to inaccurate energy predictions and suboptimal resource allocation. To address these gaps, this study proposes a hybrid energy modeling framework tailored for high-QPS services, integrating KV cache characteristics and inference scheduling dynamics. First, we construct a multi-dimensional energy factor system encompassing four core dimensions: KV Cache Configuration (e.g., cache size, eviction policy, hit ratio), Inference Scheduling Strategy (e.g., batching size, task prioritization, resource partitioning), System Environment (e.g., CPU/GPU utilization, memory bandwidth, power capping), and Workload Traits (e.g., QPS volatility, request complexity, sequence length distribution). Second, we design a two-stage modeling approach: a data-driven component (Gradient Boosting Tree, GBT) to capture non-linear relationships between factor interactions and energy consumption, and an analytical component (Queueing Theory-based latency-energy tradeoff model) to ensure QPS and latency constraints are satisfied. Third, we validate the framework using a real-world dataset from an LLM inference service (2022–2024) with QPS ranging from 5k to 30k, comparing it against three baseline methods. Experimental results show that the proposed framework outperforms traditional models: it achieves an energy prediction accuracy of 92.7% (vs. 78.3% for linear regression and 83.5% for hardware-centric modeling), reduces energy consumption by 18.9%–25.3% while maintaining target QPS and latency SLAs (Service Level Agreements), and identifies key optimization levers (e.g., adaptive KV cache resizing based on QPS fluctuations reduces energy by 14.2%). This study provides a practical tool for system administrators to balance performance and energy efficiency, supporting the development of sustainable high-QPS AI services.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biography

Wenwen Liu, Bytedance

Bytedance, CN, liuwenwen.jessica@bytedance.com.

References

[1] Chen, Y., Zhang, S., & Li, J. (2023). Dynamic batching for throughput optimization in LLM inference. IEEE Transactions on Parallel and Distributed Systems, 34(7), 2015–2028.

[2] Lee, H., Kim, S., & Park, J. (2022). Energy-aware cache eviction for edge AI inference. In Proceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management (pp. 123–136). ACM.

[3] Li, M., Wang, H., & Zhang, L. (2024). Energy modeling for large language model training: A data-driven approach. Journal of Parallel and Distributed Computing, 201, 56–70.

[4] Liu, C., Yu, T., & Chen, W. (2022). Multi-tenant scheduling for GPU inference in cloud environments. IEEE Cloud Computing, 9(4), 89–97.

[5] NVIDIA Corporation. (2023). NVIDIA system management interface (NVML) documentation. https://docs.nvidia.com/deploy/nvml-api/index.html

[6] OpenAI. (2024). GPT-4 API documentation. https://platform.openai.com/docs/models/gpt-4

[7] Prometheus. (2024). Prometheus monitoring system documentation. https://prometheus.io/docs/

[8] Wang, C., Zhao, Y., & Li, S. (2024). Lossless KV cache compression for LLM inference. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence and Engineering Applications (pp. 345–352). IEEE.

[9] Zhang, H., Liu, Y., & Wang, Z. (2023). Dynamic KV cache sizing for low-latency LLM inference. ACM Transactions on Intelligent Systems and Technology, 14(3), Article 1.

[10] PyTorch. (2023). PyTorch 2.1 documentation. https://pytorch.org/docs/stable/index.html

[11] XGBoost Developers. (2023). XGBoost 2.0 documentation. https://xgboost.readthedocs.io/en/stable/

KV Cache and Inference Scheduling: Energy Modeling for High-QPS Services

Authors

DOI:

ARK:

Disciplines:

Subjects:

References:

Keywords:

Abstract

Downloads

Metrics

Author Biography

Wenwen Liu, Bytedance

References

Downloads

Published

How to Cite

Issue

Section

ARK

License

Most read articles by the same author(s)

Keywords

Index

Information

Announcements

Strengthened Review Announcement

Announcements