Benchmarking Learned Cardinality Estimation Techniques for Analytical Query Processing in Data Warehouses

Jiacheng Hu; Xu Wang; Jiawen Lai

doi:10.70393/6a6374616d.343134

Authors

Jiacheng Hu University of New South Wales
Xu Wang Beijing University of Posts and Telecommunications
Jiawen Lai University of California

DOI:

https://doi.org/10.70393/6a6374616d.343134

ARK:

https://n2t.net/ark:/40704/JCTAM.v3n3a01

Disciplines:

Software Systems

Subjects:

Other

References:

21

Keywords:

Learned Cardinality Estimation, Data Warehouse, Query Optimization, Benchmark Evaluation

Abstract

Cardinality estimation remains one of the most critical yet error-prone components of query optimization in modern data warehouses. Recent advances in machine learning have produced a diverse family of learned cardinality estimators that demonstrate substantial accuracy improvements on standard benchmarks. Yet existing evaluations predominantly rely on third-normal-form schemas, leaving their effectiveness on star and snowflake schemas—the backbone of analytical data warehousing—largely unexplored. This paper presents a systematic empirical evaluation of seven representative learned cardinality estimation methods spanning three paradigmatic categories: query-driven, data-driven, and hybrid approaches. All methods are benchmarked against the PostgreSQL histogram-based estimator on three complementary datasets: TPC-DS with its native snowflake schema, STATS-CEB with real-world relational data, and IMDB/JOB as the established cross-study reference. The evaluation encompasses estimation accuracy measured by Q-Error and P-Error, inference latency, training cost, model compactness, end-to-end query execution time, and robustness under simulated ETL batch insertions. Results indicate that hybrid methods, particularly FactorJoin, achieve the strongest accuracy on data warehouse workloads with a median Q-Error of 1.74 on TPC-DS, while data-driven methods such as FLAT and BayesCard offer a favorable balance between accuracy and inference speed. BayesCard and FactorJoin exhibit the highest resilience to data updates, with median Q-Error increasing by fewer than 1.5 points after a 50% data insertion. These findings provide actionable guidance for practitioners seeking to deploy learned cardinality estimation in production data warehouse environments.

Author Biographies

Jiacheng Hu, University of New South Wales

Master’s Degree in Information Technology

Xu Wang, Beijing University of Posts and Telecommunications

Computer Science

Jiawen Lai, University of California

Computer Engineering

References

[1] Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How good are query optimizers, really? Proceedings of the VLDB Endowment, 9(3), 204–215.

[2] Zhou, X., Chai, C., Li, G., & Sun, J. (2022). Database meets artificial intelligence: A survey. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1096–1116.

[3] Han, Y., Wang, H., Chen, L., Dong, Y., Chen, X., Yu, B., Yang, C., & Qian, W. (2024). ByteCard: Enhancing ByteDance's data warehouse with learned cardinality estimation. In Proceedings of the 2024 ACM SIGMOD International Conference on Management of Data.

[4] Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., & Kemper, A. (2019). Learned cardinalities: Estimating correlated joins with deep learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR).

[5] Negi, P., Marcus, R., Kipf, A., Mao, H., Tatbul, N., Kraska, T., & Alizadeh, M. (2021). Flow-Loss: Learning cardinality estimates that matter. Proceedings of the VLDB Endowment, 14(11), 2019–2032.

[6] Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, X., Abbeel, P., Hellerstein, J. M., Krishnan, S., & Stoica, I. (2019). Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment, 13(3), 279–292.

[7] Yang, Z., Kamsetty, A., Luan, S., Liang, E., Duan, Y., Chen, X., & Stoica, I. (2020). NeuroCard: One cardinality estimator for all tables. Proceedings of the VLDB Endowment, 14(1), 61–73.

[8] Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., & Binnig, C. (2020). DeepDB: Learn from data, not from queries! Proceedings of the VLDB Endowment, 13(7), 992–1005.

[9] Zhu, R., Wu, Z., Han, Y., Zeng, K., Pfadler, A., Qian, Z., Zhou, J., & Cui, B. (2021). FLAT: Fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment, 14(9), 1489–1502.

[10] Wu, P., & Cong, G. (2021). A unified deep model of learning from both data and queries for cardinality estimation. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data (pp. 2009–2022).

[11] Wu, Z., Negi, P., Alizadeh, M., Kraska, T., & Madden, S. (2023). FactorJoin: A new cardinality estimation framework for join queries. Proceedings of the ACM on Management of Data, 1(1).

[12] Wang, X., Qu, C., Wu, W., Wang, J., & Zhou, Q. (2021). Are we ready for learned cardinality estimation? Proceedings of the VLDB Endowment, 14(9), 1640–1654.

[13] Han, Y., Wu, Z., Wu, P., Zhu, R., Yang, J., Tan, L. W., Zeng, K., Cong, G., Qin, Y., Pfadler, A., Qian, Z., Zhou, J., Li, J., & Cui, B. (2022). Cardinality estimation in DBMS: A comprehensive benchmark evaluation. Proceedings of the VLDB Endowment, 15(4), 752–765.

[14] Kim, K., Jung, J., Seo, I., Han, W.-S., Choi, K., & Chong, J. (2022). Learned cardinality estimation: An in-depth study. In Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data (pp. 1214–1227).

[15] Zhang, J., Zhang, C., Li, G., & Chai, C. (2021). Learned cardinality estimation: A design space exploration and a comparative evaluation. Proceedings of the VLDB Endowment, 15(1), 85–97.

[16] Wu, Z., Shaikhha, A., Zhu, R., Zeng, K., Han, Y., & Zhou, J. (2020). BayesCard: Revitalizing Bayesian frameworks for cardinality estimation. arXiv preprint arXiv:2012.14743.

[17] Li, P., Wei, W., Zhu, R., Ding, B., Zhou, J., & Lu, H. (2023). ALECE: An attention-based learned cardinality estimator for SPJ queries on dynamic workloads. Proceedings of the VLDB Endowment, 17(2), 197–210.

[18] Wang, J., Chai, C., Liu, J., & Li, G. (2021). FACE: A normalizing flow based cardinality estimator. Proceedings of the VLDB Endowment, 15(1), 72–84.

[19] Marcus, R., Negi, P., Mao, H., Tatbul, N., Alizadeh, M., & Kraska, T. (2021). Bao: Making learned query optimization practical. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data (pp. 1275–1288).

[20] Sun, J., & Li, G. (2019). An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment, 13(3), 307–319.

[21] Negi, P., Marcus, R., Kipf, A., Mao, H., Tatbul, N., Kraska, T., & Alizadeh, M. (2023). Robust query driven cardinality estimation under changing workloads. Proceedings of the VLDB Endowment, 16(7), 1520–1533.

Benchmarking Learned Cardinality Estimation Techniques for Analytical Query Processing in Data Warehouses

Authors

DOI:

ARK:

Disciplines:

Subjects:

References:

Keywords:

Abstract

Author Biographies

Jiacheng Hu, University of New South Wales

Xu Wang, Beijing University of Posts and Telecommunications

Jiawen Lai, University of California

References

Downloads

Published

How to Cite

Issue

Section

ARK

License

Digital Distribution

Indexing & Abstracting

Information

Announcements

Call for Reviewers

Current Issue