Research on Text Classification Methods Based on Decision Trees: A Case Study on the Recognition of the Entity Category 'Position'

Authors

  • Qiming Xing Belarusian State University
  • Yankuan Wang Belarusian State University

DOI:

https://doi.org/10.70393/616a6e73.323835

ARK:

https://n2t.net/ark:/40704/AJNS.v2n2a02

Disciplines:

Computer Science

Subjects:

Data Science

References:

14

Keywords:

Decision Tree, Named Entity Recognition, CLUENER2020, Word Classification

Abstract

This paper investigates the application of a decision tree model for the binary classification task of the 'Position' category on the CLUENER2020 dataset, aiming to provide a lightweight and efficient method for named entity recognition. The CLUENER2020 dataset includes multiple label categories, among which the accurate identification of the 'Position' category is of significant importance for information extraction and text processing. Through data preprocessing, feature extraction, model training, and testing, this study evaluates the performance of the decision tree model on this task. The experimental results indicate that the model achieves an overall accuracy of 98%, with a precision of 98%, recall of 100%, and F1 score of 99% for the 'Non-Position' category, while the 'Position' category has a precision of 100%, recall of 85%, and F1 score of 92%. Although the model performs excellently on the 'Non-Position' category, the lower recall rate for the 'Position' category reveals a certain degree of missed detection, primarily attributed to the class imbalance in the dataset and the complexity of text features related to positions. The contribution of this paper lies in validating the applicability of traditional machine learning models for specific named entity recognition tasks. Particularly in resource-constrained scenarios, the decision tree model offers a feasible solution. Future research could further enhance model performance and improve the accuracy and robustness of named entity recognition tasks through data augmentation techniques, the integration of more complex model architectures, and in-depth feature engineering and hyperparameter optimization methods.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biographies

Qiming Xing, Belarusian State University

Faculty of Applied Mathematics and Computer Science, Belarusian State University, Belarus.

Yankuan Wang, Belarusian State University

Faculty of Applied Mathematics and Computer Science, Belarusian State University, Belarus.

References

[1] Mohit, B. (2014). Named entity recognition. In Natural language processing of semitic languages (pp. 221-245). Berlin, Heidelberg: Springer Berlin Heidelberg.

[2] Chowdhary, K., & Chowdhary, K. R. (2020). Natural language processing. Fundamentals of artificial intelligence, 603-649.

[3] Xu, L., Dong, Q., Liao, Y., Yu, C., Tian, Y., Liu, W., ... & Zhang, X. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for chinese. arXiv 2020. arXiv preprint arXiv:2001.04351.

[4] Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.

[5] Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4), 267-373.

[6] Salehinejad, H., Sankar, S., Barfett, J., Colak, E., & Valaee, S. (2017). Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078.

[7] Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.

[8] Koroteev, M. V. (2021). BERT: a review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943.

[9] Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International journal of computer applications, 181(1), 25-29.

[10] Yu, T., & Zhu, H. (2020). Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689.

[11] Song, Q., Xia, S., & Wu, Z. (2024, May). Automatic Optimization of Hyperparameters for Deep Convolutional Neural Networks: Grid Search Enhanced with Coordinate Ascent. In Proceedings of the 2024 International Conference on Machine Intelligence and Digital Applications (pp. 300-306).

[12] Wu, J., Chen, X. Y., Zhang, H., Xiong, L. D., Lei, H., & Deng, S. H. (2019). Hyperparameter optimization for machine learning models based on Bayesian optimization. Journal of Electronic Science and Technology, 17(1), 26-40.

[13] He, R., Li, B., Li, F., & Song, Q. (2024). A Review of Feature Engineering Methods in Regression Problems. Academic Journal of Natural Science, 1(1), 32-40.

[14] Song, Q., & Xia, S. (2024). Research on the Effectiveness of Different Outlier Detection Methods in Common Data Distribution Types. Journal of Computer Technology and Applied Mathematics, 1(1), 13-25.

Downloads

Published

2025-04-14

How to Cite

Xing, Q., & Wang, Y. (2025). Research on Text Classification Methods Based on Decision Trees: A Case Study on the Recognition of the Entity Category ’Position’. Academic Journal of Natural Science , 2(2), 10–15. https://doi.org/10.70393/616a6e73.323835

Issue

Section

Articles

ARK