Analysis of the Influence of Text Input Characteristics on the Performance of MiniLMv2-L6-H384 and BERT-Base-Uncased on Quora Question Pairs

Authors

  • Ken Ratri Wardani Institut Teknologi Harapan Bangsa
  • Inge Martina Institut Teknologi Harapan Bangsa
  • Jimmy Fong Xin Wern Institut Teknologi Harapan Bangsa

DOI:

https://doi.org/10.61769/telematika.v20i2.775

Keywords:

knowledge distillation, MiniLM, BERT, semantic equivalence, quora question pairs, sequence length, token rarity

Abstract

Knowledge distillation is a technique for simplifying large language models into more concise models while maintaining accuracy. Bidirectional encoder representations from transformers (BERT) offer strong performance but require significant computational resources, whereas mini language models (MiniLM) are five times smaller. This study aims to compare the performance of these two models on a Quora question-pair dataset, focusing on the effects of sequence length and token rarity on classification accuracy. Both models were trained using identical training parameters. Test results show that BERT achieves 91.22% accuracy and 88.17% F1-score, slightly outperforming MiniLM, which achieves 90.12% accuracy and 86.73% F1-score. However, MiniLM provides 5.3 times faster inference speed. These findings provide empirical guidance for model optimisation in environments with limited computational resources or real-time response requirements, where MiniLM's efficiency is acceptable with a slight decrease in accuracy. Future research is recommended to explore hybrid systems that delegate complex tasks to large models and general tasks to smaller models.

Author Biographies

Ken Ratri Wardani, Institut Teknologi Harapan Bangsa

Informatics Study Program

Inge Martina, Institut Teknologi Harapan Bangsa

Informatics Study Program

Jimmy Fong Xin Wern, Institut Teknologi Harapan Bangsa

Informatics Study Program

References

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186, doi: 10.18653/v1/N19-1423.

Google Research, “BERT: Bidirectional Encoder Representations from Transformers,” GitHub repository. [Online]. Available: https://github.com/google-research/bert

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint, arXiv:1503.02531, 2015.

W. Wang et al., “MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 2140–2151, doi: 10.18653/v1/2021.findings-acl.188.

Microsoft, “UniLM: MiniLM—State-of-the-art natural language processing,” GitHub repository. [Online]. Available: https://github.com/microsoft/unilm/tree/master/minilm

S. Mukherjee et al., “Orca: Progressive learning from complex explanation traces of GPT-4,” arXiv preprint, arXiv:2306.02707, 2023.

P. Zhang, G. Zeng, T. Wang, and W. Lu, “TinyLlama: An open-source small language model,” arXiv preprint, arXiv:2401.02385, 2024.

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT: A distilled version of BERT: Smaller, faster, cheaper and lighter,” in Proc. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (NeurIPS), 2019. [Online]. Available: arXiv:1910.01108.

Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “MobileBERT: A compact task-agnostic BERT for resource-limited devices,” in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 2158–2170, doi: 10.18653/v1/2020.acl-main.195.

J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021, doi: 10.1007/s11263-021-01453-z.

A. Wang et al., “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proc. NAACL-HLT 2018, 2018, pp. 353–355, doi: 10.18653/v1/N18-2017.

J.-T. Baillargeon and L. Lamontagne, “Assessing the impact of sequence length learning on classification tasks for transformer encoder models,” arXiv preprint, arXiv:2212.08399, 2024.

W. Yu et al., “Dict-BERT: Enhancing language model pre-training with dictionary,” in Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 1907–1918, doi: 10.18653/v1/2022.findings-acl.150.

Published

2026-01-24

Issue

Section

Articles