A study on the evaluation of tokenizer performance in natural language processing

Choo, Sanghyun and Kim, Wonjoon (2023) A study on the evaluation of tokenizer performance in natural language processing. Applied Artificial Intelligence, 37 (1). ISSN 0883-9514

Text
A study on the evaluation of tokenizer performance in natural language processing.pdf - Published Version
Download (2MB)

Official URL: https://doi.org/10.1080/08839514.2023.2175112

Abstract

The present study aims to compare and analyze the performance of two tokenizers, Mecab-Ko and SentencePiece, in the context of natural language processing for sentiment analysis. The study adopts a comparative approach, employing five algorithms - Naive Bayes (NB), k-Nearest Neighbor (kNN), Support Vector Machine (SVM), Artificial Neural Networks (ANN), and Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) - to evaluate the performance of each tokenizer. The performance was assessed based on four widely used metrics in the field, accuracy, precision, recall, and F1-score. The results indicated that SentencePiece performed better than Mecab-Ko. To ensure the validity of the results, paired t-tests were conducted on the evaluation outcomes. The study concludes that SentencePiece demonstrated superior classification performance, especially in the context of ANN and LSTM-RNN, when used to interpret customer sentiment based on Korean online reviews. Furthermore, SentencePiece can assign specific meanings to short words or jargon commonly used in product evaluations but not defined beforehand.

Item Type:	Article
Subjects:	Archive Science > Computer Science
Depositing User:	Managing Editor
Date Deposited:	12 Jun 2023 06:56
Last Modified:	17 May 2024 11:05
URI:	http://editor.pacificarchive.com/id/eprint/1150

Actions (login required)

: View Item