NEWS CATEGORY CLASSIFICATION WITH MACHINE LEARNING METHOD

NEWS CATEGORY CLASSIFICATION WITH MACHINE LEARNING METHOD

Files

gs621130228.pdf (3.96 MB)

Date

16/8/2021

Publisher

Srinakharinwirot University

Abstract

The purpose of this research is to study the methods of categorizing news using machine learning techniques with a news dataset. This dataset consisted of 41 news categories and 202,372 headlines from 2012 to 2018, provided by news website HuffPost. In this research, techniques such as bag-of-word and Term Frequency Inverse Document Frequency (TFIDF) were explored, along with five machine learning methods: Multinomial Naive Bayes, Complement Naive Bayes, Logistic regression, LinearSVC, and Random Forest on asymmetric classes. This challenging problem was addressed by using three sampling algorithms: undersampling, synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling. The results showed that logistic regression using bag-of-word techniques and SMOTE had the highest accuracy in terms of news classification, with accuracy, recall, precision, and F1 scores of 80.69, 77.63, 77.04, and 77.31, respectively. Using the confusion matrix, it showed that the most accurate classification category was healthy living news which yielded 89% but the performance of classifying sports news was quite low.
งานวิจัยนี้มีวัตถุประสงค์เพื่อศึกษาวิธีการจำแนกประเภทของข่าว โดยใช้เทคนิคการเรียนรู้ของเครื่อง โดยใช้ชุดข้อมูลประเภทข่าว ชุดข้อมูลนี้ประเภทข่าวอยู่ 41 ประเภท และหัวข้อข่าว 202,372 หัวข้อตั้งแต่ปี 2555 ถึงปี 2561 ที่ได้รับจากเว็บไซต์ข่าว HuffPost งานวิจัยนี้ใช้อัลกอริทึม การจัดแบ่งประเภทของเอกสาร และการเรียนรู้ของเครื่อง เพื่อจำแนกประเภทข่าว กระบวนการการจำแนกประเภทจะสำรวจเทคนิค bag-of-word และ Term Frequency Inverse Document Frequency (TFIDF) ด้วย 5 การเรียนรู้ คือ Multinomial Naive Bayes, Complement Naive Bayes, Logistic regression, LinearSVC และ Random Forest บนคลาสที่ไม่สมดุล ปัญหาที่ท้าทายนี้จัดการโดยใช้อัลกอริทึมการสุ่มตัวอย่าง 3 วิธี คือ undersampling, synthetic minority oversampling technique (SMOTE) และ adaptive synthetic sampling ผลลัพธ์จากการทดลองพบว่า Logistic regression ที่ใช้เทคนิค bag-of-word และ SMOTE มีประสิทธิภาพสูงที่สุดในการจำแนกประเภทข่าว แสดงค่า Accuracy, Recall, Precision และ F1 score เป็น 80.69, 77.63, 77.04 และ 77.31 ตามลำดับ และจาก confusion matrix แสดงให้เห็นว่ามีความแม่นยำในการตรวจจับข่าวประเภท Healthy Living มากที่สุดคือ 89% แต่มีประสิทธิภาพการตรวจจับข่าวประเภท Sports ค่อนข้างต่ำ

Description

MASTER OF SCIENCE (M.Sc.)
วิทยาศาสตรมหาบัณฑิต (วท.ม.)

Keywords

การจำแนกประเภทของข่าว, การจัดแบ่งประเภทของเอกสาร, การเรียนรู้ของเครื่อง, การสุ่มตัวอย่าง, News category classification, Text classification, Machine learning, Sampling

URI

https://ir-ithesis.swu.ac.th/handle/123456789/1242

Collections

Faculty of Science

Full item page

NEWS CATEGORY CLASSIFICATION WITH MACHINE LEARNING METHOD

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By