NEWS CATEGORY CLASSIFICATION WITH MACHINE LEARNING METHOD

Please use this identifier to cite or link to this item: http://ir-ithesis.swu.ac.th/dspace/handle/123456789/1242

Full metadata record

DC Field	Value	Language
dc.contributor	KITTISAK KITTITHANUSORN	en
dc.contributor	กิตติศักดิ์ กิตติธานุสรณ์	th
dc.contributor.advisor	Vera Sa-ing	en
dc.contributor.advisor	วีระ สอิ้ง	th
dc.contributor.other	Srinakharinwirot University. Faculty of Science	en
dc.date.accessioned	2021-09-08T11:43:24Z	-
dc.date.available	2021-09-08T11:43:24Z	-
dc.date.issued	16/8/2021
dc.identifier.uri	http://ir-ithesis.swu.ac.th/dspace/handle/123456789/1242	-
dc.description	MASTER OF SCIENCE (M.Sc.)	en
dc.description	วิทยาศาสตรมหาบัณฑิต (วท.ม.)	th
dc.description.abstract	The purpose of this research is to study the methods of categorizing news using machine learning techniques with a news dataset. This dataset consisted of 41 news categories and 202,372 headlines from 2012 to 2018, provided by news website HuffPost. In this research, techniques such as bag-of-word and Term Frequency Inverse Document Frequency (TFIDF) were explored, along with five machine learning methods: Multinomial Naive Bayes, Complement Naive Bayes, Logistic regression, LinearSVC, and Random Forest on asymmetric classes. This challenging problem was addressed by using three sampling algorithms: undersampling, synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling. The results showed that logistic regression using bag-of-word techniques and SMOTE had the highest accuracy in terms of news classification, with accuracy, recall, precision, and F1 scores of 80.69, 77.63, 77.04, and 77.31, respectively. Using the confusion matrix, it showed that the most accurate classification category was healthy living news which yielded 89% but the performance of classifying sports news was quite low.	en
dc.description.abstract	งานวิจัยนี้มีวัตถุประสงค์เพื่อศึกษาวิธีการจำแนกประเภทของข่าว โดยใช้เทคนิคการเรียนรู้ของเครื่อง โดยใช้ชุดข้อมูลประเภทข่าว ชุดข้อมูลนี้ประเภทข่าวอยู่ 41 ประเภท และหัวข้อข่าว 202,372 หัวข้อตั้งแต่ปี 2555 ถึงปี 2561 ที่ได้รับจากเว็บไซต์ข่าว HuffPost งานวิจัยนี้ใช้อัลกอริทึม การจัดแบ่งประเภทของเอกสาร และการเรียนรู้ของเครื่อง เพื่อจำแนกประเภทข่าว กระบวนการการจำแนกประเภทจะสำรวจเทคนิค bag-of-word และ Term Frequency Inverse Document Frequency (TFIDF) ด้วย 5 การเรียนรู้ คือ Multinomial Naive Bayes, Complement Naive Bayes, Logistic regression, LinearSVC และ Random Forest บนคลาสที่ไม่สมดุล ปัญหาที่ท้าทายนี้จัดการโดยใช้อัลกอริทึมการสุ่มตัวอย่าง 3 วิธี คือ undersampling, synthetic minority oversampling technique (SMOTE) และ adaptive synthetic sampling ผลลัพธ์จากการทดลองพบว่า Logistic regression ที่ใช้เทคนิค bag-of-word และ SMOTE มีประสิทธิภาพสูงที่สุดในการจำแนกประเภทข่าว แสดงค่า Accuracy, Recall, Precision และ F1 score เป็น 80.69, 77.63, 77.04 และ 77.31 ตามลำดับ และจาก confusion matrix แสดงให้เห็นว่ามีความแม่นยำในการตรวจจับข่าวประเภท Healthy Living มากที่สุดคือ 89% แต่มีประสิทธิภาพการตรวจจับข่าวประเภท Sports ค่อนข้างต่ำ	th
dc.language.iso	th
dc.publisher	Srinakharinwirot University
dc.rights	Srinakharinwirot University
dc.subject	การจำแนกประเภทของข่าว	th
dc.subject	การจัดแบ่งประเภทของเอกสาร	th
dc.subject	การเรียนรู้ของเครื่อง	th
dc.subject	การสุ่มตัวอย่าง	th
dc.subject	News category classification	en
dc.subject	Text classification	en
dc.subject	Machine learning	en
dc.subject	Sampling	en
dc.subject.classification	Computer Science	en
dc.title	NEWS CATEGORY CLASSIFICATION WITH MACHINE LEARNING METHOD	en
dc.title	การจำแนกประเภทข่าวด้วยวิธีการเรียนรู้ด้วยเครื่อง	th
dc.type	Master’s Project	en
dc.type	สารนิพนธ์	th
Appears in Collections:	Faculty of Science

Files in This Item:

File	Description	Size	Format
gs621130228.pdf		4.05 MB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets