Detecting Fake News with Machine Learning

Abstract

In this project, we used Kaggle dataset of Fake news and downloaded real news from Guardian website. Most existing works have used supervised learning but given importance to the words used in the dataset. The approach may work well when the dataset is huge and covers a wide domain. Additionally, the algorithms are trained after the news has already been disseminated. In contrast, this research gives importance to content-based prediction based on language statistical features. A pattern in the language features can predict whether the news is fake or not. We extracted 43 features that include Parts of Speech and Sentiment Analysis Features. We implemented AdaBoost classifier; DecisionTreeClassifier; GaussianNB; KNeighborsClassifier; SGDClassifier; and SVC to predict whether a piece of particular news is fake or real.

Results show that AdaBoost Classifier with base estimator as Decision Tree of maximum depth $3$ and $175$ estimators performs best and provides accuracy close to 1. Features NN (noun, common, singular or mass); CD (numeral, cardinal); VBP (verb, present tense, not 3rd person singular); VBG(verb, present participle or gerund); positive (positive sentiment); NNP(noun, proper, singular); JJ(adjective or numeral, ordinal); IN(preposition or conjunction, subordinating); VBN(verb, past participle); and unique (unique words) were found top predictive features that provided accuracy of 0.85 and F-score of 0.87. In future work, we will implement this algorithm on other datasets.

Introduction

  • What is Fake News?
    • A piece of news, which is stylistically written as real news but is entirely or partially false Existing from a long time as propaganda but now due to social media and private news website/content providers
  • Why?
    • Well documented, unverified content - written to impact psychological beliefs
  • Reason
    • Information Separation - due to algorithms and preferences

Problem Statement

  • Let $N = {n_1, n_2, n_3, … n_m}$ be a collection of $m$ news items and $L = {l_1, l_2, l_3, … l_m}$ be their corresponding labels of news items such that label li is either $1$ or $0$ depending on if the news item is fake or real.
    • What is the label for $n_z \notin N$?
  • Previous approaches used dictionary-based approach

  • Can we use language features to detect fake news?
    • Clear intention to write fake news, and the content in fake news is written based on human psychology to influence social belief system

Features Extracted - 43

  • 39 Parts of Speech Features
Feature #POS TagDescriptionFeature #POS TagDescription
1$dollar2””quotes
3.Dot4:colon
5CCconjunction, coordinating6CDnumeral, cardinal
7DTdeterminer8EXexistential there
9FWforeign word10INpreposition or conjunction, subordi- nating
11JJadjective or numeral, ordinal12JJRadjective, comparative
13JJSadjective, superlative14MDmodal auxiliary
15NNnoun, common, singular or mass16NNPnoun, proper, singular
17NNPSnoun, proper, plural18NNSnoun, common, plural
19PDTpre-determiner20POSgenitive marker
21PRPpronoun, personal22PRP$pronoun, possessive
23RBadverb24RBRadverb, comparative
25RBSadverb, superlative26RPparticle
27SYMsymbol28TO“to” as preposition or innitive marker
29UHinterjection30VBverb, base form
31VBDverb, past tense32VBGverb, present participle or gerund
33VBNverb, past participle34VBPverb, present tense, not 3rd person singular
35VBZverb, present tense, 3rd person sin- gular36WDTWH-determiner
37WPWH-pronoun38WP$WH-pronoun, possessive
39WRBWh-adverb   
  • 3 Sentiment Analysis
    • Positive Words, Negative Words, Neurtal Words
  • Unique Words

Algorithms

  • Ada Boost Classifier
  • Decision Trees Classifier
  • Gaussian Naive Bayes (GaussianNB)
  • K-Nearest Neighbors (KNeighbors)
  • Stochastic Gradient Descent Classifier (SGDC)
  • Support Vector Machine

Results

  • Initial Performance with 100% of training data using default parameters
AlgorithmTest AccuracyTest F-Score
AdaBoostClassier0.99890.9993
DecisionTreeClassier0.99800.9983
GaussianNB0.99020.9867
KNeighborsClassier0.97470.9822
SGDClassier0.98590.9851
SVC0.99660.9968


Important Language Features to predict Fake News Confusion Matrix for Test Data


Important Language Features to predict Fake News Important Language Features to predict Fake News

Citation

  • Aneja N. and Aneja S. (2019). "Detecting Fake News with Machine Learning." International Conference on Deep Learning, Artificial Intelligence and Robotics, (ICDLAIR) 2019.