Bias Detection

Abstract

Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an important use case as most news and tweets these days are biased.

Hyperpartisan articles mimic the form of regular news articles, but are one-sided in the sense that opposing views are either ignored or fiercely attacked.

Problem Statement - Given the text and markup of an online news article, decide whether the article is hyperpartisan or not. The challenge of this task is to unveil the mimicking and to detect the hyperpartisan language, which may be distinguishable from regular news at the levels of style, syntax, semantics, and pragmatics. If an article is biased we also try to predict the type of bias in the article. The types of bias we are considering are- Right, Right-Center, Least, Left-Center, Left

Harsh

Dataset

Link to dataset : SemEval 2019 Task 4: Hyperpartisan News Detection
The data is split into multiple files.

Files with names starting with "articles-" contains the articles, which validate against the XML schema article.xsd.
Files with names starting with "ground-truth-" contains ground-truth information, which validate against the XML schema ground-truth.xsd.

Data is divided into two parts:

First part of data -
- Contains filename "bypublisher" and is labeled by the overall bias of the publisher.
- Contains total of 750,000 articles
- Half of total (375,000) are hyperpartisan and half are not.
- Half of the articles that are hyperpartisan (187,500) are on the left side of the political spectrum, half are on the right side.
- This data is split into a training set (80%, 600,000 articles) and a validation set (20%, 150,000 articles).
Second part of data
- Contains filename "byarticle" and is labeled through crowdsourcing on an article basis.
- Data contains only articles for which a consensus among the crowdsourcing workers existed.
- It contains a total of 645 articles
- 37% (238) are hyperpartisan and 63% (407) are not.

Baseline

Logistic Regression