How it works

In this modern world information travels faster than the speed of reason

We go to great lengths to make our analyses as unambiguous and unbiased as possible. We want you to feel confident that you’re seeing the full story.

Analyzing the news takes a lot of work from a number of different pieces of software. Here’s a high-level overview of what happens can be seen below.

How do we choose articles?

Firstly, before we can do any kind of analysis we need to know that there is even a story. We maintain a database of over 13,000 news sources which is updated weekly. Our crawler operates on a schedule periodically waking up and looking at the sources to find any new articles they have published. When it encounters a new article it puts it in a scratchpad with all the other new articles that are found. At the end of a crawling run we gather articles with similar content and group them together, calling this grouping a ‘story’.

We don’t filter or discriminate sources in any way.

Reading the articles for details

Once we have a story set, we analyze each article. Quite a few things happen during analysis starting with extracting metadata such as title, author(s), publisher, the time published and whether or not it’s an opinion piece. 

We then extract the article’s text content and split it into sentences. Why sentences though? When considering what makes up a ‘detail’ in any news story we scoured thousands of articles to see how journalists present information. A detail is some event or something that was said. Ideally it would contain context like who said the thing, and to whom it was said, and where, and why. The typical presentation for an entire detail like this is a sentence.

Next we cluster details together across the sources in order to find consensus on their semantic meaning. In a news story the different sources are all reporting on the same thing, some might have fewer or more details than others but there will be a lot of commonality. We want to find all the details that have several sources covering them, that’s the clustering and consensus part. Additionally, we want to find these clusters regardless of the exact wording each source chose for its sentence.

Terms

  • Publisher – an organization or individual that shares text content online.
  • Source – a publisher of news content.
  • Article – a piece of text about something going on, from a single source.
  • Story – a collection of articles on the same event or news item, from multiple sources.
  • Topic – a category of stories. This could be a region or a theme. Some example topics are: United States, Cars, Politics.
  • Detail – a common narrative element in a story across multiple sources.
  • Trust index – a numerical score assigned to each article based on its comparison to other articles within a story.

How do we find misleading text?

We look for potentially misleading pieces of text in each article. We do not highlight hyperboles, false dichotomies or straw man arguments. Instead, we highlight things like missing data references (“a recent study shows” – without a reference to the study), missing sources (“according to an anonymous source”) and scare quotes. Each of these can be verified by the reader just by looking at the text we highlighted. Either the data was referenced or it was not. Either the source was named or it was not.

How does this lead to a score?

Article scores are made up of three components:

  1. The coverage score – this is the percentage of all details found in the story set that a particular source covered. More is better.
  2. The misleading score – this is a percentage derived from the number of potentially misleading pieces of text we found in an article. More is worse.
  3. The trust index – this is just a simple arithmetic combination of the above two scores.

We compute the coverage score by creating a ‘weight’ for each detail found. The weight is just the number of unique sources that cover that detail. We then add these all up to get the maximum possible coverage score and compute each source’s coverage score by dividing the weighted details it contained by the maximum score.

The misleading score is much simpler; each highlighted region of potentially misleading text adds 20% to the misleading score with a maximum penalty of 100%. This means that five highlights gives that source the worst possible misleading score.

Now we come to the trust index. Choosing how to combine the two previous scores to form this is an ongoing discussion and has seen several iterations so far. One question tends to drive it however: 

What’s better, an article that covers the whole story but is a little misleading or an article that is pristine but misses a few details?

Over time we’ve decided to favor articles with more coverage, since more coverage tends to lead to a more balanced view of the story set. Based on this the computation is very simple, the coverage score makes up 80% of the trust index and the misleading score determines the remaining 20%. If an article covers every detail and has no misleading text it gets a perfect trust index. If it has five or more misleading pieces of text but perfect detail coverage it gets 80% (since 20% is lost due to misleading text).