How it works

July 14, 2022




In this modern world information travels faster than the speed of reason, so at The Daily Edit we go to great lengths to make our analyses as unambiguous and unbiased as possible. We want you to feel confident that you’re seeing the full story. This ethic permeates through every part of our operation, from how we train machine learning models to whom we hire. So, given that we tell you how trustworthy an article is compared to its peers, why should you trust us?

This post explains how our whole pipeline works, from selecting articles to be crawled, to finding the story’s details, to scoring each article. We’ll cover the parts that are completely objective, and the parts that have some subjective elements to them with an explanation of our rationale. We’ll even show you where we don’t perform so well. We’ll do our best here to explain it all in layman’s terms and will follow-up with several other blog posts going into the raw, unadulterated technical detail.

Overview

Analyzing the news takes a lot of work from a number of different pieces of software. A high-level overview of what happens can be seen below.

Before we dive in it’s best to list a few terms that will come up frequently and how we interpret them:

  • Publisher – an organization or individual that shares text content online.
  • Source – a publisher of news content.
  • Article – a piece of text about something going on, from a single source.
  • Story – a collection of articles on the same event or news item, from multiple sources.
  • Topic – a category of stories. This could be a region or a theme. Some example topics are: United States, Cars, Politics.
  • Detail – a common narrative element in a story across multiple sources.
  • Trust index – a numerical score assigned to each article based on its comparison to other articles within a story.

How do we choose articles?

Firstly, before we can do any kind of analysis we need to know that there is even a story. To do this we maintain a database of over 13,000 news sources which is updated weekly with new sources as we encounter them. Our crawler operates on a schedule. It periodically wakes up and starts looking at the sources to find any new articles they have published. When it encounters a new article it puts it in a scratchpad with all the other new articles that are found. At the end of a crawling run we gather articles with similar content and group them together, calling this grouping a ‘story’.

Stories evolve over time. More sources appear, existing sources edit their articles, some even remove their article altogether. To cover all these cases we have logic around when we reprocess stories. For starters, we refresh stories at most every six hours. We feel this is frequent enough for us to provide real value with our analyses without overburdening our servers with redundant work. During one of these refreshes, if we encounter an article we already have we’ll only refresh it once every 12 hours. This means we could miss some frequent edits on a breaking story, but by the time things have settled down we’ll have covered the changes. Keeping this relatively infrequent also reduces the burden we place on our sources’ websites.

While we don’t filter or discriminate sources in any way, we do have one technical limitation that causes us to remove some: poorly structured HTML or websites that load more articles infinitely. When we crawl a news article we’re crawling the HTML their website serves us. There are recommendations and some poorly-followed standards but for the most part HTML is the Wild West, the number of ways it can be organized are infinite. Most of the time we encounter reasonably well-structured HTML and we can extract the text content with ease, sometimes it’s a little more difficult and requires a sophisticated model to parse, other times it’s just plain diabolical. When we encounter a pathological source which our application can’t work with, we remove them from our database. This means that we might miss a detail or two, particularly if that source had the scoop, but trying to analyze text that might not be the article content will pollute all the other articles we cover with things like advertising text or image captions.

Reading the articles

At the end of this crawling process we have a collection of articles grouped into a  ‘story’ ready for analysis. Quite a few things happen during analysis starting with extracting metadata. Article metadata are items like its title, author(s), publisher, the time published and whether or not it’s an opinion piece. 

We then extract the article’s text content. We’re not interested in menus, advertising or images. What we want is the raw text content that makes up the piece. This process is rather technical and has several components itself so we’ll cover that in a post of its own. Worth mentioning, however, is that from time to time our model might leak some text that wasn’t part of the content into the analysis. Most often these leaks are image captions from the article. The ultimate effect of this is that we sometimes show a ‘more detail’ item which isn’t really relevant. We’re always working to improve this and are regularly reducing the occurrence rate.

How do we find details?

Once  we have the article’s raw text content we split it up into sentences. At first this might seem really simple, just split on the period, right? However, it’s one of those things that sounds easy but has labyrinthine complexity when you dig a little deeper. For example, what about a prefix like ‘Ms.’? Or an acronym like U.S.A? Or what about an acronym that someone just made up and placed right at the end of a sentence? Despite the challenges, we do eventually get nicely split sentences out of the article.

Why sentences though? When considering what makes up a ‘detail’ in any news story we scoured thousands of articles to see how journalists present information. A detail is some event or something that was said. Ideally it would contain context like who said the thing, and to whom it was said, and where, and why. The typical presentation for an entire detail like this is a sentence. Sometimes the context is added in adjacent sentences, forming a paragraph. We were faced with a choice, should sentences or paragraphs be the ‘atom’ when considering details? We went with sentences for a simple reason, the majority of paragraphs we researched contained more than one detail across its sentences. If we tried to analyze details this way we’d end up with all kinds of strange behavior since the semantic meaning of each detail would be mixed.

So, sentences it is! We now have a collection of them for every article in the story. Next we cluster them together across the sources in order to find consensus on their semantic meaning. That’s a mouthful, so what does it mean? 

In a news story the different sources are all reporting on the same thing, some might have fewer or more details than others but there will be a lot of commonality. We want to find all the details that have several sources covering them, that’s the clustering and consensus part. Additionally, we want to find these clusters regardless of the exact wording each source chose for its sentence. For example, let’s pretend there’s a story covering a new scientific paper on the effect of a Nutella-only diet. One detail may be that participants reported a marked increase in happiness in their daily lives. One source may write “survey respondents consistently showed an improvement in happiness” while another source may write “participants demonstrated a 10-20% increase in happiness when surveyed”. Despite the difference in words these are the same thing and we want to capture that. That’s the semantic part.

How we actually do this is horrendously technical and will be saved for our next post covering all those nitty-gritty details (pun intended). The level of consensus we need in order to call a cluster of sentences a detail depends on how many articles we have. Not every story is as earth-shattering as the Nutella diet one, some only get covered by a handful of sources. When we have less than 10 sources we only need 2 articles to form consensus with matching details. If we have up to 50 articles then that threshold is increased to 7 articles containing a shared detail. Beyond 50 we require at least 15 articles to present a detail for that detail to be considered. 

There’s no side-stepping that our choice of consensus levels is subjective. Every month we revisit these numbers and try to do better, what we have so far was chosen from trial and error with typical news stories.

You might be asking, but what about that one source which has something special the others didn’t cover? Unfortunately that will be left out of our analysis. There is no way for us to verify if that detail is at all valid or relevant to the story. During a breaking story this might cause us to miss things, however after just one hour of the story’s life we have enough to form consensus since sources tend to copy each other. 

There’s more to how we form consensus though. Here’s something fun a clever news conglomerate could do. Let’s say our conglomerate (we’ll call it Shoes Corp) has several dozen publishers in their organization. Shoes Corp could instruct each of these publishers to write the same superfluous details in order to trick our analysis software into thinking that they’ve covered some special detail. This would lead to these organizations receiving a higher score than others (more to come on that) and would unfairly favor Shoes Corp. To combat this twisted gamification, we adjust the scoring weight of each detail based on the number of unique sources that covered it. We maintain a database of correlated sources to do this.

At the end of this process we have the text content from every article and all of the details we found in the whole story. Now we go through each article and look for which details it did not contain. For each of these we then try to find a sentence within the article that is somewhat related to that missing detail. With that sentence we can highlight it in the app and give the reader a place to find the missing information with the right context. This is fun since we’re trying to connect the missing piece to something that might not have anything close to it in the article at all. Despite this we get it right most of the time but we’re always working to improve this feature in particular.

How do we find misleading text?

Next we look for potentially misleading pieces of text in each article. This can be a slippery slope, the bottom of which terminates with a sheer cliff. One person’s idea of misleading text might not be the same as another’s. Much discussion at The Daily Edit centers around this point but ultimately our plan of attack is to never consider anything misleading unless it can objectively be shown by the actual text we highlight.

This means that we do not highlight hyperboles, false dichotomies or straw man arguments. Instead, we highlight things like missing data references (“a recent study shows” – without a reference to the study), missing sources (“according to an anonymous source”) and scare quotes. Each of these can be verified by the reader just by looking at the text we highlighted. Either the data was referenced or it was not. Either the source was named or it was not (and it’s ok to not name a source, we just want to increase awareness).

We achieve this by matching grammatical patterns on each sentence. Each time we find a match we add it to a scratchpad for further review. Later in the article we might find another piece of text that does in fact clear a previous match. For example, an article might have a data reference in its first paragraph but only mention where it came from in the last paragraph. In this case the data reference is valid and we shouldn’t highlight it.

Making this happen led us to creating a small programming language which allows us to describe grammatical patterns with lots of complexity in a very concise way. It supports 60 languages so far! Since it’s a complex tool itself we’ll leave its description for a post of its own.

Alas, we do not always get this perfect. Languages are tricky and there are myriad ways to construct a sentence, so from time to time we’ll highlight something erroneously. If that happens then please report it in the app so we can improve things further.

How does this lead to a score?

OK, so now we have all the articles, details, missing details and misleading text. The poor computer is exhausted and just wants to go home and sleep. However, it has just one more thing to do before it can clock off. It has to produce a score for the reader to compare sources. 

Heads-up, this is the most subjective part of what we do, please send us any and all feedback you may have so that we can make something that works for everybody.

Article scores are made up of three components:

  1. The coverage score – this is the percentage of all details found in the story set that a particular source covered. More is better.
  2. The misleading score – this is a percentage derived from the number of potentially misleading pieces of text we found in an article. More is worse.
  3. The trust index – this is just a simple arithmetic combination of the above two scores.

We compute the coverage score by creating a ‘weight’ for each detail we found. The weight is just the number of unique sources that cover that detail. As per the Shoes Corp example above, their publishers would all count as a single source when calculating the weight. We then add these all up to get the maximum possible coverage score. We then compute each source’s coverage score by dividing the weighted details it contained by the maximum score. So if you see a source with a coverage score of 100% then it did a bang-up job of covering the story, give that journalist a Pulitzer.

The misleading score is much simpler. Each highlighted region of potentially misleading text adds 20% to the misleading score with a maximum penalty of 100%. This means that five highlights gives that source the worst possible misleading score. This sounds bad but most journalists are pretty good so it’s rare to see more than 40%.

Now we come to the trust index. Choosing how to combine the two previous scores to form this is an ongoing discussion and has seen several iterations so far. One question tends to drive it however: 

What’s better, an article that covers the whole story but is a little misleading or an article that is pristine but misses a few details?

Over time we’ve settled on favoring articles with more coverage, since more coverage tends to lead to a more balanced view of the story. If they include a couple of scare quotes then so be it, the reader is still better off than only seeing half the story, plus we highlight those scare quotes in the article so they can make their own informed judgment. Based on this the computation is very simple, the coverage score makes up 80% of the trust index and the misleading score determines the remaining 20%. If an article covers every detail and has no misleading text it gets a perfect trust index. If it has five or more misleading pieces of text but perfect detail coverage it gets 80% (since 20% is lost due to misleading text). And so on.

What about machine learning bias?

Much of what I’ve covered so far depends on the output of machine learning models and no discussion of these is complete without covering bias. We’ve all read about machine learning bias ruining models (Google did it, so did Facebook) so how does this apply here?

Our models are trained on very large corpora of text. These are selected to give the broadest possible coverage of news stories in the wild. Despite this, the inherent structure of news stories can lead to bias seeping through the model. 

For example, what about an entirely new technology covered in an article with a truly eccentric style of writing? This would never have been encountered during training. In our application the worst this leads to is a source’s sentence not making it into a cluster. This will cause us to show that detail as ‘more detail’ in the app on that source’s article despite it already being there in some funky way. 

This has two effects, first it means that we waste a few seconds of the reader’s time by showing them a detail that they can already see. Second, it means that we penalize the article and give it a lower score than it would otherwise have had. This isn’t optimal and it’s not easy for us to know when it happens so if you encounter this case in the wild please let us know so that we can improve our model’s training data in the future.

Compared to the two examples in the links above however, we see that the effects of bias in our machine learning models don’t pose a risk and are a minor annoyance more than anything else.

Conclusion

Whew! Almost 3000 words and here we are, finally, at the end. In an industry as thorny as news there is no trust without transparency, so we hope that this post has helped show you at least some of the lengths we go to at The Daily Edit to give you a better news reading experience and more media insight.

This will forever be a work in progress as the news itself changes so please send any feedback and questions you might have. We’re always open to discussion and debate on any topic.

Over the coming weeks we’ll publish more posts explaining each of these components with all the technical detail.





Download The Daily Edit app for more insights, to learn about our mission, and how our technology is changing the way the world engages with information.
Google Play   App Store