Showing posts with label selection bias. Show all posts
Showing posts with label selection bias. Show all posts

Saturday, October 17, 2015

17/10/15: Let’s talk about the Law of Small Numbers


Wonkishly awesome, folks…

Let’s start with a set up

You decide to will flip a coin 4 times in a row and record the outcome of each flip. After you done flipping, you look at every flip that “immediately followed an outcome of heads, and compute the relative frequency of heads on those flips”.

“Because the coin is fair, [you] of course expect this empirical probability of heads to be equal to the true probability of flipping a heads: 0.5.”

You will be wrong. If you “were to sample one million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4.”

Two researchers, Joshua Miller and Adam Sanjurjo “demonstrate that in a finite sequence generated by i.i.d. [independent, identically distributed] Bernoulli trials with probability of success p, the relative frequency of success on those trials that immediately follow a streak of one, or more, consecutive successes is expected to be strictly less than p, i.e. the empirical probability of success on such trials is a biased estimator of the true conditional probability of success.”

Which implies

So far, pretty innocuous from the average punter perspective. But wait. “While, in general, the bias does decrease as the sequence gets longer, for a range of sequence (and streak) lengths often used in empirical work it remains substantial, and increases in streak length.” In other words, while empirical probability does approach closer and closer to true conditional probability, it does so in trials so large (so many coins flips) that such convergence does not make much of the difference in our, human, decision making.

And that is pretty pesky for the way we look at probabilistic outcomes and make decisions based on our expectations, whenever our decisions are sequential.

Impact on decision making

“This result has considerable implications for the study of decision making in any environment that involves sequential data”. These implication are:

  1. This provides “a structural explanation for the persistence of one of the most well-documented, and robust, systematic errors in beliefs regarding sequential data—that people have an alternation bias (also known as negative recency bias)… — by which they believe, for example, that when observing multiple flips of a fair coin, an outcome of heads is more likely to be followed by a tails than by another heads;
  2. It also helps resolve “…the closely related gambler’s fallacy…, in which this alternation bias increases with the length of the streak of heads.”
  3. “Further, the result shows that data in the hot hand fallacy literature …has been systematically misinterpreted by researchers; for those trials that immediately follow a streak of successes, observing that the relative frequency of success is equal to the overall base rate of success, is in fact evidence in favor of the hot hand, rather than evidence against it.”

And tangible applications are

So the realisation that “the empirical probability of success on such trials is a biased estimator of the true conditional probability of success” helps explain why “…the inability of the gambler to detect the fallacy of his belief in alternation has an exact parallel with the researcher’s inability to detect his mistake when concluding that experts’ belief in the hot hand is a fallacy.”

But there is more. Per authors, “the result may have implications for evaluation and compensation systems. That a coin is expected to exhibit an alternation “bias” in finite sequences implies that the outcome of a flip can be successfully “predicted” in finite sequences at a rate better than that of chance (if one is free to choose when to predict).”

They offer the following example of this: “suppose that each day a stock index goes either up or down, according to a random walk in which the probability of going up is, say, 0.6. A financial analyst who can predict the next day’s performance on the days she chooses to, and whose predictions are evaluated in terms of how her success rate on predictions in a given month compares to that of chance, can expect to outperform this benchmark… For instance, she can simply predict “up” immediately following down days, or increase her expected relative performance even further by predicting “up” only immediately following longer streaks of consecutive down days.”

Going back to the first example with coin flipping, the law of large numbers implies that as your sampling size (number of coin flips) rises, “…the average empirical probability of heads would approach the true probability. The key to why this is not the case, and to why the bias remains, is that it is not the flip that is treated as the unit of analysis, but rather the sequence of flips from each coin. In particular, if [you] were willing to assume that each sequence had been generated by the same coin, and [you] were to compute the empirical probability by instead pooling together all of those flips that immediately follow a heads, regardless of which coin produced them, then the bias would converge to zero as the number of coins approaches infinity.”

What this means is that “…in treating the sequence as the unit of analysis, the average empirical probability across coins amounts to an unweighted average that does not account for the number of flips that immediately follow a heads in each sequence, and thus leads the data to appear consistent with the gambler’s fallacy.”

Per authors, “the implications for learning are stark: to the extent that decision makers update their beliefs regarding sequential dependence with the (unweighted) empirical probabilities that they observe in finite length sequences, they can never unlearn a belief in the gambler’s fallacy…”

Overall, we have

To sum this up, the authors found “a subtle but substantial bias in a standard measure of the conditional dependence of present outcomes on streaks of past outcomes… The mechanism is a form of selection bias, which leads the empirical probability …to underestimate the true probability of a given outcome, when conditioning on prior outcomes of the same kind. The biased measure has been used prominently in the literature that investigates incorrect beliefs in sequential decision making --- most notably the Gambler's Fallacy and the Hot Hand Fallacy.”

The two fallacies are defined as follows:

  • “…People believe outcomes alternate more than they actually do, e.g. for a fair coin, after observing a flip of a tails, people believe that the next flip is more likely to produce a heads than a tails. Further, as a streak of identical outcomes increases in length, people also tend to think that the alternation rate on the outcome that follows becomes even larger, which is known as the gambler’s fallacy”.
  • “The hot hand fallacy typically refers to the mistaken belief that success tends to follow success (hot hand), when in fact observed successes are consistent with the typical fluctuations of a chance process.”

After correcting for the bias, the authors show that “the conclusions of some prominent studies in the literature are reversed.” Awesomely wonkish...


Full paper: Miller, Joshua Benjamin and Sanjurjo, Adam, Surprised by the Gambler's and Hot Hand Fallacies? A Truth in the Law of Small Numbers (September 15, 2015). IGIER Working Paper #552. http://ssrn.com/abstract=2627354

Saturday, May 31, 2014

31/5/2014: Twitter: Promoting Isolation, Ideological Segregation and All Things Good to Your Political Engagement


A very interesting study looking at comparatives of media and news use via twitter (social media) and traditional media (print, radio and TV). The paper, titled "Are Social Media more Social than Media? Measuring Ideological Homophily and Segregation on Twitter" (May 2014_ by YOSH HALBERSTAM and BRIAN KNIGHT is available here: http://bfi.uchicago.edu/sites/default/files/research/Twitter_may232014.pdf

Some highlights:

Per authors, "Social media represent a rapidly growing source of information for citizens around the world. In this paper, we measure the degree of ideological homophily and segregation on social media."

The reason this is salient is that there has been a "tremendous rise in social media during the past decade, with 60 percent of American adults and over 20 percent of worldwide population currently using social networking sites (Rainie et al., 2012)…. Indeed, this phenomenal growth in social media engagement in the U.S. and around the world has transformed the nature of political discourse. Two thirds of American social media users—or 39 percent of all American adults—have engaged in some form of civic or political activity using social media, and 22 percent of registered U.S. voters used social media to let others know how they voted in the 2012 elections."

Per authors, "Three key features of social media distinguish it from other forms of media and social interactions." These are:

  • "…social media allow users to not only consume information but also to produce information." It is worth noting that social media can also reproduce information produced on social media, as well as that produced by traditional media.
  • "…the information to which users are exposed depends upon self-chosen links among users." In other words, social media produced and distributed information can be self-selection biased. The extent of this selection is more limited in the case of traditional media, where individual biases of consumers can be reinforced by selecting specific programmes/channels/publications, but beyond that, the content received by consumers is the one selected for them by someone else - journalists, editors.
  • "…information on social media travels more rapidly and broadly than in other forms of social interactions. …[social media network model] leads to a substantially broader reach and more rapid spread of information than other forms of social interactions."

As authors put it: "Given these three distinguishing features, the rapid growth of social media has the potential to effect a structural change in the way individuals engage with one another and the degree to which such communications are segregated along ideological lines."


To examine this possibility, the authors construct "a network of links between politically-engaged Twitter users. For this purpose, we selected Twitter users who followed at least one Twitter account associated with a candidate for the U.S. House during the 2012 election period. Among this population of over 2.2 million users, we identify roughly 90 million links, which form the network." Based on political party followed, users were assigned ideological identifiers.

Two key findings of the paper are as follows:

  1. "…we find that the network we constructed shares important features with face-to-face interactions. Most importantly, both settings tend to exhibit a significant degree of homophily, with links more likely to develop between individuals with similar ideological preferences." In other words, we do show strong selection biases in networks we form. Doh!..
  2. "…when computing the degree of ideological segregation and comparing it to ideological segregation in other settings, we find that Twitter is much more segregated than traditional media, such as television and radio, and is more in-line with ideological segregation in face-to-face interactions, such as among friends and co-workers." Worse: we not only form biased networks, we also create selection-biased interactions and generate selection-biased chains and flows of content. Doh! Redux...

Conclusion: "Taken together, our results suggest that social media may be a force for increasing isolation and ideological segregation in society."

Wait… so we act on the social media base to create networks that are closer to friends networks… and this leads to… isolation?.. Well, my eye, I would have thought this would be the opposite…

But top conclusion makes sense:  "The issue of ideological segregation is important when providing such information. Exposure to diverse viewpoints in a society helps to ensure that information is disseminated with little friction across a large number of people. When a community is polarized and is divided into factions, by contrast, information may spread unevenly and may miss intended targets. Our results suggest that social media are highly segregated along ideological lines and thus emphasize these potential problems associated with the flow of information in segregated networks."

The problem, of course is: Can the selection bias be ameliorated? Can people be 'incentivised' to engage with ideological opposites? In my view - yes. This can be achieved most likely by educating people about systems of thought, logic, structures of knowledge, information. The thing is: in social networks, such education is both more feasible (volume of information delivered and speed are both higher) and probably more productive (because there is inherent trust in one's own network that is stronger than in detached media networks. Peers generate stronger bonds than preachers...

The paper has some fascinating data illustration of media biases, though - worth looking at in the appendix.