I always found entertaining doing polls on Twitter to check the percentage of peoples who thinks like me on certain topics, like how many hours you spend on Instagram or what film you prefer. Doing random polls on internet is fun and all, but we should always be wary of conducting an online poll and reporting the results on an official work, especially when the writer has no basics of statistic whatsoever.
I saw many contemporaries from different specializations (like psicology or marketing) conducting polls on social network and reporting the results on their thesis, taking what they obtained for absolute truth. But in reality, polls are hard. Statisticians spend a lot of their time designing efficient polls, which must be structured in an efficient way to minimize the intrinsic bias in the questions and ensure we select a good sample of the population.
How can we pick a good sample?
We can imagine the population as the entire set of people who are eligible for answering the poll. Since it is generally impossible to study the entire population, we can consider samples which we’ll then use to draw our conclusion over the entire population. Imagine a sample of the population as a subset of the entire population with a certain size n. The larger n, the more information we will be able to gather. However, keep in mind that a large size don’t necessarily means that the information we gather is going to be any better (however if the sampling is done effectively, theoretically for very large n we’ll obtain similar distribution due ot the central limit theorem).
The biggest problem with online polls is without a doubt the fact that we cannot ensure that our sample population will be representative. A general and simple way to ensure that the sample is more representative is by doing the sampling randomly from the entire population. This way we ensure that our data won’t be skewed and we’ll obtain a better accuracy. If the random sampling is doing correctly, even small samples can show low sampling error. To achieve their precise results, statistician rely on the central limit theorem. To put it simply, given a sufficient number of sample of the same size, we’ll obtain a similar distribution to the one of the entire population.
The weak points of online polls
After explaining this, it should be more clear why using polls conducted online is a bad idea. Let’s try to summarize the main weak points of this procedure:
-
The sampling is not randomized. Since our possibilities are limited, we’ll rely on easier and cheaper methods of obtaining our informations, like using EasyPolls or asking relatives to answer our question. This lead to extremely skewed data, which are not able to capture the entire population at all. For example, if you share your polls on social network, you’ll reach only a specific demographic which usually shares the same ideas or location as you. This will be nowhere near the entire population distribution we are trying to approximate. Internet as a whole might not be able to capture the population, since there are many people who don’t use it at all. Being able to capture every demographic in our poll is important to ensure the best representation.
-
Since online polls are self-selected, there exists the risk of having only people interested in the topic to fill our poll. This implies more skewiness in our data.
-
The fact that it’s easy to access and modify the data, people (even you) might use this as a way of manipulate information towards certains results.
Since we are considering small scales polls, I will leave out the possibility of bots manipulation.
Proving the significance of your results
Now, suppose you are able to do the sampling correctly. To claim a (statistically) significat result we must also prove that the results in our data are not attained by chance alone. To do so in a scientific, credible way, you’ll need to provide values like p-value and confidence interval, which helps you prove statistical significance.
P-value is, in short, the probability that the null hypothesis is true. You can think of the null hypothesis as a ‘ground truth’ you are trying to prove false. For example, say that your null hypothesis is something like ‘all birds fly’. In order to accept the test result we are trying to prove (‘not all birds can fly’), we want the p-value to be as small as possible. In statistic, there are certain value called alpha (usually 0.05 and 0.01) used to determine how ‘good’ a p-value is. With a good p-value, we can then reject our null hypothesis and support what we are trying to prove.
On the other hand, a confidence interval is a range which is likely to contain to contain our value. Thus, to support a result as significant, we want our null hypothesis to reside outside the confidence interval.
I won’t dive more into the details, but this is to remark that even when you might think you found some kind of correlation, it might just be for chance. The only way to prove your results is by supporting it with appropriate metrics.
Conclusion
If you want to publish meaningful results in your thesis, your paper, or your article, keep away from online polls. The fact that so many people rely on such methods to draw scientific conclusions shows that we should stop treating statistic like something only engineer and mathematicians rely on, and instead try to grasp why and how statistic can help us on our everyday life. Having journalist reporting results or psychologist conducting test without the proper knowledge is disheartening at best, and disinformative for everyone reading their works. If you want to dive deeper into statistic in an more ‘relaxed’ way, ‘The Art of Statistics: Learning from Data’ from David Spiegelhalter is a wonderfoul book to approach the field, in my opinion.
Thanks for reading all the way! I hope you found this post informative. I don’t claim the absolute truth of what I report above, and I always hope for feedbacks from people to hemp me improve if I got something wrong. Moreover, if you are interested in the argument I always suggest you to read from different sources and never take everything for granted.
sources
“The Elements of Statistical Learning”, by Jerome H. Friedman, Robert Tibshirani and Trevor Hastie