|
From DataScience Bookcamp by Leonard Apeltsin This article covers: • Analysis of Binomials using the SciPy library.
|
Take 40% off Data Science Bookcamp by entering fccapeltsin into the discount code box at checkout at manning.com.
Statistics is a branch of mathematics dealing with the collection and interpretation of numeric data. It’s the precursor of all modern data science. The term Statistic originally signified “the science of the state”, because statistical methods were first developed to analyze the data of state governments. Because since ancient times, government agencies have gathered data pertaining to their populace. That data can be used to levy taxes, organize large military campaigns, build public works, etc. Hence, critical state decisions have long depended on the quality of data. Poor record-keeping could lead to potentially disastrous results. This is why state bureaucrats are concerned by any random fluctuations in their records. Probability theory eventually tamed these fluctuations, making the randomness interpretable. From then on, statistics and probability theory have been closely intertwined.
Statistics and probability theory are closely related, but in some ways they’re different. Probability theory studies random processes over a potentially infinite number of measurements. It isn’t bound by real-world limitations. This allows us to model the behavior of a coin by imagining millions of coin-flips. In real life, flipping a coin millions of times is a pointlessly time-consuming endeavor. Surely we can sacrifice some data instead of flipping coins all day and night. Statisticians acknowledge these constraints placed upon us by the data-gathering process. Real-word data collection is costly and time-consuming. Every data-point carries a price. We can’t survey a country’s population without employing government officials. We can’t test our online ads without paying for every ad which is clicked. The size of our final dataset usually depends on the size of our initial budget. If the budget is constrained, then the data is constrained. This trade-off between data and resources lies at the heart of modern statistics. Statistics helps us understand exactly how much data is sufficient to draw insights and make impactful decisions. The purpose of statistics is to find meaning in data even when that data are limited in size.
Statistics is highly mathematical, and it’s usually taught using math equations. Nevertheless, direct exposure to equations isn’t a prerequisite for statistical understanding. In fact, many data scientists don’t write formulas when running statistical analyses. Instead, they use Python libraries such as SciPy, which handle all the complex math calculations, but proper library usage still requires an intuitive understanding of statistical procedures. In this article, we cultivate our understating of statistics by applying probability theory to real-world problems.
Exploring the Relationships between Data and Probability Using SciPy
SciPy, which is shorthand for Scientific Python, provides many useful methods for scientific analysis. The SciPy library includes an entire module for addressing problems in probability and statistics; scipy.stats
. Let’s install the library. Afterwards, we’ll import the stats module.
Call “pip install scipy” from the command-line terminal in order to install the SciPy library.
Listing 1. Importing the stats
module from SciPy
from scipy import stats
The stats
module greatly aid our efforts in assessing the randomness of data. Imagine we computed the probability of a fair coin producing at-least sixteen heads after twenty flips. Our calculations required us to examine all possible combinations of twenty flipped coins. Afterwards, we computed the probability of observing sixteen or more heads or sixteen or more tails, in order to measure the randomness of our observations. SciPy allows us to measure this probability directly using the stats.binomial_test
method. The method is named after the Binomial distribution, which governs how a flipped coin might fall. The method requires three parameters: the number of heads, the total number of coin flips, and the probability of a coin landing on heads. Let’s apply the Binomial test to sixteen heads observed from twenty coin-flips. Our output should equal the previously computed value of approximately 0.011.
SciPy and standard Python handle low-value decimal points differently. We’ll round our SciPy output to 17 digits.
Listing 2. Analyzing extreme head-counts using SciPy
num_heads = 16 num_flips = 20 prob_head = 0.5 prob = stats.binom_test(num_heads, num_flips, prob_head) print(f"Probability of observing more than 15 heads or 15 tails is {prob:.17f}") Probability of observing more than 15 heads or 15 tails is 0.01181793212890625
It’s worth emphasizing that stats.binomial_test
doesn’t compute the probability of observing sixteen heads. Rather, it returned the probability of seeing a coin-flip sequence where sixteen or more coins fell on the same face. If we want the probability seeing exactly sixteen heads, then we must use the stats.binom.pmf
method. That method represents the probability mass function of the Binomial distribution. A probability mass function maps inputted integer values to their probability of occurrence. Calling stats.binom.pmf(num_heads, num_flips, prob_heads)
returns the likelihood of a coin yielding num_heads
number of heads. Under current settings, this equals the probability of a fair coin falling on heads sixteen out of twenty times.
Listing 3. Computing an exact probability using stats.binom
prob_16_heads = stats.binom.pmf(num_heads, num_flips, prob_head) print(f"The probability of seeing {num_heads} of {num_flips} heads is {prob_16_heads}") The probability of seeing 16 of 20 heads is 0.004620552062988271
We’ve used stats.binom.pmf
to find the probability of seeing exactly sixteen heads, but that method is also able to compute multiple probabilities simultaneously.
Multiple head-count probabilities can be processed by passing in a list of head-count values. For instance, passing [4, 16]
returns a two-element NumPy array containing the probabilities of seeing four heads and sixteen heads, respectively. Conceptually, the probability of seeing four heads and sixteen tails equals the probability of seeing four tails and sixteen tails. Executing stats.binom.pmf([4, 16], num_flips, prob_head)
should return a two-element array with elements that are equal. Let’s confirm.
Listing 4. Computing an array of probabilities using< stats.binom
probabilities = stats.binom.pmf([4, 16], num_flips, prob_head) assert probabilities.tolist() == [prob_16_heads] * 2
List-passing allows us to easily compute probabilities across intervals. For example, if we pass range(21)
into stats.binom.pmf
, then the outputted array contains all probabilities across the interval of every possible head-count. The sum of these probabilities should equal 1.0.
NOTE: Summing low-value decimals is computationally tricky. Over the course of the summation, tiny errors accumulate. Due to these errors, our final summed probability marginally diverges from 1.0, unless we round it to fourteen significant digits. We do this rounding in the code below.
Listing 5. Computing an interval probability using stats.binom
interval_all_counts = range(21) probabilities = stats.binom.pmf(interval_all_counts, num_flips, prob_head) total_prob = probabilities.sum() print(f"Total sum of probabilities equals {total_prob:.14f}") Total sum of probabilities equals 1.00000000000000
Also, plotting interval_all_counts
versus probabilities
reveals the shape of our twenty coin-flip distribution. We can generate the distribution plot without having to iterate through possible coin-flip combinations.
Listing 6. Plotting a twenty coin-flip Binomial Distribution
import matplotlib.pyplot as plt plt.plot(interval_all_counts, probabilities) plt.xlabel('Head-count') plt.ylabel('Probability') plt.show()
Figure 1. The probability distribution for twenty coin-flips, generated using SciPy.
Our ability to visualize the Binomial was limited by the total number of coin-flip combinations that we needed to compute. Now, this is no longer the case. The stats.binom.pmf
methods allows to display any distribution associated with an arbitrary coin-flip count. Let’s use our new-found freedom to simultaneously plot the distributions for 20, 80, 140, and 200 coin-flips.
Listing 7. Plotting five different Binomial distributions
flip_counts = [20, 80, 140, 200] linestyles = ['-', '--', '-.', ':'] colors = ['b', 'g', 'r', 'k'] for num_flips, linestyle, color in zip(flip_counts, linestyles, colors): x_values = range(num_flips + 1) y_values = stats.binom.pmf(x_values, num_flips, 0.5) plt.plot(x_values, y_values, linestyle=linestyle, color=color, label=f'{num_flips} coin-flips') plt.legend() plt.xlabel('Head-count') plt.ylabel('Probability') plt.show()
Figure 2. Multiple Binomial probability distributions across 20, 80, 140, and 20 coin-flips. The distribution centers shift right as the coin-flip count goes up. Also, every distribution becomes more dispersed around its center as the coin-flip increases.
Within the plot, the central peak of each Binomial appears to shift right-ward as the coin-flip count goes up. Also, the twenty coin-flip distribution is noticeably thinner than the two hundred coin-flip distribution. The plotted distributions grow more dispersed around their central positions as these central positions relocate right.
Such shifts in centrality and dispersion are commonly encountered in data analysis. Using randomly sampled data to visualize several histogram distributions, we subsequently observed that the plotted histogram thickness was dependent on our sample size. At the time, our observations were purely qualitative. We lacked a rigorous metric for comparing the thickness of two plots, but noting that one plot appears “thicker” than another is insufficient. Likewise, stating that one plot is more “right-ward” than another is also insufficient. We need to quantitate our distribution differences. We must assign specific numbers to centrality and dispersion in order to discern how these numbers change from plot to plot. Doing this requires that we familiarize ourselves with the concepts of variance and mean.
If you want to learn more, check out the book on Manning’s liveBook platform here.