How would you select representative sample of search queries from 5 million queries?
SAMPLE SIZE FORMULA
Selecting a sample size is an important step in designing a study or survey. The sample size should be large enough to provide reliable and accurate results, but not so large that it becomes impractical or expensive to collect data. There are different methods to determine the appropriate sample size, but one of the most commonly used formulas is the one for sample size estimation in a survey or study.
The formula to calculate the sample size is:
n = (Z² * p * q) / E²
where:
n = the sample size needed
Z = the z-score (or standard deviation) corresponding to the desired level of confidence (e.g. 1.96 for a 95% confidence level)
p = the proportion of the population that has a certain characteristic or outcome (e.g. the proportion of people who prefer a certain brand)
q = 1 — p (the proportion of the population that does not have the characteristic or outcome)
E = the margin of error (the maximum amount of error that can be tolerated in the sample estimate)
The formula assumes that the population is large enough (generally considered to be at least 10 times the sample size), and that the sampling method is random.
To understand how the formula works, consider the following example:
A researcher wants to estimate the proportion of people in a city who support a new public policy proposal. The researcher decides to conduct a survey and wants to be 95% confident that the sample proportion is within 5 percentage points of the true proportion in the population. Using the formula, the sample size can be calculated as follows:
n = (1.96² * 0.5 * 0.5) / (0.05²) = 384.16
So the researcher needs a sample size of at least 385 people to achieve the desired level of confidence and margin of error. Note that the proportion of 0.5 is used as a conservative estimate of the proportion since the true proportion is unknown.