In the world of data analytics, sampling is a common technique that reduces big data to a manageable size. If done correctly, you don’t have to work with the entire data also known as the population.
Sampling is done mainly to reduce the time it takes to analyse the population. In other cases, an analyst may not have access to the entire data so sampling is a way of analysing data that is not completely available. Sampling is also useful when there is cost involved in obtaining data. It reduces acquisition cost while providing comparable data quality.
And because sampling deals with a subset of a whole, we hope that any results that we derive from the sample possess significant semblance to that of the population. In short, to get correct results from a sample, the sample should be a “good enough” representation of the population. If not, any conclusions we derive using the sample may differ quite a bit than if we had used the entire population.
Random sampling is a technique that chooses a subset from a population with each individual part selected randomly. The technique assumes that by randomly choosing each individual part, the likelihood of the population being represented correctly increases. The more sophisticated version of this method is the stratified random sampling where segments of the population are proportionally represented.
Once you go down the sampling road, there is no turning back. The next question that often gets asked is how small of a sample size can I work with and still get acceptable results? This is a rather tricky question whose answer depends on many things such as the number of variables that you are analysing and more importantly the confidence level that you are trying to achieve. The more narrow the level of confidence i.e., 2.5% the more observations is required to get acceptable results.
An unbiased selection of individuals in a sample means that if you repeat your sampling exercise over and over again, the average sample parameters closely represents the population parameters (mean and variance). Repeated sampling is usually done to test unbiasedness.
Finally, don’t scrimp on the size of your sample. More data tend to provide better information than less data. Do your analysis and run some tests. If the outcome of your analysis corroborates what you observe then you are on your way to success with samples.