Sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population. Researchers rarely survey the entire population because the cost of a census is too high. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogeneity and to im-prove the accuracy and quality of the data.
Concept of Population
Sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population. Researchers rarely survey the entire population because the cost of a census is too high. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogeneity and to improve the accuracy and quality of the data.
Techniques of Sampling
There are two broader techniques of sampling: Probability Sampling or Random Sampling and Non-probability sampling, among which only Random Sampling can be used for statistical investigation.
Probability Sampling or Random Sampling
Probability sampling, or random sampling, is a sampling technique in which the probability of getting any particular sample may be calculated. Examples of random sampling include:
Simple Random Sampling
1. Without Replacement: One deliberately avoids choosing any member of the population more than once.
2. With Replacement: One member can be chosen more than once.
Systematic Sampling
Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Suppose you are talking data from every 10th person entering into a mall.
Stratified Sampling
Where the population embraces a number of distinct categories or “strata‖, each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected.
Where the population embraces a number of distinct categories or “strata‖, each stra-tum is then sampled as an independent sub-population, out of which individual ele-ments can be randomly selected.
male, full-time: 90
male, part-time: 18
female, full-time: 9
female, part-time: 63
Total: 180
and we are asked to take a sample of 40 staff, stratified according to the above categories.
The first step is to find the total number of staff (180) and calculate the percentage in each group.
% male, full-time = 90 / 180 = 50%
% male, part-time = 18 / 180 = 10%
% female, full-time = 9 / 180 = 5%
% female, part-time = 63 / 180 = 35%
This tells us that of our sample of 40,
50% should be male, full-time.
10% should be male, part-time.
5% should be female, full-time.
35% should be female, part-time.
50% of 40 is 20.
10% of 40 is 4.
5% of 40 is 2.
35% of 40 is 14.
Another easy way without having to calculate the percentage is to multiply each group size by the sample size and divide by the total population size (size of entire staff):
male, full-time = 90 x (40 / 180) = 20
male, part-time = 18 x (40 / 180) = 4
female, full-time = 9 x (40 / 180) = 2
female, part-time = 63 x (40 / 180) = 14
Non-Probability Sampling
In non — probability sampling, we cannot assign any probability to the selected sample. Nonprobability sampling techniques cannot be used to infer from the sample to the general population.
Examples of nonprobability sampling include:
Convenience, Haphazard or Accidental sampling — members of the population are chosen based on their relative ease of access. To sample friends, co-workers, or shoppers at a single mall, are all examples of convenience sampling.
Judgmental sampling or Purposive sampling — The researcher chooses the sample based on who they think would be appropriate for the study. This is used primarily when there is a limited number of people that have expertise in the area being re-searched.
Sampling Bias
In statistics, sampling bias is when a sample is collected in such a way that some members of the intended population are less likely to be included than others. It results in a biased sample, a non-random sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenome-non under study rather than to the method of sampling.
Sampling Distribution
The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when de-rived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size.
Population Parameters and The Estimation Theory
A statistical parameter is a parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristic of a population or a model. For example, the family of normal distributions has two parameters, the mean μ and the variance σ²: if these are specified, the distribution is known exactly. The family of Poisson distributions, on the other hand, has only one parameter, the mean λ.
In statistics, our purpose is to learn about the population by studying the samples. Estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means. So, sample mean is an estimator here and the value of the mean is the estimate A statistical parameter is a parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristic of a population or a mod-el. For example, the family of normal distributions has two parameters, the mean μ and the variance σ²: if these are specified, the distribution is known exactly. The family of Poisson distributions, on the other hand, has only one parameter, the mean λ.
In statistics, our purpose is to learn about the population by studying the samples. Estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means. So, sample mean is an estimator here and the value of the mean is the estimate. So as a parameters is to the population, a statistic is to a sample.
Types of Estimator
There are two types of estimator: Point Estimator and Interval Estimator.
The point estimators yield single-valued results, whereas an interval estimators results in a range of plausible values.
Properties of Estimator
Unbiased: The estimator is an unbiased estimator of if and only if the expectation of the estimator is equal to the population parameter.
Consistency: An estimator is called consistent if increasing the sample size increases the probability of the estimator being close to the population parameter.
Efficiency: Among unbiased estimators, there often exists one with the lowest variance, called the minimum variance unbiased estimator (MVUE) or an efficient estimator.
Sufficiency: An estimator is called sufficient if no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter.
Testing of Statistical Hypothesis
Statistical hypotheses are statements about real relationships; and like all hypotheses, statistical hypotheses may match the reality, or they may fail to do so. Statistical hypotheses have the special characteristic that one ordinarily attempts to test them (i.e., to reach a decision about whether or not one believes the statement is correct, in the sense of corresponding to the reality) by observing facts relevant to the hypothesis in a sample. This procedure, of course, introduces the difficulty that the sample may or may not represent well the population from which it was drawn.
Types of Hypotheses
Null Hypothesis (H0): Hypothesis testing works by collecting data and measuring how likely the particular set of data is, assuming the null hypothesis is true. If the data-set is very unlikely, defined as being part of a class of sets of data that only rarely will be observed, the experimenter rejects the null hypothesis concluding it (probably) is false. The null hypothesis can never be proven, only thing we can do is to reject it or not reject it.
Alternative Hypothesis (H1 or HA): The alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. An example might be where water quality in a stream has been observed over many years and a test is made of the null hypothesis that there is no change in quality between the first and second halves of the data against the alternative hypothesis that the quality is poorer in the second half of the record.
Examples of Statistical Hypotheses
1. The mean age of all Calcutta University students is 23.4 years.
2. The proportion of Calcutta University students who are women is 50 percent.
3. The heights of all the male students of Calcutta University are normally distributed.
Types of Errors in Testing of Hypothesis
There are two types of error as follows:
Type I Error: A type I error, also known as an error of the first kind, occurs when the null hypothesis (H0) is true, but is rejected. It is asserting something that is absent, a false hit. In terms of folk tales, an investigator may be “crying wolf” without a wolf in sight (raising a false alarm) (H0: no wolf).
Type II Error: A type II error, also known as an error of the second kind, occurs when the null hypothesis is false, but it is erroneously accepted as true. It is missing to see what is present, a miss. A type II error may be compared with a so-called false negative (where an actual ‘hit’ was disregarded by the test and seen as a ‘miss’) in a test checking for a single condition with a definitive result of true or false. A Type II error is committed when we fail to believe a truth.
Consequences of Type I and Type II Errors
Both types of errors are problems for individuals, corporations, and data analysis. Based on the real-life consequences of an error, one type may be more serious than the other. For example, NASA engineers would prefer to throw out an electronic circuit that is really fine (null hypothesis H0: not broken; reality: not broken; action: thrown out; error: type I, false positive) than to use one on a space-craft that is actually broken (null hypothesis H0: not broken; reality: broken; action: use it; error: type II, false negative). In that situation a type I error raises the budget, but a type II error would risk the entire mission.
Level of Significance
Statistical significance is a statistical assessment of whether observations reflect a pat-tern rather than just chance, the fundamental challenge being that any partial picture is subject to observational error. In statistical testing, a result is deemed statistically significant if it is unlikely to have occurred by chance, and hence provides enough evidence to reject the hypothesis of ‘no effect’. As used in statistics, significant does not mean important or meaningful, as it does in everyday speech.
The significance level is usually denoted by the Greek symbol α. Popular levels of significance are 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005), and 0.1% (0.001). If a test of significance gives a p-value lower than the significance level α, the null hypothesis is rejected.
Confidence Interval
In statistics, a confidence interval (CI) is a kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. Confidence intervals consist of a range of values (interval) that act as good estimates of the un-known population parameter. However, in rare cases, none of these values may cover the value of the parameter. The level of confidence of the confidence interval would indicate the probability that the confidence range captures this true population parameter given a distribution of samples.
If a corresponding hypothesis test is performed, the confidence level corresponds with the level of significance, i.e. a 95% confidence interval reflects an significance level of 0.05, and the confidence interval contains the parameter values that, when tested, should not be rejected with the same sample. In statistics, a confidence interval (CI) is a kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. However, in rare cases, none of these values may cover the value of the parameter. The level of confidence of the confidence interval would indicate the probability that the confidence range captures this true population parameter given a distribution of samples. If a corresponding hypothesis test is performed, the confidence level corresponds with the level of significance, i.e. a 95% confidence interval reflects an significance level of 0.05, and the confidence interval contains the parameter values that, when tested, should not be rejected with the same sample.