Types of samples. Small sample

The small sample method has a number of advantages over the large sample method. Its main advantages are, firstly, a reduction in the amount of computational work, and secondly, the ability to monitor the dynamics of changes in process accuracy over time, which cannot be done using the large sample method. The large sample method can only give an idea of ​​the accuracy and stability of the process during the period of sampling, which can remain in the future if the conditions of the process do not change after taking the sample. In reality, such invariability of production conditions cannot be foreseen in advance. For example, when working on a bar machine, during a shift, the material is replaced several times (bar change), the tool is changed due to wear, the machine is adjusted, etc., which can make significant adjustments to the previously obtained distribution parameters. The method of small samples, if the latter are taken regularly throughout the shift at certain intervals, allows you to obtain a complete picture of the state of the process during the period under study, determine the degree of its stability, and also identify the reasons for the insufficient stability of the process over time, if any.

Statistical analysis with small samples is carried out as follows. Samples of n = 5-10 pcs. taken at certain fixed intervals (for example, after 15-30 minutes). The time period for sampling is established empirically and depends on the productivity of the machine, the sample size and the degree of stability of the technological process. For each sample you need to calculate and S. Next, it is necessary for each two adjacent samples to test the hypothesis of homogeneity of sample variances using F - Fisher criterion.

If the hypothesis is confirmed, then this indicates the stability of the dispersion or that the samples being compared are taken from the same population. When confirming the hypothesis of homogeneity of variances of two samples, the hypothesis of homogeneity of two sample means should be tested. t -Student's test.

Confirmation of the hypothesis of the equality of two adjacent sample means means that the center of equipment tuning will not change at the time of taking this sample and remains the same as it was when taking the previous sample, i.e. the process is in a stable state. When the hypothesis of equality of the two average samples is not confirmed, this indicates a shift in the center of machine tuning at the time of taking this sample. Since samples are taken at certain intervals, if a shift in the tuning center or a change in the dispersion zone is detected, it is possible to determine the period of time after which a violation of process stability occurred.

Having discovered the fact of a violation of process stability, it is possible to establish the area in which the cause of this phenomenon should be sought. The heterogeneity of sample dispersions, indicating instability of dispersion, indicates that the reason for this should be sought in the machine or in the mechanical properties of the material being processed. The heterogeneity of sample means indicates a shift in the center of tuning (look for the reason in the instrument).

Thus, by taking small samples from the current output of the machine during a shift at certain time intervals, the averages and variances of the samples are calculated by comparing and assessing their discrepancies using F and t-criteria, it is possible to establish the moments of process disorder and even the sources of these disorder.

When controlling the quality of goods in economic research, an experiment can be conducted on the basis of a small sample. Under small sample refers to a non-continuous statistical survey in which the sample population is formed from a relatively small number of units in the general population. The volume of a small sample usually does not exceed 30 units and can reach 4 - 5 units. The average error of a small sample is calculated by the formula:, where is the variance of the small sample. When determining the variance, the number of degrees of freedom is n-1: . The marginal error of a small sample is determined by the formula. In this case, the value of the confidence coefficient t depends not only on the given confidence probability, but also on the number of sample units n. For individual values ​​of t and n, the confidence probability of a small sample is determined using special Student tables (Table 9.1.), which give the distributions of standardized deviations: Since when conducting a small sample, the value of 0.59 or 0.99 is practically accepted as a confidence probability, then to determine the marginal error of a small sample, the following readings of the Student distribution are used:

Ways to generalize sample characteristics to the population. The sampling method is most often used to obtain characteristics of the population according to the corresponding sample indicators. Depending on the purposes of the research, this is done either by direct recalculation of sample indicators for the general population, or by calculating correction factors. Direct recalculation method. It consists in the fact that the indicators of the sample share or the average are extended to the general population, taking into account the sampling error. Thus, in trade, the number of non-standard products received in a consignment is determined. To do this (taking into account the accepted degree of probability), the indicators of the share of non-standard products in the sample are multiplied by the number of products in the entire batch of goods. Method of correction factors. It is used in cases where the purpose of the sampling method is to clarify the results of a complete census. In statistical practice, this method is used to clarify data from annual censuses of livestock owned by the population. For this purpose, after generalizing the data from the complete census, a 10% sample survey is used to determine the so-called “percentage of undercounting”. Methods for selecting units from the general population. In statistics, various methods of forming sample populations are used, which is determined by the objectives of the study and depends on the specifics of the object of study. The main condition for conducting a sample survey is the prevention of systematic errors arising from violation of the principle of equal opportunity for each unit of the general population to be included in the sample. Prevention of systematic errors is achieved through the use of scientifically based methods for forming a sample population. There are the following methods for selecting units from the general population: 1) individual selection - individual units are selected for the sample; 2) group selection - qualitatively homogeneous groups or series of studied units are included in the sample; 3) combined selection - this is a combination of individual and group selection. Selection methods are determined by the rules for forming a sample population. Sampling can be: - purely random; - mechanical; - typical; - serial; - combined. Proper random sampling consists in the fact that the sample population is formed as a result of random (unintentional) selection of individual units from the general population. In this case, the number of units selected in the sample population is usually determined based on the accepted sample proportion. The sample share is the ratio of the number of units in the sample population n to the number of units in the general population N, i.e. So, with a 5% sample from a batch of goods of 2,000 units. sample size n is 100 units. (5*2000:100), and with a 20% sample it will be 400 units. (20*2000:100), etc. Mechanical sampling consists in the fact that the selection of units in the sample population is made from the general population, divided into equal intervals (groups). In this case, the size of the interval in the general population is equal to the reciprocal of the sample share. Thus, with a 2% sample, every 50th unit is selected (1: 0.02), with a 5% sample - every 20th unit (1: 0.05), etc. Thus, in accordance with the accepted proportion of selection, the general population is, as it were, mechanically divided into equal groups. From each group, only one unit is selected for the sample. An important feature of mechanical sampling is that the formation of a sample population can be carried out without resorting to compiling lists. In practice, the order in which the units of the population are actually located is often used. For example, the sequence of exit of finished products from a conveyor or production line, the order of placement of units of a batch of goods during storage, transportation, sales, etc. Typical sample. In typical sampling, the population is first divided into homogeneous typical groups. Then, from each typical group, a purely random or mechanical sample is used to individually select units into the sample population. Typical sampling is usually used when studying complex statistical populations. For example, in a sample survey of the labor productivity of trade workers, consisting of separate groups by qualification. An important feature of a typical sample is that it gives more accurate results compared to other methods of selecting units in the sample population. To determine the average error of a typical sample, the formulas are used: re-selection , non-repetitive selection , The variance is determined by the following formulas: , At single stage In a sample, each selected unit is immediately studied according to a given characteristic. This is the case with purely random and serial sampling. With multi-stage In the sample, individual groups are selected from the general population, and individual units are selected from the groups. This is how a typical sample is made with a mechanical method of selecting units into the sample population. Combined sampling can be two-stage. In this case, the population is first divided into groups. Then the groups are selected, and within the latter the individual units are selected.

When studying variability, quantitative and qualitative characteristics are distinguished, the study of which is carried out by variation statistics, which is based on probability theory. Probability indicates the possible frequency of an individual meeting a particular trait. P=m/n, where m is the number of individuals with a given trait value; n is the number of all individuals in the group. The probability ranges from 0 to 1 (for example, the probability is 0.02 - the appearance of twins in a herd, i.e., two twins will appear per 100 calvings). Thus, the object of study of biometrics is a varying characteristic, the study of which is carried out on a certain group of objects, i.e. totality. There are general and sample populations. Population This is a large group of individuals that interests us based on the trait being studied. The general population may include a species of animal or breed of the same species. The general population (breed) includes several million animals. At the same time, the breed diverges into many groups, i.e. herds of individual farms. Since the general population consists of a large number of individuals, it is technically difficult to study it. Therefore, they do not study the entire population, but only a part of it, which is called elective or sample population.

Based on the sample population, a judgment is made about the entire population as a whole. The sampling must be carried out according to all the rules, which must include individuals with all values ​​of the varying trait. The selection of individuals from the general population is carried out according to the principle of chance or by drawing lots. In biometrics, there are two types of random sampling: large and small. Large sample they call one that includes more than 30 individuals or observations, and small sample less than 30 individuals. There are different data processing methods for large and small sample populations. The source of statistical information can be data from zootechnical and veterinary records, which provide information about each animal from birth to disposal. Another source of information can be data from scientific and production experiments conducted on a limited number of animals. Once the sample has been obtained, processing begins. This makes it possible to obtain in the form of mathematical quantities a number of statistical quantities or coefficients that characterize the characteristics of the groups of animals of interest.

The following statistical parameters or indicators are obtained using the biometric method:

1. Average values ​​of a varying characteristic (arithmetic mean, mode, median, geometric mean).

2. Coefficients that measure the amount of variation i.e. (variability) of the studied characteristic (standard deviation, coefficient of variation).

3. Coefficients that measure the magnitude of the relationship between characteristics (correlation coefficient, regression coefficient and correlation ratio).

4. Statistical errors and reliability of the obtained statistical data.

5. The share of variation arising under the influence of various factors and other indicators that are associated with the study of genetic and selection problems.

When statistically processing a sample, members of the population are organized in the form of a variation series. A series of variations is a grouping of individuals into classes depending on the value of the trait being studied. The variation series consists of two elements: classes and a series of frequencies. The variation series can be intermittent or continuous. Features that can only take an integer are called intermittent number heads, number of eggs, number of piglets and others. Features that can be expressed in fractional numbers are called continuous(height cm, milk yield kg, % fat, live weight and others).

When constructing a variation series, the following principles or rules are adhered to:

1. Determine or count the number of individuals for which the variation series (n) will be constructed.

2. Find the max and min value of the characteristic being studied.

3. Determine the class interval K = max - min / number of classes, the number of classes is taken arbitrarily.

4. Construct classes and determine the boundary of each class, min+K.

5. They distribute the members of the population into classes.

After constructing classes and distributing individuals into classes, the main indicators of the variation series (X, σ, Cv, Mх, Мσ, Мcv) are calculated. The average value of the attribute received the greatest value in characterizing the population. When solving all zootechnical, veterinary, medical, economic and other problems, the average value of a trait is always determined (average milk yield for the herd, % fat, fertility in pig breeding, egg production in chickens and other traits). The parameters characterizing the average value of a characteristic include the following:

1. Arithmetic mean.

2. Weighted arithmetic average.

3. Geometric mean.

4. Fashion (Mo).

5. Median (Me) and other parameters.

Arithmetic mean shows us what value of traits the individuals of a given group had if it was the same for everyone, and is determined by the formula X = A + b × K

The main property of the arithmetic mean is that it eliminates the variation of a characteristic and makes it common to the entire population. At the same time, it should be noted that the arithmetic mean takes on an abstract meaning, i.e. when calculating it, fractional indicators are obtained, which in reality may not exist. For example: the yield of calves per 100 cows is 85.3 calves, the fertility of sows is 11.8 piglets, the egg production of chickens is 252.4 eggs and other indicators.

The value of the arithmetic mean is very high in livestock farming practice and population characteristics. In the practice of animal husbandry, in particular cattle breeding, a weighted arithmetic value is used to determine the average fat content in milk during lactation.

Geometric mean value is calculated if it is necessary to characterize the growth rate, the rate of increase in the population, when the arithmetic average distorts the data.

Fashion name the most frequently encountered value of a varying characteristic, both quantitative and qualitative. The modal number for a cow is teat number-4. Although there are cows with five or six teats. In a variation series, the modal class will be the class where there is the largest number of frequencies and we define it as the zero class.

Median is called a variant that divides all members of the population into two equal parts. Half of the population members will have a variable trait value less than the median, and the other half will have a value greater than the median (for example: breed standard). The median is most often used to characterize qualitative characteristics. For example: the shape of the udder is cup-shaped, round, goat. With correct sampling option, all three indicators should be the same (i.e. X, Mo, Me). Thus, the first characteristic of a population is average values, but they are not enough to judge the population.

The second important indicator of any population is the variability or variability of the trait. The variability of a trait is determined by many environmental factors and internal factors, i.e. hereditary factors.

Determining the variability of a trait is of great importance, both in biology and in animal husbandry practice. Thus, using statistical parameters that measure the degree of variability of a trait, it is possible to establish breed differences in the degree of variability of various economically useful traits, to predict the level of selection in different groups of animals, as well as its effectiveness.

The current state of statistical analysis makes it possible not only to establish the degree of manifestation of phenotypic variability, but also to divide phenotypic variability into its component types, namely genotypic and paratypic variability. This decomposition of variability is done using analysis of variance.

The main indicators of variability are the following statistical values:

1. Limits;

2. Standard deviation (σ);

3. Coefficient of variability or variation (Cv).

The simplest way to present the amount of variability of a trait is through limits. The limits are determined as follows: the difference between the max and min values ​​of the attribute. The greater this difference, the greater the variability of this trait. The main parameter for measuring the variability of a trait is the standard deviation or (σ) and is determined by the formula:

σ = ±К ∙ √∑ Pa 2- b 2

The main properties of the standard deviation i.e. (σ) are as follows:

1. Sigma is always a named value and is expressed (in kg, g, meters, cm, pcs.).

2. Sigma is always a positive value.

3. The greater the value of σ, the greater the variability of the trait.

4. In the variation series, all frequencies are included in ±3σ.

Using the standard deviation, you can determine which variation series a given individual belongs to. Methods for determining the variability of a characteristic using limits and standard deviation have their drawbacks, since it is impossible to compare different characteristics based on the magnitude of variability. It is necessary to know the variability of various traits in the same animal or the same group of animals, for example: variability in milk yield, fat content in milk, live weight, amount of milk fat. Therefore, by comparing the variability of opposite characteristics and identifying the degree of their variability, the coefficient of variability is calculated using the following formula:

Thus, the main methods for assessing the variability of characteristics among members of a population are: limits; standard deviation (σ) and coefficient of variation or variability.

In animal husbandry practice and experimental research, one often has to deal with small samples. Small sample they call the number of individuals or animals not exceeding 30 or less than 30. Established patterns using a small sample are transferred to the entire population. For a small sample, the same statistical parameters are determined as for a large sample (X, σ, Cv, Mx). However, their formulas and calculations differ from a large sample (i.e., from the formulas and calculations of a variation series).

1. Arithmetic mean value X = ∑V

V - absolute value of the option or characteristic;

n is the number of variants or number of individuals.

2. Standard deviation σ = ± √ ∑α 2

α = x-¯x, this is the difference between the value of the option and the arithmetic mean. This difference α is squared and α 2 n-1 is the number of degrees of freedom, i.e. the number of all variants or individuals reduced by one (1).

Control questions:

1.What is biometrics?

2.What statistical parameters characterize the population?

3.What indicators characterize variability?

4.What is a small sample

5. What are mode and median?

Lecture No. 12

Biotechnology and embryo transplantation

1. The concept of biotechnology.

2. Selection of donor and recipient cows, embryo transplantation.

3. The importance of transplantation in animal husbandry.

The extension of sample characteristics to the general population, based on the law of large numbers, requires a sufficiently large sample size. However, in the practice of statistical research, one often encounters the impossibility, for one reason or another, of increasing the number of sample units that have a small size. This applies to studying the activities of enterprises, educational institutions, commercial banks, etc., the number of which in the regions is, as a rule, insignificant, and sometimes amounts to only 5-10 units.

In the case when the sample population consists of a small number of units, less than 30, the sample is called small In this case, Lyapunov’s theorem cannot be used to calculate the sampling error, since the sample mean is significantly influenced by the value of each of the randomly selected units and its distribution may differ significantly from normal.

In 1908 V.S. Gosset proved that the estimate of the discrepancy between the sample mean of a small sample and the general mean has a special distribution law (see Chapter 4). Dealing with the problem of probabilistic estimation of a sample mean with a small number of observations, he showed that in this case it is necessary to consider the distribution not of the sample means themselves, but of the magnitude of their deviations from the mean of the original population. In this case, the conclusions can be quite reliable.

Student's discovery is called small sample theory.

When assessing the results of a small sample, the value of the general variance is not used in the calculations. In small samples, the “corrected” sample variance is used to calculate the average sampling error:

those. in contrast to large samples in the denominator instead P costs (and - 1). The calculation of the average sampling error for a small sample is given in table. 5.7.

Table 5.7

Calculation of the average error of a small sample

The marginal error of a small sample is: where t- trust factor.

Magnitude t relates differently to probable estimation than with a large sample. In accordance with the Student distribution, the probable estimate depends on both the value t, and on the sample size I in the event that the marginal error does not exceed r-fold the average error in small samples. However, it largely depends on the number of units selected.

V.S. Gosset compiled a table of probability distributions in small samples corresponding to given values ​​of the confidence coefficient t and different volumes of a small sample and, an excerpt from it is given in table. 5.8.

Table 5.8

Fragment of Student's probability table (probabilities multiplied by 1000)

Table data 5.8 indicate that with an unlimited increase in the sample size (i = °°), the Student distribution tends to the normal distribution law, and at i = 20 it differs little from it.

The Student distribution table is often given in a different form, more convenient for practical use (Table 5.9).

Table 5.9

Some values ​​(Student's t-distributions

Number of degrees of freedom

for one-way interval

for two-way spacing

P= 0,99

Let's look at how to use the distribution table. Each fixed value P calculate the number of degrees of freedom k, Where k = n - 1. For each value of the degree of freedom, the limit value is indicated t p (t 095 or t 0 99), which with a given probability R will not be exceeded due to random fluctuations in the sampling results. Based on magnitude tp the boundaries of trust are determined

interval

As a rule, the confidence level used in two-sided testing is P = 0.95 or P = 0.99, which does not exclude the choice of other probability values. The probability value is selected based on the specific requirements of the tasks for which a small sample is used.

The probability of the general average values ​​going beyond the confidence interval is equal to q, Where q = 1 - R. This value is very small. Accordingly, for the considered probabilities R it is 0.05 and 0.01.

Small samples are widespread in the technical sciences and biology, but they must be used in statistical research with great caution, only with appropriate theoretical and practical examination. A small sample can be used only if the distribution of the characteristic in the population is normal or close to it, and the average value is calculated from sample data obtained as a result of independent observations. In addition, keep in mind that the accuracy of results from a small sample size is lower than from a large sample size.

small-sample statistics

It is generally accepted that the beginning of S. m.v. or, as it is often called, “small n” statistics, was founded in the first decade of the 20th century with the publication of the work of W. Gosset, in which he placed the t-distribution postulated by the “student” who gained world fame a little later. At the time, Gossett was working as a statistician at the Guinness breweries. One of his duties was to analyze successive batches of barrels of freshly brewed porter. For a reason he never really explained, Gossett experimented with the idea of ​​significantly reducing the number of samples taken from the very large number of barrels in the brewery's warehouses to randomly control the quality of the porter. This led him to postulate the t-distribution. Because the Guinness breweries' bylaws prohibited their employees from publishing research results, Gossett published the results of his experiment comparing quality control sampling using the t-distribution for small samples and the traditional z-distribution (normal distribution) anonymously, under the pseudonym "Student" - hence the name Student's t-distribution).

t-distribution. The t-distribution theory, like the z-distribution theory, is used to test the null hypothesis that two samples are simply random samples from the same population and therefore the calculated statistics (eg mean and standard deviation) are unbiased estimates of population parameters. However, unlike the theory of the normal distribution, the theory of the t-distribution for small samples does not require a priori knowledge or precise estimates of the expected value and population variance. Moreover, although testing a difference between the means of two large samples for statistical significance requires the fundamental assumption that characteristics of the population are normally distributed, the theory of the t distribution does not require assumptions about the parameters.

It is well known that normally distributed characteristics are described by one single curve - the Gaussian curve, which satisfies the following equation:

With the t-distribution, the whole family of curves is represented by the following formula:

This is why the equation for t includes a gamma function, which in mathematics means that as n changes, a different curve will satisfy the given equation.

Degrees of freedom

In the equation for t, the letter n denotes the number of degrees of freedom (df) associated with the estimate of the population variance (S2), which represents the second moment of any moment generating function, such as the equation for the t distribution. In S., the number of degrees of freedom indicates how many characteristics remain free after their partial use in a particular type of analysis. In a t-distribution, one of the deviations from the sample mean is always fixed, since the sum of all such deviations must be equal to zero. This affects the sum of squares when calculating the sample variance as an unbiased estimate of the parameter S2 and leads to df being equal to the number of measurements minus one for each sample. Hence, in the formulas and procedures for calculating t-statistics for testing the null hypothesis, df = n - 2.

F-pacndivision. The null hypothesis tested by a t test is that the two samples were randomly drawn from the same population or were randomly drawn from two different populations with the same variance. But what if you need to analyze more groups? The answer to this question was sought for twenty years after Gosset discovered the t-distribution. Two of the most eminent statisticians of the 20th century were directly involved in its production. One is the largest English statistician R. A. Fisher, who proposed the first theories. formulations, the development of which led to the production of the F-distribution; his work on small sample theory, developing Gosset's ideas, was published in the mid-20s (Fisher, 1925). Another is George Snedecor, one of a galaxy of early American statisticians, who developed a way to compare two independent samples of any size by calculating the ratio of two estimates of variance. He called this relationship the F-ratio, after Fischer. Research results Snedecor led to the fact that the F-distribution began to be specified as the distribution of the ratio of two statistics c2, each with its own degrees of freedom:

From this came Fisher's classic work on analysis of variance, a statistical method explicitly focused on the analysis of small samples.

The sampling distribution F (where n = df) is represented by the following equation:

As with the t-distribution, the gamma function indicates that there is a family of distributions that satisfy the equation for F. In this case, however, the analysis involves two df quantities: the number of degrees of freedom for the numerator and for the denominator of the F-ratio.

Tables for estimating t- and F-statistics. When testing the null hypothesis using S., based on the theory of large samples, usually only one lookup table is required - a table of normal deviations (z), which allows you to determine the area under the normal curve between any two z values ​​​​on the x-axis. However, the tables for the t- and F-distributions are necessarily presented in a set of tables, since these tables are based on a variety of distributions resulting from varying the number of degrees of freedom. Although t- and F-distributions are probability density distributions, like the normal distribution for large samples, they differ from the latter in four ways that are used to describe them. The t distribution, for example, is symmetric (note t2 in its equation) for all df, but becomes increasingly peaked as the sample size decreases. Peaked curves (those with kurtosis greater than normal) tend to be less asymptotic (i.e., less close to the x-axis at the ends of the distribution) than curves with normal kurtosis, such as the Gaussian curve. This difference results in noticeable discrepancies between the points on the x-axis corresponding to the t and z values. With df = 5 and a two-tailed α level of 0.05, t = 2.57, whereas the corresponding z = 1.96. Therefore, t = 2.57 indicates statistical significance at the 5% level. However, in the case of a normal curve, z = 2.57 (more precisely 2.58) will already indicate a 1% level of statistical significance. Similar comparisons can be made with the F distribution, since t is equal to F when the number of samples is two.

What constitutes a “small” sample?

At one time, the question was raised about how large the sample should be in order to be considered small. There is simply no definite answer to this question. However, the conventional boundary between a small and a large sample is considered to be df = 30. The basis for this somewhat arbitrary decision is the result of comparing the t-distribution with the normal distribution. As noted above, the discrepancy between t and z values ​​tends to increase as df decreases and decrease as df increases. In fact, t begins to closely approach z long before the limiting case where t = z for df = ∞. A simple visual examination of the table values ​​of t shows that this approximation becomes quite fast, starting from df = 30 and above. Comparative values ​​of t (at df = 30) and z are equal, respectively: 2.04 and 1.96 for p = 0.05; 2.75 and 2.58 for p = 0.01; 3.65 and 3.29 for p = 0.001.

Other statistics for “small” samples

Although statistics such as t and F are specifically designed for use with small samples, they are equally applicable to large samples. There are, however, many other statistical methods designed to analyze small samples and are often used for this purpose. This refers to the so-called. non-parametric or distribution-free methods. Basically, the scales appearing in these methods are intended to be applied to measurements obtained using scales that do not satisfy the definition of ratio or interval scales. Most often these are ordinal (rank) or nominal measurements. Nonparametric scales do not require assumptions regarding distribution parameters, particularly regarding estimates of dispersion, because ordinal and nominal scales eliminate the very concept of dispersion. For this reason, nonparametric methods are also used for measurements obtained using interval and ratio scales when small samples are analyzed and the basic assumptions required for the use of parametric methods are likely to be violated. These tests, which can be reasonably applied to small samples, include: Fisher's exact probability test, Friedman's two-factor nonparametric (rank) analysis of variance, Kendall's t rank correlation coefficient, Kendall's coefficient of concordance (W), Kruskal's H test - Wallace for non-parametric (rank) one-way analysis of variance, Mann-Whitney U-test, median test, sign test, Spearman's rank correlation coefficient r and Wilcoxon t-test.