Probabilistic and statistical methods for modeling economic systems. Probabilistic and statistical methods of decision making Theoretical frequencies of preferences

3. The essence of probabilistic-statistical methods

How are the approaches, ideas and results of probability theory and mathematical statistics used when processing data - the results of observations, measurements, tests, analyzes, experiments in order to make practically important decisions?

The basis is a probabilistic model of a real phenomenon or process, i.e. a mathematical model in which objective relationships are expressed in terms of probability theory. Probabilities are used primarily to describe the uncertainties that must be taken into account when making decisions. This refers to both undesirable opportunities (risks) and attractive ones (“lucky chance”). Sometimes randomness is deliberately introduced into a situation, for example, when drawing lots, randomly selecting units for control, conducting lotteries or conducting consumer surveys.

Probability theory allows one probabilities to be used to calculate others of interest to the researcher. For example, using the probability of getting a coat of arms, you can calculate the probability that in 10 coin tosses you will get at least 3 coats of arms. Such a calculation is based on a probabilistic model, according to which coin tosses are described by a pattern of independent trials; in addition, the coat of arms and the hash marks are equally possible, and therefore the probability of each of these events is equal to ½. A more complex model is one that considers checking the quality of a unit of production instead of tossing a coin. The corresponding probabilistic model is based on the assumption that the quality control of various units of production is described by an independent testing scheme. Unlike the coin toss model, it is necessary to introduce a new parameter - probability R that the product is defective. The model will be fully described if we assume that all units of production have the same probability of being defective. If the last assumption is incorrect, then the number of model parameters increases. For example, you can assume that each unit of production has its own probability of being defective.

Let us discuss a quality control model with a probability of defectiveness common to all units of production R. In order to “get to the number” when analyzing the model, it is necessary to replace R to some specific value. To do this, it is necessary to move beyond the probabilistic model and turn to data obtained during quality control. Mathematical statistics solves the inverse problem in relation to probability theory. Its goal is, based on the results of observations (measurements, analyses, tests, experiments), to obtain conclusions about the probabilities underlying the probabilistic model. For example, based on the frequency of occurrence of defective products during inspection, conclusions can be drawn about the probability of defectiveness (see discussion above using Bernoulli's theorem). Based on Chebyshev’s inequality, conclusions were drawn about the correspondence of the frequency of occurrence of defective products to the hypothesis that the probability of defectiveness takes a certain value.

Thus, the application of mathematical statistics is based on a probabilistic model of a phenomenon or process. Two parallel series of concepts are used - those related to theory (probabilistic model) and those related to practice (sampling of observation results). For example, the theoretical probability corresponds to the frequency found from the sample. The mathematical expectation (theoretical series) corresponds to the sample arithmetic mean (practical series). As a rule, sample characteristics are estimates of theoretical ones. At the same time, quantities related to the theoretical series “are in the heads of researchers”, relate to the world of ideas (according to the ancient Greek philosopher Plato), and are not available for direct measurement. Researchers have only sample data with which they try to establish the properties of a theoretical probabilistic model that interest them.

Why do we need a probabilistic model? The fact is that only with its help can the properties established from the analysis of a specific sample be transferred to other samples, as well as to the entire so-called general population. The term "population" is used when referring to a large but finite collection of units being studied. For example, about the totality of all residents of Russia or the totality of all consumers of instant coffee in Moscow. The goal of marketing or sociological surveys is to transfer statements obtained from a sample of hundreds or thousands of people to populations of several million people. In quality control, a batch of products acts as a general population.

To transfer conclusions from a sample to a larger population requires some assumptions about the relationship of the sample characteristics with the characteristics of this larger population. These assumptions are based on an appropriate probabilistic model.

Of course, it is possible to process sample data without using one or another probabilistic model. For example, you can calculate a sample arithmetic mean, count the frequency of fulfillment of certain conditions, etc. However, the calculation results will relate only to a specific sample; transferring the conclusions obtained with their help to any other population is incorrect. This activity is sometimes called “data analysis.” Compared to probabilistic-statistical methods, data analysis has limited educational value.

So, the use of probabilistic models based on estimation and testing of hypotheses using sample characteristics is the essence of probabilistic-statistical methods of decision making.

We emphasize that the logic of using sample characteristics for making decisions based on theoretical models involves the simultaneous use of two parallel series of concepts, one of which corresponds to probabilistic models, and the second to sample data. Unfortunately, in a number of literary sources, usually outdated or written in a recipe spirit, no distinction is made between sample and theoretical characteristics, which leads readers to confusion and errors in the practical use of statistical methods.

In many cases in mining science it is necessary to study not only deterministic, but also random processes. All geomechanical processes occur under continuously changing conditions, when certain events may or may not occur. In this case, it becomes necessary to analyze random connections.

Despite the random nature of events, they are subject to certain patterns, discussed in probability theory , which studies the theoretical distributions of random variables and their characteristics. Another science, the so-called mathematical statistics, deals with methods of processing and analyzing random empirical events. These two related sciences constitute a unified mathematical theory of mass random processes, widely used in scientific research.

Elements of probability theory and mathematical statistics. Under totality understand the set of homogeneous events of a random variable X, which constitutes the primary statistical material. The population may be general (large sample N), containing a wide variety of options for a mass phenomenon, and selective (small sample N 1), which represents only a part of the general population.

Probability R(X) events X called the ratio of the number of cases N(X) that lead to the occurrence of an event X, to the total number of possible cases N:

In mathematical statistics, an analogue of probability is the concept of event frequency, which is the ratio of the number of cases in which the event occurred to the total number of events:

With an unlimited increase in the number of events, frequency tends to probability R(X).

Let's say there are some statistical data presented in the form of a distribution series (histogram) in Fig. 4.11, then frequency characterizes the probability of a random variable appearing in the interval і , and the smooth curve is called the distribution function.

The probability of a random variable is a quantitative assessment of the possibility of its occurrence. A reliable event has R=1, impossible event – R=0. Therefore, for a random event , and the sum of the probabilities of all possible values .

In research, it is not enough to have a distribution curve, but you also need to know its characteristics:

a) arithmetic mean – ; (4.53)

b) scope – R= x max – x min , which can be used to roughly estimate the variation of events, where x max and x min – extreme values of the measured value;

c) mathematical expectation – . (4.54)

For continuous random variables, the mathematical expectation is written in the form

, (4.55)

those. equal to the actual value of the observed events X, and the abscissa corresponding to the expectation is called the center of the distribution.

d) dispersion – , (4.56)

which characterizes the dispersion of a random variable in relation to the mathematical expectation. The variance of a random variable is also called the second-order central moment.

For a continuous random variable, the variance is equal to

; (4.57)

e) standard deviation or standard –

e) coefficient of variation (relative dispersion) –

, (4.59)

which characterizes the intensity of scattering in different populations and is used to compare them.

The area under the distribution curve corresponds to unity, which means that the curve covers all values of the random variables. However, a large number of such curves that will have an area equal to unity can be constructed, i.e. they may have different scattering. The measure of dispersion is dispersion or standard deviation (Fig. 4.12).

Above we examined the main characteristics of the theoretical distribution curve, which are analyzed by probability theory. In statistics, they operate with empirical distributions, and the main task of statistics is the selection of theoretical curves according to the existing empirical distribution law.

Let a variational series be obtained as a result of n measurements of a random variable X 1 , X 2 , X 3 , …x n. Processing such series is reduced to the following operations:

– group x i in the interval and set absolute and relative frequencies for each of them;

– a step histogram is constructed based on the values (Fig. 4.11);

– calculate the characteristics of the empirical distribution curve: arithmetic mean, variance D= ; standard deviation.

Values D And s empirical distribution correspond to the values , D(X) And s(X) theoretical distribution.

Let's look at the basic theoretical distribution curves. Most often in research, the law of normal distribution is used (Fig. 4.13), the equation of which has the form:

(4.60)

If you combine the coordinate axis with the point m, i.e. accept m(x)=0 and accept , the law of normal distribution will be described by a simpler equation:

To estimate scattering, the quantity is usually used . The less s,the less scattering, i.e. observations differ little from each other. With increase s scattering increases, the probability of errors increases, and the maximum of the curve (ordinate), equal to , decreases. Therefore the value at=1/ at 1 is called a measure of accuracy. The standard deviations correspond to the inflection points (shaded area in Fig. 4.12) of the distribution curve.

When analyzing many random discrete processes, the Poisson distribution (short-term events occurring per unit time) is used. Probability of occurrence of numbers of rare events X=1, 2, ... for a given period of time is expressed by Poisson’s law (see Fig. 4.14):

, (4.62)

Where X– number of events for a given period of time t;

λ – density, i.e. average number of events per unit of time;

– average number of events over time t;

For Poisson's law, the variance is equal to the mathematical expectation of the number of occurrences of events over time t, i.e. .

To study the quantitative characteristics of some processes (time of machine failures, etc.), an exponential distribution law is used (Fig. 4.15), the distribution density of which is expressed by the dependence

Where λ – intensity (average number) of events per unit of time.

In the exponential distribution, the intensity λ is the reciprocal of the mathematical expectation λ = 1/m(x). In addition, the relation is valid.

The Weibull distribution law is widely used in various fields of research (Fig. 4.16):

, (4.64)

Where n, μ , – parameters of the law; X– argument, most often time.

When studying processes associated with a gradual decrease in parameters (decrease in rock strength over time, etc.), the gamma distribution law is applied (Fig. 4.17):

, (4.65)

Where λ , a- options. If a=1, the gamma function turns into an exponential law.

In addition to the above laws, other types of distributions are also used: Pearson, Rayleigh, beta distribution, etc.

Analysis of variance. In research, the question often arises: To what extent does this or that random factor influence the process under study? Methods for establishing the main factors and their influence on the process under study are discussed in a special section of probability theory and mathematical statistics - variance analysis. There is a distinction between one and multifactor analysis. Analysis of variance is based on the use of the normal distribution law and on the hypothesis that the centers of normal distributions of random variables are equal. Therefore, all measurements can be considered as a sample from the same normal population.

Reliability theory. Methods of probability theory and mathematical statistics are often used in reliability theory, which is widely used in various branches of science and technology. Reliability is understood as the property of an object to perform specified functions (maintain established performance indicators) for the required period of time. In reliability theory, failures are considered random events. For a quantitative description of failures, mathematical models are used - distribution functions of time intervals (normal and exponential distribution, Weibull, gamma distributions). The task is to find the probabilities of various indicators.

Monte Carlo method. To study complex processes of a probabilistic nature, the Monte Carlo method is used. Using this method, problems of finding the best solution from a variety of options under consideration are solved.

The Monte Carlo method is also called the statistical modeling method. This is a numerical method, it is based on the use of random numbers that simulate probabilistic processes. The mathematical basis of the method is the law of large numbers, which is formulated as follows: with a large number of statistical tests, the probability that the arithmetic mean of a random variable tends to its mathematical expectation, is equal to 1:

, (4.64)

where ε is any small positive number.

Sequence of solving problems using the Monte Carlo method:

– collection, processing and analysis of statistical observations;

– selection of main and discarding secondary factors and drawing up a mathematical model;

– drawing up algorithms and solving problems on a computer.

To solve problems using the Monte Carlo method, you need to have a statistical series, know the law of its distribution, the mean value, the mathematical expectation and the standard deviation. The solution is effective only with the use of a computer.

How are probability theory and mathematical statistics used? These disciplines are the basis of probabilistic and statistical methods of decision making. To use their mathematical apparatus, it is necessary to express decision-making problems in terms of probabilistic-statistical models. The application of a specific probabilistic-statistical decision-making method consists of three stages:

The transition from economic, managerial, technological reality to an abstract mathematical and statistical scheme, i.e. construction of a probabilistic model of a control system, technological process, decision-making procedure, in particular based on the results of statistical control, etc.

Carrying out calculations and drawing conclusions using purely mathematical means within the framework of a probabilistic model;

Interpretation of mathematical and statistical conclusions in relation to a real situation and making an appropriate decision (for example, on the compliance or non-compliance of product quality with established requirements, the need to adjust the technological process, etc.), in particular, conclusions (on the proportion of defective units of product in a batch, on specific form of laws of distribution of controlled parameters of the technological process, etc.).

Mathematical statistics uses the concepts, methods and results of probability theory. Let's consider the main issues of constructing probabilistic models of decision-making in economic, managerial, technological and other situations. For the active and correct use of regulatory, technical and instructional documents on probabilistic and statistical methods of decision-making, preliminary knowledge is required. Thus, it is necessary to know under what conditions a particular document should be used, what initial information is necessary to have for its selection and application, what decisions should be made based on the results of data processing, etc.

Application examples probability theory and mathematical statistics. Let's consider several examples where probabilistic-statistical models are a good tool for solving management, production, economic, and national economic problems. So, for example, in A.N. Tolstoy’s novel “Walking through Torment” (vol. 1) it is said: “the workshop produces twenty-three percent of rejects, you stick to this figure,” Strukov told Ivan Ilyich.”

The question arises of how to understand these words in the conversation of factory managers, since one unit of production cannot be 23% defective. It can be either good or defective. Strukov probably meant that a large-volume batch contains approximately 23% defective units of production. The question then arises, what does “approximately” mean? Let 30 out of 100 tested units of production turn out to be defective, or out of 1000 - 300, or out of 100,000 - 30,000, etc., should Strukov be accused of lying?

Or another example. The coin used as a lot must be “symmetrical”, i.e. when throwing it, on average, in half the cases the coat of arms should appear, and in half the cases - a hash (tails, number). But what does "on average" mean? If you conduct many series of 10 tosses in each series, then you will often encounter series in which the coin lands as a coat of arms 4 times. For a symmetrical coin, this will happen in 20.5% of runs. And if after 100,000 tosses there are 40,000 coats of arms, can the coin be considered symmetrical? The decision-making procedure is based on probability theory and mathematical statistics.

The example in question may not seem serious enough. However, it is not. Drawing lots is widely used in organizing industrial technical and economic experiments, for example, when processing the results of measuring the quality indicator (friction torque) of bearings depending on various technological factors (the influence of the conservation environment, methods of preparing bearings before measurement, the influence of bearing loads during the measurement process, etc.). P.). Let's say it is necessary to compare the quality of bearings depending on the results of their storage in different preservation oils, i.e. in composition oils A And IN. When planning such an experiment, the question arises which bearings should be placed in the oil of the composition A, and which ones - in the oil composition IN, but in such a way as to avoid subjectivity and ensure the objectivity of the decision made.

The answer to this question can be obtained by drawing lots. A similar example can be given with quality control of any product. To decide whether the controlled batch of products meets or does not meet the established requirements, a sample is selected from it. Based on the results of the sample control, a conclusion is made about the entire batch. In this case, it is very important to avoid subjectivity when forming a sample, that is, it is necessary that each unit of product in the controlled batch has the same probability of being selected for the sample. In production conditions, the selection of product units for the sample is usually carried out not by lot, but by special tables of random numbers or using computer random number sensors.

Similar problems of ensuring objectivity of comparison arise when comparing various schemes for organizing production, remuneration, during tenders and competitions, selecting candidates for vacant positions, etc. Everywhere we need a draw or similar procedures. Let us explain with the example of identifying the strongest and second strongest teams when organizing a tournament according to the Olympic system (the loser is eliminated). Let the stronger team always defeat the weaker one. It is clear that the strongest team will definitely become the champion. The second strongest team will reach the final if and only if it has no games with the future champion before the final. If such a game is planned, the second strongest team will not make it to the final. The one who plans the tournament can either “knock out” the second-strongest team from the tournament ahead of schedule, pitting it against the leader in the first meeting, or provide it with second place by ensuring meetings with weaker teams right up to the final. To avoid subjectivity, a draw is carried out. For an 8-team tournament, the probability that the top two teams will meet in the final is 4/7. Accordingly, with a probability of 3/7, the second strongest team will leave the tournament early.

Any measurement of product units (using a caliper, micrometer, ammeter, etc.) contains errors. To find out whether there are systematic errors, it is necessary to make repeated measurements of a unit of product whose characteristics are known (for example, a standard sample). It should be remembered that in addition to systematic error, there is also random error.

Therefore, the question arises of how to find out from the measurement results whether there is a systematic error. If we only note whether the error obtained during the next measurement is positive or negative, then this task can be reduced to the previous one. Indeed, let’s compare a measurement to throwing a coin, a positive error to the loss of a coat of arms, a negative error to a grid (a zero error with a sufficient number of scale divisions almost never occurs). Then checking for the absence of systematic error is equivalent to checking the symmetry of the coin.

The purpose of these considerations is to reduce the problem of checking the absence of a systematic error to the problem of checking the symmetry of a coin. The above reasoning leads to the so-called “sign criterion” in mathematical statistics.

In the statistical regulation of technological processes, based on the methods of mathematical statistics, rules and plans for statistical process control are developed, aimed at timely detection of problems in technological processes and taking measures to adjust them and prevent the release of products that do not meet established requirements. These measures are aimed at reducing production costs and losses from the supply of low-quality units. During statistical acceptance control, based on the methods of mathematical statistics, quality control plans are developed by analyzing samples from product batches. The difficulty lies in being able to correctly build probabilistic-statistical models of decision-making, on the basis of which the questions posed above can be answered. In mathematical statistics, probabilistic models and methods for testing hypotheses have been developed for this purpose, in particular, hypotheses that the proportion of defective units of production is equal to a certain number R 0 , For example, R 0 = 0.23 (remember Strukov’s words from the novel by A.N. Tolstoy).

Assessment tasks. In a number of managerial, production, economic, and national economic situations, problems of a different type arise - problems of assessing the characteristics and parameters of probability distributions.

Let's look at an example. Let a batch of N electric lamps From this batch, a sample of n electric lamps A number of natural questions arise. How to determine the average service life of electric lamps based on the test results of sample elements and with what accuracy can this characteristic be assessed? How will the accuracy change if we take a larger sample? At what number of hours T it can be guaranteed that at least 90% of electric lamps will last T and more hours?

Let us assume that when testing a sample size n electric lamps turned out to be defective X electric lamps Then the following questions arise. What boundaries can be specified for a number? D defective light bulbs in a batch, for the level of defectiveness D/ N and so on.?

Or, when statistically analyzing the accuracy and stability of technological processes, it is necessary to evaluate such quality indicators as the average value of the controlled parameter and the degree of its scatter in the process under consideration. According to probability theory, it is advisable to use its mathematical expectation as the average value of a random variable, and dispersion, standard deviation or coefficient of variation as a statistical characteristic of the spread. This raises the question: how to estimate these statistical characteristics from sample data and with what accuracy can this be done? There are many similar examples that can be given. Here it was important to show how probability theory and mathematical statistics can be used in production management when making decisions in the field of statistical management of product quality.

What is "mathematical statistics"? Mathematical statistics is understood as “a branch of mathematics devoted to mathematical methods of collecting, systematizing, processing and interpreting statistical data, as well as using them for scientific or practical conclusions. The rules and procedures of mathematical statistics are based on probability theory, which allows us to evaluate the accuracy and reliability of the conclusions obtained in each problem based on the available statistical material.” In this case, statistical data refers to information about the number of objects in any more or less extensive collection that have certain characteristics.

Based on the type of problems being solved, mathematical statistics is usually divided into three sections: data description, estimation, and hypothesis testing.

Based on the type of statistical data processed, mathematical statistics is divided into four areas:

Univariate statistics (statistics of random variables), in which the result of an observation is described by a real number;

Multivariate statistical analysis, where the result of observing an object is described by several numbers (vector);

Statistics of random processes and time series, where the result of observation is a function;

Statistics of objects of a non-numerical nature, in which the result of an observation is of a non-numerical nature, for example, it is a set (a geometric figure), an ordering, or obtained as a result of a measurement based on a qualitative criterion.

Historically, some areas of statistics of objects of a non-numerical nature (in particular, problems of estimating the proportion of defects and testing hypotheses about it) and one-dimensional statistics were the first to appear. The mathematical apparatus is simpler for them, so their example is usually used to demonstrate the basic ideas of mathematical statistics.

Only those data processing methods, i.e. mathematical statistics are evidence-based, which are based on probabilistic models of relevant real phenomena and processes. We are talking about models of consumer behavior, the occurrence of risks, the functioning of technological equipment, obtaining experimental results, the course of a disease, etc. A probabilistic model of a real phenomenon should be considered constructed if the quantities under consideration and the connections between them are expressed in terms of probability theory. Correspondence to the probabilistic model of reality, i.e. its adequacy is substantiated, in particular, using statistical methods for testing hypotheses.

Non-probabilistic methods of data processing are exploratory; they can only be used in preliminary data analysis, since they do not make it possible to assess the accuracy and reliability of conclusions obtained on the basis of limited statistical material.

Probabilistic and statistical methods are applicable wherever it is possible to construct and justify a probabilistic model of a phenomenon or process. Their use is mandatory when conclusions drawn from sample data are transferred to the entire population (for example, from a sample to an entire batch of products).

In specific areas of application, both probabilistic and statistical methods of general application and specific ones are used. For example, in the section of production management devoted to statistical methods of product quality management, applied mathematical statistics (including design of experiments) are used. Using its methods, statistical analysis of the accuracy and stability of technological processes and statistical quality assessment are carried out. Specific methods include methods of statistical acceptance control of product quality, statistical regulation of technological processes, reliability assessment and control, etc.

Applied probabilistic and statistical disciplines such as reliability theory and queuing theory are widely used. The content of the first of them is clear from the name, the second deals with the study of systems such as a telephone exchange, which receives calls at random times - the requirements of subscribers dialing numbers on their telephone sets. The duration of servicing these requirements, i.e. the duration of conversations is also modeled by random variables. A great contribution to the development of these disciplines was made by Corresponding Member of the USSR Academy of Sciences A.Ya. Khinchin (1894-1959), Academician of the Academy of Sciences of the Ukrainian SSR B.V. Gnedenko (1912-1995) and other domestic scientists.

Briefly about the history of mathematical statistics. Mathematical statistics as a science begins with the works of the famous German mathematician Carl Friedrich Gauss (1777-1855), who, based on probability theory, investigated and justified the least squares method, created by him in 1795 and used for processing astronomical data (in order to clarify the orbit of a small planet Ceres). One of the most popular probability distributions, the normal one, is often named after him, and in the theory of random processes the main object of study is Gaussian processes.

At the end of the 19th century. - early 20th century Major contributions to mathematical statistics were made by English researchers, primarily K. Pearson (1857-1936) and R. A. Fisher (1890-1962). In particular, Pearson developed the chi-square test for testing statistical hypotheses, and Fisher developed analysis of variance, the theory of experimental design, and the maximum likelihood method for estimating parameters.

In the 30s of the twentieth century. Pole Jerzy Neumann (1894-1977) and Englishman E. Pearson developed the general theory of testing statistical hypotheses, and Soviet mathematicians Academician A.N. Kolmogorov (1903-1987) and corresponding member of the USSR Academy of Sciences N.V. Smirnov (1900-1966) laid the foundations of nonparametric statistics. In the forties of the twentieth century. Romanian A. Wald (1902-1950) built the theory of sequential statistical analysis.

Mathematical statistics is developing rapidly at the present time. Thus, over the past 40 years, four fundamentally new areas of research can be distinguished:

Development and implementation of mathematical methods for planning experiments;

Development of statistics of objects of non-numerical nature as an independent direction in applied mathematical statistics;

Development of statistical methods that are resistant to small deviations from the probabilistic model used;

Widespread development of work on the creation of computer software packages designed for statistical data analysis.

Probabilistic-statistical methods and optimization. The idea of optimization permeates modern applied mathematical statistics and other statistical methods. Namely, methods of planning experiments, statistical acceptance control, statistical regulation of technological processes, etc. On the other hand, optimization formulations in decision-making theory, for example, the applied theory of optimization of product quality and standard requirements, provide for the widespread use of probabilistic statistical methods, primarily applied mathematical statistics.

In production management, in particular, when optimizing product quality and standard requirements, it is especially important to apply statistical methods at the initial stage of the product life cycle, i.e. at the stage of research preparation of experimental design developments (development of promising product requirements, preliminary design, technical specifications for experimental design development). This is due to the limited information available at the initial stage of the product life cycle and the need to predict the technical capabilities and economic situation for the future. Statistical methods should be used at all stages of solving an optimization problem - when scaling variables, developing mathematical models of the functioning of products and systems, conducting technical and economic experiments, etc.

In optimization problems, including optimization of product quality and standard requirements, all areas of statistics are used. Namely, statistics of random variables, multivariate statistical analysis, statistics of random processes and time series, statistics of objects of non-numerical nature. It is advisable to select a statistical method for analyzing specific data in accordance with the recommendations.

The group of methods under consideration is the most important in sociological research; these methods are used in almost every sociological study that can be considered truly scientific. They are aimed mainly at identifying statistical patterns in empirical information, i.e. patterns that are fulfilled “on average”. Actually, sociology deals with the study of the “average person”. In addition, another important purpose of using probabilistic and statistical methods in sociology is to assess the reliability of the sample. How much confidence is there that the sample gives more or less accurate results and what is the error of statistical conclusions?

The main object of study when applying probabilistic and statistical methods is random variables. Taking a random variable to some value is random event– an event that, if these conditions are met, may or may not occur. For example, if a sociologist conducts surveys in the field of political preferences on a city street, then the event “the next respondent turns out to be a supporter of the party in power” is random if nothing in the respondent previously revealed his political preferences. If a sociologist interviewed a respondent near the building of the Regional Duma, then the event is no longer random. A random event is characterized probability his offensive. Unlike the classic problems involving dice and card combinations taught in probability courses, in sociological research calculating probability is not so simple.

The most important basis for the empirical assessment of probability is tendency of frequency to probability, if by frequency we mean the ratio of how many times an event occurred to how many times it theoretically could have occurred. For example, if among 500 respondents randomly selected on the streets of the city, 220 turned out to be supporters of the party in power, then the frequency of occurrence of such respondents is 0.44. When representative sample of sufficiently large size we will get the approximate probability of an event or the approximate proportion of people possessing a given trait. In our example, with a well-selected sample, we find that approximately 44% of citizens are supporters of the party in power. Of course, since not all citizens were surveyed, and some may have lied during the survey, there is some error.

Let's consider some problems that arise in the statistical analysis of empirical data.

Estimation of the magnitude distribution

If a certain characteristic can be expressed quantitatively (for example, the political activity of a citizen as a value showing how many times over the past five years he participated in elections at various levels), then the task can be set to evaluate the distribution law of this characteristic as a random variable. In other words, the distribution law shows which values a quantity takes more often and which less often, and how often/less often. Most often found both in technology and nature, and in society normal distribution law. Its formula and properties are set out in any statistics textbook, and in Fig. 10.1 shows the appearance of the graph - it is a “bell-shaped” curve, which can be more “stretched” upward or more “smeared” along the axis of the values of the random variable. The essence of the normal law is that most often a random variable takes values near some “central” value, called mathematical expectation, and the further away from it, the less often the value “gets” there.

There are many examples of distributions that can be accepted as normal with a small error. Back in the 19th century. The Belgian scientist A. Quetelet and the Englishman F. Galton proved that the frequency distribution of any demographic or anthropometric indicator (life expectancy, height, age at marriage, etc.) is characterized by a “bell-shaped” distribution. The same F. Galton and his followers proved that psychological characteristics, for example, abilities, obey the normal law.

Rice. 10.1.

Example

The most striking example of a normal distribution in sociology concerns the social activity of people. According to the law of normal distribution, it turns out that socially active people in society are usually about 5–7%. All these socially active people go to rallies, conferences, seminars, etc. Approximately the same number are excluded from participation in social life altogether. The majority of people (80–90%) seem to be indifferent to politics and public life, but they follow the processes that interest them, although in general they have a detached attitude towards politics and society and do not show significant activity. Such people miss most political events, but occasionally watch news on television or the Internet. They also go to vote in the most important elections, especially if they are “threatened with a stick” or “encouraged with a carrot.” Members of these 80–90% are almost useless individually from a socio-political point of view, but sociological research centers are quite interested in these people, since there are a lot of them, and their preferences cannot be ignored. The same applies to pseudo-scientific organizations that carry out research on orders from politicians or trade corporations. And the opinion of the “gray masses” on key issues related to predicting the behavior of many thousands and millions of people in elections, as well as during acute political events, during a split in society and conflicts between different political forces, is not indifferent to these centers.

Of course, not all values are distributed according to the normal distribution. Besides it, the most important in mathematical statistics are the binomial and exponential distributions, the Fisher-Snedecor, Chi-square, and Student distributions.

Evaluation of the relationship of features

The simplest case is when you simply need to establish the presence/absence of a connection. The most popular method in this regard is the Chi-square method. This method is focused on working with categorical data. For example, these are clearly gender and marital status. Some data appears to be numeric at first glance, but can be "turned" into categorical data by dividing the range of values into several small intervals. For example, factory experience can be categorized as less than one year, one to three years, three to six years, and more than six years.

Let the parameter X available P possible values: (x1,..., X r1), and the parameter Y–t possible values: (y1,..., at T) , q ij is the observed frequency of occurrence of the pair ( x i, at j), i.e. the number of detected occurrences of such a pair. We calculate the theoretical frequencies, i.e. how many times each pair of values should appear for absolutely unrelated quantities:

Based on the observed and theoretical frequencies, we calculate the value

You also need to calculate the amount degrees of freedom according to the formula

Where m, n– the number of categories tabulated. In addition, we choose significance level. The higher reliability we want to obtain, the lower the significance level should be taken. Typically, a value of 0.05 is chosen, which means we can trust the results with a probability of 0.95. Next, in the reference tables we find the critical value by the number of degrees of freedom and the level of significance. If , then the parameters X And Y are considered independent. If , then the parameters X And Y – dependent. If, then it is dangerous to draw conclusions about the dependence or independence of the parameters. In the latter case, it is advisable to conduct additional research.

Note also that the Chi-square test can be used with very high confidence only when all theoretical frequencies are not below a given threshold, which is usually considered to be 5. Let v be the minimum theoretical frequency. For v > 5, the Chi-square test can be confidently used. At v< 5 использование критерия становится нежелательным. При v ≥ 5 вопрос остается открытым, требуется дополнительное исследование о применимости критерия "Хи-квадрат".

Let's give an example of using the Chi-square method. Let, for example, in a certain city, a survey was conducted among young fans of local football teams and the following results were obtained (Table 10.1).

Let’s put forward a hypothesis about the independence of the football preferences of the city’s youth N from the gender of the respondent at a standard significance level of 0.05. We calculate the theoretical frequencies (Table 10.2).

Table 10.1

Fan survey results

Table 10.2

Theoretical preference frequencies

For example, the theoretical frequency for youth fans of Zvezda is obtained as

similarly - other theoretical frequencies. Next, we calculate the Chi-square value:

We determine the number of degrees of freedom. For and significance level of 0.05, we look for the critical value:

Since, and the superiority is significant, we can almost certainly say that the football preferences of the city’s boys and girls N vary greatly, except in the case of an unrepresentative sample, for example, if the researcher did not obtain a sample from different areas of the city, limiting himself to interviewing respondents in his own block.

A more difficult situation is when you need to quantify the strength of the connection. In this case, methods are often used correlation analysis. These methods are usually discussed in advanced courses in mathematical statistics.

Approximation of dependencies using point data

Let there be a set of points - empirical data ( X i, Yi), i = 1, ..., P. It is required to approximate the real dependence of the parameter at from parameter X, and also develop a rule for calculating the value y, When X is located between two "nodes" Xi.

There are two fundamentally different approaches to solving the problem. The first is that among the functions of a given family (for example, polynomials), a function is selected whose graph passes through the existing points. The second approach does not "force" the graph of the function to pass through the points. The most popular method in sociology and a number of other sciences is least square method– belongs to the second group of methods.

The essence of the least squares method is as follows. Given a family of functions at(x, a 1, ..., A t) with m uncertain coefficients. It is required to select uncertain coefficients by solving an optimization problem

Minimum function value d can act as a measure of approximation accuracy. If this value is too high, a different function class should be selected at or extend the used class. For example, if the class “polynomials of degree no higher than 3” did not provide acceptable accuracy, we take the class “polynomials of degree no higher than 4” or even “polynomials of degree no higher than 5”.

Most often, the method is used for the family of “polynomials of degree no higher than N":

For example, when N= 1 is a family of linear functions, with N = 2 – family of linear and quadratic functions, with N = 3 – family of linear, quadratic and cubic functions. Let

Then the coefficients of the linear function ( N= 1) are sought as a solution to a system of linear equations

Coefficients of a function of the form A 0 + a 1x + a 2X 2 (N= 2) are sought as a solution to the system

Those wishing to apply this method to an arbitrary value N can do this by seeing the pattern according to which the given systems of equations are compiled.

Let's give an example of using the least squares method. Let the number of a certain political party change as follows:

It can be noted that changes in party size over different years are not very different, which allows us to approximate the dependence with a linear function. To make it easier to calculate, instead of a variable X– year – introduce a variable t = x – 2010, i.e. Let's take the first year of counting as “zero”. We calculate M 1; M 2:

Now we calculate M", M*:

Odds a 0, a 1 functions y = a 0t + A 1 are calculated as a solution to the system of equations

Solving this system, for example, using Cramer’s rule or the substitution method, we obtain: A 0 = 11,12; A 1 = 3.03. Thus, we obtain the approximation

which allows you not only to operate with one function instead of a set of empirical points, but also to calculate function values that go beyond the boundaries of the initial data - “to predict the future.”

Also note that the least squares method can be used not only for polynomials, but also for other families of functions, for example, for logarithms and exponentials:

The degree of confidence of a model constructed using the least squares method can be determined based on the R-squared measure, or coefficient of determination. It is calculated as

Here . The closer R 2 to 1, the more adequate the model.

Outlier detection

An outlier of a data series is an anomalous value that stands out sharply in the general sample or general series. For example, let the percentage of citizens of a country who have a positive attitude towards a certain politician be in 2008–2013. respectively 15, 16, 12, 30, 14 and 12%. It is easy to notice that one of the values differs sharply from all the others. In 2011, the politician’s rating for some reason sharply exceeded the usual values, which were within the range of 12–16%. The presence of emissions can be due to various reasons:

1)measurement errors;
2) unusual nature of the input data(for example, when the average percentage of votes received by a politician is analyzed; this value at a polling station in a military unit may differ significantly from the average value in the city);
3) consequence of the law(values that differ sharply from the rest can be determined by a mathematical law - for example, in the case of a normal distribution, an object with a value sharply different from the average may be included in the sample);
4) disasters(for example, during a period of short but acute political confrontation, the level of political activity of the population can change dramatically, as happened during the “color revolutions” of 2000–2005 and the “Arab Spring” of 2011);
5) control actions(for example, if in the year before the study a politician made a very popular decision, then in this year his rating may be significantly higher than in other years).

Many data analysis methods are not robust to outliers, so to use them effectively, the data must be cleared of outliers. A striking example of an unstable method is the least squares method mentioned above. The simplest method for searching for outliers is based on the so-called interquartile distance. Determining the range

Where Q m – meaning T- th quartile. If some member of the series does not fall within the range, then it is regarded as an outlier.

Let's explain with an example. The meaning of quartiles is that they divide a series into four equal or approximately equal groups: the first quartile “separates” the left quarter of the series, sorted in ascending order, the third quartile separates the right quarter of the series, the second quartile runs in the middle. Let us explain how to search Q 1, and Q 3. Let in a number series sorted in ascending order P values. If n + 1 is divisible by 4 without a remainder, then Q k essence k(P+ 1)/4th term of the series. For example, given the series: 1, 2, 5, 6, 7, 8, 10, 11, 13, 15, 20, here is the number of terms n = 11. Then ( P+ 1)/4 = 3, i.e. first quartile Q 1 = 5 – third term of the series; 3( n + 1)/4 = 9, i.e. third quartile Q:i= 13 – ninth member of the series.

The case is a little more complicated when n + 1 is not a multiple of 4. For example, given the series 2, 3, 5, 6, 7, 8, 9, 30, 32, 100, where the number of terms P= 10. Then ( P + 1)/4 = 2,75 -

position between the second member of the series (v2 = 3) and the third member of the series (v3 = 5). Then we take the value 0.75v2 + 0.25v3 = 0.75 3 + 0.25 5 = 3.5 - this will be Q 1. 3(P+ 1)/4 = 8.25 – position between the eighth member of the series (v8= 30) and the ninth member of the series (v9=32). We take the value 0.25v8 + 0.75v9 = 0.25 30 + + 0.75 32 = 31.5 - this will be Q 3. There are other calculation options Q 1 and Q 3, but it is recommended to use the option presented here.

Strictly speaking, in practice, an “approximately” normal law is usually encountered - since the normal law is defined for a continuous quantity along the entire real axis, many real quantities cannot strictly satisfy the properties of normally distributed quantities.
Nasledov A. D. Mathematical methods of psychological research. Analysis and interpretation of data: textbook, manual. St. Petersburg: Rech, 2004. pp. 49–51.
For the most important distributions of random variables, see, for example: Orlov A.I. Mathematics of chance: probability and statistics - basic facts: textbook. allowance. M.: MZ-Press, 2004.

Of particular interest is the quantitative assessment of business risk using mathematical statistics methods. The main tools of this assessment method are:

§ probability of occurrence of a random variable,

§ mathematical expectation or average value of the random variable under study,

§ dispersion,

§ standard (mean square) deviation,

§ the coefficient of variation ,

§ probability distribution of the random variable under study.

To make a decision, you need to know the magnitude (degree) of risk, which is measured by two criteria:

1) average expected value (mathematical expectation),

2) fluctuations (variability) of the possible result.

Average Expected Value this is the weighted average of a random variable, which is associated with the uncertainty of the situation:

where is the value of the random variable.

Average expected value measures the outcome we expect on average.

The average value is a generalized qualitative characteristic and does not allow a decision to be made in favor of any particular value of a random variable.

To make a decision, it is necessary to measure fluctuations in indicators, that is, to determine the measure of variability of a possible result.

Variation in a possible outcome is the degree to which the expected value deviates from the average value.

For this purpose, in practice, two closely related criteria are usually used: “dispersion” and “standard deviation”.

Dispersion – weighted average of the squares of actual results from the expected average:

Standard deviation is the square root of the variance. It is a dimensional quantity and is measured in the same units in which the random variable under study is measured:

Variance and standard deviation provide a measure of absolute variation. The coefficient of variation is usually used for analysis.

The coefficient of variation represents the ratio of the standard deviation to the average expected value, multiplied by 100%

or .

The coefficient of variation is not affected by the absolute values of the studied indicator.

Using the coefficient of variation, you can even compare fluctuations in characteristics expressed in different units of measurement. The coefficient of variation can vary from 0 to 100%. The higher the coefficient, the greater the fluctuations.

In economic statistics, the following assessment of different values of the coefficient of variation is established:

up to 10% - weak fluctuation, 10 – 25% - moderate, over 25% - high.

Accordingly, the higher the fluctuations, the greater the risk.

Example. The owner of a small store at the beginning of each day purchases some perishable product for sale. A unit of this product costs 200 UAH. Sales price – 300 UAH. for a unit. From observations it is known that the demand for this product during the day can be 4, 5, 6 or 7 units with corresponding probabilities of 0.1; 0.3; 0.5; 0.1. If the product is not sold during the day, then at the end of the day it will always be bought at a price of 150 UAH. for a unit. How many units of this product should the store owner purchase at the beginning of the day?

Solution. Let's build a profit matrix for the store owner. Let's calculate the profit that the owner will receive if, for example, he purchases 7 units of a product, and sells one unit during day 6 and at the end of the day. Each unit of product sold during the day gives a profit of 100 UAH, and at the end of the day - a loss of 200 - 150 = 50 UAH. Thus, the profit in this case will be:

Calculations are carried out similarly for other combinations of supply and demand.

Expected profit is calculated as the mathematical expectation of possible profit values for each row of the constructed matrix, taking into account the corresponding probabilities. As you can see, among the expected profits, the largest is 525 UAH. It corresponds to the purchase of the product in question in the amount of 6 units.

To justify the final recommendation to purchase the required number of units of the product, we calculate the variance, standard deviation and coefficient of variation for each possible combination of supply and demand for the product (each row of the profit matrix):


400	0,1	40	16000
400	0,3	120	48000
400	0,5	200	80000
400	0,1	40	16000
	1,0	400	160000


350	0,1	35	12250
500	0,3	150	75000
500	0,5	250	125000
500	0,1	50	25000
	1,0	485	2372500


300	0,1	30	9000
450	0,3	135	60750
600	0,5	300	180000
600	0,1	60	36000
	1,0	525	285750

As for the store owner purchasing 6 units of product compared to 5 and 4 units, this is not obvious, since the risk when purchasing 6 units of product (19.2%) is greater than when purchasing 5 units (9.3%) and even more so than when purchasing 4 units (0%).

Thus, we have all the information about expected profits and risks. And the store owner decides how many units of the product he needs to purchase every morning, taking into account his experience and risk appetite.

In our opinion, the store owner should be recommended to purchase 5 units of the product every morning and his average expected profit will be 485 UAH. and if you compare this with the purchase of 6 units of product, at which the average expected profit is 525 UAH, which is 40 UAH. more, but the risk in this case will be 2.06 times greater.

Women's portal Fleurot

Estimation of the magnitude distribution

Evaluation of the relationship of features

Approximation of dependencies using point data

Outlier detection

RELATED ARTICLES