" "
 
( - )
 
2006


631.6

Using Johnson Distribution in Statistical Analysis

 

Dimitrios Papadopoulos

Moscow State University of Environmental Engineering

Problem Research Laboratory, Sydney, Australia

 

Introduction

Many parameters of various engineering problems are determined in several points of the investigated area. The value to use for the calculations of a parameter is obtained either by averaging the measurements or by using some other considerations specific for each case. We will use as a working example determination of parameters in saline soils.

The basic equation of movement of salt and water contains seven kinds of parameters the space coordinates x, y, z, time t, soil porosity μ, convection diffusion coefficient D, filtration velocity, coefficient of salt exchange and concentration (or content) of salt (, , z, t) and concentration of saturation s.

Coordinates x, y, z, t and concentration of saturation s have more or less exact values, while the others are random variables, since in practice they are usually determined in several points of a region with an inevitable error of measurement

Parameter μ, as a rule, varies insignificantly, ranging for the loamy soils from 0.4 to 0.5 and it does not play a major role in determination of the leaching rates. Knowing approximately the smallest value μ = μmin and the largest μ = μmax, one can estimate the error in the calculation of the leaching rate Ν.

In a similar way, it is possible to determine which parameters have to be considered stochastic and which, at least in the first approximation can be assumed to be non-random.

Correct determination of the concentration of salt in soil cp is very important. Clearly, calculation of the leaching rates based on the average is very risky because up to 50 % of the area might remain insufficiently leached.

Since the salt content of a region is determined by conducting a detailed salt sampling, i.e. by analyzing a salt sample, the use of Mathematical Statistics might help to answer some important questions and determine the necessary parameters, such as the value of the salt content to be used for the calculation of the leaching rates.

We have in our disposal 90 samples of salt content in various layers in a few regions of Central Asia. Our research shows that the overwhelming majority of the samples follow the SB Johnson distribution (see also [2]). For this reason we stress here the material relating to the family of Johnson distributions [6, 7]. A program in VB.NET has been developed to help in the calculations. The program reads a sample or a frequency table and calculates all the necessary parameters.

The special feature of our approach consists in that the program optimizes some parameters to achieve the best possible fit of the empirical distribution.

The software can be requested in the Problem Research Laboratory of MSUEE, by contacting Prof V. V. Shabanov or Dr. N. P. Bunina.

Distribution law and calculation of the parameters

Statistical analysis of the source data is very well described in the literature [1, 3, 4, 5, 6 etc.]. We will try to provide only the background information necessary to present our ideas.

Distribution Functions

For the calculation of the value cp of salt corresponding to the probability p and using the sample data, it is necessary to know its distribution law. In general the distribution law of a random variable represented by some sample {1, 2, ... n} is described by two functions f(x) and F(x) which as a rule depend on certain parameters. The first function f(x) is called differential distribution function or distribution density, the second F(x) is the integral distribution function.

The purpose of the statistical analysis during the processing of the initial salinization data is: given the sample = {1, 2, ... n}, determine the functions f(x) and F(x), and their parameters.

One of the most common distribution laws is the so called normal distribution. It is described by the functions

, (1)

 

where the parameters MX = μ and DX = σ2, mathematical expectation, variance (and the square root of the variance standard deviation) have a well defined physical meaning and can be estimated using the sample data 1, 2, ..., n.

The estimated values of the mathematical expectation = μ and variance DX = σ2 (sample mean m and sample variance s2) are calculated using the formulae

. (2)

 

Roughly speaking, the distribution function gives the probability that the random variable will have values less than or equal to , while the distribution density shows the probability that will have a value equal to .

The case when mathematical expectation m = 0, and variance s2 = 1 represents special interest. Such distribution is called Standard Normal Distribution or N(0, 1).

Tables of f(x) and F(x) for N(0,1) are provided in all books of Mathematical Statistics (see for example [1, 4, 5, 6]). The tables can be used, if the random variable is normalized by being replaced by the variable z:

. (3)

 

The normal distribution is a well defined and well documented law. The studied random variables are often transformed in such a way that the new transformed variable follows the normal distribution law.

In Mathematical Statistics there is no method by which the distribution law can be unambiguously determined using sample data. First the so called null-hypothesis H0 is stated (according to H0 the random variable X follows certain distribution law). Then the validity of H0 is tested. The null-hypothesis is usually stated on the basis of the form of the differential distribution function f*(x) and tested on the basis of the integral function F*(x).

Frequency Tables

An initial impression of the distribution law can be obtained by building a frequency table of the random variable.

Assume that a sample X = {x1, x2, , xn} representing for example the salt content in the layer [hmin, hmax] of soil is given. Let the smallest value of X be xmin and the largest value xmax i.e. any xi from X belongs in the interval (xmin, xmax).

To construct the empirical differential and integral distributions the following the steps are taken:

The segment R = xmax xmin is broken into several intervals. The number of intervals is sually 8 to 12, sometimes more than 12. This number can also be estimated by the formula

k = [lg2n] + 1, (4)

where [log2n] is the smallest integer not greater than log2n.

The approximate width of each interval is calculated

Δx = R / k, (5)

ith some practice the problem of the number of the intervals and their width is handled without and . One should only remember that too large a width makes the picture less graphic, hiding some important details, while a too small one might reveal some uncharacteristic and/or unimportant details of the studied population.

All values of X falling into the i-th interval are equated to its middle value.

The frequencies ni the number of elements of Î falling into each interval are calculated. After that the operations are performed on k numbers instead of n.

The cumulative frequencies Ni representing the sum of the frequencies up to and including the current interval are calculated.

Using formulae the sample mean m and variance s2 are calculated.

The existence of serious blunders i.e. erroneous measurements, is tested. Under the normal distribution 99,7% of possible values of a random variable should be lying within the interval m+3s. For greater security, especially when the distribution law is asymmetric, the 4-sigma interval can be used. Values of X not lying within the interval (m-ks, m+ks), can be discarded as serious measurement errors. This is justified by the Chebyshev inequality [5 and other], according to which for any distribution the probability of a particular x falling beyond the limits of the interval (m-ks, m+ks) is not greater than 1/k2. For k = 4 this probability is no more than 0,06. In other words for any distribution not less than 94% of possible values of X are located within the interval m4s.

The new mean and variance are calculated. The process is repeated until all the values fall into he 4-sigma interval.

The empirical relative frequencies are calculated. They give us an estimate of the values of the differential distribution function of .

The empirical cumulative relative frequencies are calculated. They give us an estimate of the values of the integral distribution function of .

This information is sufficient for the construction of the graphs of the distribution functions. Axis x will show the values of the random variable, axis y the values of and F* (i.e. we will have primary and secondary axis y). The form of f*(x) gives the first impression of the distribution law that the variable X follows. The graph of F* is useful for the test of the hypothesis concerning the distribution law.

Logarithmically Normal Distribution

The logarithmically normal distribution is characterized by the fact that if the random variable is represented by a sample of size n (i.e. X = {x1, x2,, xn}), then the random variable Y, obtained by taking the logarithms of i.e. Y = {y1, y2, , yn} = {ln x1, ln x2, , ln xn} is distributed normally.

If my and are the estimates of the mathematical expectation and variance, then the sample mean and variance of are found using the formulae:

; (6)

. (7)

Since in practice the parameters of are found first, the following formulae can be used to calculate my and sy from mx and sx:

; (8)

; (9)

kvx = sx / mx. (10)

Central moments and the (b1, b2) plane

Any distribution can in principle be precisely defined by its moments [8]. We will continue to use the sample estimates of the theoretical values.

Sample moment of the first degree is the mean of the random variable

. (11)

 

Central sample moments of the degree r (r > 1) are defined as

. (12)

 

In most cases the first four moments do adequately describe a random variable . The first moment m1 geometrically defines the location, the second moment m2 is equal to the variance s2 and determines the spread of the values of X around the mean, the third moment m3 defines the asymmetry (the term Skewness is also used) and the fourth m4 the steepness (or kurtosis). Since the first and second moments contain information on the location and spread but not the shape, the classification system can be limited to the third and forth moments [6, 8].

We introduce now the so-called Pearson constants b1 and b2

. (13)

 

A convenient way for presentation of the distributions is the (b1, b2) plane [6]. As we can see in Figure 1 the plane (b1, b2) is divided into several parts representing various regions with various distributions. The top of the plane is the impossible area: there are no distributions with b2<1+b1. The Normal distribution has b1 = 0, b2 = 3, i.e. it is represented by a single point, as is the exponential distribution. The Lognormal and Gamma distributions are represented by a line, and others by a region.

Johnson Distributions

The family of empirical distributions of N.L. Johnson [6, 7, 8] is convenient in that it is obtained by the transformation of the normal distribution, it covers the whole plane (b1, b2) and it could be used in place of any other distribution.

In [6, 7] these distributions are described in detail. We will look at the part of the material that in our opinion is adequate for the analysis of one sample that describes the state of some physical phenomenon (content of salt in soil, humidity, hydraulic conductivity etc).

 

Figure 1. Various distributions on the plane (b1, b2)

 

In general the transformation of the normal distribution has the form

z = γ + ητ(, ε, ζ) (η > 0, ζ > 0; γ, εÎ(-¥, +¥)). (14)

ere τ an arbitrary function; γ, η, ε, ζ parameters; z a random variable following the standard normal distribution.

Johnson N. L. [6, 7] proposed the following three functions:

; (15)

; (16)

. (17)

 

When τ = τ1 we get the lognormal distribution (or SL- Johnson distribution), when τ = τ2 S Johnson distribution and when τ = τ3 SU distribution, which is often symmetric or has left asymmetry.

Calculation of the parameters for the SL (Lognormal) Distribution

As shown in [6], for the SL distribution we can write

, (18)

where γ* = γ η lnζ. (19)

In this case parameter ε is the smallest value of X and we set it at zero.

The estimates for η and γ* are found from the formulae:

. (20)

 

where my, sy mean and standard deviation calculated from .

We can calculate the values of the theoretical distribution function F(z) and the values of corresponding to a given probability p (for example the value of salt content c0.9 that guarantees with probability p = 0,9 that the salt content in the region will not exceed c0.9)..

In [6] the formulae for an unknown ε > 0 are given:

; (21)

; (22)

, (23)

where β = 1 α, zβ = z1-α = -zα are the quintilesof the standard normal distribution, corresponding to the probabilities α and β = 1 α; x0.5, xα, xβ are the values of X corresponding to probabilities 0,5, α and β; they are found from the sorted sample as values with numbers 0,5(n+1), α(n+1), β(n+1). If the numbers happen to be fractional, the corresponding values are found by interpolation of the closest -s.

From the frequency tables, x0.5, xα, xβ are also found by interpolation. We suggest the following formula

, (24)

 

where np = (n+1)p (and = 0.5, α or β); nR cumulative frequency of the interval with n.

nL cumulative frequency of the previous interval; Δx width of the intervals; xL left value of the interval with n.

The value of X, corresponding to the probability , can be found as

. (25)

 

Calculation of parameters for the S distribution

For the S distribution, parameters ε and ζ are the minimum and maximum possible values of the random variable . We assume ε = 0 and ζ slightly greater than the biggest sample element (we will hence write xmax instead of ζ).

Formula with taken into account gives

 

. (26)

 

Parameters η and γ are found using the formulae:

; (27)

 

 

, (28)

 

ere zα, zβ are the quintiles of the standard normal distribution, corresponding to probabilities α and β; α, xβ corresponding empirical quintiles; α, xβ are found as α(n+1)-st and β(n+1)-st values of the ordered sample.If these values are fractional, the interpolation of the two closest values is performed.

Knowing γ, η, ζ, we can construct the distribution functions of the random variable. Setting equal to the right values of the intervals and replacing them into the formula , we obtain the corresponding values of z. From the tables of normal distribution we find the integral probability F(z). Having calculated the empirical values of the integral probability the cumulative relative frequencies F*(), we can estimate the probability of the null hypothesis. We can find the best possible representation of the distribution of the random variable by varying the parameters α and β.

Formula can be solved relative to

. (29)

 

Using we can find the values of , corresponding to the required probability.

For example, if we are dealing with the salt content corresponding to the probability = 0,9, then, knowing γ, η, xmax and setting z = 1,282 from we can find the required value of .

Calculation of parameters for the SU distribution

As we mentioned earlier, the SU distribution is either symmetric or has left asymmetry. If (b1, b2) falls into SU zone, the parameters of the distribution are calculated as follows:

From the tables of SU distribution [6] the estimates of η and γ* are found

Values of ζ and ε are calculated using formulae:

; (32)

 

; (31)

. (32)

Then from

, (33)

 

the values of z are found and from the tables of normal distribution the corresponding F(z), which can be used to test the distribution law hypothesis.

Eventually, the formula

, (34)

 

allows the calculation of the x, corresponding to the probability .

Testing the distribution law hypothesis

Correspondence of the empirical distribution law to the theoretical one is examined using various correspondence criteria. The (b1, b2) plane is not used for that purpose although it gives a fairly good idea about the distribution. One of the reasons is that the variance of the moments above the second is also fairly large. The hypothesis regarding the distribution law is usually examined using other criteria, such as the Kolmogorov-Smirnov criterion, Pearson (χ2) criterion etc.

We will be using the Kolmogorov-Smirnov criterion.

Kolmogorov-Smirnov goodness of fit criterion

Kolmogorov-Smirnov criterion [PUST] is one of the simplest and most popular criteria.

It is used in the following way:

The largest difference between the values of the empirical F*(x) and theoretical F(x) functions is found

D = max|F(xi) F*(xi)| (i=1,2,,k). (35)

The value of is found

From the probability (λ) of the difference D between F(x) and F*(x) having occurred by chance is found.

If this probability is large, then there is no reason to conclude that the hypothesis is not valid, if (λ) is small (usually less than 0,1), the null hypothesis needs a complementary testing since there are reasons to believe that the distribution of the random variable is not properly described by the chosen distribution law.

Table 1

Values of λ and (λ) for the Kolmogorov-Smirnov Criterion

 

λ

p(λ)

λ

p(λ)

λ

p(λ)

λ

p(λ)

0,0

1,00

0,6

0,86

1,1

0,18

1,6

0,01

0,1

1,00

0,7

0,71

1.2

0,11

1,7

0,01

0,2

1,00

0,8

0,54

1,3

0,07

1,8

0,00

0,3

1,00

0,9

0,39

1,4

0,04

1,9

0,00

0,4

1,00

1,0

0,27

1,5

0,02

2,0

0,00

0,5

0,96

 

 

 

 

 

 

Example

A hypothetical sample with the values of the hard residue of salt in soil is given.

We have:

n = 100, xmin = 0,05, xmax = 0,98;

R = xmax xmin = 0,93;

k = [log2n] + 1 = 7;

Δx = R/k = 0,93/7 = 0,133.

 

The width of the intervals given by the formula is not very convenient, so we change the number of the intervals k to 10 and the set width Δx = 0,1.

Shows the obtained frequency table and some of the calculations. Using the provided formulae we find:

 

(36)

Table 2

Example of a sorted sample

 

1

2

3

4

5

6

7

8

9

10

1

0,05

21

0,34

41

0,43

61

0,49

81

0,59

2

0,12

22

0,34

42

0,43

62

0,50

82

0,60

1

2

3

4

5

6

7

8

9

10

3

0,15

23

0,35

43

0,43

63

0,50

83

0,60

4

0,18

24

0,35

44

0,44

64

0,51

84

0,61

5

0,21

25

0,36

45

0,44

65

0,51

85

0,62

6

0,22

26

0,36

46

0,44

66

0,52

86

0,63

7

0,23

27

0,37

47

0,45

67

0,52

87

0,64

8

0,24

28

0,37

48

0,45

68

0,53

88

0,65

9

0,25

29

0,38

49

0,45

69

0,53

89

0,66

10

0,25

30

0,38

50

0,46

70

0,54

90

0,67

11

0,26

31

0,39

51

0,46

71

0,54

91

0,68

12

0,27

32

0,39

52

0,46

72

0,55

92

0,69

13

0,28

33

0,40

53

0,47

73

0,55

93

0,70

14

0,28

34

0,40

54

0,47

74

0,56

94

0,73

15

0,31

35

0,41

55

0,47

75

0,56

95

0,75

16

0,31

36

0,41

56

0,48

76

0,57

96

0,78

17

0,32

37

0,41

57

0,48

77

0,57

97

0,84

18

0,32

38

0,42

58

0,48

78

0,58

98

0,86

19

0,33

39

0,42

59

0,49

79

0,58

99

0,92

20

0,33

40

0,42

60

0,49

80

0,59

100

0,98

 

 

Table 3

Frequency table for the sample in table 2

 

1

2

i

ni

nx

n(x-m)2

n(x-m)3

n(x-m)4

Ni

f*

F*

0,0

0,1

0,05

1

0,1

0,17

-0,071

0,0294

1

0,010

0,010

0,1

0,2

0,15

3

0,5

0,30

-0,093

0,0292

4

0,030

0,040

0,2

0,3

0,25

10

2,5

0,46

-0,098

0,0210

14

0,100

0,140

0,3

0,4

0,35

20

7,0

0,26

-0,030

0,0034

34

0,200

0,340

0,4

0,5

0,45

29

13,1

0,01

0,000

0,0000

63

0,290

0,630

0,5

0,6

0,55

20

11,0

0,15

0,013

0,0011

83

0,200

0,830

0,6

0,7

0,65

10

6,5

0,35

0,064

0,0120

93

0,100

0,930

0,7

0,8

0,75

3

2,3

0,25

0,070

0,0201

96

0,030

0,960

0,8

0,9

0,85

2

1,7

0,30

0,115

0,0444

98

0,020

0,980

0,9

1,0

0,95

2

1,9

0,47

0,230

0,1116

100

0,020

1,000

 

 

100

46,4

2,70

0,200

0,2720

 

 

 

 

In figure 2, all the results of the calculations using the above mentioned software are shown. We can see that all three Johnson distributions can be used for the sample in Table 1, but SU distribution has the smallest λ, therefore the highest p(λ) and should be the preferred distribution.

We will show now the manual use of the described formulae.

Calculations for the SL distribution

Using formulae and we have:

;

sy = 0,3454;

;

η = 1 / 0,3454 = 2,895;

γ = 0,8275 / 0,3454 = 2,396.

The values of z are calculated using formula

z1 = γ* + ηlnx1 = 2,396 + 2,895 × ln(0,1) = -4,290 etc.

 

 

 

Figure 3. Results of the calculations after the optimization of parameters α and β

 

Note that 1 is equal to 0,1, i.e. the right end of the first interval. F(z) is found from the tables. One other way to find F(z) is using the function NORMSDIST in Microsoft Excel.

The largest difference D between F*(x) and F(z) is achieved at = 0,4 and is equal to 0,059.

This gives .

Corresponding probability p(λ) = 0,88.

And the value of x for = 0,9

0.9 = ((1,282-2,396)/2,895) = 0,681.

Our program varies α from 0,01 to 0,3 with step 0,01 and for each α the value of (λ) is calculated. If a p(λ) is found which is less than the one calculated from and , then this value is taken as the one to be used and is shown on the screen in the columns with calculations for the SL distribution.

Calculations for the S distribution

The program varies α from 0,01 to 0,30 with step 0,01 and β from 0,99 to 0,7 with step -0,01. Out of 900 or so results the one with the smallest λ is chosen. Figure 4 hows that the smallest λ = 0,295 is achieved when α = 0,10 and β = 0,85.

Indeed: we have Δx = 0,1; n = 100;

Let α = 0,10; β = 0,85; xmax = 1,001.

Using formulae we find:

 

nα

10,10

x

0.2

n

14

n

4

α = 0,2 + 0,1 × (10,10 4)/(14-4)

0,261

nβ

85,85

x

0,6

n

93

n

83

β = 0,6 + 0,1 × (85,85 83/(93-83)

0,629

 

;

 

 

; ;

 

0,9 = 1,001 / (1 + exp[(0,26 1,282) / 148])=0,67.

 

For the value of z1

z1 = γ + ηln(x1 /(xmax - x1)= 0,26 + 1,48 × ln(0,1/(1,001 0,1)) = -2,994.

In the same way all other zi are calculated.

The values of F(z) are found either from the tables of normal distribution or using Microsoft Excel. The largest difference D between F*(x) and F(z) is achieved when = 0,8 and is equal to 0,029.

Then and p(λ) = 1,00.

The value of x corresponding to = 0,9

0.9 = 1,001/((0,26 1,28)/1,48) = 0,67.

Calculations for the SU distribution

We have already calculated m = 0,464; s = 0,165; b1 = 0,2038 and b2 = 3,7301. We also need . From the tables for -γ and η in [6] and using linear interpolation we find

γ = -1,527; η = 3,214.

Using formulae :

ω = exp(1 / 3,2142) = 1,102;

;

 

;

.

 

In the same way the other zi are calculated. The values of F(z) are found either from the tables of normal distribution or using Microsoft Excel. The largest difference D between F*(x) and F(z) is achieved when = 0,3 and is equal to 0,025. Then and p(λ) = 1,00.

The value of x corresponding to = 0.9

.

Conclusion

Johnson distributions cover all other possible distributions and can be used, if there is a convenient way to calculate the necessary parameters. Software designed and developed in the Problem Research Laboratory provides such an opportunity. The program calculates the parameters for all three distributions and allows the user to choose the one best suited for the studied case.

References

 

1.      Ventsel E.S. Teoria veroyatnostei (Probability Theory), Moskva, Nauka, 1969.

2.      Golovanov A.I., Papadopoulos D.G. Statisticheskaya kharakteristika, Vestnik s/h nauki, 1972. 2.

3.      Mitropolsky A.K..Tehnika statisticheskih Vychislenii, Moskva, Nauka, 1971.

4.      Pustylnik E.I. Statisticheskie metody analiza i obrabotki nabliudenii, Moskva, Nauka, 1968.

5.      Smirno N.V., Dunin-Barkovsky I.V. Kurs teorii veroyatnostei I matematicheskoi statistiki, Moskva, Nauka, 1969.

6.      Hahn G. Shapiro Samuel S. Statistical Models in Engineering. New York.

7.      Johnson N.L.Systems of Frequency Curves Generated by Methods of Translation, Biometrica, 36, 149, (1949).

8.      Shapiro Samuel S. Gross Alan G. Statistical Modelling Techniques, Marcel Dekker, New York.

 

 


 
...