ÓÄÊ 631.6
Using Johnson Distribution in Statistical Analysis
Dimitrios Papadopoulos
Problem Research
Laboratory,
Many parameters of various engineering
problems are determined in several points of the investigated area. The value
to use for the calculations of a parameter is obtained either by averaging the
measurements or by using some other considerations specific for each case. We
will use as a working example determination of parameters in saline soils.
The basic equation of movement of
salt and water contains seven kinds of parameters – the space coordinates x, y, z, time t, soil porosity μ,
convection diffusion coefficient D,
filtration velocity,
coefficient of salt exchange and concentration (or content) of salt ñ(õ, ó, z, t) and concentration of
saturation ñ_{s}.
Coordinates x, y, z, t and concentration of saturation ñ_{s} have more or less exact values, while the others are
random variables, since in practice they are usually determined in several
points of a region with an inevitable error of measurement
Parameter μ,
as a rule, varies insignificantly, ranging for the loamy soils from 0.4 to 0.5
and it does not play a major role in determination of the leaching rates.
Knowing approximately the smallest value μ
= μ_{min} and the largest μ
= μ_{max}, one can estimate the error in the calculation of
the leaching rate Ν.
In a similar way, it is possible to
determine which parameters have to be considered stochastic and which, at least
in the first approximation can be assumed to be nonrandom.
Correct determination of the
concentration of salt in soil – c_{p}
is very important. Clearly, calculation of the leaching rates based on the
average is
very risky because up to 50 % of the area
might remain insufficiently leached.
Since the salt content
of a region is determined by conducting a detailed salt sampling, i.e. by
analyzing a salt sample, the use of Mathematical Statistics might help to
answer some important questions and determine the necessary parameters, such as
the value of the salt content to be used for the calculation of the leaching
rates.
We have in our disposal
90 samples of salt content in various layers in a few regions of
The special feature of
our approach consists in that the program optimizes some parameters to achieve
the best possible fit of the empirical distribution.
The software can be
requested in the Problem Research Laboratory of MSUEE, by contacting Prof V. V.
Shabanov or Dr. N. P. Bunina.
Statistical analysis of the source
data is very well described in the literature [1, 3, 4, 5, 6 etc.]. We will try
to provide only the background information necessary to present our ideas.
For the calculation of
the value c_{p} of salt corresponding
to the probability p and using the
sample data, it is necessary to know its distribution law. In general the
distribution law of a random variable Õ
represented by some sample {õ_{1},
õ_{2}, ... õ_{n}} is described by two functions – f(x)
and F(x) which as a rule depend on certain parameters. The first function
– f(x) is called differential distribution function or distribution
density, the second – F(x) is the integral distribution function.
The purpose of the
statistical analysis during the processing of the initial salinization data is:
given the sample Õ = {õ_{1}, õ_{2}, ... õ_{n}},
determine the functions f(x) and F(x), and their parameters.
One of the most common
distribution laws is the so called normal distribution. It is described by the
functions
, (1)
where the parameters MX = μ
and DX = σ^{2}, mathematical expectation, variance (and the
square root of the variance – standard deviation) have a well defined physical
meaning and can be estimated using the sample data õ_{1}, õ_{2}, ..., õ_{n}.
The estimated values of
the mathematical expectation ÌÕ = μ and variance DX = σ^{2}
(sample mean m and sample variance s^{2}) are calculated using the
formulae
. (2)
Roughly speaking, the
distribution function gives the probability that the random variable Õ
will have values less than or equal to õ, while the distribution density shows the probability that Õ will have a value equal to õ.
The case when mathematical
expectation m = 0, and variance s^{2} = 1 represents special
interest. Such distribution is called «Standard Normal
Distribution» or N(0, 1).
Tables of f(x) and
F(x) for N(0,1) are provided
in all books of Mathematical Statistics (see for example [1, 4, 5, 6]). The
tables can be used, if the random variable Õ
is normalized by being replaced by the variable z:
. (3)
The normal distribution
is a well defined and well documented law. The studied random variables are
often transformed in such a way that the new transformed variable follows the
normal distribution law.
In Mathematical Statistics there is
no method by which the distribution law can be unambiguously determined using
sample data. First the so called nullhypothesis H_{0} is stated (according to H_{0} the random variable X follows certain distribution law). Then the validity of H_{0} is tested. The
nullhypothesis is usually stated on the basis of the form of the differential
distribution function f^{*}(x) and tested on the basis of the
integral function F^{*}(x).
An initial impression of the
distribution law can be obtained by building a frequency table of the random
variable.
Assume that a sample X = {x_{1},
x_{2}, …, x_{n}} representing for example the salt content
in the layer [h_{min}, h_{max}]
of soil is given. Let the smallest value of X be x_{min} and the largest value – x_{max} i.e. any x_{i}
from X belongs in the interval (x_{min}, x_{max}).
To construct the
empirical differential and integral distributions the following the steps are
taken:
The
segment R = x_{max} – x_{min} is broken into
several intervals. The number of intervals is sually 8 to 12, sometimes more
than 12. This number can also be estimated by the formula
k = [lg_{2}n] + 1, (4)
where [log_{2}n] is the smallest integer not greater
than log_{2}n.
The
approximate width of each interval is calculated
Δx = R / k, (5)
ith some
practice the problem of the number of the intervals and their width is handled
without and . One should only remember that too
large a width makes the picture less graphic, hiding some important details,
while a too small one might reveal some uncharacteristic and/or unimportant
details of the studied population.
All values of X falling into the ith interval are equated to its middle value.
The frequencies n_{i} – the number of elements
of õÎÕ falling into each interval are calculated.
After that the operations are performed on k
numbers instead of n.
The cumulative
frequencies N_{i}_{ }representing the sum of the frequencies
up to and including the current interval are calculated.
Using formulae
the sample mean m and variance s^{2} are calculated.
The existence of «serious blunders» i.e. erroneous measurements, is tested. Under the normal distribution
99,7% of possible values of a random
variable should be lying within the interval m+3s. For greater
security, especially when the distribution law is asymmetric, the 4sigma interval
can be used. Values of X not lying
within the interval (mks, m+ks), can
be discarded as serious measurement errors. This is justified by the Chebyshev
inequality [5 and other], according to which for any distribution the
probability of a particular x falling
beyond the limits of the interval (mks,
m+ks) is not greater than 1/k^{2}.
For k = 4 this probability is no more
than 0,06. In other words for any distribution
not less than 94% of possible values of X
are located within the interval m±4s.
The new mean and
variance are calculated. The process is repeated until all the values fall into
he 4sigma interval.
The empirical relative
frequencies are calculated. They
give us an estimate of the values of the differential distribution function of Õ.
The empirical cumulative
relative frequencies are calculated. They
give us an estimate of the values of the integral distribution function of Õ.
This information is
sufficient for the construction of the graphs of the distribution functions.
Axis x will show the values of the
random variable, axis y – the values
of and
F^{*} (i.e. we will have
primary and secondary axis y). The
form of f^{*}(x) gives the first impression of the
distribution law that the variable X
follows. The graph of F^{*}
is useful for the test of the hypothesis concerning the distribution law.
The logarithmically normal
distribution is characterized by the fact that if the random variable Õ is represented by a sample of size n (i.e. X = {x_{1}, x_{2},…, x_{n}}), then the random variable Y, obtained by taking the logarithms of Õ i.e. Y = {y_{1}, y_{2}, …, y_{n}} = {ln x_{1}, ln x_{2}, …, ln x_{n}} is distributed normally.
If m_{y}
and are the estimates
of the mathematical expectation and variance, then the sample mean and variance
of Õ
are found using the formulae:
;
(6)
. (7)
Since in practice the parameters of Õ are found first, the following
formulae can be used to calculate m_{y} and s_{y}
from m_{x} and s_{x}:
;
(8)
; (9)
k_{vx} = s_{x} / m_{x}.
(10)
Any distribution can in
principle be precisely defined by its moments [8]. We will continue to use the
sample estimates of the theoretical values.
Sample moment of the
first degree is the mean of the random variable Õ
. (11)
Central
sample moments of the degree r (r > 1) are defined as
. (12)
In most cases the first four moments
do adequately describe a random variable Õ.
The first moment m_{1} geometrically defines the location, the second moment
m_{2} is equal to the
variance s^{2} and determines
the spread of the values of X around
the mean, the third moment m_{3}
defines the asymmetry (the term Skewness is also used) and the fourth m_{4} – the steepness (or
kurtosis). Since the first and second moments contain information on the
location and spread but not the shape, the classification system can be limited
to the third and forth moments [6, 8].
We introduce now the
socalled Pearson constants b_{1}
and b_{2}
. (13)
A convenient way
for presentation of the distributions is the (b_{1}, b_{2}) plane [6]. As we can see in Figure 1 the
plane (b_{1}, b_{2}) is divided into several parts representing
various regions with various distributions. The top of the plane is the «impossible area»: there are no distributions with b_{2}<1+b_{1}.
The Normal distribution has b_{1}
= 0, b_{2} = 3, i.e. it is
represented by a single point, as is the exponential distribution. The
Lognormal and Gamma distributions are represented by a line, and others by a
region.
The family of empirical
distributions of N.L. Johnson [6, 7, 8] is convenient in that it is obtained by
the transformation of the normal distribution, it covers the whole plane (b_{1}, b_{2}) and it could be used in place of any other
distribution.
In [6, 7] these distributions are
described in detail. We will look at the part of the material that in our
opinion is adequate for the analysis of one sample that describes the state of
some physical phenomenon (content of salt in soil, humidity, hydraulic
conductivity etc).

Figure 1. Various distributions on the
plane (b_{1}, b_{2})
In general the transformation of the
normal distribution has the form
z = γ + ητ(õ, ε, ζ) (η > 0, ζ > 0; γ, εÎ(¥, +¥)).
(14)
íere τ – an arbitrary function; γ, η,
ε, ζ –
parameters; z – a random variable following the standard
normal distribution.
Johnson N. L. [6, 7] proposed the following
three functions:
;
(15)
; (16)
. (17)
When τ = τ_{1} we get the lognormal distribution (or S_{L} Johnson distribution),
when τ = τ_{2} – S_{Â}_{ }– Johnson distribution and when τ = τ_{3} –S_{U} distribution, which is
often symmetric or has left asymmetry.
As shown in [6], for the S_{L}
distribution we can write
,
(18)
where
γ* = γ –
η lnζ.
(19)
In this case parameter ε is the
smallest value of X and we set it at
zero.
The estimates for η and γ* are found from the formulae:
. (20)
where m_{y}, s_{y} – mean and standard deviation
calculated from ….
We can calculate the values of the
theoretical distribution function F(z) and the values of Õ corresponding to a given probability p (for example the value of salt content
c_{0.9} that guarantees with
probability p = 0,9 that the salt
content in the region will not exceed c_{0.9})..
In [6] the formulae for an unknown ε > 0 are given:
;
(21)
;
(22)
,
(23)
where β = 1 – α, z_{β}
= z_{1α} = z_{α} are the quintilesof the standard normal
distribution, corresponding to the probabilities α and β = 1 – α; x_{0.5}, x_{α}, x_{β} are the values of X corresponding to probabilities 0,5, α and β; they are found from the
sorted sample as values with numbers 0,5(n+1),
α(n+1), β(n+1). If the
numbers happen to be fractional, the corresponding values are found by interpolation
of the closest õs.
From the frequency tables, x_{0.5}, x_{α}, x_{β} are also found by
interpolation. We suggest the following formula
,
(24)
where n_{p} = (n+1)p (and ð = 0.5, α or β);
n_{R}_{ }–
cumulative frequency of the interval with n_{ð}.
n_{L}_{ }– cumulative frequency of the previous interval;
Δx – width of the intervals;
x_{L} – left value of the interval with n_{ð}.
The value of X, corresponding to the probability ð, can be found as
. (25)
For the S_{Â} distribution, parameters ε and ζ are the
minimum and maximum possible values of the random variable Õ. We assume ε = 0 and ζ slightly greater
than the biggest sample element (we will hence write x_{max} instead of ζ).
Formula with taken into account gives
.
(26)
Parameters η
and γ are found using the formulae:
;
(27)
,
(28)
íere z_{α}, z_{β}_{
}are the
quintiles of the standard normal distribution, corresponding to probabilities α and β; õ_{α}, x_{β} – corresponding empirical quintiles; õ_{α}, x_{β} are found as α(n+1)st and β(n+1)st
values of the ordered sample.If these values are fractional, the interpolation
of the two closest values is performed.
Knowing γ, η, ζ, we can construct the distribution functions
of the random variable. Setting õ
equal to the right values of the intervals and replacing them into the formula , we obtain the corresponding values
of z. From the tables of normal
distribution we find the integral probability F(z). Having calculated
the empirical values of the integral probability – the cumulative relative
frequencies F*(õ), we can estimate the probability of the null hypothesis. We can
find the best possible representation of the distribution of the random
variable by varying the parameters α
and β.
Formula can be solved relative to õ
.
(29)
Using we can find the values of õ,
corresponding to the required probability.
For example, if we are dealing with
the salt content corresponding to the probability ð = 0,9, then, knowing γ, η, x_{max} and
setting z = 1,282 from we can find the required value of õ.
As we mentioned earlier, the S_{U} distribution is either
symmetric or has left asymmetry. If (b_{1},
b_{2}) falls into S_{U} zone, the parameters of
the distribution are calculated as follows:
From the tables of S_{U } distribution [6] the estimates of η and γ* are found
Values of ζ and ε are calculated using formulae:
; (32)
;
(31)
.
(32)
Then
from
,
(33)
the values of z are found and from the tables of normal distribution – the
corresponding F(z), which can be used to test the distribution law hypothesis.
Eventually, the formula
,
(34)
allows the calculation
of the x, corresponding to the
probability ð.
Correspondence of the
empirical distribution law to the theoretical one is examined using various
correspondence criteria. The (b_{1},
b_{2}) plane is not used for
that purpose although it gives a fairly good idea about the distribution. One
of the reasons is that the variance of the moments above the second is also
fairly large. The hypothesis regarding the distribution law is usually examined
using other criteria, such as the KolmogorovSmirnov criterion, Pearson (χ^{2}) criterion etc.
We will be using the
KolmogorovSmirnov criterion.
KolmogorovSmirnov criterion [PUST] is one of the simplest and most
popular criteria.
It is used in the following way:
The
largest difference between the values of the empirical F^{*}(x) and
theoretical F(x) functions is found
D = maxF(x_{i}) – F^{*}(x_{i}) (i=1,2,…,k). (35)
The value of is
found
From the probability ð(λ) of the difference
D between F(x) and F^{*}(x) having occurred by chance is found.
If this probability is
large, then there is no reason to conclude that the hypothesis is not valid, if
ð(λ)
is small (usually less than 0,1), the null
hypothesis needs a complementary testing since there are reasons to believe
that the distribution of the random variable is not properly described by the
chosen distribution law.
Table 1
Values of λ and ð(λ) for the KolmogorovSmirnov
Criterion
λ 
p(λ) 
λ 
p(λ) 
λ 
p(λ) 
λ 
p(λ) 
0,0 
1,00 
0,6 
0,86 
1,1 
0,18 
1,6 
0,01 
0,1 
1,00 
0,7 
0,71 
1.2 
0,11 
1,7 
0,01 
0,2 
1,00 
0,8 
0,54 
1,3 
0,07 
1,8 
0,00 
0,3 
1,00 
0,9 
0,39 
1,4 
0,04 
1,9 
0,00 
0,4 
1,00 
1,0 
0,27 
1,5 
0,02 
2,0 
0,00 
0,5 
0,96 






A hypothetical sample with the
values of the hard residue of salt in soil is given.
We have:
n = 100, x_{min} = 0,05, x_{max} = 0,98;
R = x_{max} – x_{min} = 0,93;
k = [log_{2}n] + 1 = 7;
Δx =
R/k = 0,93/7 = 0,133.
The width of the intervals given by
the formula is not very convenient, so we
change the number of the intervals k
to 10 and the set width Δx = 0,1.
Shows the obtained frequency table and some of
the calculations. Using the provided formulae we find:
(36)
Table 2
Example of a sorted sample
¹ 
õ 
¹ 
õ 
¹ 
õ 
¹ 
õ 
¹ 
Õ 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
1 
0,05 
21 
0,34 
41 
0,43 
61 
0,49 
81 
0,59 
2 
0,12 
22 
0,34 
42 
0,43 
62 
0,50 
82 
0,60 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
3 
0,15 
23 
0,35 
43 
0,43 
63 
0,50 
83 
0,60 
4 
0,18 
24 
0,35 
44 
0,44 
64 
0,51 
84 
0,61 
5 
0,21 
25 
0,36 
45 
0,44 
65 
0,51 
85 
0,62 
6 
0,22 
26 
0,36 
46 
0,44 
66 
0,52 
86 
0,63 
7 
0,23 
27 
0,37 
47 
0,45 
67 
0,52 
87 
0,64 
8 
0,24 
28 
0,37 
48 
0,45 
68 
0,53 
88 
0,65 
9 
0,25 
29 
0,38 
49 
0,45 
69 
0,53 
89 
0,66 
10 
0,25 
30 
0,38 
50 
0,46 
70 
0,54 
90 
0,67 
11 
0,26 
31 
0,39 
51 
0,46 
71 
0,54 
91 
0,68 
12 
0,27 
32 
0,39 
52 
0,46 
72 
0,55 
92 
0,69 
13 
0,28 
33 
0,40 
53 
0,47 
73 
0,55 
93 
0,70 
14 
0,28 
34 
0,40 
54 
0,47 
74 
0,56 
94 
0,73 
15 
0,31 
35 
0,41 
55 
0,47 
75 
0,56 
95 
0,75 
16 
0,31 
36 
0,41 
56 
0,48 
76 
0,57 
96 
0,78 
17 
0,32 
37 
0,41 
57 
0,48 
77 
0,57 
97 
0,84 
18 
0,32 
38 
0,42 
58 
0,48 
78 
0,58 
98 
0,86 
19 
0,33 
39 
0,42 
59 
0,49 
79 
0,58 
99 
0,92 
20 
0,33 
40 
0,42 
60 
0,49 
80 
0,59 
100 
0,98 
Table 3
Frequency table for the sample in table 2
õ_{1} 
õ_{2} 
õ_{i} 
n_{i} 
n_{x} 
n(xm)^{2} 
n(xm)^{3} 
n(xm)^{4} 
Ni 
f* 
F* 
0,0 
0,1 
0,05 
1 
0,1 
0,17 
0,071 
0,0294 
1 
0,010 
0,010 
0,1 
0,2 
0,15 
3 
0,5 
0,30 
0,093 
0,0292 
4 
0,030 
0,040 
0,2 
0,3 
0,25 
10 
2,5 
0,46 
0,098 
0,0210 
14 
0,100 
0,140 
0,3 
0,4 
0,35 
20 
7,0 
0,26 
0,030 
0,0034 
34 
0,200 
0,340 
0,4 
0,5 
0,45 
29 
13,1 
0,01 
0,000 
0,0000 
63 
0,290 
0,630 
0,5 
0,6 
0,55 
20 
11,0 
0,15 
0,013 
0,0011 
83 
0,200 
0,830 
0,6 
0,7 
0,65 
10 
6,5 
0,35 
0,064 
0,0120 
93 
0,100 
0,930 
0,7 
0,8 
0,75 
3 
2,3 
0,25 
0,070 
0,0201 
96 
0,030 
0,960 
0,8 
0,9 
0,85 
2 
1,7 
0,30 
0,115 
0,0444 
98 
0,020 
0,980 
0,9 
1,0 
0,95 
2 
1,9 
0,47 
0,230 
0,1116 
100 
0,020 
1,000 


Ñóììû 
100 
46,4 
2,70 
0,200 
0,2720 



In figure 2, all the results of the calculations
using the above mentioned software are shown.
We can see that all three Johnson distributions can be used for the
sample in Table 1, but SU distribution has the smallest λ, therefore the
highest p(λ) and should be the preferred distribution.
We will show now the manual use of the
described formulae.
Using formulae … and we have:
;
s_{y}
= 0,3454;
;
η = 1 /
0,3454 = 2,895;
γ = 0,8275 / 0,3454 = 2,396.
The values of z are calculated using formula
z_{1} = γ^{*}
+ ηlnx_{1} = 2,396 + 2,895 × ln(0,1) = 4,290 etc.

Figure 3. Results of the calculations
after the optimization of parameters α
and β
Note that õ_{1} is equal to
0,1, i.e. the right end of the first
interval. F(z) is found from the tables. One other way to find F(z)
is using the function NORMSDIST in Microsoft Excel™.
The largest difference D between F*(x) and F(z)
is achieved at õ = 0,4 and is equal to 0,059.
This gives .
Corresponding probability p(λ)
= 0,88.
And the value of x for ð = 0,9
õ_{0.9} = åõð((1,2822,396)/2,895) = 0,681.
Our program varies α from 0,01 to 0,3 with step 0,01 and for each α the value of ð(λ)
is calculated. If a p(λ) is found which is less than the
one calculated from … and , then this value is taken as the
one to be used and is shown on the screen in the columns with calculations for
the S_{L} distribution.
The program varies α from 0,01 to 0,30 with step 0,01 and β
from 0,99 to 0,7 with step 0,01. Out of 900 or so results the one
with the smallest λ is chosen. Figure 4 hows that the smallest λ =
0,295 is achieved when α = 0,10 and β = 0,85.
Indeed: we have Δx = 0,1; n = 100;
Let α = 0,10; β = 0,85; x_{max} = 1,001.
Using formulae … we find:
n_{α} 
10,10 
x_{Lα} 
0.2 
n_{Rα} 
14 
n_{Lα} 
4 
õ_{α} = 0,2 + 0,1 × (10,10 – 4)/(144) 
0,261 
n_{β} 
85,85 
x_{Lβ} 
0,6 
n_{Rβ} 
93 
n_{Lβ} 
83 
õ_{β} = 0,6 + 0,1 × (85,85 – 83/(9383) 
0,629 
;
;
;
õ_{0,9} = 1,001 / (1 + exp[(0,26 – 1,282) / 148])=0,67.
For the value of z_{1}
z_{1} = γ
+ ηln(x_{1} /(x_{max}_{ } x_{1})= 0,26 + 1,48 × ln(0,1/(1,001 – 0,1)) = 2,994.
In the same way all other z_{i} are calculated.
The values of F(z) are found either
from the tables of normal distribution or using Microsoft Excel. The largest
difference D between F*(x) and F(z) is
achieved when õ = 0,8 and is equal to 0,029.
Then and p(λ)
= 1,00.
The value of x corresponding to ð = 0,9
õ_{0.9} = 1,001/åõð((0,26 – 1,28)/1,48) = 0,67.
We have already calculated
m = 0,464; s = 0,165; b_{1} = 0,2038 and b_{2 }= 3,7301. We also need . From the tables for γ and η in [6] and using linear interpolation
we find
γ = 1,527; η = 3,214.
Using formulae …:
ω = exp(1 / 3,214^{2}) = 1,102;
;
;
.
In the same way the other z_{i} are calculated.
The values of F(z) are found either from the tables of normal distribution or using
Microsoft Excel. The largest difference D
between F*(x) and F(z) is achieved when õ = 0,3 and is equal to 0,025. Then and p(λ)
= 1,00.
The value of x corresponding to ð =
0.9
.
Johnson distributions cover all other possible
distributions and can be used, if there is a convenient way to calculate the
necessary parameters. Software designed and developed in the Problem Research
Laboratory provides such an opportunity. The program calculates the parameters
for all three distributions and allows the user to choose the one best suited
for the studied case.
1. Ventsel E.S. Teoria veroyatnostei (Probability
Theory), Moskva, Nauka, 1969.
2. Golovanov A.I., Papadopoulos D.G. Statisticheskaya
kharakteristika…, Vestnik s/h nauki, 1972. ¹ 2.
3. Mitropolsky A.K..Tehnika statisticheskih Vychislenii, Moskva, Nauka, 1971.
4. Pustylnik E.I. Statisticheskie metody analiza i obrabotki nabliudenii, Moskva, Nauka,
1968.
5. Smirno N.V., DuninBarkovsky I.V. Kurs teorii veroyatnostei I matematicheskoi
statistiki, Moskva, Nauka, 1969.
6. Hahn G. Shapiro Samuel S.
Statistical Models
in Engineering.
7. Johnson N.L.Systems of Frequency Curves Generated by Methods of
Translation, Biometrica, 36, 149, (1949).
8. Shapiro Samuel S.
Gross Alan G. Statistical Modelling Techniques, Marcel Dekker,