Statistical Treatment

R contains a very comprehensive library with statistical functions, including the most common probability distributions:

Associated Functions

There are several functions associated to every probability distribution, and they can be accessed adding a prefix to the distribution name:

_______       _______________________________________________________________________

Prefix                                    Meaning
_______       _______________________________________________________________________

  d           density function
  p           distribution function (cumulative function)
  q           inverse of the distribution function (quantile function)
  r           random generation of numbers following the probability distribution
_______       _______________________________________________________________________

The arguments are obviously different for each associated function. For the Normal Distribution:

> ?Normal

Normal                  package:stats                  R Documentation

The Normal Distribution

Description:

   Density, distribution function, quantile function and random
   generation for the normal distribution with mean equal to ‘mean’
   and standard deviation equal to ‘sd’.

Usage:


  dnorm(x, mean = 0, sd = 1, log = FALSE)
  pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
  qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
  rnorm(n, mean = 0, sd = 1)

...

dnorm(x, mean = 0, sd = 1, log = FALSE)

It evaluates the density of the normal distribution with mean mean and standard deviation sd in x abscissa. The normal distribution has density

\(f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}\)

where \(\mu\) is the mean of the distribution and \(\sigma\) the standard deviation.

> x <- seq(-10,10,by=.5)                      # sequence of numbers
> x
 [1] -10.0  -9.5  -9.0  -8.5  -8.0  -7.5  -7.0  -6.5  -6.0  -5.5  -5.0  -4.5
[13]  -4.0  -3.5  -3.0  -2.5  -2.0  -1.5  -1.0  -0.5   0.0   0.5   1.0   1.5
[25]   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5   6.0   6.5   7.0   7.5
[37]   8.0   8.5   9.0   9.5  10.0

> y <- dnorm(x, mean=3, sd=2)                 # Normal distribution with mean=3 and sd=2

> plot(x,y,main="Normal Distribution Example")        # Plot the result

rnorm(n, mean = 0, sd = 1)

Random sequence of n numbers following a normal distribution with mean mean and standard deviation sd.

> x <- rnorm(1000,mean=3,sd=2)                # 1000 random numbers with mean=3 and sd=2
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-3.783   1.694   3.003   3.047   4.436   9.864

> hist(x,main="Normal Distribution Simulation", ylab="Frequency", plot=TRUE)

To ensure reproducibility, it is important to set the random number seed when performing simulations:

> set.seed(1000)
> rnorm(10)
  [1] -0.44577826 -1.20585657  0.04112631  0.63938841 -0.78655436 -0.38548930
  [7] -0.47586788  0.71975069 -0.01850562 -1.37311776
> rnorm(10)
  [1] -0.98242783 -0.55448870  0.12138119 -0.12087232 -1.33604105  0.17005748
  [7]  0.15507872  0.02493187 -2.04658541  0.21315411
> set.seed(1000)
> rnorm(10)
  [1] -0.44577826 -1.20585657  0.04112631  0.63938841 -0.78655436 -0.38548930
  [7] -0.47586788  0.71975069 -0.01850562 -1.37311776

pnorm(q, mean = 0, sd = 1, lower.tail = FALSE, log.p = FALSE)

It evaluates the distribution function (area below the probability distribution) for a normal distribution with mean mean and standard deviation sd. By default, lower.tail = TRUE returns the area in the left wing of the distribution (\(P[X \le x]\)) and lower.tail = FALSE returns the right wing (\(P[X > x]\)).

> pnorm(1.5,mean=3,sd=2)                              # left wing (default)
[1] 0.2266274

> pnorm(1.5,mean=3,sd=2,lower.tail=FALSE)             # right wing
[1] 0.7733726

The R object ecdf(x) lets us calculate and plot the Empirical Cumulative Distribution Function (useful when the cumulative distribution is not known). Let’s see with an example how to plot the cumulative function in the case of a normal distribution:

> par(mfrow = c(1, 2))                # define 1 row and 2 columns to plot
> x <- rnorm(50, 2, 4)                # random numbers following normal distribution

> plot(ecdf(x),verticals = TRUE, col.points = "blue",
  + col.hor = "red", col.vert = "bisque")     # plot Empirical Cumulative Distribution Function

which is equivalent to:

> y <- pnorm(x, 2, 4)
> plot(x,y, main="CDF using pnorm",
+      col="darkolivegreen",pch=20) # plot Cumulative Distribution Function using 'pnorm'

qnorm(p, mean = 0, sd = 1, lower.tail = FALSE, log.p = FALSE)

It evaluates the inverse of the distribution function (the abscissa for an area p under the probability distribution) for a normal distribution with mean mean and standard deviation sd. By default, lower.tail = TRUE assumes that the area is that of the left wing of the distribution and lower.tail = FALSE assumes that is the right wing area.

> qnorm(0.2266274,mean=3,sd=2)                        # left wing (default)
[1] 1.5

> qnorm(0.7733726,mean=3,sd=2,lower.tail=FALSE)       # right wing
[1] 1.5

Common probability distributions

______________________________________________________________________________

Distribution                  Associated Function
______________________________________________________________________________

Uniform                       dunif, punif, qunif, runif
Binomial                      dbinom, pbinom, qbinom, rbinom
Poisson                       dpois, ppois, qpois, rpois
...                           d..., p..., q..., r...
______________________________________________________________________________

Normal                        dnorm, pnorm, qnorm, rnorm
t de Student                  dt, pt, qt, rt
chi                           dchisq, pchisq, qchisq, rchisq
F de Fisher                   df, pf, qf, rf
...                           d..., p..., q..., r...
______________________________________________________________________________

Example script

Purpose: Estimation of the value of \(\pi\) using random points generated inside a square.

Procedure: Calculate the ratio between the inner and outer points in a circle with radius equal to 1, inscribed in a square of side equal to 2 (i.e., the circle’s diameter is equal to the square’s side).

We save the script in a file called pirandom.R:

# estimate PI by using random numbers
#    A.squ = n = (2*r)²
#    A.cir = n.inside = pi * r²
#
#    pi = n.inside/ r² = 4*n.inside/n
#
pirandom <- function(n)               # define function
  {
    x <- runif(n,-1,1)                # random numbers in [-1,1]
    y <- runif(n,-1,1)                # random numbers in [-1,1]
    plot(x,y) # plot
    r <- sqrt(x*x+y*y)                # distance to centre
    rinside <- r[r<1]                 # inside circle with r=1?
    n.inside <- length(rinside)
    print(4*n.inside/n)               # print pi estimation
}

The code is executed in R as follows:

> source("pirandom.R")                # load the code (function) in the script
> pirandom(1000)                      # run the function for 1000 points
[1] 3.184                             # 'pi' value estimation