Week 2: Frequency distributions

The space environment is forever changing on all spatial and temporal scales. Energy releases are observed in numerous dynamic phenomena (e.g. solar flares, coronal mass ejections, solar energetic particle events) where measurements provide signatures of the dynamics. Parameters (e.g. peak count rate, total energy released, etc.) describing these phenomena are found to have frequency size distributions that follow power-law behavior. Natural phenomena on Earth, such as earthquakes and landslides, display similar power-law behavior. This suggests an underlying universality in nature and poses the question of whether the distribution of energy is the same for all these phenomena. Frequency distributions provide constraints for models that aim to simulate the physics and statistics observed in the individual phenomenon. The concept of self-organized criticality (SOC), also known as the "avalanche concept", was introduced by Bak et al. (1987, 1988), to characterize the behavior of dissipative systems that contain a large number of elements interacting over a short range. The systems evolve to a critical state in which a minor event starts a chain reaction that can affect any number of elements in the system. It is found that frequency distributions of the output parameters from the chain reaction taken over a period of time can be represented by power-laws. During the last decades SOC has been debated from all angles. New SOC models, as well as non-SOC models have been proposed to explain the power-law behavior that is observed. Furthermore, since Bak's pioneering work in 1987, people have searched for signatures of SOC everywhere. This paper will review how SOC behavior has become one way of interpreting the power-law behavior observed in natural occurring phenomenon in the Sun down to the Earth.


Frequency distributions
When data are purely qualitative, the simplest way to deal with them is to count the number of cases in each category.For example, in the analysis of the census of a psychiatric hospital population, one of the variables of interest was the patient's principal diagnosis.To summarise these data, we count the number of patients having each diagnosis.The results are shown in Table 1.The count of individuals having a particular quality is called the frequency of that quality.For example, the frequency of schizophrenia is 474.The proportion of individuals having the quality is called the relative frequency or proportional frequency.The relative frequency of schizophrenia is 474/1467 = 0.32 or 32%.The set of frequencies of all the possible categories is called the frequency distribution of the variable.Although the categories are ordered these are not quantitative data.There is no sense in which the difference between 'likely' and 'possibly' is the same as the difference between 'possibly' and 'unlikely'.
Table 3 shows the frequency distribution of a quantitative variable, parity.This shows the number of previous pregnancies for a sample of women booking for delivery at St. George's Hospital.Only certain values are possible, as the number of pregnancies must be an integer, so this variable is discrete.The frequency of each separate value is given.
Table 4 shows a continuous variable, forced expiratory volume in one second (FEV1) in a sample of male medical students.As most of the values occur only once, to get a useful frequency distribution we need to divide the FEV1 scale into class intervals, e.g. from 3.0 to 3.5, from 3.5 to 4.0, and so on, and count the number of individuals with FEV1s in each class interval.The class intervals should not overlap, so we must decide which interval contains the boundary point to avoid it being counted twice.It is usual to put the lower boundary of an interval into that interval and the higher boundary into the next interval.Thus the interval starting at 3.0 and ending at 3.5 contains 3.0 but not 3.5.We can write this as '3.0 -' or '3.0 -3.5 -'or '3.0 -3.499' If we take a starting point of 2.5 and an interval of 0.5 we get the frequency distribution shown in Table 5.Note that this is not unique.If we take a starting point of 2.4 and an interval of 0.2 we get a different set of frequencies.
We then count up the number in each interval.In practice this is very difficult to do accurately, and it needs to be checked and double-checked.

Histograms and other frequency graphs
Graphical methods are very useful for examining frequency distributions.The most common way of depicting a frequency distribution is by a histogram.This is a diagram where the class intervals are on an axis and rectangles with heights or areas proportional to the frequencies erected on them.Figure 1 shows the histogram for the FEV1 distribution in Table 4.The vertical scale shows frequency, the number of observations in each interval.
Figure 2 shows a histogram for the same distribution, with frequency per unit FEV1 (or frequency density) shown on the vertical axis.The distributions appear identical and we may well wonder whether it matters which method we choose.We see that it does matter when we consider a frequency distribution with unequal intervals, as in Table 6.If we plot the histogram using the heights of the rectangles to represent relative frequency in the interval we get Figure 3, whereas if we use the relative frequency per year we get Figure 4.These histograms tell different stories.Figure 3 suggests that the most common age for accident victims is between 15 and 44 years, whereas Figure 4 suggests it is between 0 and 4. Figure 4 is correct, Figure 3 being distorted by the unequal class intervals.It is therefore preferable in general to use the frequency per unit rather than per class interval when plotting a histogram.The frequency for a particular interval is then represented by the area of the rectangle on that interval.Only when the class intervals are all equal can the frequency for the class interval be represented by the height of the rectangle.Histograms are not the only graphical method to show a frequency distribution.Another which you may come across quite often is the frequency polygon A frequency polygon joins up the mid-points of the tops of the bars.The bars are then removed to leave a graph like Figure 6.Frequency polygons are useful for showing more than one frequency distribution together.For example, Figure 7 shows the distribution of PEF is female and male students, enabling us to compare the two distributions easily.
The box and whisker plot is another frequency graph which is quite widely used, but we shall discuss it later.

Shapes of frequency distribution
Figure 1 shows a frequency distribution of a shape often seen in health data.The distribution is roughly symmetrical about its central value and has frequency concentrated about one central point.The most common value is called the mode of the distribution and Figure 1 has one such point, as do Figure 4 and Figure 5.They are unimodal.Figure 8 shows a very different shape.Here there are two distinct modes one near 5 and the other near 8.5.This distribution is bimodal.We must be careful to distinguish between the unevenness in the histogram which results from using a small sample to represent a large population and those which result from genuine bimodality in the data.The trough between 6 and 7 in Figure 8 is very marked and might represent a genuine bimodality.In this case we have children some of whom may have a condition which raises the cholesterol level and some of whom do not.We actually have two separate populations represented with some overlap between them.However, almost all distributions encountered in medical statistics are unimodal.If the tails are equal the distribution is symmetrical, as in Figure 1.Most distributions encountered in health work are symmetrical or skew to the right, for reasons we shall discuss later.
If the tail on the left is longer than the tail on the right, the distribution is skew to the left or negatively skew.This is much more unusual in health data.Figure 10 shows an example, gestational age at birth.This is a rather artificially negative skew distribution, because some babies are delivered early because of obstetric intervention and none are allowed to be born later than 44 weeks for the same reason.

Medians and quantiles
We often want to summarise a frequency distribution in a few numbers, for ease of reporting or comparison.The most direct method is to use quantiles.The quantiles are values which divide the distribution such that there is a given proportion of observations below the quantile.For example, the median is a quantile.The median is the central value of the distribution, such that half the points are less than or equal to it and half are greater than or equal to it.For the FEV1 data the median is 4.1, the 29th value in Table 4.If we have an even number of points, we choose a value midway between the two central values.
Other quantiles which are particularly useful are the quartiles of the distribution.The quartiles divide the distribution into four equal parts.The second quartile is the median.Figure 11 shows the three quartiles for the serum triglyceride data.
We often divide the distribution into 100 parts at 99 centiles or percentiles.The median is thus the 50th centile.
We use the quartiles in another graph to show a frequency distribution, the box and whisker plot.Figure 12 shows two examples.We draw a box whose height is the distance between the two quartiles and draw a line across at the median.We then draw lines stretching beyond the box to the minimum and maximum.This can be used to show several distributions together, as in Figure 13.This shows the distribution of mannitol absorption for four groups of subjects, classified by their HIV status and symptomatology.Points more than 1.5 box heights from the top or bottom of the box are often shown separately, as outlying points.This is the case for ARC.The numbers in the groups are small in this example, particularly for ARC, which makes them rather uneven.

Figure
Figure 1.Histogram of FEV1: frequency scale

Figure 2 .
Figure 2. Histogram of FEV1: frequency per unit FEV1 or frequency density scale

Figure 4 .Figure 6 .
Figure 4. Age distribution of home accident victims: relative frequency density

Figure 5
Figure 7. Frequency polygons for PEF comparing males and females

Figure 8 .Figure 9
Figure 8. Serum cholesterol in children from kinships with familial hypercholesterolemia

FigureFigure 13 .Figure 12
Figure 12.Two examples of box and whisker plots

Table 1 .
Principle diagnosis of patients in Tooting Bec Hospital

Table 2 .
Likelihood of discharge of patients in Tooting Bec Hospital

Table 3 .
Parity of 125 women attending antenatal clinics at St. George's HospitalIn this census we assessed whether patients were 'likely to be discharged', 'possibly to be discharged' or 'unlikely to be discharged'.The frequencies of these categories are shown in Table2.Likelihood of discharge is a qualitative variable, like diagnosis, but the categories are ordered.This enables us to use another set of summary statistics, the cumulative frequencies.The cumulative frequency for a value of a variable is the number of individuals with values less than or equal to that value.

Table 7 .
Distribution of age in people suffering accidents in the home