Average

From SubSurfWiki
Jump to navigation Jump to search

When we compute an average we are measuring the central tendency: a single quantity to represent the dataset. The trouble is, our data can have different shapes, different dimensionality, and different type (to use a computer science term). For example, we may be dealing with lognormal distributions, or rates, or classes. So, to cope with this variability, we have different averages.

Summary table

Here's a summary of the different means, with Python code for finding them for some array-like a:

Types of average
Average Remarks NumPy function SciPy function Excel function
Arithmetic mean The sum divided by the population size, n — used when the sum is of interest. numpy.mean(a)[1] AVERAGE(a)
Trimmed mean The mean, not including the extremes — for very noisy data. scipy.stats.mstats.tmean(a, limits=(0,0.4))[2] TRIMMEAN(a, alpha)
Winsorized mean The mean, with clipped extremes — for very noisy data. scipy.stats.mstats.winsorize(a)[3]
Harmonic mean n divided by the sum of the reciprocals — used for rates and ratios. scipy.stats.mstats.hmean(a)[4] HARMEAN(a)
Geometric mean The nth root of the product — used when the product is of interest. scipy.stats.mstats.gmean(a)[5] GEOMEAN(a)
Median The central value, the P50 value. numpy.median(a)[6] MEDIAN(a)
Mode The most frequent, or most likely, value; use for discrete class data. scipy.stats.mstats.mode(a)[7] MODE(a)
Quadratic mean or RMS The square root of the arithmetic mean of the squares — used for magnitudes. np.sqrt(np.mean(a**2)) SQRT(SUMSQ(a)/COUNT(a))
Midrange The mean of the min and max; rarely useful. np.mean(np.amax(a), np.amin(a))) AVERAGE(MIN(a),MAX(a))
Weighted mean For combining populations. numpy.average(a, weights)[8]
Swanson's mean An estimate of the mean for modestly skewed distributions. Just use the mean!

Arithmetic mean

Everyone's friend, the plain old mean, the average. The only trouble is that it is, statistically speaking, not robust. This means that it's an estimator that is unduly affected by outliers. What are outliers? Anything that departs from some assumption of smoothness or uniformity in your data, from whatever model you have of what your data 'should' look like. Notwithstanding that you might just be wrong!

Trimmed mean

One way to cope with the outliers in the arithmetic mean is to remove them from the data. To do this, sort the data and remove the lowest and highest values. The mean of what's left is usually called the trimmed, alpha-trimmed, or truncated mean, with a parameter, alpha, describing the proportion of points deleted (e.g. α = 0.2 for 2 points of 10). This type of mean is often the basis of subjectively judged scoring systems, as used in figure skating for example, to ameliorate bias or other anomalies (in figure skating they also eliminate some scores randomly).

Winsorized mean

Similar to the trimmed mean, the winsorized mean replaces the extremes rather than removing them. The tails are replaced with the highest unclipped value, retaining them but effectively unweighting them.

Harmonic mean

The third and final Pythagorean mean, always equal to or smaller than the geometric mean. It's sometimes (by 'sometimes' I mean 'never') called the subcontrary mean. It tends towards the smaller values in a dataset; if those small numbers are outliers, this is a bug not a feature. Use it for rates: if you drive 10 km at 60 km/hr (10 minutes), then 10 km at 120 km/hr (5 minutes), then your average speed over the 20 km is 80 km/hr, not the 90 km/hr the arithmetic mean might have led you to believe.

But if you drive for 10 minutes at 60 km/hr and 10 minutes at 120 km/hr, then your average speed is 90 km/hr. Be careful out there!

Geometric mean

Like the arithmetic mean, this is one of the classical Pythagorean means. It is always equal to or smaller than the arithmetic mean. It has a simple geometric visualization: the geometric mean of a and b is the side of a square having the same area as the rectangle with sides a and b. Clearly, it is only meaningfully defined for positive numbers. When might you use it? For quantities with exponential distributions — permeability, say. And this is the only mean to use for data that have been normalized to some reference value.

Median average

The median is the central value in the sorted data. In some ways, it's the archetypal average: the middle, with 50% of values being greater and 50% being smaller. If there is an even number of data points, then its the arithmetic mean of the middle two. In a probability distribution, the median is often called the P50. In a positively skewed distribution (the most common one in petroleum geoscience, like the figure here), it is larger than the mode, or most likely, and smaller than the mean.

Remember the trimmed mean? The median is equivalent to the trimmed mean with alpha = 1 (well, 0.99...). You can think of the trimmed mean as a hybrid of the mean and the median.

When to use the median? When we think about smoothing a surface with a moving average, it often makes sense to use the median, because it removes noisy spikes, but retains persistent edges in the data. Because of this, it's my default filter for smoothing horizons.

Mode average

The mode is the most frequent result in the data. We often use it for what are called nominal data: classes or names, rather than the cardinal numbers we've been discussing to now. For example, the name Smith is not the 'average' name in the US, as such, since most people are called something else[9]. But it is the central tendency of names. One of the commonest applications is in a simple voting system: the person with the most votes wins. If you are averaging data like facies or waveform classes, say, then the mode is the only average that makes sense.

Root mean square

I already mentioned the trimmed mean, for dealing with outliers. But there's an extensive menu of central tendency representatives, and we've barely scratched the surface. For example, most geophysicists know about the root mean square, or quadratic mean, because it's a measure of magnitude independent of sign, so works on sinusoids varying around zero, for example.

Weighted mean

Finally, the weighted mean is worth a mention. Sometimes this one seems intuitive: if you want to average two datasets, but they have different populations, for example. If you have a mean porosity of 19% from a set of 90 samples, and another mean of 11% from a set of 10 similar samples, then it's clear you can't simply take their arithmetic average — you have to weight them first: (0.9 × 0.21) + (0.1 × 0.14) = 0.20. But sometimes, it's not so obvious you need the weighted sum, for example, if you care about the perception of the subjects you are averaging.[10]

Midrange

Simply the arithmetic mean of the minimum and maximum values of the dataset. An inexpensive, simple L-estimator, but sensitive to outliers — it may be sensible to trim the data before computing the midrange. Only really useful for uniform distributions.

Swanson's mean

A handy back-of-the-envelope estimator of the mean for a moderately skewed (usually lognormal) distribution, given P90, P50 and P10 values. It's easy, even for petroleum geologists:

It was published by retired Exxon geologist Roy Swanson[11], with some friends from the University of Aberdeen, and popularized by Peter Rose's courses and excellent book[12].

Examples

We'll look at the averages for 4 datasets:

A: 46, 34, 56, 45, 34, 23, 44, 56, 40, 45, 45, 34, 56, 54, 67
B: 46, 436, 56, 45, 34, 23, 44, 56, 40, 45, 2, 34, 56, 54, 67
C: 0.002, 0.04, 1, 1, 1, 1, 1, 2, 2, 3, 4, 17, 34, 56, 167
D: -120, -8, -2.3, -1, -1, -1, 0, 0, 2, 3, 4, 8, 16, 40, 140
A B C D
Arithmetic mean 45.267 69.2 19.336 5.313
Trimmed mean 45.364 46.364 6.091 2.518
Harmonic mean 42.270 18.621 0.028
Geometric mean 43.834 42.221 1.940
Median average 45 45 2 0
Mode average 34 56 1 -1
Quadratic mean 46.574 120.956 46.553 49.004
Midrange 45 219 83.501 10

See also

External links

References