Fast Gini – calculating Gini coefficients in Excel efficiently

“The Spirit Level” is a very interesting book about the effects of inequality on health and wellbeing. It introduced me to the Gini coefficient, which is a number ranging from 0 to 1 that measures the degree of inequality in a set of data.

Of course as an Excel Geek I immediately wanted to know how to calculate Gini Coefficients in Excel …

And then I wanted to see how an array formula compared to  a VBA UDF and a C++ XLL UDF …

What are Gini Coefficients?
Gini Coefficients are a frequently-used method of measuring inequalities such as income distribution in a population.
They can be calculated as “the relative mean difference” – the mean of the difference between every possible pair of datapoints, divided by the mean of all the datapoints. A Gini Coefficient ranges from 0 (everyone has the same income) to 1 (one person has all the income).

Some Gini Income coefficients are:

  • Sweden     0.23
  • France      0.28
  • UK            0.34
  • USA          0.45
  • Brazil        0.57
  • Zimbabwe  0.57

(Source: http://en.wikipedia.org/wiki/List_of_countries_by_income_equality)

The Gini formula is often written as:
G=SUM(i=1 to n) SUM(j=1 to n) ABS(Datapoints(i)-Datapoints(j)) / (n*n*Average(Datapoints))

where Datapoints is the range of data and n is the number of points in Datapoints.

A Bias Correction factor of n/(n-1) is usually applied.

Calculating Gini Coefficients using Excel.
The Gini Formula can be written as an Excel array formula (Array1, credited to Harlan Grove) (ignoring the Bias correction Factor):
{=AVERAGE(ABS(DataPoints-TRANSPOSE(Datapoints)))/AVERAGE(Datapoints)/2}

You need to enter this formula without the { } using Control-Shift-Enter to make Excel handle it as an array formula.

This array formula works well for small amounts of data, but because it creates and calculates a square array of n rows by n columns it is slow, and does not work for n> approx 4000.

The 4000 limit can be bypassed by programming a VBA UDF (DaGini, credited to Dana DeLouis), and this UDF calculates very slightly faster than the array formula (in Excel 2003).

Function DaGini(v)
Dim r As Long
Dim c As Long
Dim n As Long
Dim t, M
With WorksheetFunction
M = .Transpose(v.Cells.Value)
n = v.Cells.Count
For r = 2 To n
For c = 1 To (r - 1)
t = t + Abs(M(r) - M(c))
Next c
Next r
DaGini = t / (CDbl(n) * .Average(M) * CDbl(n))
End With
End Function

For large data volumes the performance of this UDF is horrible (317 seconds for 64000 datapoints)

A more efficient formula was developed by Angus Deaton (Princeton):
G=(n+1)/(n-1)-2 /(n*(n-1)* Average(Datapoints))*SUM(i=1 to n) Datapoints(i)*Rank(i)
Where Rank is 1 for the largest value and n for the smallest value.
This formula has the bias correction factor built-in.

This can be translated into an Array formula (Array2):
{=(ROWS(Datapoints)+1)/(ROWS(datapoints)-1)-2/(ROWS(Datapoints)*(ROWS(Datapoints)-1)*AVERAGE(Datapoints)) *SUM(RANK(Datapoints,Datapoints)*Datapoints)}

Don’t forget to enter this formula without the { } using Control-Shift-Enter to make Excel handle it as an array formula.
This array formula is considerably more efficient than the previous array formula, with most of the calculation time being used in the RANK function, but its still quite slow with large amounts of data (72 seconds for 64000 datapoints)

Rewriting this formula as a VBA UDF (GiniCoefA) using an internal Quicksort to sort the data once instead of using the RANK function results in an extremely efficient function that is useable with very large numbers of datapoints: (0.4 secs to calculate 64000 datapoints).

The GiniCoefA function has options for the data to be already sorted and to include/exclude the Bias correction factor.

Public Function GINICOEFA(InputData As Variant, _
Optional Sorted As Long = 0, Optional BiasCorrect As Boolean = True) As Variant
Dim InputValues As Variant
Dim j As Long
Dim jRank As Long
Dim dObs As Double
Dim dGini As Double
Dim dSum As Double
Dim oRng1 As Range
Dim oRng2 As Range
On Error GoTo Fail
If IsObject(InputData) Then
If InputData.Rows.Count > 10000 Then
Set oRng1 = InputData.Parent.UsedRange
Set oRng2 = Application.Intersect(oRng1, InputData)
Else
Set oRng2 = InputData
End If
If Application.WorksheetFunction.CountA(oRng2) = 0 Then Exit Function
InputValues = oRng2.Value2
ElseIf VarType(InputData) >= vbArray Then
InputValues = InputData
ElseIf IsEmpty(InputData) Or IsError(InputData) Then
GoTo Finish
ElseIf Len(CStr(InputData)) = 0 Then
GoTo Finish
Else
GINICOEFA = 1
GoTo Finish
End If
dObs = UBound(InputValues)
If Sorted = 0 Then QSortVar InputValues, 1, CLng(dObs)
For j = 1 To dObs
If Sorted = 1 Then
jRank = j
Else
jRank = dObs - j + 1
End If
dGini = dGini + InputValues(j, 1) * jRank
dSum = dSum + InputValues(j, 1)
Next j
GINICOEFA = (dObs + 1) / (dObs - 1) - dGini * 2# / (dObs * (dObs - 1) * _
(dSum / dObs))
If Not BiasCorrect Then GINICOEFA = GINICOEFA * (dObs - 1) / dObs
Finish:
Set oRng1 = Nothing
Set oRng2 = Nothing
Exit Function
Fail:
GINICOEFA = CVErr(xlErrValue)
GoTo Finish
End Function

Update: added the Quicksort VBA Sub

Sub QSortVar(InputValues As Variant, jStart As Long, jEnd As Long)
Dim jStart2 As Long
Dim jEnd2 As Long
Dim v1 As Variant
Dim v2 As Variant
jStart2 = jStart
jEnd2 = jEnd
'
' choose random pivot
'
v1 = InputValues((jStart + (jEnd - jStart) * Rnd()), 1)
While jStart2 < jEnd2
While InputValues(jStart2, 1) < v1 And jStart2 < jEnd
jStart2 = jStart2 + 1
Wend
While InputValues(jEnd2, 1) > v1 And jEnd2 > jStart
jEnd2 = jEnd2 - 1
Wend
If jStart2 < jEnd2 Then
v2 = InputValues(jStart2, 1)
InputValues(jStart2, 1) = InputValues(jEnd2, 1)
InputValues(jEnd2, 1) = v2
End If
If jStart2 <= jEnd2 Then
jStart2 = jStart2 + 1
jEnd2 = jEnd2 - 1
End If
Wend
If jStart2 > jStart Then QSortVar InputValues, jStart, jEnd2
If jStart2 < jEnd Then QSortVar InputValues, jStart2, jEnd
End Sub

Rewriting the VBA UDF in C++ using STL Vector and XLL Plus Version 7 gives further significant speed improvements (0.01 secs to calculate 64000 datapoints).

CXlOper* GINICOEFF_Impl(CXlOper& xloResult, const CXlOper* InputValues_op, const
CXlOper* Sorted_op, const CXlOper* BiasCorrect_op)
{
// Input buffers
std::vector InputValues;
long Sorted;
bool BiasCorrect;
// Validate and translate inputs
XlReadVector(*InputValues_op, InputValues, L"InputValues", XLA_TRUNC_ONEMPTY
|XLA_TRUNC_ONBLANK|XLA_FLAG_REJECT_NULL_ARRAY);
static CScalarConvertParams Sorted__params(L"Sorted", XLA_DEFAULT_ZERO
|XLA_DEFAULT_EMPTY|XLA_DEFAULT_NONNUMERIC|XLA_DEFAULT_BLANK, 0, -1, (
long)0);
XlReadScalar(*Sorted_op, Sorted, Sorted__params);
static CScalarConvertParams BiasCorrect__params(L"BiasCorrect",
XLA_DEFAULT_EMPTY|XLA_DEFAULT_BLANK, 0, -1, true);
XlReadScalar(*BiasCorrect_op, BiasCorrect, BiasCorrect__params);
// End of generated code
long nObs=0;
long jRank=0;
double dGini=0.0,dSum=0.0,dObs=0.0,dVal=0.0;
nObs=InputValues.size();
dObs=nObs;
if (nObs==0) throw CXlErrorException(xlerrValue);
// single value result is 1
if (nObs==1)
{xloResult=1.0; return xloResult.Ret();}
// sort data if required
if (Sorted==0) sort(InputValues.begin(),InputValues.end());
// calc values * rank, and sum
for (long j=0;j<nObs;j++)
{
if (Sorted==1)
{jRank=j + 1;}
else
{jRank=nObs - j ;}
;
dVal=InputValues[j];
// values not allowed to be negative
if (dVal<0.0) throw CXlErrorException(xlerrNum);
{dGini=(dObs + 1.0) / (dObs - 1.0) - dGini * 2.0 / (dObs * (dObs - 1.0) * (dSum / dObs));}
if (!BiasCorrect) dGini = dGini * (dObs-1.0) / dObs;
xloResult=dGini;
return xloResult.Ret();
}

Comparison Timings in Milliseconds using Excel 2003 on an Intel i7 870 2.9GHz

 500 Data Rows – Milliseconds

Array 1

22

DaGini VBA UDF

20

Array 2

5

GiniCoefA VBA UDF

1

GiniCoeff XLL

0.2

 64000 Data Rows – Milliseconds

DaGini VBA UDF

318000

Array 2

72000

GiniCoefA VBA UDF

394

GiniCoeff XLL

12

Conclusions:

  • For small amounts of data there is no point in optimising the calculation: any method is fast enough.
  • Finding the best algorithm is often more important that using faster technology.
  • An efficient UDF is much faster than Array formulae for this calculation.
  • C++ XLL can be a lot faster than VBA.

More:

If you want an efficient ready-made Excel Gini Coefficient solution try SpeedTools Extras, a library of Excel functions including GINICOEFF

http://www.decisionmodels.com/FastExcelV3SpeedTools_buy.htm

For more information on Gini Coefficients see:
http://en.wikipedia.org/wiki/Gini_coefficient
http://mathworld.wolfram.com/GiniCoefficient.html

This entry was posted in UDF, VBA, XLL. Bookmark the permalink.

21 Responses to Fast Gini – calculating Gini coefficients in Excel efficiently

  1. Matias says:

    Hello,
    Nice function, but when rewriting GiniCoefA as a VBA UDF I have a problem with the QSortVar subfunction, which appears as undefined.

    Thanks in advance!

    • fastexcel says:

      Matias,
      I have added the Quicksort code.

      • Matias says:

        Wow, that was fast!

        But I’ve got compilation errors in the Quicksort VBA Sub, on lines:

        While InputValues(jStart2, 1) < v1 And jStart2 v1 And jEnd2 > jStart
        (end of instruction problem?)

        If jStart2 jStart Then QSortVar InputValues, jStart, jEnd2
        (missing operator between jStart2 and jStart ?)

        Thanks again!

      • fastexcel says:

        OK, I nearly always goof when copy-pasting VBA code into HTML. I have reworked it – hopefully its OK now.

      • Matias says:

        Nice and clean 🙂

        Thanks a lot, U saved my week!

  2. scott says:

    What if I wanted to do the opposite? I’ve got the gini coefficient and average income for a country, but I want to know % of population below X income level.

    • Bart says:

      Scott
      What you ask for is not possible: there are multiple income distributions that can lead to the same level of inequality (gini-coëf.).

  3. ghassan chammas says:

    can you work out a UDF calculating the gini between two columns of data instead of 1 column. This will become the ginibetween. it is the double sum of absolute difference between datapoints (i)-datapoints (j) divided by [(average(i)+average(j)]x n^2
    this will calculate the gini between two sets of different data having same number of entries off course.

    thank you

    ghassan chammas

  4. How to calculate this gini coficeient?

  5. ambatnisha says:

    Can i use deaton’s formula to get state level gini coefficient with state level data?

    • fastexcel says:

      I don’t see why not …

      • ambatnisha says:

        I have data on infrastructure variable like state wise road density in India for 20 years (17 states *20 years data points). When i use fastgini command in stata, it returns value for the whole nation over the years.Can u explain me how i can find state level gini from state level data?

  6. There’s a simpler way. List all the numbers in ascending order in column A. Then list all their contributions to the gini coefficient in another column. Then average that other column. Then divide that average by the average of column A. If you want, you can hide the other column.

    I implemented it as a series of columns to work out the logic. Here are the calculations lifted from row 10:

    B10 (Sum Above) =SUM(A$2:A9)
    C10 (Count Above) =COUNT(A$2:A9)
    D10 (Sum Below) =SUM(A11:A$101)
    E10 (Count Below) =COUNT(A11:A$101)
    F10 (Diff Above) =A10*C10-B10
    G10 (Diff Below) =D10-A10*E10
    H10 (Denominator) =C10+E10
    I10 (Gini) =(F10+G10)/H10

    The gini coefficient is the average of I divided by the average of A.

    If you want the entire calculation in one column, the above eight formulae can be combined into one. I’ll leave that up to you.

    Testing: The gini coefficient of the 100 integers from 1 to 100 was found to be 0.67. The gini coefficient of the 100 integers from 101 to 200 was found to be 0.22. 0.67*50.5/150.5=0.22, so it seems reasonable. I have nothing to compare these numbers against, so please let me know if they are wrong – Warren Gaebel @ qatBusinessServices.com (without the spaces).

    • fastexcel says:

      Yes you probably can do it with a lot of formulas if your data is sorted, but a UDF is IMHO much simpler to use as soon as you need to do it more than once or its not convenient to sort your data.
      If you want to check your formulas then just generate a few thousand random numbers, download the trial version of FastExcel from http://www.decisionmodels.com/fxldown.htm and crosscheck your result against the GINICOEFF FastExcel implementation.

  7. Randy says:

    Your array formula for calculating Gini Coefficients and your GINICOEFA() user-defined function is great and I am using them in my work. But I would like to calculate a wealth Gini Coefficient for a set of 500 people in which some of them have negative wealth (they are in debt). In this situation, the normal calculations for Gini result in a number greater than 1. I found a very interesting paper “The Gini Coefficient and Negative Income” http://dad.ecn.ulaval.ca/technical_notes/note12/Ref/CTR_1982.pdf from 1982 that proposes a normalized Gini Coefficient that seems like it does a good job of addressing this problem. However, I have no idea how to translate the mathematical equations in this paper into an Excel array function or into a UDF. I thought you might find this an interesting problem to tackle.

    • fastexcel says:

      Good question: I would have thought that just adding the most negative number to each item in the series before calculating the Gini would work, but I am not a sufficiently good mat/stats person to be sure.

      • Randy says:

        Adding a fixed amount doesn’t work. Chen (in the link I gave above) quotes Kendall and Stuart (1963) as saying that the Gini coefficient suffers from “the disadvantage of being affected very much by the value of the mean measured from some arbitrary origin, and are not usually employed unless there is a natural origin of measurement or comparisons are being made between distributions with similar origins.” [But multiplying by any number gives the same Gini coefficient since the overall distribution stays the same among households.]
        You can see this by using the 10 numbers in their example: -500, -300, -300, -100, 200, 300, 300, 400, 500, 500. If you just calculate the regular Gini coefficient, you get 1.940 — a ludicrous result. If you add 500 to each number, G=0.323; if you add 600 to each number G=0.277.
        Using their normalized version, produces G*=0.9065. This intuitively seems like a more appropriate value for this dispersion. I’m not a mathematician and haven’t studied this enough to know if their formulation is a good one or if it has any acceptance by other scholars.
        And I also don’t know how to translate their formulas into Excel.
        Perhaps Angus Deaton knows if normalized G* is accepted by the math community and if he has a way to translate that into code.
        [By the way, this UN FAO document http://www.fao.org/docs/up/easypol/329/gini_index_040EN.pdf has an understandable analysis of Gini which I can follow (if I study it carefully).]

  8. fastexcel says:

    I don’t know: 0.9065 looks like a very strange Gini coeff to me: thats very close to 1 which should mean that the distribution is extremely skewed towards the top, which is not what I “see” from the numbers. I think that you need to choose a Gini method of handling negative numbers that makes sense for whatever your data is. And then as Kendall & Stewart say – you can’t use Gini to compare distributions that do not have a consistent origin.

    It would be interesting to see what the Wikipedia suggestion of half the relative mean absolute difference gives.

  9. Randy says:

    I did some more searching and found that these folks corrected the formula in the 1982 paper for a normalized Gini Coefficient that can address negative numbers:
    Z.M. Berrebi and Jacques Silber (1985) “The Gini coefficient and negative income: a comment.” Oxford Economic Papers. 37:525– 526. https://academic.oup.com/oep/article-abstract/37/3/525/2361128

    And these Italian folks then expanded the formula to include situations in which most of the wealth/income values are negative, and if I am reading their paper right, even when the sum of wealth/income is zero or negative:
    Emanuela Raffinetti, Elena Siletti, and Achille Vernizzi, (RSV)
    “On the Gini coefficient normalization when attributes with negative values are considered”
    Statistical Methods and Applications, (2015) 24:507–521
    https://link.springer.com/article/10.1007/s10260-014-0293-4
    An earlier version of the paper is available here: https://www.researchgate.net/profile/Elena_Siletti/publication/299566579_Inequality_measures_and_the_issue_of_negative_incomes/links/59d342ae4585150177f92cad/Inequality-measures-and-the-issue-of-negative-incomes.pdf

    They provide a module for the R statistical package as described here: https://cran.r-project.org/web/packages/GiniWegNeg/GiniWegNeg.pdf

    Here is an implementation of this Gini-RSV in Python using some commands from the numpy library code: https://github.com/agcopenhaver/spatial_nGINI
    The value of T is the total sum for all households, T+ is the sum of all positive amounts, and T- is the sum of all negative amounts.

    It may be years (or decades) before mathematicians decide that this is the best Gini formula for addressing negative values, but right now this looks like a good formulation.

  10. Randy says:

    After a great deal of effort (much of it trial-and-error guesses since I am not a very good VBA programmer), I was able to create a User-Defined Function (UDF) to calculate the Gina Coefficient based on the formulas of Emanuela Raffinetti, Elena Siletti, and Achille Vernizzi, (RSV). Below is the Excel macro code I wrote along with your QuickSort (from https://excelmacromastery.com/excel-vba-array/ which is needed for GiniRSV). Once this is installed, you can calculate a Gini coefficient =GiniRSV(selectedRange) in the same way you might calculate a sum =SUM(selectedRange) . Note that this code includes the Bias Correction Factor N/(N-1) since in Excel, we are likely calculating Gini for a relatively small data set (in the R code, it is missing).

    This UDF uses the variable names in the R statistical package version. The code could be tightened up a bit by removing any references to weighting and possibly by using arrays in a better way in the VBA code (I have little experience using arrays and I find Excel’s terminology confusing, poorly documented, and, perhaps, inconsistent).

    Public Function GiniRSV(inputRange As Range) As Double
    ‘ This macro is a conversion of the R statistical package code listed here:
    https://rdrr.io/cran/GiniWegNeg/src/R/Gini_RSV.R
    ‘ But all the weights (w) are assumed to be 1 (so there is no w array).
    ‘ It uses (roughly) the same variable names.

    ‘ There is also equivalent Python code listed here:
    https://github.com/agcopenhaver/spatial_nGINI

    ‘ But note that I multiplied by N/(N-1) at the end since N is likely to be relatively small in a spreadsheet,
    ‘ so this term is needed.

    Dim arrayInput() As Variant
    Dim arrayY As Variant
    Dim C_i As Variant
    Dim icount As Long
    Dim i As Long
    Dim N As Long
    Dim num_1 As Double
    Dim num_2 As Double
    Dim num_3 As Double
    Dim G_num As Double
    Dim T_neg As Double
    Dim T_pos As Double
    Dim n_RSV As Double
    Dim mean_RSV As Double

    On Error GoTo FuncFail:

    arrayInput = inputRange.Value ‘ Take the values of the range and assign to an array. This array is
    ‘ two-dimensional since the spreadsheet range has two dimensions.
    icount = inputRange.Count
    ReDim arrayY(icount – 1) ‘ Dimension a one-dimensional array of the same size
    ‘ as the Selection (0 to icount – 1)

    i = 0
    For i = 1 To icount
    arrayY(i – 1) = arrayInput(i, 1) ‘ Convert the two-dimensional Selection (1xN)
    Next i ‘ into a one-dimensional array of length N called arrayY

    ‘ Sort the inputted values in ascending order (in arrayY)
    QuickSort arrayY, LBound(arrayY), UBound(arrayY)

    N = 0
    num_1 = 0
    num_2 = 0
    num_3 = 0
    T_neg = 0
    T_pos = 0
    C_i = arrayY ‘ To create an array the same size as arrayY, we just set it equal to arrayY
    ‘ The actual values of this array are assigned below

    For i = LBound(arrayY) To UBound(arrayY)
    N = N + 1 ‘ Since the weights are all 1, this is just N + 1 so it will end up being N (=icount)
    ‘ — if the weights were different, this would be = N + w(i) so it would calculate the sum of w
    C_i(i) = i + 1 ‘ Since the weights are all 1, this is simple:
    ‘ C_i looks like [1, 2, 3, 4, 5, … N]
    ‘ (This is the cumulative line for perfect equality)
    ‘ — if the weights were different, this would be = C_i(i-1) + w(i) with C(0) previously set to 0

    num_1 = num_1 + (arrayY(i) * C_i(i)) ‘ This is equivalent to sum(arrayY*C_i)
    num_2 = num_2 + arrayY(i) ‘ This is equivalent to sum(arrayY)
    num_3 = num_3 + arrayY(i)
    ‘ — if the weights were different, this would be = num_3 + arrayY(i) * w(i)

    If arrayY(i) < 0 Then ' If values are negative, then add them to T_neg
    T_neg = T_neg + arrayY(i)
    ' — if the weights were different, this would be = T_neg + arrayY(i) * w(i)

    Else ' If they are positive, then add them to T_pos
    T_pos = T_pos + arrayY(i)
    ' — if the weights were different, this would be = T_pos + arrayY(i) * w(i)
    End If
    Next i

    G_num = (2 / N ^ 2) * num_1 – (1 / N) * num_2 – (1 / N ^ 2) * num_3
    n_RSV = (2 * (T_pos + (Abs(T_neg))) / N)
    mean_RSV = n_RSV / 2
    GiniRSV = (1 / mean_RSV) * G_num

    GiniRSV = GiniRSV * N / (N – 1) ' If N is very large, then N/(N-1) approaches 1 and this line can be eliminated
    Exit Function

    FuncFail:
    GiniRSV = CVErr(xlErrValue)
    End Function

    Sub QuickSort(arr As Variant, first As Long, last As Long)
    ' from https://excelmacromastery.com/excel-vba-array/

    Dim vCentreVal As Variant, vTemp As Variant
    Dim iTempLow As Long
    Dim iTempHi As Long
    iTempLow = first
    iTempHi = last

    vCentreVal = arr((first + last) \ 2)
    Do While iTempLow <= iTempHi

    Do While arr(iTempLow) < vCentreVal And iTempLow < last
    iTempLow = iTempLow + 1
    Loop

    Do While vCentreVal first
    iTempHi = iTempHi – 1
    Loop

    If iTempLow <= iTempHi Then

    ' Swap values
    vTemp = arr(iTempLow)

    arr(iTempLow) = arr(iTempHi)
    arr(iTempHi) = vTemp

    ' Move to next positions
    iTempLow = iTempLow + 1
    iTempHi = iTempHi – 1

    End If

    Loop

    If first < iTempHi Then QuickSort arr, first, iTempHi
    If iTempLow < last Then QuickSort arr, iTempLow, last
    End Sub

    The Hoover Index (or Robin Hood Index ), defined as the portion of the total that must be redistributed to create equality, provides another easy-to-calculate way of measuring inequality that focuses more on riches at the top. Here is an array formula to calculate it:
    {=0.5 * SUM( ABS(selectedCells – AVERAGE(selectedCells) ) ) / SUM(selectedCells)}

    This measure normally ranges from 0 (full equality) to 1 (all income/wealth concentrated in one set of hands), but like the Gini coefficient, it can be greater than 1 when the negative wealth (debt) or negative income greatly reduces the total. For this measure, I'm not sure if that is a problem since it is true that more than the total must be redistributed to create an equal society.

  11. Randy Schutt says:

    After further reflection, I’ve decided that the attempts at fixes to the Gini coefficient are misguided. When a society has a lot of residents with a large amount of debt, common sense says that this society has more wealth stratification than one in which those same residents had no debt. So the Gini coefficient should be higher, even if this means that its value goes above 1.

    So I am inclined to go back to using your array formulas to calculate Gini coefficients.

Leave a comment