## Fast Gini – calculating Gini coefficients in Excel efficiently

“The Spirit Level” is a very interesting book about the effects of inequality on health and wellbeing. It introduced me to the Gini coefficient, which is a number ranging from 0 to 1 that measures the degree of inequality in a set of data.

Of course as an Excel Geek I immediately wanted to know how to calculate Gini Coefficients in Excel …

And then I wanted to see how an array formula compared to  a VBA UDF and a C++ XLL UDF …

What are Gini Coefficients?
Gini Coefficients are a frequently-used method of measuring inequalities such as income distribution in a population.
They can be calculated as “the relative mean difference” – the mean of the difference between every possible pair of datapoints, divided by the mean of all the datapoints. A Gini Coefficient ranges from 0 (everyone has the same income) to 1 (one person has all the income).

Some Gini Income coefficients are:

• Sweden     0.23
• France      0.28
• UK            0.34
• USA          0.45
• Brazil        0.57
• Zimbabwe  0.57

The Gini formula is often written as:
G=SUM(i=1 to n) SUM(j=1 to n) ABS(Datapoints(i)-Datapoints(j)) / (n*n*Average(Datapoints))

where Datapoints is the range of data and n is the number of points in Datapoints.

A Bias Correction factor of n/(n-1) is usually applied.

Calculating Gini Coefficients using Excel.
The Gini Formula can be written as an Excel array formula (Array1, credited to Harlan Grove) (ignoring the Bias correction Factor):
{=AVERAGE(ABS(DataPoints-TRANSPOSE(Datapoints)))/AVERAGE(Datapoints)/2}

You need to enter this formula without the { } using Control-Shift-Enter to make Excel handle it as an array formula.

This array formula works well for small amounts of data, but because it creates and calculates a square array of n rows by n columns it is slow, and does not work for n> approx 4000.

The 4000 limit can be bypassed by programming a VBA UDF (DaGini, credited to Dana DeLouis), and this UDF calculates very slightly faster than the array formula (in Excel 2003).

```Function DaGini(v)
Dim r As Long
Dim c As Long
Dim n As Long
Dim t, M
With WorksheetFunction
M = .Transpose(v.Cells.Value)
n = v.Cells.Count
For r = 2 To n
For c = 1 To (r - 1)
t = t + Abs(M(r) - M(c))
Next c
Next r
DaGini = t / (CDbl(n) * .Average(M) * CDbl(n))
End With
End Function
```

For large data volumes the performance of this UDF is horrible (317 seconds for 64000 datapoints)

A more efficient formula was developed by Angus Deaton (Princeton):
G=(n+1)/(n-1)-2 /(n*(n-1)* Average(Datapoints))*SUM(i=1 to n) Datapoints(i)*Rank(i)
Where Rank is 1 for the largest value and n for the smallest value.
This formula has the bias correction factor built-in.

This can be translated into an Array formula (Array2):
{=(ROWS(Datapoints)+1)/(ROWS(datapoints)-1)-2/(ROWS(Datapoints)*(ROWS(Datapoints)-1)*AVERAGE(Datapoints)) *SUM(RANK(Datapoints,Datapoints)*Datapoints)}

Don’t forget to enter this formula without the { } using Control-Shift-Enter to make Excel handle it as an array formula.
This array formula is considerably more efficient than the previous array formula, with most of the calculation time being used in the RANK function, but its still quite slow with large amounts of data (72 seconds for 64000 datapoints)

Rewriting this formula as a VBA UDF (GiniCoefA) using an internal Quicksort to sort the data once instead of using the RANK function results in an extremely efficient function that is useable with very large numbers of datapoints: (0.4 secs to calculate 64000 datapoints).

The GiniCoefA function has options for the data to be already sorted and to include/exclude the Bias correction factor.

```Public Function GINICOEFA(InputData As Variant, _
Optional Sorted As Long = 0, Optional BiasCorrect As Boolean = True) As Variant
Dim InputValues As Variant
Dim j As Long
Dim jRank As Long
Dim dObs As Double
Dim dGini As Double
Dim dSum As Double
Dim oRng1 As Range
Dim oRng2 As Range
On Error GoTo Fail
If IsObject(InputData) Then
If InputData.Rows.Count &gt; 10000 Then
Set oRng1 = InputData.Parent.UsedRange
Set oRng2 = Application.Intersect(oRng1, InputData)
Else
Set oRng2 = InputData
End If
If Application.WorksheetFunction.CountA(oRng2) = 0 Then Exit Function
InputValues = oRng2.Value2
ElseIf VarType(InputData) &gt;= vbArray Then
InputValues = InputData
ElseIf IsEmpty(InputData) Or IsError(InputData) Then
GoTo Finish
ElseIf Len(CStr(InputData)) = 0 Then
GoTo Finish
Else
GINICOEFA = 1
GoTo Finish
End If
dObs = UBound(InputValues)
If Sorted = 0 Then QSortVar InputValues, 1, CLng(dObs)
For j = 1 To dObs
If Sorted = 1 Then
jRank = j
Else
jRank = dObs - j + 1
End If
dGini = dGini + InputValues(j, 1) * jRank
dSum = dSum + InputValues(j, 1)
Next j
GINICOEFA = (dObs + 1) / (dObs - 1) - dGini * 2# / (dObs * (dObs - 1) * _
(dSum / dObs))
If Not BiasCorrect Then GINICOEFA = GINICOEFA * (dObs - 1) / dObs
Finish:
Set oRng1 = Nothing
Set oRng2 = Nothing
Exit Function
Fail:
GINICOEFA = CVErr(xlErrValue)
GoTo Finish
End Function
```

Update: added the Quicksort VBA Sub

```Sub QSortVar(InputValues As Variant, jStart As Long, jEnd As Long)
Dim jStart2 As Long
Dim jEnd2 As Long
Dim v1 As Variant
Dim v2 As Variant
jStart2 = jStart
jEnd2 = jEnd
'
' choose random pivot
'
v1 = InputValues((jStart + (jEnd - jStart) * Rnd()), 1)
While jStart2 &lt; jEnd2
While InputValues(jStart2, 1) &lt; v1 And jStart2 &lt; jEnd
jStart2 = jStart2 + 1
Wend
While InputValues(jEnd2, 1) &gt; v1 And jEnd2 &gt; jStart
jEnd2 = jEnd2 - 1
Wend
If jStart2 &lt; jEnd2 Then
v2 = InputValues(jStart2, 1)
InputValues(jStart2, 1) = InputValues(jEnd2, 1)
InputValues(jEnd2, 1) = v2
End If
If jStart2 &lt;= jEnd2 Then
jStart2 = jStart2 + 1
jEnd2 = jEnd2 - 1
End If
Wend
If jStart2 &gt; jStart Then QSortVar InputValues, jStart, jEnd2
If jStart2 &lt; jEnd Then QSortVar InputValues, jStart2, jEnd
End Sub
```

Rewriting the VBA UDF in C++ using STL Vector and XLL Plus Version 7 gives further significant speed improvements (0.01 secs to calculate 64000 datapoints).

```CXlOper* GINICOEFF_Impl(CXlOper&amp; xloResult, const CXlOper* InputValues_op, const
CXlOper* Sorted_op, const CXlOper* BiasCorrect_op)
{
// Input buffers
std::vector InputValues;
long Sorted;
bool BiasCorrect;
// Validate and translate inputs
XlReadVector(*InputValues_op, InputValues, L"InputValues", XLA_TRUNC_ONEMPTY
|XLA_TRUNC_ONBLANK|XLA_FLAG_REJECT_NULL_ARRAY);
static CScalarConvertParams Sorted__params(L"Sorted", XLA_DEFAULT_ZERO
|XLA_DEFAULT_EMPTY|XLA_DEFAULT_NONNUMERIC|XLA_DEFAULT_BLANK, 0, -1, (
long)0);
XlReadScalar(*Sorted_op, Sorted, Sorted__params);
static CScalarConvertParams BiasCorrect__params(L"BiasCorrect",
XLA_DEFAULT_EMPTY|XLA_DEFAULT_BLANK, 0, -1, true);
XlReadScalar(*BiasCorrect_op, BiasCorrect, BiasCorrect__params);
// End of generated code
long nObs=0;
long jRank=0;
double dGini=0.0,dSum=0.0,dObs=0.0,dVal=0.0;
nObs=InputValues.size();
dObs=nObs;
if (nObs==0) throw CXlErrorException(xlerrValue);
// single value result is 1
if (nObs==1)
{xloResult=1.0; return xloResult.Ret();}
// sort data if required
if (Sorted==0) sort(InputValues.begin(),InputValues.end());
// calc values * rank, and sum
for (long j=0;j&lt;nObs;j++)
{
if (Sorted==1)
{jRank=j + 1;}
else
{jRank=nObs - j ;}
;
dVal=InputValues[j];
// values not allowed to be negative
if (dVal&lt;0.0) throw CXlErrorException(xlerrNum);
{dGini=(dObs + 1.0) / (dObs - 1.0) - dGini * 2.0 / (dObs * (dObs - 1.0) * (dSum / dObs));}
if (!BiasCorrect) dGini = dGini * (dObs-1.0) / dObs;
xloResult=dGini;
return xloResult.Ret();
}
```

Comparison Timings in Milliseconds using Excel 2003 on an Intel i7 870 2.9GHz

500 Data Rows – Milliseconds

 Array 1 22 DaGini VBA UDF 20 Array 2 5 GiniCoefA VBA UDF 1 GiniCoeff XLL 0.2

64000 Data Rows – Milliseconds

 DaGini VBA UDF 318000 Array 2 72000 GiniCoefA VBA UDF 394 GiniCoeff XLL 12

Conclusions:

• For small amounts of data there is no point in optimising the calculation: any method is fast enough.
• Finding the best algorithm is often more important that using faster technology.
• An efficient UDF is much faster than Array formulae for this calculation.
• C++ XLL can be a lot faster than VBA.

More:

If you want an efficient ready-made Excel Gini Coefficient solution try SpeedTools Extras, a library of Excel functions including GINICOEFF

http://www.decisionmodels.com/FastExcelV3SpeedTools_buy.htm

For more information on Gini Coefficients see:

http://en.wikipedia.org/wiki/Gini_coefficient

http://mathworld.wolfram.com/GiniCoefficient.html

About these ads
This entry was posted in UDF, VBA, XLL. Bookmark the permalink.

### 7 Responses to Fast Gini – calculating Gini coefficients in Excel efficiently

1. Matias says:

Hello,
Nice function, but when rewriting GiniCoefA as a VBA UDF I have a problem with the QSortVar subfunction, which appears as undefined.

Thanks in advance!

• fastexcel says:

Matias,
I have added the Quicksort code.

• Matias says:

Wow, that was fast!

But I’ve got compilation errors in the Quicksort VBA Sub, on lines:

While InputValues(jStart2, 1) < v1 And jStart2 v1 And jEnd2 > jStart
(end of instruction problem?)

If jStart2 jStart Then QSortVar InputValues, jStart, jEnd2
(missing operator between jStart2 and jStart ?)

Thanks again!

• fastexcel says:

OK, I nearly always goof when copy-pasting VBA code into HTML. I have reworked it – hopefully its OK now.

• Matias says:

Nice and clean :)

Thanks a lot, U saved my week!

2. scott says:

What if I wanted to do the opposite? I’ve got the gini coefficient and average income for a country, but I want to know % of population below X income level.

• Bart says:

Scott
What you ask for is not possible: there are multiple income distributions that can lead to the same level of inequality (gini-coëf.).