As part of extending the performance profiling abilities of FastExcel, I wanted to develop a Function Profiler Map. A key component of this is to extract the names of the functions embedded in Excel formulas.
So I experimented with some different approaches:
- using Rob Van Gelder’s AudXL formula parser
- using the MID function to scan through character by character
- using byte arrays
The Test Data
For performance testing I am using a large worksheet with 638K used cells, 103K constants and 413K formulas. There are 3 functions on this sheet (710K INDEX, 5 SUM and 51K IF).
For validity testing I am using a small worksheet with some tricky formulas:
Rob Van Gelder’s AudXL formula parser
Rob converted a javascript-based Excel formula parser written by Eric Bachtal to VBA. You can download AudXL.xla from here. It is a useful tool for breaking formulas apart so that they are more readable.
So I started with that as a performance baseline: it takes 130 seconds to parse the large test worksheet.
But of course a general purpose formula parser should be slower than a purpose-built parser that only finds the functions.
Rules for Finding Functions in Formula strings
- Function names are always followed by open bracket “(“
- Function Names are always preceded by an operator
+-,/*(=><& :!^
- Formula text enclosed in single or double quotes cannot contain a function
Assumptions
- Formulas to be parsed will be obtained using Range.FormulaR1C1.
- This means that the formula string will will use American English settings and native function names.
- And I only have to parse formulas that have been validated by Excel.
- The full range of Unicode characters are allowed, so I can’t just use ASCII codes.
Download the VBA Code
You can download the routines containing the VBA from here
The First attempt: using Byte arrays
In a previous post I discussed using Byte arrays as an efficient method for processing all the characters in a string.
This is a good method for handling Unicode strings – each character gets converted into 2 separate bytes (0 to 255) which uniquely define the character and cover all the possible National Language characters in the world.
So “(” for instance is always 40 and 0 and
Dim abFormula() as Byte
Dim str1 as string
str1="=NA()"
abFormula=str1
produces:
Note that the Byte arrays produced by assigning a string to a byte array are always zero-based.
You can also do the reverse: assigning a byte array to a string converts back.
The First Algorithm
So here is an algorithm that follows the rules:
Scan each character in the byte array from left to right:
If the character is a single or double quote ignore all subsequent characters until there is another single or double quote.
if the character is in the list of operators (abStartChars) then set a flag
if the character is NOT in the list of operators check if its a “(” and
if it IS a “(” and there are one or more non-operator characters preceding it we have found a function name.
The code looks like this:
Function GetFunc1(abFormula() As Byte, abStartChars() As Byte, abEndChar() As Byte, _
abQuotes() As Byte, abSQ() As Byte, jStart As Long, jEnd As Long) As String
'
' search for a function name within a byte array starting at the jStart byte position
' byte array is zero based pairs of bytes
' abFormula is a byte array version of the formula
' abStartChars is a byte array containing characters that can precede the function name
' abEndChar is a byte array containing "("
' abQuotes is a byte array contianing a double quote
' abSQ is a byte array containing a single quote
'
' returns name of function as a string and jEnd as the byte position of "("
'
Dim j As Long
Dim k As Long
Dim jStartChar As Long
Dim jFirst As Long
Dim blStart As Boolean
Dim blString As Boolean
Dim abFunc() As Byte
'
jFirst = jStart + 2
blString = False
For j = jStart + 2 To (UBound(abFormula) - 2) Step 2
'
' skip text strings
'
If (abFormula(j) = abQuotes(0) And abFormula(j + 1) = abQuotes(1)) _
Or (abFormula(j) = abSQ(0) And abFormula(j + 1) = abSQ(1)) Then
blString = Not blString
End If
If Not blString Then
'
' look for non startchar
'
blStart = False
For jStartChar = 0 To UBound(abStartChars) Step 2
If abFormula(j) = abStartChars(jStartChar) _
And abFormula(j + 1) = abStartChars(jStartChar + 1) Then
blStart = True
jFirst = j + 2
Exit For
End If
Next jStartChar
If Not blStart Then
If abFormula(j) = abEndChar(0) And abFormula(j + 1) = abEndChar(1) Then
'
' we have a (
'
If j > jFirst Then
'
' we have a function
'
jEnd = j
'
' jend points to first byte of the ( character
' jfirst points to the first byte of the function name
' convert slice of formula to function name string
'
ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte
For k = 0 To UBound(abFunc)
abFunc(k) = abFormula(jFirst + k)
Next k
GetFunc1 = abFunc
Exit Function
'
ElseIf abFormula(jFirst) = abEndChar(0) _
And abFormula(jFirst + 1) = abEndChar(1) Then
jFirst = jFirst + 2
End If
End If
End If
End If
Next j
End Function
And the driver routine just loops on all the formulas in the worksheet, calling the function parsing routine and storing/counting the functions found in a dictionary.
Sub testing1()
Dim strFormula As String
Dim strFunc As String
Dim abFormula() As Byte
Dim abStartChars() As Byte
Dim abEndChar() As Byte
Dim abQuotes() As Byte
Dim abSQ() As Byte
Dim jStart As Long
Dim jEnd As Long
Dim oFormulas As Range
Dim oCell As Range
Dim dTime As Double
Dim dicFuncs As New Dictionary
'
''' characters that can come before the start of a Function name
Const strStartChars1 As String = "+-,/*=><& :!^" & vbLf
'
dTime = MicroTimer
'
abStartChars = strStartChars1
abEndChar = "("
abQuotes = Chr(34)
abSQ = Chr(39)
'
dicFuncs.CompareMode = TextCompare
'
Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
'
For Each oCell In oFormulas
strFormula = oCell.FormulaR1C1
abFormula = strFormula
jStart = 0
jEnd = 0
Do
strFunc = UCase(GetFunc1(abFormula, abStartChars, abEndChar, abQuotes, abSQ, jStart, jEnd))
If LenB(strFunc) = 0 Then Exit Do
jStart = jEnd
'
' add Function to dictionary and count occurrences
'
If dicFuncs.Exists(strFunc) Then
dicFuncs(strFunc) = dicFuncs(strFunc) + 1
Else
dicFuncs.Add strFunc, 1
End If
Loop
Next oCell
'
MsgBox MicroTimer - dTime
End Sub
This version takes 48 seconds: better but with room for improvement.
The (nearly) Final Algorithm
- I don’t need to worry about upper-lower case Function names, since Excel already takes care of that, so I can use the default Binary Compare for the dictionary
- All the special characters for operators etc always have byte 2 =0 so I only need to test that once per character rather on each comparison.
- If I first search left to right for the “(” character and then work backwards looking for an operator character the function does many fewer comparison operations, because there is only 1 end character but many start characters.
The resulting VBA code looks like this:
Function GetFunc3(abFormula() As Byte, abStartChars() As Byte, jStart As Long, jEnd As Long) As String
'
' search for a function name within a byte array bstarting at the jStart byte position
' byte array is zero based pairs of bytes
' abFormula is a byte array version of the formula
' abstartchars is a byte array containing characters that can precede the function name
'
' returns a string name of function and jEnd as the position of "("
'
Dim j As Long
Dim k As Long
Dim jj As Long
Dim jStartChar As Long
Dim jFirst As Long
Dim blStart As Boolean
Dim blDoubleQ As Boolean
Dim blSingleQ As Boolean
Dim abFunc() As Byte
'
jFirst = jStart + 2
blDoubleQ = False
For j = jStart + 2 To (UBound(abFormula) - 2) Step 2
'
' start and end characters always have byte 2 =0
'
If abFormula(j + 1) = 0 Then
'
' skip text strings
'
If abFormula(j) = 39 Then blSingleQ = Not blSingleQ
If Not blSingleQ Then
If abFormula(j) = 34 Then blDoubleQ = Not blDoubleQ
If Not blDoubleQ Then
'
' look for (
'
If abFormula(j) = 40 Then
'
' we have a (
' look backwards for a startchar
'
blStart = False
For jj = j - 2 To jStart Step -2
For jStartChar = 0 To UBound(abStartChars) Step 2
If abFormula(jj) = abStartChars(jStartChar) Then
blStart = True
jFirst = jj + 2
Exit For
End If
Next jStartChar
If blStart Then Exit For
Next jj
If blStart Then
If j > jFirst Then
'
' we have a function
'
jEnd = j
Exit For
ElseIf abFormula(jFirst) = 40 Then
jFirst = jFirst + 2
End If
End If
End If
End If
End If
End If
Next j
If blStart And jEnd > jFirst Then
'
' convert slice of formula to function name string
' jend points to first byte of the ( character
' jfirst points to the first byte of the function name
'
ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte
For k = 0 To UBound(abFunc)
abFunc(k) = abFormula(jFirst + k)
Next k
GetFunc3 = abFunc
End If
End Function
And the corresponding driver routine looks like this:
Sub testing3()
Dim strFunc As String
Dim abFormula() As Byte
Dim abStartChars() As Byte
Dim jStart As Long
Dim jEnd As Long
Dim oFormulas As Range
Dim oCell As Range
Dim dTime As Double
Dim dicFuncs As New Dictionary
'
''' characters that can come before the start of a Function name
Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf
'
dTime = MicroTimer
'
abStartChars = strStartChars2
Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
'
For Each oCell In oFormulas
abFormula = oCell.FormulaR1C1
jStart = 0
jEnd = 0
Do
strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd)
If LenB(strFunc) = 0 Then Exit Do
jStart = jEnd
If dicFuncs.Exists(strFunc) Then
dicFuncs(strFunc) = dicFuncs(strFunc) + 1
Else
dicFuncs.Add strFunc, 1
End If
Loop
Next oCell
MsgBox MicroTimer - dTime
End Sub
This version takes 11.6 seconds: but maybe I can make it faster?
Using MID$() instead of a Byte array
What happens if I forget about all this Byte array stuff and just implement the same algorithm using MID$() to extract each character from the formula and INSTR to check against the list of start operators (start characters)?
The GetFunc VBA code looks like this:
Function GetFunc4(strFormula As String, strStartChars As String, strEndChar As String, strQuotes As String, strSQ As String, jStart As Long, jEnd As Long) As String
'
' search for a function name within a formula string starting at the jStart character position
'
' strStartChars is a string containing characters that can precede the function name
'
' returns a string name of function and jEnd as the position of "("
'
Dim j As Long
Dim k As Long
Dim jj As Long
Dim jStartChar As Long
Dim jFirst As Long
Dim blStart As Boolean
Dim blDoubleQ As Boolean
Dim blSingleQ As Boolean
Dim abFunc() As Byte
Dim strChar As String
Dim iStartChar As Long
'
jFirst = jStart + 1
blDoubleQ = False
For j = jStart + 1 To (LenB(strFormula) - 1)
strChar = Mid$(strFormula, j, 1)
'
' skip text strings
'
If strChar = strSQ Then blSingleQ = Not blSingleQ
If Not blSingleQ Then
If strChar = strQuotes Then blDoubleQ = Not blDoubleQ
If Not blDoubleQ Then
'
' look for (
'
If strChar = strEndChar Then
'
' we have a (
' look backwards for a startchar
'
blStart = False
For jj = j - 1 To jStart Step -1
strChar = Mid$(strFormula, jj, 1)
iStartChar = InStrB(strStartChars, strChar)
If iStartChar > 0 Then
blStart = True
jFirst = jj + 1
Exit For
End If
Next jj
If blStart Then
If j > jFirst Then
'
' we have a function
'
jEnd = j
'
' convert slice of formula to function name string
' jend points to first byte of the ( character
' jfirst points to the first byte of the function name
GetFunc4 = Mid$(strFormula, jFirst, jEnd - jFirst)
Exit Function
ElseIf Mid$(strFormula, jFirst, 1) = strEndChar Then
jFirst = jFirst + 1
End If
End If
End If
End If
End If
Next j
End Function
Using MID$ and INSTR is slower: it takes 21.8 seconds.
Optimising the Driver Routine.
OK so I am reasonably happy with the parsing routine: I have gone from 130 seconds for a generic parsing routine to 11.6 seconds for the specialised Byte array routine.
But the driver routine looks at every single formula on the worksheet, and we know that on most large worksheets a large percentage of the formulas are copied.
So I can use the dictionary approach to find the distinct formulas (use R1C1 rather than A1 mode because copied formulas are identical in R1C1) and just parse those:
Testing3C also makes some more speed improvements by looping on each area and getting the formulas into a variant array rather than looping directly on the cells.
Sub testing3c()
Dim strFunc As String
Dim abFormula() As Byte
Dim abStartChars() As Byte
Dim jStart As Long
Dim jEnd As Long
Dim oFormulas As Range
Dim oCell As Range
Dim oArea As Range
Dim vArr As Variant
Dim vF As Variant
Dim dTime As Double
Dim dtotTime As Double
Dim dicFuncs As New Dictionary
Dim dicFormulas As New Dictionary
Dim j As Long
Dim k As Long
'
Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf ''' characters that can come before the start of a Function name
'
dTime = MicroTimer
'
abStartChars = strStartChars2
Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
'
' find distinct formulas in areas and count occurrences
'
For Each oArea In oFormulas.Areas
'
' get a block of formulas
'
vArr = oArea.FormulaR1C1
If IsArray(vArr) Then
For k = 1 To UBound(vArr, 2)
For j = 1 To UBound(vArr)
If dicFormulas.Exists(vArr(j, k)) Then
dicFormulas(vArr(j, k)) = dicFormulas(vArr(j, k)) + 1
Else
dicFormulas.Add vArr(j, k), 1
End If
Next j
Next k
Else
If dicFormulas.Exists(vArr) Then
dicFormulas(vArr) = dicFormulas(vArr) + 1
Else
dicFormulas.Add vArr, 1
End If
End If
Next oArea
'
' parse only the distinct formulas
'
For Each vF In dicFormulas
abFormula = vF
jStart = 0
jEnd = 0
Do
strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd)
If LenB(strFunc) = 0 Then Exit Do
jStart = jEnd
If dicFuncs.Exists(strFunc) Then
dicFuncs(strFunc) = dicFuncs(strFunc) + dicFormulas(vF)
Else
dicFuncs.Add strFunc, dicFormulas(vF)
End If
Loop
Next vF
MsgBox MicroTimer - dTime
End Sub
Now this takes only 3.5 seconds: adding a formula string to a dictionary is much faster than parsing it!
Conclusion
- Byte Arrays can be significantly faster for character by character operations rather than using MID and INSTR.
- Looking at how the first attempt at an algorithm works can give you clues about how to improve it.
- Even using Byte arrays string parsing operations are slow in VBA.
- Using a dictionary its really fast to find the distinct formulas even on a large worksheet.
Challenge
OK guys: who can write a faster VBA function parsing routine?
Download the routines from here