Parsing Functions from Excel Formulas using VBA: Is MID or a Byte array the best method?

As part of extending the performance profiling abilities of FastExcel, I wanted to develop a Function Profiler Map. A key component of this is to extract the names of the functions embedded in Excel formulas.

So I experimented with some different approaches:

  • using Rob Van Gelder’s AudXL formula parser
  • using the MID function to scan through character by character
  • using byte arrays

The Test Data

For performance testing I am using a large worksheet with 638K used cells, 103K constants and 413K formulas. There are 3 functions on this sheet (710K INDEX, 5 SUM and 51K IF).

For validity testing I am using a small worksheet with some tricky formulas:

ParseFormulas1

Rob Van Gelder’s AudXL formula parser

Rob converted a javascript-based Excel formula parser  written by Eric Bachtal to VBA. You can download AudXL.xla from here. It is a useful tool for breaking formulas apart so that they are more readable.

So I started with that as a performance baseline: it takes 130 seconds to parse the large test worksheet.

But of course a general purpose formula parser should be slower than a purpose-built parser that only finds the functions.

Rules for Finding Functions in Formula strings

  • Function names are always followed by open bracket “(“
  • Function Names are always preceded by an operator
    +-,/*(=><& :!^
  • Formula text enclosed in single or double quotes cannot contain a function

Assumptions

  • Formulas to be parsed will be obtained using Range.FormulaR1C1.
  • This means that the formula string will will use American English settings and native function names.
  • And I only have to parse formulas that have been validated by Excel.
  • The full range of Unicode characters are allowed, so I can’t just use ASCII codes.

Download the VBA Code

You can download the routines containing the VBA from here

The First attempt: using Byte arrays

In a previous post I discussed using Byte arrays as an efficient method for processing all the characters in a string.
This is a good method for handling Unicode strings – each character gets converted into 2 separate bytes (0 to 255) which uniquely define the character and cover all the possible National Language characters in the world.

So “(” for instance is always 40 and 0 and


Dim abFormula() as Byte
Dim str1 as string
str1="=NA()"
abFormula=str1


produces:

ParseFormulas2

Note that the Byte arrays produced by assigning a string to a byte array are always zero-based.

You can also do the reverse: assigning a byte array to a string converts back.

The First Algorithm

So here is an algorithm that follows the rules:

Scan each character in the byte array from left to right:
If the character is a single or double quote ignore all subsequent characters until there is another single or double quote.
if the character is in the list of operators (abStartChars) then set a flag
if the character is NOT in the list of operators check if its a “(” and
if it IS a “(” and there are one or more non-operator characters preceding it we have found a function name.

The code looks like this:


Function GetFunc1(abFormula() As Byte, abStartChars() As Byte, abEndChar() As Byte, _
 abQuotes() As Byte, abSQ() As Byte, jStart As Long, jEnd As Long) As String
 '
 ' search for a function name within a byte array starting at the jStart byte position
 ' byte array is zero based pairs of bytes
 ' abFormula is a byte array version of the formula
 ' abStartChars is a byte array containing characters that can precede the function name
 ' abEndChar is a byte array containing "("
 ' abQuotes is a byte array contianing a double quote
 ' abSQ is a byte array containing a single quote
 '
 ' returns name of function as a string and jEnd as the byte position of "("
 '
 Dim j As Long
 Dim k As Long
 Dim jStartChar As Long
 Dim jFirst As Long
 Dim blStart As Boolean
 Dim blString As Boolean
 Dim abFunc() As Byte
 '
 jFirst = jStart + 2
 blString = False
 For j = jStart + 2 To (UBound(abFormula) - 2) Step 2
 '
 ' skip text strings
 '
 If (abFormula(j) = abQuotes(0) And abFormula(j + 1) = abQuotes(1)) _
 Or (abFormula(j) = abSQ(0) And abFormula(j + 1) = abSQ(1)) Then
 blString = Not blString
 End If
 If Not blString Then
 '
 ' look for non startchar
 '
 blStart = False
 For jStartChar = 0 To UBound(abStartChars) Step 2
 If abFormula(j) = abStartChars(jStartChar) _
 And abFormula(j + 1) = abStartChars(jStartChar + 1) Then
 blStart = True
 jFirst = j + 2
 Exit For
 End If
 Next jStartChar
 If Not blStart Then
 If abFormula(j) = abEndChar(0) And abFormula(j + 1) = abEndChar(1) Then
 '
 ' we have a (
 '
 If j > jFirst Then
 '
 ' we have a function
 '
 jEnd = j
 '
 ' jend points to first byte of the ( character
 ' jfirst points to the first byte of the function name
 ' convert slice of formula to function name string
 '
 ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte
 For k = 0 To UBound(abFunc)
 abFunc(k) = abFormula(jFirst + k)
 Next k
 GetFunc1 = abFunc
 Exit Function
 '
 ElseIf abFormula(jFirst) = abEndChar(0) _
 And abFormula(jFirst + 1) = abEndChar(1) Then
 jFirst = jFirst + 2
 End If
 End If
 End If
 End If
 Next j
 End Function

And the driver routine just loops on all the formulas in the worksheet, calling the function parsing routine and storing/counting the functions found in a dictionary.


Sub testing1()
 Dim strFormula As String
 Dim strFunc As String
 Dim abFormula() As Byte
 Dim abStartChars() As Byte
 Dim abEndChar() As Byte
 Dim abQuotes() As Byte
 Dim abSQ() As Byte
 Dim jStart As Long
 Dim jEnd As Long
 Dim oFormulas As Range
 Dim oCell As Range
 Dim dTime As Double
 Dim dicFuncs As New Dictionary
 '
 ''' characters that can come before the start of a Function name
 Const strStartChars1 As String = "+-,/*=><& :!^" & vbLf
 '
 dTime = MicroTimer
 '
 abStartChars = strStartChars1
 abEndChar = "("
 abQuotes = Chr(34)
 abSQ = Chr(39)
 '
 dicFuncs.CompareMode = TextCompare
 '
 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
 '
 For Each oCell In oFormulas
 strFormula = oCell.FormulaR1C1
 abFormula = strFormula
 jStart = 0
 jEnd = 0
 Do
 strFunc = UCase(GetFunc1(abFormula, abStartChars, abEndChar, abQuotes, abSQ, jStart, jEnd))
 If LenB(strFunc) = 0 Then Exit Do
 jStart = jEnd
 '
 ' add Function to dictionary and count occurrences
 '
 If dicFuncs.Exists(strFunc) Then
 dicFuncs(strFunc) = dicFuncs(strFunc) + 1
 Else
 dicFuncs.Add strFunc, 1
 End If
 Loop
 Next oCell
 '
 MsgBox MicroTimer - dTime
 End Sub

This version takes 48 seconds: better but with room for improvement.

The (nearly) Final Algorithm

  • I don’t need to worry about upper-lower case Function names, since Excel already takes care of that, so I can use the default Binary Compare for the dictionary
  • All the special characters for operators etc always have byte 2 =0 so I only need to test that once per character rather on each comparison.
  • If I first search left to right for the “(” character and then work backwards looking for an operator character the function does many fewer comparison operations, because there is only 1 end character but many start characters.

The resulting VBA code looks like this:


Function GetFunc3(abFormula() As Byte, abStartChars() As Byte, jStart As Long, jEnd As Long) As String
 '
 ' search for a function name within a byte array bstarting at the jStart byte position
 ' byte array is zero based pairs of bytes
 ' abFormula is a byte array version of the formula
 ' abstartchars is a byte array containing characters that can precede the function name
 '
 ' returns a string name of function and jEnd as the position of "("
 '
 Dim j As Long
 Dim k As Long
 Dim jj As Long
 Dim jStartChar As Long
 Dim jFirst As Long
 Dim blStart As Boolean
 Dim blDoubleQ As Boolean
 Dim blSingleQ As Boolean
 Dim abFunc() As Byte
 '
 jFirst = jStart + 2
 blDoubleQ = False
 For j = jStart + 2 To (UBound(abFormula) - 2) Step 2
 '
 ' start and end characters always have byte 2 =0
 '
 If abFormula(j + 1) = 0 Then
 '
 ' skip text strings
 '
 If abFormula(j) = 39 Then blSingleQ = Not blSingleQ
 If Not blSingleQ Then
 If abFormula(j) = 34 Then blDoubleQ = Not blDoubleQ
 If Not blDoubleQ Then
 '
 ' look for (
 '
 If abFormula(j) = 40 Then
 '
 ' we have a (
 ' look backwards for a startchar
 '
 blStart = False
 For jj = j - 2 To jStart Step -2
 For jStartChar = 0 To UBound(abStartChars) Step 2
 If abFormula(jj) = abStartChars(jStartChar) Then
 blStart = True
 jFirst = jj + 2
 Exit For
 End If
 Next jStartChar
 If blStart Then Exit For
 Next jj
 If blStart Then
 If j > jFirst Then
 '
 ' we have a function
 '
 jEnd = j
 Exit For
 ElseIf abFormula(jFirst) = 40 Then
 jFirst = jFirst + 2
 End If
 End If
 End If
 End If
 End If
 End If
 Next j
 If blStart And jEnd > jFirst Then
 '
 ' convert slice of formula to function name string
 ' jend points to first byte of the ( character
 ' jfirst points to the first byte of the function name
 '
 ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte
 For k = 0 To UBound(abFunc)
 abFunc(k) = abFormula(jFirst + k)
 Next k
 GetFunc3 = abFunc
 End If
 End Function

And the corresponding driver routine looks like this:


Sub testing3()
 Dim strFunc As String
 Dim abFormula() As Byte
 Dim abStartChars() As Byte
 Dim jStart As Long
 Dim jEnd As Long
 Dim oFormulas As Range
 Dim oCell As Range
 Dim dTime As Double
 Dim dicFuncs As New Dictionary
 '
 ''' characters that can come before the start of a Function name
 Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf
 '
 dTime = MicroTimer
 '
 abStartChars = strStartChars2
 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
 '
 For Each oCell In oFormulas
 abFormula = oCell.FormulaR1C1
 jStart = 0
 jEnd = 0
 Do
 strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd)
 If LenB(strFunc) = 0 Then Exit Do
 jStart = jEnd
 If dicFuncs.Exists(strFunc) Then
 dicFuncs(strFunc) = dicFuncs(strFunc) + 1
 Else
 dicFuncs.Add strFunc, 1
 End If
 Loop
 Next oCell
 MsgBox MicroTimer - dTime
 End Sub

This version takes 11.6 seconds: but maybe I can make it faster?

Using MID$() instead of a Byte array

What happens if I forget about all this Byte array stuff and just implement the same algorithm using MID$() to extract each character from the formula and INSTR to check against the list of start operators (start characters)?

The GetFunc VBA code looks like this:


Function GetFunc4(strFormula As String, strStartChars As String, strEndChar As String, strQuotes As String, strSQ As String, jStart As Long, jEnd As Long) As String
 '
 ' search for a function name within a formula string starting at the jStart character position
 '
 ' strStartChars is a string containing characters that can precede the function name
 '
 ' returns a string name of function and jEnd as the position of "("
 '
 Dim j As Long
 Dim k As Long
 Dim jj As Long
 Dim jStartChar As Long
 Dim jFirst As Long
 Dim blStart As Boolean
 Dim blDoubleQ As Boolean
 Dim blSingleQ As Boolean
 Dim abFunc() As Byte
 Dim strChar As String
 Dim iStartChar As Long
 '
 jFirst = jStart + 1
 blDoubleQ = False
 For j = jStart + 1 To (LenB(strFormula) - 1)
 strChar = Mid$(strFormula, j, 1)
 '
 ' skip text strings
 '
 If strChar = strSQ Then blSingleQ = Not blSingleQ
 If Not blSingleQ Then
 If strChar = strQuotes Then blDoubleQ = Not blDoubleQ
 If Not blDoubleQ Then
 '
 ' look for (
 '
 If strChar = strEndChar Then
 '
 ' we have a (
 ' look backwards for a startchar
 '
 blStart = False
 For jj = j - 1 To jStart Step -1
 strChar = Mid$(strFormula, jj, 1)
 iStartChar = InStrB(strStartChars, strChar)
 If iStartChar > 0 Then
 blStart = True
 jFirst = jj + 1
 Exit For
 End If
 Next jj
 If blStart Then
 If j > jFirst Then
 '
 ' we have a function
 '
 jEnd = j
 '
 ' convert slice of formula to function name string
 ' jend points to first byte of the ( character
 ' jfirst points to the first byte of the function name
 GetFunc4 = Mid$(strFormula, jFirst, jEnd - jFirst)
 Exit Function
 ElseIf Mid$(strFormula, jFirst, 1) = strEndChar Then
 jFirst = jFirst + 1
 End If
 End If
 End If
 End If
 End If
 Next j
 End Function

Using MID$ and INSTR is slower: it takes 21.8 seconds.

Optimising the Driver Routine.

OK so I am reasonably happy with the parsing routine: I have gone from 130 seconds for a generic parsing routine to 11.6 seconds for the specialised Byte array routine.

But the driver routine looks at every single formula on the worksheet, and we know that on most large worksheets a large percentage of the formulas are copied.

So I can use the dictionary approach to find the distinct formulas (use R1C1 rather than A1 mode because copied formulas are identical in R1C1) and just parse those:

Testing3C also makes some more speed improvements by looping on each area and getting the formulas into a variant array rather than looping directly on the cells.


Sub testing3c()
 Dim strFunc As String
 Dim abFormula() As Byte
 Dim abStartChars() As Byte
 Dim jStart As Long
 Dim jEnd As Long
 Dim oFormulas As Range
 Dim oCell As Range
 Dim oArea As Range
 Dim vArr As Variant
 Dim vF As Variant
 Dim dTime As Double
 Dim dtotTime As Double
 Dim dicFuncs As New Dictionary
 Dim dicFormulas As New Dictionary
 Dim j As Long
 Dim k As Long
 '
 Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf ''' characters that can come before the start of a Function name
 '
 dTime = MicroTimer
 '
 abStartChars = strStartChars2
 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23)
 '
 ' find distinct formulas in areas and count occurrences
 '
 For Each oArea In oFormulas.Areas
 '
 ' get a block of formulas
 '
 vArr = oArea.FormulaR1C1
 If IsArray(vArr) Then
 For k = 1 To UBound(vArr, 2)
 For j = 1 To UBound(vArr)
 If dicFormulas.Exists(vArr(j, k)) Then
 dicFormulas(vArr(j, k)) = dicFormulas(vArr(j, k)) + 1
 Else
 dicFormulas.Add vArr(j, k), 1
 End If
 Next j
 Next k
 Else
 If dicFormulas.Exists(vArr) Then
 dicFormulas(vArr) = dicFormulas(vArr) + 1
 Else
 dicFormulas.Add vArr, 1
 End If
 End If
 Next oArea
 '
 ' parse only the distinct formulas
 '
 For Each vF In dicFormulas
 abFormula = vF
 jStart = 0
 jEnd = 0
 Do
 strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd)
 If LenB(strFunc) = 0 Then Exit Do
 jStart = jEnd
 If dicFuncs.Exists(strFunc) Then
 dicFuncs(strFunc) = dicFuncs(strFunc) + dicFormulas(vF)
 Else
 dicFuncs.Add strFunc, dicFormulas(vF)
 End If
 Loop
 Next vF
 MsgBox MicroTimer - dTime
 End Sub

Now this takes only 3.5 seconds: adding a formula string to a dictionary is much faster than parsing it!

Conclusion

  • Byte Arrays can be significantly faster for character by character operations rather than using MID and INSTR.
  • Looking at how the first attempt at an algorithm works can give you clues about how to improve it.
  • Even using Byte arrays string parsing operations are slow in VBA.
  • Using a dictionary its really fast to find the distinct formulas even on a large worksheet.

Challenge

OK guys: who can write a faster VBA function parsing routine?

Download the routines from here

Advertisements
This entry was posted in Calculation, VBA. Bookmark the permalink.

2 Responses to Parsing Functions from Excel Formulas using VBA: Is MID or a Byte array the best method?

  1. Kevin Lee says:

    hi, audXL.xla is no longer available? could u post it on your website? ideally with the sourcecode

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s