trying to work out some regexp issues in lasso9

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

trying to work out some regexp issues in lasso9

Tim Taplin
So, I've always steered clear of regex but I know there's a place for it. I'm working on updating the very useful csv tag on tagswap to be lasso9 friendly and running into some regex specific issues.

the original tag uses a regex to parse each line in the csv file on loading. The regex has a glitch that creates an array entry for the commas which actually contains a comma. This is handled in the next processing step nicely in lasso8 but in lasso9, the same query creates an array entry which is null.

Since it is possible that there could be both empty and null values in a csv file, I cant just look for empty array values and delete them. I've been trying to figure out why the regex creates this extra entry, and whether there is some reason for the change in behavior. Also looked to work out a way to resolve this issue at the regex level. All with no success and much hair pulling.

Here is the basic code and the regex as it stands mildly ported to lasso9 syntax, where the local #line contains the line string.

                local(field = string() )
                local(i = null)
                local(row = array())
                local(linesplit = string_findregexp( #line, -find = '"(?:[^"]|"")*"|[^,]*|,'))
               
                iterate( #linesplit, #i )
                        if(#i == ',')
                                #row->insertlast(#field)
                                #field = ''
                               
                        else(#i->beginswith('"') && #i->endswith('"'))
                                #field += #i->substring(2, #i->size - 2)->replace('""', '"')&;
                               
                        else
                                #field += #i
                        /if
                /iterate

I think that the issue may relate in some way to the note in the documentation regarding grouping:
If groups are defined in the -Find expression then the output contains the entire search result followed by each of the sub-groups. If there were 2 matches of the expressions and 2 sub-groups then the array contains a total of 6 items.

As I read the regex it is searching for
        (a leading doublequote followed by 0 or 1 repetitions of a non doublequote character or two doublequotes until the next doublequote)
        or
        (any number of non comma characters or a comma)

I think that my issue is that I'm getting the comma match and the grouped match for the quotes, but cant figure out a way to remove that element without breaking the parsing of unquoted numeric, null, or empty values.

I thought I remembered that there were some differences in regex behavior in lasso9 but cant find those discussions or any documentation referring to the differences.

Any help would be appreciated.

Tim Taplin

#############################################################
This message is sent to you because you are subscribed to
  the mailing list Lasso
[hidden email]
To unsubscribe, E-mail to: <[hidden email]>
Send administrative queries to  <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: trying to work out some regexp issues in lasso9

Brad Lindsay-2
On Jan 15, 2013, at 7:31 PM, Tim Taplin <[hidden email]> wrote:

> Here is the basic code and the regex as it stands mildly ported to lasso9 syntax, where the local #line contains the line string.
>
> local(field = string() )
> local(i = null)
> local(row = array())
> local(linesplit = string_findregexp( #line, -find = '"(?:[^"]|"")*"|[^,]*|,'))
>
> iterate( #linesplit, #i )
> if(#i == ',')
> #row->insertlast(#field)
> #field = ''
>
> else(#i->beginswith('"') && #i->endswith('"'))
> #field += #i->substring(2, #i->size - 2)->replace('""', '"')&;
>
> else
> #field += #i
> /if
> /iterate
>
> I think that the issue may relate in some way to the note in the documentation regarding grouping:
> If groups are defined in the -Find expression then the output contains the entire search result followed by each of the sub-groups. If there were 2 matches of the expressions and 2 sub-groups then the array contains a total of 6 items.
>
> As I read the regex it is searching for
> (a leading doublequote followed by 0 or 1 repetitions of a non doublequote character or two doublequotes until the next doublequote)
> or
> (any number of non comma characters or a comma)
>
> I think that my issue is that I'm getting the comma match and the grouped match for the quotes, but cant figure out a way to remove that element without breaking the parsing of unquoted numeric, null, or empty values.
>
> I thought I remembered that there were some differences in regex behavior in lasso9 but cant find those discussions or any documentation referring to the differences.

I don't think there's a difference in the way regex matching works, though I have encountered some differences in how to use the regex type / methods. I got your example to work by replacing the [if(#i == ',')] with [if(#i == '')] and it seems to handle null values just fine. (I then went in and did some other Lasso 9 updates.):

local(line) = `"Round Starts",,"Session ID","Round ID","Subject ""ID""","Subject #",`

local(field = '')
local(row = array)
local(linesplit = string_findregexp( #line, -find = `"(?:[^"]|"")*"|[^,]*|,`))

with i in #linesplit do {
        if(#i == '') => {
                #row->insertlast(#field)
                #field = ''
        else(#i->beginswith('"') && #i->endswith('"'))
                #field->append(#i->substring(2, #i->size - 2)->replace('""', '"')&)
        else
                #field->append(#i)
        }
}
#row


This seems to work for me. I also have a method that extends the [file] type that will parse CSV formatted files that I've passed on to Tim and would be willing to pass on to anyone else interested.

Brad
#############################################################
This message is sent to you because you are subscribed to
  the mailing list Lasso
[hidden email]
To unsubscribe, E-mail to: <[hidden email]>
Send administrative queries to  <[hidden email]>