Help with RegEx?

Discussion in 'Programmer Misc' started by SparkyGuy, Jan 6, 2008.

  1. SparkyGuy

    SparkyGuy Guest

    I want to build a regular expression that will find certain characters in a
    field. For example:

    i,n,t,u,o,n

    all need to be present (at least once) for the RegEx interpreter to label
    this search True. The order is not important, and case should be ignored.

    I tried

    [Ii][Nn][Tt][Uu][Oo][Nn]

    and

    [Ii].*[Nn].*[Tt].*[Uu].*[Oo].*[Nn]

    No joy.

    Also tried use of ^ and $ but I'm not sure how to implement them, and whether
    or not they are required.

    So basically I'm stumbling around in the dark. But I want to learn. I've
    viewed several tutorials on-line, but this subject is so obtuse to me that
    it's difficult even getting started.

    Any suggestions would be greatly appreciated.

    Thanks!
     
    SparkyGuy, Jan 6, 2008
    #1
    1. Advertisements

  2. SparkyGuy

    SparkyGuy Guest

    > [Ii].*[Nn].*[Tt].*[Uu].*[Oo].*[Nn]

    It turns out that this does work to find terms with all these letters in this
    order even if there are other characters interspersed between them. Such as:

    intuition
    in3t5u7ition
    in tuitio9n
    inBBtuCCon

    Now I want to find terms that have all these characters in any order. Such
    as:

    noiiitunt
    tn8no9uitii
    otAAnuiBiit
    unt ioitni

    I guess this has something to do with the ^ and $ parsing metasymbols but I'm
    not knowledgeable enough on this topic to know how, exactly.

    Any help would be greatly appreciated.

    Thanks!
     
    SparkyGuy, Jan 6, 2008
    #2
    1. Advertisements

  3. SparkyGuy

    SM Ryan Guest

    SparkyGuy <> wrote:
    # I want to build a regular expression that will find certain characters in a
    # field. For example:
    #
    # i,n,t,u,o,n
    #
    # all need to be present (at least once) for the RegEx interpreter to label
    # this search True. The order is not important, and case should be ignored.
    #
    # I tried
    #
    # [Ii][Nn][Tt][Uu][Oo][Nn]
    #
    # and
    #
    # [Ii].*[Nn].*[Tt].*[Uu].*[Oo].*[Nn]
    #
    # No joy.

    The software really isn't intended to be used this way. It would
    be simpler to conjoin a number of searches. Assuming an interface
    like
    numberofmatches = regexp(pattern,string)
    you can do something like
    regexp("[Ii]",string)==1
    && regexp("[Nn]",string)==2
    && regexp("[Tt]",string)==1
    && regexp("[Uu]",string)==1
    && regexp("[Oo]",string)==1

    Some interfaces also allow a flag to ignore character case.
    regexpnocase("i",string)==1
    && regexpnocase("n",string)==2
    && regexpnocase("t",string)==1
    && regexpnocase("u",string)==1
    && regexpnocase("o",string)==1

    To do this in a single RE, you have to use all 120 permutations,
    [^IiNnTtUuOoNn]*[Ii][^IiNnTtUuOoNn]*[Nn][^IiNnTtUuOoNn]*[Tt]...
    |[^IiNnTtUuOoNn]*[Ii][^IiNnTtUuOoNn]*[Uo][^IiNnTtUuOoNn]*[Nn]...
    |...

    --
    SM Ryan http://www.rawbw.com/~wyrmwif/
    GERBILS
    GERBILS
    GERBILS
     
    SM Ryan, Jan 6, 2008
    #3
  4. SparkyGuy

    David Empson Guest

    SparkyGuy <> wrote:

    > > [Ii].*[Nn].*[Tt].*[Uu].*[Oo].*[Nn]

    >
    > It turns out that this does work to find terms with all these letters in this
    > order even if there are other characters interspersed between them. Such as:
    >
    > intuition
    > in3t5u7ition
    > in tuitio9n
    > inBBtuCCon
    >
    > Now I want to find terms that have all these characters in any order. Such
    > as:
    >
    > noiiitunt
    > tn8no9uitii
    > otAAnuiBiit
    > unt ioitni


    To clarify: do you want at least one each of "I", "T", "U" and "O", and
    at least two "N"s, in any order and mixed with any other characters (or
    more of the same ones), ignoring case?

    Tricky. You can't do that with a simple regular expression, or even a
    Perl-compatible regular expression.

    It would best done in parallel, testing the string against five
    different regular expressions:

    [Ii]
    [Nn].*[Nn]
    [Tt]
    [Uu]
    [Oo]

    The string has to match all of these to pass.

    If you really don't need two "N"s then you can simplify the second test
    to be like the others. If you want a certain number of each letter, but
    in any position, then use the same general syntax as the second line.

    > I guess this has something to do with the ^ and $ parsing metasymbols but I'm
    > not knowledgeable enough on this topic to know how, exactly.


    Those just mean "start of string" and "end of string" respectively. For
    example, if you want to only match a string which starts with "I" or "i"
    then your regular expression is "^[Ii]".

    --
    David Empson
     
    David Empson, Jan 6, 2008
    #4
  5. In article <>,
    SparkyGuy <> wrote:

    > I want to build a regular expression that will find certain characters in a
    > field.
    >
    > For example:
    >
    > i,n,t,u,o,n
    >
    > all need to be present (at least once) for the RegEx interpreter to label
    > this search True. The order is not important, and case should be ignored.


    The 'order is not important' part makes a regular expression a 'less
    than the ideal' solution for your problem.

    > I tried
    >
    > [Ii][Nn][Tt][Uu][Oo][Nn]
    >
    > and
    >
    > [Ii].*[Nn].*[Tt].*[Uu].*[Oo].*[Nn]
    >
    > No joy.
    >
    > Also tried use of ^ and $ but I'm not sure how to implement them, and whether
    > or not they are required.
    >
    > So basically I'm stumbling around in the dark.


    The 'I tried', and 'I am not sure' parts already carried that message,
    but it is good to hear that you know that you do not really know what
    you are doing.

    > But I want to learn. I've viewed several tutorials on-line, but this
    > subject is so obtuse to me that it's difficult even getting started.


    I guess that you are at the stage where 'every thing looks like a nail,
    even your thumb'. Regular expressions are powerful, but not suited for
    every job. This is one of those jobs. You can create a regular
    expression of this, but it would have to sum up all 360 (that would be
    720 if the six letters were different) permutations of the six letters
    used, for a regular expression of length around 34 * 360 + 2 * 359.

    > Any suggestions would be greatly appreciated.


    For me, the #1 rule when building regular expressions is: when your
    regular expression does not do what you think it should do, shorten it,
    and check (in a simple test program) that the shorter one does what you
    think it should do.

    In your case, you might want to start with "[Ii].*[Nn]" and work from
    there.

    <http://www.regular-expressions.info/> might help you.

    If you have access to a Windows machine: have you seen
    <http://www.regexbuddy.com/test.html>? I have not used it myself, but
    heard positive comments about it.

    > Thanks!


    In article <>,
    SparkyGuy <> wrote:

    > > [Ii].*[Nn].*[Tt].*[Uu].*[Oo].*[Nn]

    >
    > It turns out that this does work to find terms with all these letters in this
    > order even if there are other characters interspersed between them. Such as:
    >
    > intuition
    > in3t5u7ition
    > in tuitio9n
    > inBBtuCCon


    That should work on with most, if not all, regular expression libraries.
    See for example <http://www.regextester.com/>. Which one are you using?

    > I guess this has something to do with the ^ and $ parsing metasymbols


    Why makes you think that?

    Reinder
     
    Reinder Verlinde, Jan 6, 2008
    #5
  6. SparkyGuy

    Paul Floyd Guest

    On Sat, 05 Jan 2008 23:07:58 -0500, SparkyGuy <> wrote:
    > I want to build a regular expression that will find certain characters in a
    > field. For example:
    >
    > i,n,t,u,o,n


    Exactly which language/library/regex engine are you using?

    In any caseisn't easy in a single regexp.

    A bientot
    Paul
    --
    Paul Floyd http://paulf.free.fr
     
    Paul Floyd, Jan 6, 2008
    #6
  7. SparkyGuy

    Ben Artin Guest

    In article <>,
    SparkyGuy <> wrote:

    > I want to build a regular expression that will find certain characters in a
    > field. For example:
    >
    > i,n,t,u,o,n
    >
    > all need to be present (at least once) for the RegEx interpreter to label
    > this search True. The order is not important, and case should be ignored.


    REs are not the right tool for this job; while it is possible to build such an
    RE, it's annoying. For example, this is a RE that would match every string
    containing all of 'a', 'b', and 'c' in any order, case-insensitively:

    ([^aA][aA][^bB][bB][^cC][cC])|([^aA][aA][^cC][cC][^bB][bB])|([^bB][bB][^aA][aA][^
    cC][cC])|([^bB][bB][^cC][cC][^aA][aA])|([^cC][cC][^aA][aA][^bB][bB])|([^cC][cC][^
    bB][bB][^aA][aA])

    If you are using a RE engine that allows RE options, then you can use /i to
    indicate you want a case-insensitive match, and then you can simplify this to

    /([^a]a[^b]b[^c]c)|([^a]a[^c]c[^b]b)|([^b]b[^c]c[^a]a)|([^b]b[^a]a[^c]c)|([^c]c[^
    a]a[^b]b)|([^c]c[^b]b[^a]a)/i

    and you can further simplify this to

    /([^a]a(([^b]b)|([^c]c)))|([^b]b(([^a]a)|([^c]c)))|([^c]c(([^a]a)|([^b]b)))/i

    but as you can see, this is pretty painful and it gets exponentially more
    painful the more characters you want to match.

    Ben

    --
    If this message helped you, consider buying an item
    from my wish list: <http://artins.org/ben/wishlist>

    I changed my name: <http://periodic-kingdom.org/People/NameChange.php>
     
    Ben Artin, Jan 9, 2008
    #7
  8. SparkyGuy

    SparkyGuy Guest

    I have reduced my requirements to just matching input that contains the
    letters "i n t u o" in that order even if interspersed with other characters.


    The RE I worked out is:

    [Ii].*[Nn].*[Tt].*[Uu].*[Oo]

    Thanks to all who helped me come to this result.

    I also need to match input that exceeds 50 characters. I tried several REs,
    paring it down, eventually, to:

    .{50}

    It doesn't work.

    This seems like it should be pretty simple, but it's not working. It works in
    regextester.com's tester, but it isn't working for me. Is there a simpler (or
    different) way to check for input length? (I presume that my "flavor" of
    regex interpreter isn't accepting this form...)

    Thanks for your help.
     
    SparkyGuy, Jan 9, 2008
    #8
  9. SparkyGuy

    David Empson Guest

    SparkyGuy <> wrote:

    > I have reduced my requirements to just matching input that contains the
    > letters "i n t u o" in that order even if interspersed with other characters.
    >
    >
    > The RE I worked out is:
    >
    > [Ii].*[Nn].*[Tt].*[Uu].*[Oo]
    >
    > Thanks to all who helped me come to this result.
    >
    > I also need to match input that exceeds 50 characters. I tried several REs,
    > paring it down, eventually, to:
    >
    > .{50}
    >
    > It doesn't work.


    That regular expression matches exactly 50 of any combination of
    characters. Assuming your input is being processed on a line at a time
    basis, it should match any line which contains a minimum of 50
    characters.

    > This seems like it should be pretty simple, but it's not working. It works in
    > regextester.com's tester, but it isn't working for me. Is there a simpler (or
    > different) way to check for input length? (I presume that my "flavor" of
    > regex interpreter isn't accepting this form...)


    The {} syntax is only available if your regular expression engine
    supports Perl-compatible regular expressions ("PCRE").

    A simple regular expression has no way to represent quantity, other than
    "zero or more" (*) or "one or more" (+).

    To match at least 50 characters in a simple regular expression you would
    need to enter 50 periods:

    ...................................................

    --
    David Empson
     
    David Empson, Jan 9, 2008
    #9
  10. SparkyGuy

    SparkyGuy Guest

    > The {} syntax is only available if your regular expression engine
    > supports Perl-compatible regular expressions ("PCRE").
    >
    > A simple regular expression has no way to represent quantity, other than
    > "zero or more" (*) or "one or more" (+).
    >
    > To match at least 50 characters in a simple regular expression you would
    > need to enter 50 periods:
    >
    > ..................................................


    The author of the application within which I'm using RegEx to develop filters
    tells me that the app uses this RegEx library:

    <http://arglist.com/regex/>

    This is what I found through testing:

    Using "." (a single period) matches all input, regardless of length.
    Apparently it is being interpreted equivalent to ".*"

    Multiple uses of "." (ie, "......") are redundant.

    This is interesting:

    "Matches RE '.{1,50}' " matches nothing
    "Does not match RE '.{1,50}' " matches all input

    More tests:

    [Ee] matches all input with at least 1 E or e.

    [Ee]{1,3} matches nothing, although there's plenty of valid input.

    [Ee][Ee][Ee] also fails to match anything.

    Is my "flavor" of RE interpreter broken? I'd think that at least some of the
    basic forms should be supported...

    Ideas?

    Thanks,
     
    SparkyGuy, Jan 9, 2008
    #10
  11. SparkyGuy

    SparkyGuy Guest

    > The author of the application within which I'm using RegEx to develop filters

    > tells me that the app uses this RegEx library:


    He further says that it is the regex(3) library that he has implemented.
    "Basic" expressions.

    It seems that the basic (referred to as "obsolete") REs are a subset of
    "extended" REs.

    Of all places, I found a list of basic expressions on Wikipedia:

    <http://en.wikipedia.org/wiki/Regular_expression>

    (scroll down the heading "POSIX".)

    My question is about the metacharacter ".". Using a single ".", shouldn't it
    match input that consists of a single character, and not match anything with
    more than one character?

    When I use this metacharacter I'm getting matches for all input, regardless
    of length.

    Thanks.
     
    SparkyGuy, Jan 10, 2008
    #11
  12. SparkyGuy

    John Whorfin Guest

    SparkyGuy wrote:
    > My question is about the metacharacter ".". Using a single ".", shouldn't it
    > match input that consists of a single character, and not match anything with
    > more than one character?


    Only if you anchor it, i.e. "^.$" will match lines containing only a
    single character. Read this as "At the start of the input, match a
    single character, which must be followed by the end of the line"
    (newline or end-of-string). A "." alone will match anything, other than
    an empty line. Some regex matchers have options to implicitly anchor
    the regex but others don't.

    Don't know if anyone has suggested this, but one way to achieve the
    match you want (in the original post) is to sort the characters of the
    field prior to matching. Then match against a simpler regex. The
    sorting eliminates the complications of specifying a regex that can cope
    with the arbitrary ordering of the input characters. Of course it may
    not be the most efficient thing to do depending upon the nature of the
    input (quantity, likelihood of match etc...) and a complex regex may be
    better.
     
    John Whorfin, Jan 10, 2008
    #12
  13. SparkyGuy

    SparkyGuy Guest

    > Only if you anchor it, i.e. "^.$" will match lines containing only a
    > single character. Read this as "At the start of the input, match a
    > single character, which must be followed by the end of the line"
    > (newline or end-of-string). A "." alone will match anything, other than
    > an empty line. Some regex matchers have options to implicitly anchor
    > the regex but others don't.


    Ah. Thanks! It now works to identify specific numbers of characters, such as:
    ^.....$ five characters
    ^..........$ ten characters, etc.

    My goal is to select a range of numbers of characters, the general form of
    which would be:

    .{5,10}

    But in this limited set of supported expressions, however, the range
    metacharacters must be escaped:

    \{5,10\}

    So how do I incorporate this with "."? I tried

    ^.\{5,10\}$

    to no avail. Other permutations I can think of don't work either.

    Ideas?

    > Don't know if anyone has suggested this, but one way to achieve the
    > match you want (in the original post) is to sort the characters of the
    > field prior to matching. Then match against a simpler regex. The
    > sorting eliminates the complications of specifying a regex that can cope
    > with the arbitrary ordering of the input characters. Of course it may
    > not be the most efficient thing to do depending upon the nature of the
    > input (quantity, likelihood of match etc...) and a complex regex may be
    > better.


    I may have stated earlier (?) that I'm not working in a programming language,
    but simply using RegEx to set up filters in an application that supports the
    basic set of RegEx expressions. The application presents a single field
    within which a single RegEx can be entered. In a single RegEx, can I sort and
    match?

    Thanks for your help. It is very much appreciated.
     
    SparkyGuy, Jan 10, 2008
    #13
  14. SparkyGuy

    SparkyGuy Guest

    It turns out that simple range expressions are not supported in this (very)
    limited set of RegEx:

    "\{m,n\} Matches the preceding element at least m and not more than n times.
    For example, a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". ***This is not
    found in a few, older instances of regular expressions.***"
    (Emphasis mine.)

    <http://en.wikipedia.org/wiki/Regular_expression#POSIX>
     
    SparkyGuy, Jan 10, 2008
    #14
  15. SparkyGuy

    David Empson Guest

    SparkyGuy <> wrote:

    > It turns out that simple range expressions are not supported in this (very)
    > limited set of RegEx:
    >
    > "\{m,n\} Matches the preceding element at least m and not more than n times.
    > For example, a\{3,5\} matches only "aaa", "aaaa", and "aaaaa". ***This is not
    > found in a few, older instances of regular expressions.***"
    > (Emphasis mine.)


    That notation is wrong. For extended regular expressions (specifically
    "Perl Compatible Regular Expressions"), the curly braces should NOT be
    preceded by a backslash. The backslash means "ignore the special meaning
    of the next character and treat it as a normal character" (or treat a
    normal characgter as a special character, such as \s for space).

    The correct syntax for "match at least m but no more than n of any
    character" is

    ..{m,n}

    If you used this:

    ..\{m,n\}

    it would mean "match any character, then a "{", then m, then a comma,
    then n, then a "}".

    If your regular expression engine only supports basic regular
    expressions then the "{" and "}" characters have no special meaning and
    are treated as normal characters. A backslash in front of them will just
    be ignored.

    --
    David Empson
     
    David Empson, Jan 10, 2008
    #15
    1. Advertisements

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Similar Threads
  1. Warren Oates

    Spam regex (MTNW)

    Warren Oates, Nov 23, 2007, in forum: Apps
    Replies:
    7
    Views:
    194
    Frédérique & Hervé Sainct
    Nov 24, 2007
  2. Josselin

    regex blindness

    Josselin, Oct 27, 2006, in forum: Misc
    Replies:
    0
    Views:
    177
    Josselin
    Oct 27, 2006
  3. SparkyGuy

    Help with RegEx?

    SparkyGuy, Jan 6, 2008, in forum: Programmer Help
    Replies:
    14
    Views:
    206
    David Empson
    Jan 10, 2008
  4. Warren Oates

    Spam regex (MTNW)

    Warren Oates, Nov 23, 2007, in forum: Mac
    Replies:
    7
    Views:
    223
    Frédérique & Hervé Sainct
    Nov 24, 2007
  5. Michelle Steiner

    regex cheat sheets

    Michelle Steiner, May 14, 2009, in forum: Mac
    Replies:
    1
    Views:
    255
    Fred Moore
    May 14, 2009
  6. Ian Piper

    perl or regex advice needed

    Ian Piper, Dec 13, 2007, in forum: UK Macs
    Replies:
    9
    Views:
    102
    Sam Nelson
    Dec 14, 2007
  7. Tim Streater

    MT-NW filter regex help needed

    Tim Streater, Dec 18, 2010, in forum: UK Macs
    Replies:
    6
    Views:
    132
    Jaimie Vandenbergh
    Dec 19, 2010
Loading...