Home Libraries Author Links

Match: Pattern Matching with Regular Expressions
[SysToMath Base C Library]

Collaboration diagram for Match: Pattern Matching with Regular Expressions:

Detailed Description

Pattern matching and match result processing in character strings by means of regular expressions with embedded action functions.

The SysToMath Base C Library module Match provides abstract data types to perform pattern matching in character strings using regular expressions.


Files

file  match.h
 Specification of data types and functions for pattern matching in strings by means of regular expressions.

Modules

 Match Test
 Demonstration of the usage and tests of the implementation of pattern matching in strings by means of regular expressions.
 Abstract Data Type StmRe
 StmRe is an abstract data type used to represent regular expressions.

Regular Expressions

Regular expressions provide a mechanism to select specific strings from a set of character strings.

Regular Expression Syntax

A regular expression reStr serves the purpose of pattern matching in a character string str and is a string with the following syntax in yacc format (deviating from yacc optional parts are included in brackets, informal descriptions of nonterminals are quoted with '"' characters and comments extending till the end of the line are preceded by two slash characters '//'):
     reStr                             // The regular expression reStr
                                       // matches, if
         : alt [ '||' reStr ]          // alt matches, or else its regular
                                       // subexpression reStr matches.
         ;

     alt                               // The alternative alt matches, if
         : term [ '|' alt ]            // term matches with at least as many
                                       // characters as alt or else alt
                                       // matches.
         ;

     term                              // The term term matches, if
         : qualifRe [ term ]           // qualifRe and, if applicable,its
                                       // subterm term matches.
         ;

     qualifRe                          // The qualified regular expression
                                       // qualifRe matches, if there is a
         : optInverter                 // positive or inverse (see below)
           simpleRe                    // match of simpleRe
           optQualifier                // q (see below) times.
           optActions                  // If so, perform action functions
           optGoback                   // and optionally go back just before
                                       // the part of str matched by qualifRe.
         ;

     optInverter
         :                             // Positive match: the corresponding
                                       // qualifRe matches, if the following
                                       // simpleRe matches q times.
         | '!'                         // Inverse match: the corresponding
                                       // qualifRe matches, if the following
                                       // simpleRe matches q times not (for
                                       // each of the q non matches one
                                       // character of str is consumed).
         ;

     simpleRe                          // The simple regular expression
                                       // simpleRe matches, if the regular
         : '(' reStr ')'               // subexpression reStr matches,
         | '&' reName '&'              // reStr reName matches (see below),
         | '[' group ']'               // group matches (see below),
         | '%' escSeq                  // character escSeq matches,
         | '.'                         // arbitrary character matches, or
         | reChar                      // character reChar matches.
         ;

     optQualifier                      // Conditions for the effective
                                       // repetition factor q (see above):
         :                             // q = 1,
         | '?'                         // q = 0 or q = 1,
         | '+'                         // q >= 1,
         | '*'                         // q >= 0, or
         | '{' range '}'               // q in range (see below).
         ;

     range                             // Effective repetition factor q:
         : minmax                      // q = minmax and q > 0,
         | min ','                     // q >= min, or
         | min ',' max                 // min <= q and q <= max.
         ;

     optActions
         :
         | ':' actionNames ':'         // If the corresponding qualifRe
                                       // matches, the action functions in
                                       // actionNames are executed.
         ;

     actionNames
         : actionName
         | actionNames ',' actionName
         ;

     optGoback
         :
         | '''                         // If the corresponding qualifRe
                                       // matches, go back just before the
                                       // part of str matched by qualifRe.
                                       // This way for a part of str an and
                                       // connection (conjunction) between a
                                       // qualifRe and any following qualifRe
                                       // may be achieved.
         ;

     minmax
     min
     max
         : "unsigned non-negative decimal number not bigger than INT_MAX"
         ;

     reName
         : "one or more C identifiers separated by single '-' characters"
                                       // Name of a named regular expression
                                       // defined by function stmReCreate().
         ;

     actionName
         : "one or more C identifiers separated by single '-' characters"
                                       // Name of a qualified regular
                                       // expression action defined by
                                       // function stmQreActDef().
         ;

     group                             // group matches, if
         : [ '^' ]                     // optionally no character, else
           [ ('[' | '-') ]             // one exactly from '[' or '-' or
           groupEnd                    // groupEnd matches.
         ;

     groupEnd
         :
         | groupElement groupEnd
         ;

     groupElement
         : groupChar [ '-' groupChar ] // matches, if groupChar matches or if
                                       // any character from the character
                                       // range groupChar '-' groupChar
                                       // matches.
         ;

     groupChar
         : '%' escSeq
         | greChar
         ;

     escSeq
         : 'n'                         // matches newline.
         | 't'                         // matches horizontal tab.
         | 'v'                         // matches vertical tab.
         | 'b'                         // matches backspace.
         | 'r'                         // matches carriage return.
         | 'f'                         // matches formfeed.
         | 'a'                         // matches audible alert.
         | 'x' hexDigit [ hexDigit ]   // matches a hexadecimal number.
         | octDigit [ octDigit ] [ octDigit ]  // matches an octal number.
         | "any character (except \0 for stmReMatch()) not mentioned above
            matches itself"
         ;

     hexDigit
         : "any hexadecimal digit fron the character ranges 0-9, a-f or A-F"
         ;

     octDigit
         : "any octal digit from the character range 0-7"
         ;

     greChar
         : "any reChar except [ and -"
         ;

     reChar
         : "any character except +?*|()[{&%.!:' (and \0 for stmReMatch())"
         ;

Only when using function stmReMatchSub() '\0' characters may be contained in the string str to be matched, since using function stmReMatch() a '\0' character would be interpreted as end of string condition.

Generally a character or a character string from a regular expression is said to match, if it coincides with the next character(s) of the string str to be matched after these have been transformed (see stmReCtrlCreate ()), if applicable.

The elements of the sequence of terms of a reStr separated by '|' or '||' are numbered beginning with 0. The uniquely determined term of this sequence causing reStr's match is called its matching term and the number of this term its matching term number.

The effective repetition factor q of a qualifRe specified by optQualifier always is chosen maximal so that a possibly following term with which qualifRe forms another term, matches.

The match length of the first qualifRe of a term always is chosen maximal so that the possibly following subterm of term matches.

If the minimal value of the effective repetion factor q of a qualified regular expression qualifRe is 0 (which is the case for optQualifier '?', '*', or '{0,max}'), qualifRe is called potentially optional. If a term term consists of all potentially optional qualified regular expressions also term is called potentially optional. If an alternative alt contains at least one potentially optional term also alt is called potentially optional. Finally if a regular expression reStr contains at least one potentially optional alternative, reStr is called potentially optional.

Before a regular expression with the syntax reStr can be used for pattern matching by means of the functions stmReMatch() or stmReMatchSub() it has to be named and defined by the function stmReCreate().

The regular expressions with the syntax reStr referenceable by the construct '&'reName '&' also have to be defined by the function stmReCreate().

The action functions for qualified regular expressions qualifRe referenceable by the construct ':'actionNames ':' have to be defined by function stmQreActDef().


© Copyright Tom Michaelis 2002-2007

Distributed under the SysToMath Software License (See the accompanying file license.txt or a copy at www.SysToMath.com).