|
The regular expression (abbreviated when regexp, regex, or even regxp, sustaining plural forms regexps, regexes, or even regexen) occurs as string that describes or matches the set of strings, according to certain syntax rules. Regular expressions come utilized by numerous text editors & utilities to lookup and manipulate bodies of text according to certain system. Numbers of programing language trend lines regular expressions for string manipulation. For instance, Perl has a mighty regular expression engine built directly into its syntax. A placed of utilities (including a editor sed and the purification grep) provided by Unix distributions were a number one to popularize the construct of regular expressions.
Basic concepts
The regular expression, typically known as the pattern, is an expression that describes the placed of strings. It is unremarkably wont to give the concise description of the placed, forswearing getting to names 100% elements. For instance, a placed containing a trey strings Handel, Händel, & Haendel may be described per pattern "H(ä|ae?)ndel" (or even instead, these are said that a pattern matches apiece of the ternary strings). As a side note, there are commonly multiple different system describing any given placed. Virtually all formalisms provide a below operations to construct regular expressions.
;alternation
;grouping
;quantification
These constructions may be combined to form at random complex expressions, a great deal such as 1 potty construct arithmetic expressions from either a prices & a operations +, -, * & /.
And then "H(ae?|ä)ndel" & "H(a|ae|ä)ndel" come valid system, & moreover, it two match a equivalent strings when a case from either a beginning of the article.
A pattern "((great )*grand )?(father|mother)" matches any ascendent: father, mother, grand father, grand mother, outstanding grand father, outstanding grand mother, smashing smashing grand father, nifty nifty grand mother, groovy low peachy grand father, groovy groovy outstanding grand mother and soIn.
A exact syntax for regular expressions varies among tools & application areas, & is described within supplementary detail in the image below.
History
A origin of regular expressions lies within automata theory and formal language theory (both a share of theoretical computer science). These fields survey system of computation (automata) & slipway to describe & classify formal languages. In the 1940s, Warren McCulloch and Walter Pitts described the nervous rules by modelling nerve cell when little elementary automata. A mathematician Stephen Kleene later described these models utilizing his mathematical notation known as regular sets. Ken Thompson built this notation into the editor QED, then into a Unix editor ed and eventually into grep. Ever since that period, regular expressions keep close at h& been widely utilized around Unix and Unix-prefer utilities like: expr, awk, Emacs, vi, lex, and Perl.
Perl regular expressions were from either regex written by Henry Spencer. Philip Hazel developed [http://www.pcre.org/ pcre] (Perl Compatible Regular Expressions) which tries to closely mimic Perl's regular expression functionality, & is utilized by several modern information like Python, PHP, and Apache.
A integration of regular expressions around virtually all computer-oriented language is however super unfortunate &, potentially though Perl's regular expression integration is one of the right around, section of the effort in the project of the first [http://dev.perl.org/perl6 Perl6] is improving this integration. This is the subject of [http://dev.perl.org/perl6/apocalypse/A05.html Apocalypse 5].
In formal language theory
Regular expressions consist of constants & operators that denote sets of strings & operations on top these sets, severally. Given a finite alphabet Σ the ensuing constants come defined:
(empty placed) * (empty string) ε denoting a placed * (literal character) a inside Σ denoting the placed & a resulting operations:
(concatenation) RS denoting a placed .
(alternation) R|S denoting a placed union of R & S.
(Kleene star) R* denoting the little superset of R that contains ε and is closed under string concatenation. This is the placed of 100% strings that may be manufactured by concatenating zero or even further strings around R. For instance, .
Numerous school text utilise a symbols for alternation instead of the vertical bar.
To make sure your not brackets these are assumed that a Kleene star has a greatest priority, so concatenation so placed union. In case no ambiguity so brackets can be omitted. For instance, (ab)c is written when abc & a|(b(c*)) may be written when a|bc*.
Examples:
the|b* denotes * (the|b)* denotes a placed of everthing strings consisting of any total of a & b symbols, including a empty string
b*(ab*)* a same
ab*(c|ε) denotes the placed of strings starting sustaining a, so zero or even extra bs & eventually optionally the c.
((a|ba(aa)*b)(b(aa)*b)*(a|ba(aa)*b)|b(aa)*b)*(a|ba(aa)*b)(b(aa)*b)* denotes a placed of completely strings which contain an possibly total of bs & an odd total of as. Note that this regular expression is of the form (Ten Y*10 U Y)*10 Y* by using X = a|ba(aa)*b & Y = b(aa)*b.
A formal definition of regular expressions is purposefully penurious & avoids defining a redundant quantifiers ? & +, which may be expressed when follows: a+= athe*, & a? = (ε|a). Occasionally a complement operator ~ is added; ~R denotes a placed of everthing strings across Σ that are non within R. Therein out break the ensuant operators form a Kleene algebra. A complement operator is redundant: it could universally become expressed by single using the more operators.
Regular expressions in that feel may express exactly a class of languages accepted by finite state automata: the regular languages. There exists, yet, the important difference inside compactness: a few classes of regular languages may lone become described by automata that grow exponentially in size, while a mandatory regular expressions lone develop linearly. Regular expressions correspond to the nature and severity Three grammars of the Chomsky hierarchy and may be utilized to describe the regular language.
You can too survey expressive power in a formalism. When a lesson shows, different regular expressions potty express a equivalent language: a formalism is redundant.
These are imaginable to write an algorithm which for two given regular expressions decides whether the described languages come equal - essentially, it reduces to each one expression to a minimum deterministic finite state automaton & determines whether it is isomorphic (equivalent).
To what extent might this redundancy become eliminated? May i personally buy an interesting subset of regular expressions that is however fully expressive? Kleene star & placed union come manifestly called upon, however peradventure i personally could limit their have. This turns bent exist as the amazingly hard condition. When elementary when a regular expressions come, it turns out no method to consistently rewrite the two to occasionally normal form. It is non finitely axiomatizable. Therefore i have to resort to more methods. This leads to the star height problem.
These are worth noting that numbers of real-globe "regular expression" engines implement features that just can't exist as expressed in the regular expression algebra; understand below for more on this.
Syntax
Traditional Unix regular expressions
A "basic" Unix regular expression syntax is now defined when obsolete by POSIX, but is however widely utilized for the purposes of backwards compatibility. Virtually all regular-expression–aware Unix utilities, for instance grep and sed, use it by default.
Therein syntax, virtually all characters come treated when literals—they match only themselves ("a" matches "a", "(bc" matches "(bc", etc). A exceptions come known as metacharacters:
Old versions of grep did non trend lines a alternation operator "|".
Examples:
Since numbers of ranges of characters depends on the chosen locus setting (e.g., around a select few settings letters come organized when abc..yzABC..YZ spell inside a few others when aAbBcC..yYzZ) a POSIX standard defines occasionally classes or even categories of characters when shown in the resulting table:
|- align="left"
! POSIX class !! similar to !! meaning
|-
| [:upper:]
| [A-Z]
| uppercase letters
|-
| [:lower:]
| [a-z]
| lowercase letters
|-
| [:alpha:]
| [A-Za-z]
| upper- & lowercase letters
|-
| [:alnum:]
| [A-Za-z0-9]
| digits, upper- & lowercase letters
|-
| [:digit:]
| [0-9]
| digits
|-
| [:xdigit:]
| [0-9A-Fa-f]
| hex digits
|-
| [:punct:]
| [.,!?:...]
| punctuation
|-
| [:blank:]
| [ \t]
| space & TAB
|-
| [:space:]
| [ \t\n\r\f\v]
| blank characters
|-
| [:cntrl:]
|
| control characters
|-
| [:graph:]
| [^ \t\n\r\f\v]
| printed characters
|-
| [:print:]
| [^\t\n\r\f\v]
| printed characters & space
|}
lesson: [[:upper:]ab] should only return the uppercase letters and lowercase 'a' and 'b'.
POSIX modern (extended) regular expressions
The more modern "extended" regular expressions can often be used with modern Unix utilities by including the command line flag "-E".
POSIX extended regular expressions are similar in syntax to the traditional Unix regular expressions, with some exceptions. The following metacharacters are added:
| + |
Match the last "block" one or more times - "ba+" matches "ba", "baa", "baaa" and so on |
| ? |
Match the last "block" zero or one times - "ba?" matches "b" or "ba" |
| | |
The choice (or set union) operator: match either the expression before or the expression after the operator - "abc|def" matches "abc" or "def". |
Also, backslashes are removed: \and \(...\) becomes
(...)
Examples:
Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '$' are used as special symbols they have to be escaped if they are meant literally. This is done by preceding them with '\' which therefore also has to be escaped this way if meant literally.
Examples:
Perl-compatible regular expressions (PCRE)
Perl has a much richer syntax than even the extended POSIX regexp. In addition, its syntax is somewhat more predictable
(for example, a backstroke always quotes a non-alphanumeric character). For these reasons, the Perl syntax has been adopted in other utilities and applications — Tcl, exim, and BBEdit, for example.
Patterns for irregular languages
Many patterns provide an expressive power that far exceeds the regular languages.
For example, the ability to group subexpressions with brackets and recall them in the same expression means that a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory. The pattern for these strings is just "(.*)\1". However, the language of squares is not regular, nor is it context-free. Pattern matching with an unbounded number of back references, as supported by a number of modern tools, is NP-complete.
However, many tools, libraries, and engines that provide such constructions still use the term regular expression for their patterns. This has lead to a nomenclature where the term "regular expression" has different meanings in formal language theory and pattern matching. It has been suggested to use the term regex or simply "pattern" for the latter. Larry Wall (author of Perl) writes in Apocalypse 5:
Implementations and running times
There are at least two different algorithms that decide if (and how) a given string matches a regular expression.
The oldest and fastest relies on a result in formal language theory that allows every Nondeterministic Finite State Machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process takes time linear in the length of the input string. More precisely, an input string of size n can be tested against a regular expression of size m in time O(n+2m) or O(nm), depending on the details of the implementation. This algorithm is often referred to as DFA. It is fast, but can be used only for matching and not for recalling grouped subexpressions.
The other algorithm is to match the pattern against the input string by backtracking. (This algorithm is sometimes called NFA, but this terminology is highly confusing.) Its running time can be exponential, which many implementations exhibit when matching against expressions like "(a|aa)*b" that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of subcases. Even though backtracking implementations give no running time guarantee in the worst case, they allow much greater flexibility and provide more expressive power.
Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.
|