Mercurial > hg > octave-kai > gnulib-hg
annotate doc/regex.texi @ 17274:69f030e5cec4
doc: avoid small caps
* doc/parse-datetime.texi, doc/regex.texi: Don't use small caps;
they're more trouble than they're worth. Suggested by Karl Berry
in <http://bugs.gnu.org/13360>.
author | Paul Eggert <eggert@cs.ucla.edu> |
---|---|
date | Sat, 05 Jan 2013 17:23:52 -0800 |
parents | a712776b11ce |
children |
rev | line source |
---|---|
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1 @node Overview |
13531 | 2 @chapter Overview |
3 | |
4 A @dfn{regular expression} (or @dfn{regexp}, or @dfn{pattern}) is a text | |
5 string that describes some (mathematical) set of strings. A regexp | |
6 @var{r} @dfn{matches} a string @var{s} if @var{s} is in the set of | |
7 strings described by @var{r}. | |
8 | |
9 Using the Regex library, you can: | |
10 | |
11 @itemize @bullet | |
12 | |
13 @item | |
13532 | 14 see if a string matches a specified pattern as a whole, and |
13531 | 15 |
16 @item | |
17 search within a string for a substring matching a specified pattern. | |
18 | |
19 @end itemize | |
20 | |
21 Some regular expressions match only one string, i.e., the set they | |
22 describe has only one member. For example, the regular expression | |
23 @samp{foo} matches the string @samp{foo} and no others. Other regular | |
24 expressions match more than one string, i.e., the set they describe has | |
25 more than one member. For example, the regular expression @samp{f*} | |
26 matches the set of strings made up of any number (including zero) of | |
27 @samp{f}s. As you can see, some characters in regular expressions match | |
28 themselves (such as @samp{f}) and some don't (such as @samp{*}); the | |
29 ones that don't match themselves instead let you specify patterns that | |
30 describe many different strings. | |
31 | |
32 To either match or search for a regular expression with the Regex | |
33 library functions, you must first compile it with a Regex pattern | |
34 compiling function. A @dfn{compiled pattern} is a regular expression | |
35 converted to the internal format used by the library functions. Once | |
36 you've compiled a pattern, you can use it for matching or searching any | |
37 number of times. | |
38 | |
13553
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
39 The Regex library is used by including @file{regex.h}. |
13531 | 40 @pindex regex.h |
41 Regex provides three groups of functions with which you can operate on | |
17274 | 42 regular expressions. One group---the GNU group---is more |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
43 powerful but not completely compatible with the other two, namely the |
17274 | 44 POSIX and Berkeley Unix groups; its interface was designed |
45 specifically for GNU. | |
13531 | 46 |
47 We wrote this chapter with programmers in mind, not users of | |
48 programs---such as Emacs---that use Regex. We describe the Regex | |
49 library in its entirety, not how to write regular expressions that a | |
50 particular program understands. | |
51 | |
52 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
53 @node Regular Expression Syntax |
13531 | 54 @chapter Regular Expression Syntax |
55 | |
56 @cindex regular expressions, syntax of | |
57 @cindex syntax of regular expressions | |
58 | |
59 @dfn{Characters} are things you can type. @dfn{Operators} are things in | |
60 a regular expression that match one or more characters. You compose | |
61 regular expressions from operators, which in turn you specify using one | |
62 or more characters. | |
63 | |
64 Most characters represent what we call the match-self operator, i.e., | |
65 they match themselves; we call these characters @dfn{ordinary}. Other | |
66 characters represent either all or parts of fancier operators; e.g., | |
67 @samp{.} represents what we call the match-any-character operator | |
68 (which, no surprise, matches (almost) any character); we call these | |
69 characters @dfn{special}. Two different things determine what | |
70 characters represent what operators: | |
71 | |
72 @enumerate | |
73 @item | |
74 the regular expression syntax your program has told the Regex library to | |
75 recognize, and | |
76 | |
77 @item | |
78 the context of the character in the regular expression. | |
79 @end enumerate | |
80 | |
81 In the following sections, we describe these things in more detail. | |
82 | |
83 @menu | |
84 * Syntax Bits:: | |
85 * Predefined Syntaxes:: | |
86 * Collating Elements vs. Characters:: | |
87 * The Backslash Character:: | |
88 @end menu | |
89 | |
90 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
91 @node Syntax Bits |
13532 | 92 @section Syntax Bits |
13531 | 93 |
94 @cindex syntax bits | |
95 | |
96 In any particular syntax for regular expressions, some characters are | |
97 always special, others are sometimes special, and others are never | |
98 special. The particular syntax that Regex recognizes for a given | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
99 regular expression depends on the current syntax (as set by |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
100 @code{re_set_syntax}) when the pattern buffer of that regular expression |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
101 was compiled. |
13531 | 102 |
103 You get a pattern buffer by compiling a regular expression. @xref{GNU | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
104 Pattern Buffers}, for more information on pattern buffers. @xref{GNU |
13531 | 105 Regular Expression Compiling}, and @ref{BSD Regular Expression |
106 Compiling}, for more information on compiling. | |
107 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
108 Regex considers the current syntax to be a collection of bits; we refer |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
109 to these bits as @dfn{syntax bits}. In most cases, they affect what |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
110 characters represent what operators. We describe the meanings of the |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
111 operators to which we refer in @ref{Common Operators}, @ref{GNU |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
112 Operators}, and @ref{GNU Emacs Operators}. |
13531 | 113 |
114 For reference, here is the complete list of syntax bits, in alphabetical | |
115 order: | |
116 | |
117 @table @code | |
118 | |
119 @cnindex RE_BACKSLASH_ESCAPE_IN_LIST | |
120 @item RE_BACKSLASH_ESCAPE_IN_LISTS | |
121 If this bit is set, then @samp{\} inside a list (@pxref{List Operators} | |
122 quotes (makes ordinary, if it's special) the following character; if | |
123 this bit isn't set, then @samp{\} is an ordinary character inside lists. | |
16236
8d0c35a0ae1d
doc: fix minor quoting issues, mostly with `
Paul Eggert <eggert@cs.ucla.edu>
parents:
15563
diff
changeset
|
124 (@xref{The Backslash Character}, for what @samp{\} does outside of lists.) |
13531 | 125 |
126 @cnindex RE_BK_PLUS_QM | |
127 @item RE_BK_PLUS_QM | |
128 If this bit is set, then @samp{\+} represents the match-one-or-more | |
129 operator and @samp{\?} represents the match-zero-or-more operator; if | |
130 this bit isn't set, then @samp{+} represents the match-one-or-more | |
131 operator and @samp{?} represents the match-zero-or-one operator. This | |
132 bit is irrelevant if @code{RE_LIMITED_OPS} is set. | |
133 | |
134 @cnindex RE_CHAR_CLASSES | |
135 @item RE_CHAR_CLASSES | |
136 If this bit is set, then you can use character classes in lists; if this | |
137 bit isn't set, then you can't. | |
138 | |
139 @cnindex RE_CONTEXT_INDEP_ANCHORS | |
140 @item RE_CONTEXT_INDEP_ANCHORS | |
141 If this bit is set, then @samp{^} and @samp{$} are special anywhere outside | |
142 a list; if this bit isn't set, then these characters are special only in | |
143 certain contexts. @xref{Match-beginning-of-line Operator}, and | |
144 @ref{Match-end-of-line Operator}. | |
145 | |
146 @cnindex RE_CONTEXT_INDEP_OPS | |
147 @item RE_CONTEXT_INDEP_OPS | |
148 If this bit is set, then certain characters are special anywhere outside | |
149 a list; if this bit isn't set, then those characters are special only in | |
150 some contexts and are ordinary elsewhere. Specifically, if this bit | |
151 isn't set then @samp{*}, and (if the syntax bit @code{RE_LIMITED_OPS} | |
152 isn't set) @samp{+} and @samp{?} (or @samp{\+} and @samp{\?}, depending | |
153 on the syntax bit @code{RE_BK_PLUS_QM}) represent repetition operators | |
154 only if they're not first in a regular expression or just after an | |
155 open-group or alternation operator. The same holds for @samp{@{} (or | |
156 @samp{\@{}, depending on the syntax bit @code{RE_NO_BK_BRACES}) if | |
157 it is the beginning of a valid interval and the syntax bit | |
158 @code{RE_INTERVALS} is set. | |
159 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
160 @cnindex RE_CONTEXT_INVALID_DUP |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
161 @item RE_CONTEXT_INVALID_DUP |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
162 If this bit is set, then an open-interval operator cannot occur at the |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
163 start of a regular expression, or immediately after an alternation, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
164 open-group or close-interval operator. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
165 |
13531 | 166 @cnindex RE_CONTEXT_INVALID_OPS |
167 @item RE_CONTEXT_INVALID_OPS | |
168 If this bit is set, then repetition and alternation operators can't be | |
169 in certain positions within a regular expression. Specifically, the | |
170 regular expression is invalid if it has: | |
171 | |
172 @itemize @bullet | |
173 | |
174 @item | |
175 a repetition operator first in the regular expression or just after a | |
176 match-beginning-of-line, open-group, or alternation operator; or | |
177 | |
178 @item | |
179 an alternation operator first or last in the regular expression, just | |
180 before a match-end-of-line operator, or just after an alternation or | |
181 open-group operator. | |
182 | |
183 @end itemize | |
184 | |
185 If this bit isn't set, then you can put the characters representing the | |
186 repetition and alternation characters anywhere in a regular expression. | |
187 Whether or not they will in fact be operators in certain positions | |
188 depends on other syntax bits. | |
189 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
190 @cnindex RE_DEBUG |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
191 @item RE_DEBUG |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
192 If this bit is set, and the regex library was compiled with |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
193 @code{-DDEBUG}, then internal debugging is turned on; if unset, then |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
194 it is turned off. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
195 |
13531 | 196 @cnindex RE_DOT_NEWLINE |
197 @item RE_DOT_NEWLINE | |
198 If this bit is set, then the match-any-character operator matches | |
199 a newline; if this bit isn't set, then it doesn't. | |
200 | |
201 @cnindex RE_DOT_NOT_NULL | |
202 @item RE_DOT_NOT_NULL | |
203 If this bit is set, then the match-any-character operator doesn't match | |
204 a null character; if this bit isn't set, then it does. | |
205 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
206 @cnindex RE_HAT_LISTS_NOT_NEWLINE |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
207 @item RE_HAT_LISTS_NOT_NEWLINE |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
208 If this bit is set, nonmatching lists @samp{[^...]} do not match |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
209 newline; if not set, they do. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
210 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
211 @cnindex RE_ICASE |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
212 @item RE_ICASE |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
213 If this bit is set, then ignore case when matching; otherwise, case is |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
214 significant. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
215 |
13531 | 216 @cnindex RE_INTERVALS |
217 @item RE_INTERVALS | |
218 If this bit is set, then Regex recognizes interval operators; if this bit | |
219 isn't set, then it doesn't. | |
220 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
221 @cnindex RE_INVALID_INTERVAL_ORD |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
222 @item RE_INVALID_INTERVAL_ORD |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
223 If this bit is set, a syntactically invalid interval is treated as a |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
224 string of ordinary characters. For example, the extended regular |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
225 expression @samp{a@{1} is treated as @samp{a\@{1}. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
226 |
13531 | 227 @cnindex RE_LIMITED_OPS |
228 @item RE_LIMITED_OPS | |
229 If this bit is set, then Regex doesn't recognize the match-one-or-more, | |
230 match-zero-or-one or alternation operators; if this bit isn't set, then | |
231 it does. | |
232 | |
233 @cnindex RE_NEWLINE_ALT | |
234 @item RE_NEWLINE_ALT | |
235 If this bit is set, then newline represents the alternation operator; if | |
236 this bit isn't set, then newline is ordinary. | |
237 | |
238 @cnindex RE_NO_BK_BRACES | |
239 @item RE_NO_BK_BRACES | |
240 If this bit is set, then @samp{@{} represents the open-interval operator | |
241 and @samp{@}} represents the close-interval operator; if this bit isn't | |
242 set, then @samp{\@{} represents the open-interval operator and | |
243 @samp{\@}} represents the close-interval operator. This bit is relevant | |
244 only if @code{RE_INTERVALS} is set. | |
245 | |
246 @cnindex RE_NO_BK_PARENS | |
247 @item RE_NO_BK_PARENS | |
248 If this bit is set, then @samp{(} represents the open-group operator and | |
249 @samp{)} represents the close-group operator; if this bit isn't set, then | |
250 @samp{\(} represents the open-group operator and @samp{\)} represents | |
251 the close-group operator. | |
252 | |
253 @cnindex RE_NO_BK_REFS | |
254 @item RE_NO_BK_REFS | |
255 If this bit is set, then Regex doesn't recognize @samp{\}@var{digit} as | |
256 the back reference operator; if this bit isn't set, then it does. | |
257 | |
258 @cnindex RE_NO_BK_VBAR | |
259 @item RE_NO_BK_VBAR | |
260 If this bit is set, then @samp{|} represents the alternation operator; | |
261 if this bit isn't set, then @samp{\|} represents the alternation | |
262 operator. This bit is irrelevant if @code{RE_LIMITED_OPS} is set. | |
263 | |
264 @cnindex RE_NO_EMPTY_RANGES | |
265 @item RE_NO_EMPTY_RANGES | |
266 If this bit is set, then a regular expression with a range whose ending | |
267 point collates lower than its starting point is invalid; if this bit | |
268 isn't set, then Regex considers such a range to be empty. | |
269 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
270 @cnindex RE_NO_GNU_OPS |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
271 @item RE_NO_GNU_OPS |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
272 If this bit is set, GNU regex operators are not recognized; otherwise, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
273 they are. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
274 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
275 @cnindex RE_NO_POSIX_BACKTRACKING |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
276 @item RE_NO_POSIX_BACKTRACKING |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
277 If this bit is set, succeed as soon as we match the whole pattern, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
278 without further backtracking. This means that a match may not be |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
279 the leftmost longest; @pxref{What Gets Matched?} for what this means. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
280 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
281 @cnindex RE_NO_SUB |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
282 @item RE_NO_SUB |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
283 If this bit is set, then @code{no_sub} will be set to one during |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
284 @code{re_compile_pattern}. This causes matching and searching routines |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
285 not to record substring match information. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
286 |
13531 | 287 @cnindex RE_UNMATCHED_RIGHT_PAREN_ORD |
288 @item RE_UNMATCHED_RIGHT_PAREN_ORD | |
289 If this bit is set and the regular expression has no matching open-group | |
290 operator, then Regex considers what would otherwise be a close-group | |
291 operator (based on how @code{RE_NO_BK_PARENS} is set) to match @samp{)}. | |
292 | |
293 @end table | |
294 | |
295 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
296 @node Predefined Syntaxes |
13532 | 297 @section Predefined Syntaxes |
13531 | 298 |
299 If you're programming with Regex, you can set a pattern buffer's | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
300 (@pxref{GNU Pattern Buffers}) |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
301 syntax either to an arbitrary combination of syntax bits |
13531 | 302 (@pxref{Syntax Bits}) or else to the configurations defined by Regex. |
303 These configurations define the syntaxes used by certain | |
17274 | 304 programs---GNU Emacs, |
13532 | 305 @cindex Emacs |
17274 | 306 POSIX Awk, |
13531 | 307 @cindex POSIX Awk |
13532 | 308 traditional Awk, |
13531 | 309 @cindex Awk |
310 Grep, | |
311 @cindex Grep | |
312 @cindex Egrep | |
17274 | 313 Egrep---in addition to syntaxes for POSIX basic and extended |
13531 | 314 regular expressions. |
315 | |
13549
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
316 The predefined syntaxes---taken directly from @file{regex.h}---are: |
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
317 |
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
318 @smallexample |
13531 | 319 #define RE_SYNTAX_EMACS 0 |
320 | |
321 #define RE_SYNTAX_AWK \ | |
322 (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \ | |
323 | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | |
324 | RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \ | |
325 | RE_UNMATCHED_RIGHT_PAREN_ORD) | |
326 | |
327 #define RE_SYNTAX_POSIX_AWK \ | |
328 (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS) | |
329 | |
330 #define RE_SYNTAX_GREP \ | |
331 (RE_BK_PLUS_QM | RE_CHAR_CLASSES \ | |
332 | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \ | |
333 | RE_NEWLINE_ALT) | |
334 | |
335 #define RE_SYNTAX_EGREP \ | |
336 (RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \ | |
337 | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \ | |
338 | RE_NEWLINE_ALT | RE_NO_BK_PARENS \ | |
339 | RE_NO_BK_VBAR) | |
340 | |
341 #define RE_SYNTAX_POSIX_EGREP \ | |
342 (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES) | |
343 | |
344 /* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */ | |
345 #define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC | |
346 | |
347 #define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC | |
348 | |
349 /* Syntax bits common to both basic and extended POSIX regex syntax. */ | |
350 #define _RE_SYNTAX_POSIX_COMMON \ | |
351 (RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \ | |
352 | RE_INTERVALS | RE_NO_EMPTY_RANGES) | |
353 | |
354 #define RE_SYNTAX_POSIX_BASIC \ | |
355 (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM) | |
356 | |
357 /* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes | |
358 RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this | |
359 isn't minimal, since other operators, such as \`, aren't disabled. */ | |
360 #define RE_SYNTAX_POSIX_MINIMAL_BASIC \ | |
361 (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS) | |
362 | |
363 #define RE_SYNTAX_POSIX_EXTENDED \ | |
364 (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | |
365 | RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \ | |
366 | RE_NO_BK_PARENS | RE_NO_BK_VBAR \ | |
367 | RE_UNMATCHED_RIGHT_PAREN_ORD) | |
368 | |
369 /* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS | |
370 replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */ | |
371 #define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \ | |
372 (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | |
373 | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \ | |
374 | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | |
375 | RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD) | |
13549
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
376 @end smallexample |
13531 | 377 |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
378 @node Collating Elements vs. Characters |
13532 | 379 @section Collating Elements vs.@: Characters |
13531 | 380 |
17274 | 381 POSIX generalizes the notion of a character to that of a |
13531 | 382 collating element. It defines a @dfn{collating element} to be ``a |
383 sequence of one or more bytes defined in the current collating sequence | |
384 as a unit of collation.'' | |
385 | |
386 This generalizes the notion of a character in | |
387 two ways. First, a single character can map into two or more collating | |
388 elements. For example, the German | |
389 @tex | |
16236
8d0c35a0ae1d
doc: fix minor quoting issues, mostly with `
Paul Eggert <eggert@cs.ucla.edu>
parents:
15563
diff
changeset
|
390 ``\ss'' |
13531 | 391 @end tex |
392 @ifinfo | |
393 ``es-zet'' | |
394 @end ifinfo | |
395 collates as the collating element @samp{s} followed by another collating | |
396 element @samp{s}. Second, two or more characters can map into one | |
397 collating element. For example, the Spanish @samp{ll} collates after | |
398 @samp{l} and before @samp{m}. | |
399 | |
17274 | 400 Since POSIX's ``collating element'' preserves the essential idea of |
13531 | 401 a ``character,'' we use the latter, more familiar, term in this document. |
402 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
403 @node The Backslash Character |
13531 | 404 @section The Backslash Character |
405 | |
406 @cindex \ | |
407 The @samp{\} character has one of four different meanings, depending on | |
408 the context in which you use it and what syntax bits are set | |
409 (@pxref{Syntax Bits}). It can: 1) stand for itself, 2) quote the next | |
410 character, 3) introduce an operator, or 4) do nothing. | |
411 | |
412 @enumerate | |
413 @item | |
414 It stands for itself inside a list | |
415 (@pxref{List Operators}) if the syntax bit | |
416 @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is not set. For example, @samp{[\]} | |
417 would match @samp{\}. | |
418 | |
419 @item | |
420 It quotes (makes ordinary, if it's special) the next character when you | |
421 use it either: | |
422 | |
423 @itemize @bullet | |
424 @item | |
425 outside a list,@footnote{Sometimes | |
426 you don't have to explicitly quote special characters to make | |
427 them ordinary. For instance, most characters lose any special meaning | |
428 inside a list (@pxref{List Operators}). In addition, if the syntax bits | |
429 @code{RE_CONTEXT_INVALID_OPS} and @code{RE_CONTEXT_INDEP_OPS} | |
430 aren't set, then (for historical reasons) the matcher considers special | |
431 characters ordinary if they are in contexts where the operations they | |
432 represent make no sense; for example, then the match-zero-or-more | |
433 operator (represented by @samp{*}) matches itself in the regular | |
434 expression @samp{*foo} because there is no preceding expression on which | |
435 it can operate. It is poor practice, however, to depend on this | |
436 behavior; if you want a special character to be ordinary outside a list, | |
437 it's better to always quote it, regardless.} or | |
438 | |
439 @item | |
440 inside a list and the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is set. | |
441 | |
442 @end itemize | |
443 | |
444 @item | |
445 It introduces an operator when followed by certain ordinary | |
446 characters---sometimes only when certain syntax bits are set. See the | |
447 cases @code{RE_BK_PLUS_QM}, @code{RE_NO_BK_BRACES}, @code{RE_NO_BK_VAR}, | |
448 @code{RE_NO_BK_PARENS}, @code{RE_NO_BK_REF} in @ref{Syntax Bits}. Also: | |
449 | |
450 @itemize @bullet | |
451 @item | |
452 @samp{\b} represents the match-word-boundary operator | |
453 (@pxref{Match-word-boundary Operator}). | |
454 | |
455 @item | |
456 @samp{\B} represents the match-within-word operator | |
457 (@pxref{Match-within-word Operator}). | |
458 | |
459 @item | |
460 @samp{\<} represents the match-beginning-of-word operator @* | |
461 (@pxref{Match-beginning-of-word Operator}). | |
462 | |
463 @item | |
464 @samp{\>} represents the match-end-of-word operator | |
465 (@pxref{Match-end-of-word Operator}). | |
466 | |
467 @item | |
468 @samp{\w} represents the match-word-constituent operator | |
469 (@pxref{Match-word-constituent Operator}). | |
470 | |
471 @item | |
472 @samp{\W} represents the match-non-word-constituent operator | |
473 (@pxref{Match-non-word-constituent Operator}). | |
474 | |
475 @item | |
476 @samp{\`} represents the match-beginning-of-buffer | |
477 operator and @samp{\'} represents the match-end-of-buffer operator | |
478 (@pxref{Buffer Operators}). | |
479 | |
480 @item | |
481 If Regex was compiled with the C preprocessor symbol @code{emacs} | |
482 defined, then @samp{\s@var{class}} represents the match-syntactic-class | |
483 operator and @samp{\S@var{class}} represents the | |
484 match-not-syntactic-class operator (@pxref{Syntactic Class Operators}). | |
485 | |
486 @end itemize | |
487 | |
488 @item | |
489 In all other cases, Regex ignores @samp{\}. For example, | |
490 @samp{\n} matches @samp{n}. | |
491 | |
492 @end enumerate | |
493 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
494 @node Common Operators |
13531 | 495 @chapter Common Operators |
496 | |
497 You compose regular expressions from operators. In the following | |
498 sections, we describe the regular expression operators specified by | |
17274 | 499 POSIX; GNU also uses these. Most operators have more than one |
13531 | 500 representation as characters. @xref{Regular Expression Syntax}, for |
501 what characters represent what operators under what circumstances. | |
502 | |
503 For most operators that can be represented in two ways, one | |
504 representation is a single character and the other is that character | |
505 preceded by @samp{\}. For example, either @samp{(} or @samp{\(} | |
506 represents the open-group operator. Which one does depends on the | |
507 setting of a syntax bit, in this case @code{RE_NO_BK_PARENS}. Why is | |
508 this so? Historical reasons dictate some of the varying | |
17274 | 509 representations, while POSIX dictates others. |
13531 | 510 |
511 Finally, almost all characters lose any special meaning inside a list | |
512 (@pxref{List Operators}). | |
513 | |
514 @menu | |
515 * Match-self Operator:: Ordinary characters. | |
516 * Match-any-character Operator:: . | |
517 * Concatenation Operator:: Juxtaposition. | |
518 * Repetition Operators:: * + ? @{@} | |
519 * Alternation Operator:: | | |
520 * List Operators:: [...] [^...] | |
521 * Grouping Operators:: (...) | |
522 * Back-reference Operator:: \digit | |
523 * Anchoring Operators:: ^ $ | |
524 @end menu | |
525 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
526 @node Match-self Operator |
13531 | 527 @section The Match-self Operator (@var{ordinary character}) |
528 | |
529 This operator matches the character itself. All ordinary characters | |
530 (@pxref{Regular Expression Syntax}) represent this operator. For | |
531 example, @samp{f} is always an ordinary character, so the regular | |
532 expression @samp{f} matches only the string @samp{f}. In | |
533 particular, it does @emph{not} match the string @samp{ff}. | |
534 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
535 @node Match-any-character Operator |
13531 | 536 @section The Match-any-character Operator (@code{.}) |
537 | |
538 @cindex @samp{.} | |
539 | |
540 This operator matches any single printing or nonprinting character | |
541 except it won't match a: | |
542 | |
543 @table @asis | |
544 @item newline | |
545 if the syntax bit @code{RE_DOT_NEWLINE} isn't set. | |
546 | |
547 @item null | |
548 if the syntax bit @code{RE_DOT_NOT_NULL} is set. | |
549 | |
550 @end table | |
551 | |
552 The @samp{.} (period) character represents this operator. For example, | |
553 @samp{a.b} matches any three-character string beginning with @samp{a} | |
554 and ending with @samp{b}. | |
555 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
556 @node Concatenation Operator |
13531 | 557 @section The Concatenation Operator |
558 | |
559 This operator concatenates two regular expressions @var{a} and @var{b}. | |
560 No character represents this operator; you simply put @var{b} after | |
561 @var{a}. The result is a regular expression that will match a string if | |
562 @var{a} matches its first part and @var{b} matches the rest. For | |
563 example, @samp{xy} (two match-self operators) matches @samp{xy}. | |
564 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
565 @node Repetition Operators |
13532 | 566 @section Repetition Operators |
13531 | 567 |
568 Repetition operators repeat the preceding regular expression a specified | |
569 number of times. | |
570 | |
571 @menu | |
572 * Match-zero-or-more Operator:: * | |
573 * Match-one-or-more Operator:: + | |
574 * Match-zero-or-one Operator:: ? | |
575 * Interval Operators:: @{@} | |
576 @end menu | |
577 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
578 @node Match-zero-or-more Operator |
13531 | 579 @subsection The Match-zero-or-more Operator (@code{*}) |
580 | |
581 @cindex @samp{*} | |
582 | |
583 This operator repeats the smallest possible preceding regular expression | |
584 as many times as necessary (including zero) to match the pattern. | |
585 @samp{*} represents this operator. For example, @samp{o*} | |
586 matches any string made up of zero or more @samp{o}s. Since this | |
587 operator operates on the smallest preceding regular expression, | |
588 @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}. So, | |
589 @samp{fo*} matches @samp{f}, @samp{fo}, @samp{foo}, and so on. | |
590 | |
591 Since the match-zero-or-more operator is a suffix operator, it may be | |
592 useless as such when no regular expression precedes it. This is the | |
593 case when it: | |
594 | |
595 @itemize @bullet | |
13532 | 596 @item |
13531 | 597 is first in a regular expression, or |
598 | |
13532 | 599 @item |
13531 | 600 follows a match-beginning-of-line, open-group, or alternation |
601 operator. | |
602 | |
603 @end itemize | |
604 | |
605 @noindent | |
606 Three different things can happen in these cases: | |
607 | |
608 @enumerate | |
609 @item | |
610 If the syntax bit @code{RE_CONTEXT_INVALID_OPS} is set, then the | |
611 regular expression is invalid. | |
612 | |
613 @item | |
614 If @code{RE_CONTEXT_INVALID_OPS} isn't set, but | |
615 @code{RE_CONTEXT_INDEP_OPS} is, then @samp{*} represents the | |
616 match-zero-or-more operator (which then operates on the empty string). | |
617 | |
618 @item | |
619 Otherwise, @samp{*} is ordinary. | |
620 | |
621 @end enumerate | |
622 | |
623 @cindex backtracking | |
624 The matcher processes a match-zero-or-more operator by first matching as | |
625 many repetitions of the smallest preceding regular expression as it can. | |
13532 | 626 Then it continues to match the rest of the pattern. |
13531 | 627 |
628 If it can't match the rest of the pattern, it backtracks (as many times | |
629 as necessary), each time discarding one of the matches until it can | |
630 either match the entire pattern or be certain that it cannot get a | |
631 match. For example, when matching @samp{ca*ar} against @samp{caaar}, | |
632 the matcher first matches all three @samp{a}s of the string with the | |
633 @samp{a*} of the regular expression. However, it cannot then match the | |
634 final @samp{ar} of the regular expression against the final @samp{r} of | |
635 the string. So it backtracks, discarding the match of the last @samp{a} | |
636 in the string. It can then match the remaining @samp{ar}. | |
637 | |
638 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
639 @node Match-one-or-more Operator |
13531 | 640 @subsection The Match-one-or-more Operator (@code{+} or @code{\+}) |
641 | |
13532 | 642 @cindex @samp{+} |
13531 | 643 |
644 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't recognize | |
645 this operator. Otherwise, if the syntax bit @code{RE_BK_PLUS_QM} isn't | |
646 set, then @samp{+} represents this operator; if it is, then @samp{\+} | |
647 does. | |
648 | |
649 This operator is similar to the match-zero-or-more operator except that | |
650 it repeats the preceding regular expression at least once; | |
651 @pxref{Match-zero-or-more Operator}, for what it operates on, how some | |
652 syntax bits affect it, and how Regex backtracks to match it. | |
653 | |
654 For example, supposing that @samp{+} represents the match-one-or-more | |
655 operator; then @samp{ca+r} matches, e.g., @samp{car} and | |
656 @samp{caaaar}, but not @samp{cr}. | |
657 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
658 @node Match-zero-or-one Operator |
13531 | 659 @subsection The Match-zero-or-one Operator (@code{?} or @code{\?}) |
660 @cindex @samp{?} | |
661 | |
662 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't | |
663 recognize this operator. Otherwise, if the syntax bit | |
664 @code{RE_BK_PLUS_QM} isn't set, then @samp{?} represents this operator; | |
665 if it is, then @samp{\?} does. | |
666 | |
667 This operator is similar to the match-zero-or-more operator except that | |
668 it repeats the preceding regular expression once or not at all; | |
669 @pxref{Match-zero-or-more Operator}, to see what it operates on, how | |
670 some syntax bits affect it, and how Regex backtracks to match it. | |
671 | |
672 For example, supposing that @samp{?} represents the match-zero-or-one | |
673 operator; then @samp{ca?r} matches both @samp{car} and @samp{cr}, but | |
674 nothing else. | |
675 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
676 @node Interval Operators |
13531 | 677 @subsection Interval Operators (@code{@{} @dots{} @code{@}} or @code{\@{} @dots{} @code{\@}}) |
678 | |
679 @cindex interval expression | |
680 @cindex @samp{@{} | |
681 @cindex @samp{@}} | |
682 @cindex @samp{\@{} | |
683 @cindex @samp{\@}} | |
684 | |
685 If the syntax bit @code{RE_INTERVALS} is set, then Regex recognizes | |
686 @dfn{interval expressions}. They repeat the smallest possible preceding | |
687 regular expression a specified number of times. | |
688 | |
689 If the syntax bit @code{RE_NO_BK_BRACES} is set, @samp{@{} represents | |
690 the @dfn{open-interval operator} and @samp{@}} represents the | |
691 @dfn{close-interval operator} ; otherwise, @samp{\@{} and @samp{\@}} do. | |
692 | |
693 Specifically, supposing that @samp{@{} and @samp{@}} represent the | |
694 open-interval and close-interval operators; then: | |
695 | |
696 @table @code | |
697 @item @{@var{count}@} | |
698 matches exactly @var{count} occurrences of the preceding regular | |
699 expression. | |
700 | |
13537
77dd6d58a96b
erroneous commas inside @var
Karl Berry <karl@freefriends.org>
parents:
13533
diff
changeset
|
701 @item @{@var{min},@} |
13531 | 702 matches @var{min} or more occurrences of the preceding regular |
703 expression. | |
704 | |
13537
77dd6d58a96b
erroneous commas inside @var
Karl Berry <karl@freefriends.org>
parents:
13533
diff
changeset
|
705 @item @{@var{min}, @var{max}@} |
13531 | 706 matches at least @var{min} but no more than @var{max} occurrences of |
707 the preceding regular expression. | |
708 | |
709 @end table | |
710 | |
711 The interval expression (but not necessarily the regular expression that | |
712 contains it) is invalid if: | |
713 | |
714 @itemize @bullet | |
715 @item | |
13532 | 716 @var{min} is greater than @var{max}, or |
13531 | 717 |
718 @item | |
719 any of @var{count}, @var{min}, or @var{max} are outside the range | |
720 zero to @code{RE_DUP_MAX} (which symbol @file{regex.h} | |
721 defines). | |
722 | |
723 @end itemize | |
724 | |
725 If the interval expression is invalid and the syntax bit | |
726 @code{RE_NO_BK_BRACES} is set, then Regex considers all the | |
727 characters in the would-be interval to be ordinary. If that bit | |
728 isn't set, then the regular expression is invalid. | |
729 | |
730 If the interval expression is valid but there is no preceding regular | |
731 expression on which to operate, then if the syntax bit | |
732 @code{RE_CONTEXT_INVALID_OPS} is set, the regular expression is invalid. | |
733 If that bit isn't set, then Regex considers all the characters---other | |
734 than backslashes, which it ignores---in the would-be interval to be | |
735 ordinary. | |
736 | |
737 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
738 @node Alternation Operator |
13531 | 739 @section The Alternation Operator (@code{|} or @code{\|}) |
740 | |
741 @kindex | | |
742 @kindex \| | |
743 @cindex alternation operator | |
744 @cindex or operator | |
745 | |
746 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't | |
747 recognize this operator. Otherwise, if the syntax bit | |
748 @code{RE_NO_BK_VBAR} is set, then @samp{|} represents this operator; | |
749 otherwise, @samp{\|} does. | |
750 | |
751 Alternatives match one of a choice of regular expressions: | |
752 if you put the character(s) representing the alternation operator between | |
753 any two regular expressions @var{a} and @var{b}, the result matches | |
754 the union of the strings that @var{a} and @var{b} match. For | |
755 example, supposing that @samp{|} is the alternation operator, then | |
756 @samp{foo|bar|quux} would match any of @samp{foo}, @samp{bar} or | |
757 @samp{quux}. | |
758 | |
759 The alternation operator operates on the @emph{largest} possible | |
760 surrounding regular expressions. (Put another way, it has the lowest | |
761 precedence of any regular expression operator.) | |
762 Thus, the only way you can | |
763 delimit its arguments is to use grouping. For example, if @samp{(} and | |
764 @samp{)} are the open and close-group operators, then @samp{fo(o|b)ar} | |
765 would match either @samp{fooar} or @samp{fobar}. (@samp{foo|bar} would | |
766 match @samp{foo} or @samp{bar}.) | |
767 | |
768 @cindex backtracking | |
13532 | 769 The matcher usually tries all combinations of alternatives so as to |
13531 | 770 match the longest possible string. For example, when matching |
771 @samp{(fooq|foo)*(qbarquux|bar)} against @samp{fooqbarquux}, it cannot | |
772 take, say, the first (``depth-first'') combination it could match, since | |
13532 | 773 then it would be content to match just @samp{fooqbar}. |
13531 | 774 |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
775 Note that since the default behavior is to return the leftmost longest |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
776 match, when more than one of a series of alternatives matches the actual |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
777 match will be the longest matching alternative, not necessarily the |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
778 first in the list. |
13531 | 779 |
780 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
781 @node List Operators |
13531 | 782 @section List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]}) |
783 | |
784 @cindex matching list | |
785 @cindex @samp{[} | |
786 @cindex @samp{]} | |
787 @cindex @samp{^} | |
788 @cindex @samp{-} | |
789 @cindex @samp{\} | |
790 @cindex @samp{[^} | |
791 @cindex nonmatching list | |
792 @cindex matching newline | |
793 @cindex bracket expression | |
794 | |
795 @dfn{Lists}, also called @dfn{bracket expressions}, are a set of one or | |
796 more items. An @dfn{item} is a character, | |
13532 | 797 a collating symbol, an equivalence class expression, |
13531 | 798 a character class expression, or a range expression. The syntax bits |
799 affect which kinds of items you can put in a list. We explain the last | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
800 four items in subsections below. Empty lists are invalid. |
13531 | 801 |
802 A @dfn{matching list} matches a single character represented by one of | |
803 the list items. You form a matching list by enclosing one or more items | |
804 within an @dfn{open-matching-list operator} (represented by @samp{[}) | |
13532 | 805 and a @dfn{close-list operator} (represented by @samp{]}). |
13531 | 806 |
807 For example, @samp{[ab]} matches either @samp{a} or @samp{b}. | |
808 @samp{[ad]*} matches the empty string and any string composed of just | |
809 @samp{a}s and @samp{d}s in any order. Regex considers invalid a regular | |
810 expression with a @samp{[} but no matching | |
811 @samp{]}. | |
812 | |
813 @dfn{Nonmatching lists} are similar to matching lists except that they | |
814 match a single character @emph{not} represented by one of the list | |
815 items. You use an @dfn{open-nonmatching-list operator} (represented by | |
816 @samp{[^}@footnote{Regex therefore doesn't consider the @samp{^} to be | |
817 the first character in the list. If you put a @samp{^} character first | |
818 in (what you think is) a matching list, you'll turn it into a | |
819 nonmatching list.}) instead of an open-matching-list operator to start a | |
13532 | 820 nonmatching list. |
13531 | 821 |
822 For example, @samp{[^ab]} matches any character except @samp{a} or | |
13532 | 823 @samp{b}. |
13531 | 824 |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
825 If the syntax bit @code{RE_HAT_LISTS_NOT_NEWLINE} is set, then |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
826 nonmatching lists do not match a newline. |
13531 | 827 |
828 Most characters lose any special meaning inside a list. The special | |
829 characters inside a list follow. | |
830 | |
831 @table @samp | |
832 @item ] | |
833 ends the list if it's not the first list item. So, if you want to make | |
834 the @samp{]} character a list item, you must put it first. | |
835 | |
836 @item \ | |
837 quotes the next character if the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is | |
838 set. | |
839 | |
840 @item [. | |
841 represents the open-collating-symbol operator (@pxref{Collating Symbol | |
842 Operators}). | |
843 | |
844 @item .] | |
845 represents the close-collating-symbol operator. | |
846 | |
847 @item [= | |
848 represents the open-equivalence-class operator (@pxref{Equivalence Class | |
849 Operators}). | |
850 | |
851 @item =] | |
852 represents the close-equivalence-class operator. | |
853 | |
854 @item [: | |
855 represents the open-character-class operator (@pxref{Character Class | |
856 Operators}) if the syntax bit @code{RE_CHAR_CLASSES} is set and what | |
857 follows is a valid character class expression. | |
858 | |
859 @item :] | |
860 represents the close-character-class operator if the syntax bit | |
861 @code{RE_CHAR_CLASSES} is set and what precedes it is an | |
862 open-character-class operator followed by a valid character class name. | |
863 | |
13532 | 864 @item - |
13531 | 865 represents the range operator (@pxref{Range Operator}) if it's |
866 not first or last in a list or the ending point of a range. | |
867 | |
868 @end table | |
869 | |
870 @noindent | |
13532 | 871 All other characters are ordinary. For example, @samp{[.*]} matches |
872 @samp{.} and @samp{*}. | |
13531 | 873 |
874 @menu | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
875 * Collating Symbol Operators:: [.elem.] |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
876 * Equivalence Class Operators:: [=class=] |
13531 | 877 * Character Class Operators:: [:class:] |
878 * Range Operator:: start-end | |
879 @end menu | |
880 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
881 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
882 @node Collating Symbol Operators |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
883 @subsection Collating Symbol Operators (@code{[.} @dots{} @code{.]}) |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
884 |
13648
40fe4f708fa8
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13647
diff
changeset
|
885 Collating symbols can be represented inside lists. |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
886 You form a @dfn{collating symbol} by |
13531 | 887 putting a collating element between an @dfn{open-collating-symbol |
14774
70d101744577
maint: correct misuse of "a" and "an"
Jim Meyering <meyering@redhat.com>
parents:
13648
diff
changeset
|
888 operator} and a @dfn{close-collating-symbol operator}. @samp{[.} |
13531 | 889 represents the open-collating-symbol operator and @samp{.]} represents |
890 the close-collating-symbol operator. For example, if @samp{ll} is a | |
891 collating element, then @samp{[[.ll.]]} would match @samp{ll}. | |
892 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
893 @node Equivalence Class Operators |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
894 @subsection Equivalence Class Operators (@code{[=} @dots{} @code{=]}) |
13531 | 895 @cindex equivalence class expression in regex |
896 @cindex @samp{[=} in regex | |
897 @cindex @samp{=]} in regex | |
898 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
899 Regex recognizes equivalence class |
13531 | 900 expressions inside lists. A @dfn{equivalence class expression} is a set |
901 of collating elements which all belong to the same equivalence class. | |
902 You form an equivalence class expression by putting a collating | |
903 element between an @dfn{open-equivalence-class operator} and a | |
904 @dfn{close-equivalence-class operator}. @samp{[=} represents the | |
905 open-equivalence-class operator and @samp{=]} represents the | |
906 close-equivalence-class operator. For example, if @samp{a} and @samp{A} | |
907 were an equivalence class, then both @samp{[[=a=]]} and @samp{[[=A=]]} | |
908 would match both @samp{a} and @samp{A}. If the collating element in an | |
909 equivalence class expression isn't part of an equivalence class, then | |
910 the matcher considers the equivalence class expression to be a collating | |
911 symbol. | |
912 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
913 @node Character Class Operators |
13531 | 914 @subsection Character Class Operators (@code{[:} @dots{} @code{:]}) |
915 | |
916 @cindex character classes | |
15563
ebf52f657a28
avoid literal : in index entries
Karl Berry <karl@freefriends.org>
parents:
14775
diff
changeset
|
917 @cindex @samp{[colon} in regex |
ebf52f657a28
avoid literal : in index entries
Karl Berry <karl@freefriends.org>
parents:
14775
diff
changeset
|
918 @cindex @samp{colon]} in regex |
13531 | 919 |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
920 If the syntax bit @code{RE_CHAR_CLASSES} is set, then Regex recognizes |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
921 character class expressions inside lists. A @dfn{character class |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
922 expression} matches one character from a given class. You form a |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
923 character class expression by putting a character class name between |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
924 an @dfn{open-character-class operator} (represented by @samp{[:}) and |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
925 a @dfn{close-character-class operator} (represented by @samp{:]}). |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
926 The character class names and their meanings are: |
13531 | 927 |
928 @table @code | |
929 | |
13532 | 930 @item alnum |
13531 | 931 letters and digits |
932 | |
933 @item alpha | |
934 letters | |
935 | |
936 @item blank | |
17274 | 937 system-dependent; for GNU, a space or tab |
13531 | 938 |
939 @item cntrl | |
17274 | 940 control characters (in the ASCII encoding, code 0177 and codes |
13531 | 941 less than 040) |
942 | |
943 @item digit | |
944 digits | |
945 | |
946 @item graph | |
947 same as @code{print} except omits space | |
948 | |
13532 | 949 @item lower |
13531 | 950 lowercase letters |
951 | |
952 @item print | |
17274 | 953 printable characters (in the ASCII encoding, space |
13531 | 954 tilde---codes 040 through 0176) |
955 | |
956 @item punct | |
957 neither control nor alphanumeric characters | |
958 | |
959 @item space | |
960 space, carriage return, newline, vertical tab, and form feed | |
961 | |
962 @item upper | |
963 uppercase letters | |
964 | |
965 @item xdigit | |
966 hexadecimal digits: @code{0}--@code{9}, @code{a}--@code{f}, @code{A}--@code{F} | |
967 | |
968 @end table | |
969 | |
970 @noindent | |
971 These correspond to the definitions in the C library's @file{<ctype.h>} | |
972 facility. For example, @samp{[:alpha:]} corresponds to the standard | |
973 facility @code{isalpha}. Regex recognizes character class expressions | |
974 only inside of lists; so @samp{[[:alpha:]]} matches any letter, but | |
975 @samp{[:alpha:]} outside of a bracket expression and not followed by a | |
976 repetition operator matches just itself. | |
977 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
978 @node Range Operator |
13531 | 979 @subsection The Range Operator (@code{-}) |
980 | |
981 Regex recognizes @dfn{range expressions} inside a list. They represent | |
982 those characters | |
983 that fall between two elements in the current collating sequence. You | |
13532 | 984 form a range expression by putting a @dfn{range operator} between two |
13531 | 985 of any of the following: characters, collating elements, collating symbols, |
986 and equivalence class expressions. The starting point of the range and | |
987 the ending point of the range don't have to be the same kind of item, | |
988 e.g., the starting point could be a collating element and the ending | |
989 point could be an equivalence class expression. If a range's ending | |
990 point is an equivalence class, then all the collating elements in that | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
991 class will be in the range.@footnote{You can't use a character class for the starting |
13531 | 992 or ending point of a range, since a character class is not a single |
993 character.} @samp{-} represents the range operator. For example, | |
994 @samp{a-f} within a list represents all the characters from @samp{a} | |
995 through @samp{f} | |
996 inclusively. | |
997 | |
998 If the syntax bit @code{RE_NO_EMPTY_RANGES} is set, then if the range's | |
999 ending point collates less than its starting point, the range (and the | |
1000 regular expression containing it) is invalid. For example, the regular | |
1001 expression @samp{[z-a]} would be invalid. If this bit isn't set, then | |
1002 Regex considers such a range to be empty. | |
1003 | |
1004 Since @samp{-} represents the range operator, if you want to make a | |
1005 @samp{-} character itself | |
1006 a list item, you must do one of the following: | |
1007 | |
1008 @itemize @bullet | |
1009 @item | |
1010 Put the @samp{-} either first or last in the list. | |
1011 | |
1012 @item | |
1013 Include a range whose starting point collates strictly lower than | |
1014 @samp{-} and whose ending point collates equal or higher. Unless a | |
1015 range is the first item in a list, a @samp{-} can't be its starting | |
1016 point, but @emph{can} be its ending point. That is because Regex | |
1017 considers @samp{-} to be the range operator unless it is preceded by | |
17274 | 1018 another @samp{-}. For example, in the ASCII encoding, @samp{)}, |
13531 | 1019 @samp{*}, @samp{+}, @samp{,}, @samp{-}, @samp{.}, and @samp{/} are |
1020 contiguous characters in the collating sequence. You might think that | |
1021 @samp{[)-+--/]} has two ranges: @samp{)-+} and @samp{--/}. Rather, it | |
1022 has the ranges @samp{)-+} and @samp{+--}, plus the character @samp{/}, so | |
1023 it matches, e.g., @samp{,}, not @samp{.}. | |
1024 | |
1025 @item | |
1026 Put a range whose starting point is @samp{-} first in the list. | |
1027 | |
1028 @end itemize | |
1029 | |
1030 For example, @samp{[-a-z]} matches a lowercase letter or a hyphen (in | |
17274 | 1031 English, in ASCII). |
13531 | 1032 |
1033 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1034 @node Grouping Operators |
13531 | 1035 @section Grouping Operators (@code{(} @dots{} @code{)} or @code{\(} @dots{} @code{\)}) |
1036 | |
1037 @kindex ( | |
1038 @kindex ) | |
1039 @kindex \( | |
1040 @kindex \) | |
1041 @cindex grouping | |
1042 @cindex subexpressions | |
1043 @cindex parenthesizing | |
1044 | |
1045 A @dfn{group}, also known as a @dfn{subexpression}, consists of an | |
1046 @dfn{open-group operator}, any number of other operators, and a | |
1047 @dfn{close-group operator}. Regex treats this sequence as a unit, just | |
1048 as mathematics and programming languages treat a parenthesized | |
1049 expression as a unit. | |
1050 | |
1051 Therefore, using @dfn{groups}, you can: | |
1052 | |
1053 @itemize @bullet | |
1054 @item | |
1055 delimit the argument(s) to an alternation operator (@pxref{Alternation | |
1056 Operator}) or a repetition operator (@pxref{Repetition | |
1057 Operators}). | |
1058 | |
13532 | 1059 @item |
13531 | 1060 keep track of the indices of the substring that matched a given group. |
1061 @xref{Using Registers}, for a precise explanation. | |
1062 This lets you: | |
1063 | |
1064 @itemize @bullet | |
1065 @item | |
1066 use the back-reference operator (@pxref{Back-reference Operator}). | |
1067 | |
13532 | 1068 @item |
13531 | 1069 use registers (@pxref{Using Registers}). |
1070 | |
1071 @end itemize | |
1072 | |
1073 @end itemize | |
1074 | |
1075 If the syntax bit @code{RE_NO_BK_PARENS} is set, then @samp{(} represents | |
1076 the open-group operator and @samp{)} represents the | |
1077 close-group operator; otherwise, @samp{\(} and @samp{\)} do. | |
1078 | |
1079 If the syntax bit @code{RE_UNMATCHED_RIGHT_PAREN_ORD} is set and a | |
1080 close-group operator has no matching open-group operator, then Regex | |
1081 considers it to match @samp{)}. | |
1082 | |
1083 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1084 @node Back-reference Operator |
13531 | 1085 @section The Back-reference Operator (@dfn{\}@var{digit}) |
1086 | |
1087 @cindex back references | |
1088 | |
1089 If the syntax bit @code{RE_NO_BK_REF} isn't set, then Regex recognizes | |
1090 back references. A back reference matches a specified preceding group. | |
1091 The back reference operator is represented by @samp{\@var{digit}} | |
1092 anywhere after the end of a regular expression's @w{@var{digit}-th} | |
1093 group (@pxref{Grouping Operators}). | |
1094 | |
1095 @var{digit} must be between @samp{1} and @samp{9}. The matcher assigns | |
1096 numbers 1 through 9 to the first nine groups it encounters. By using | |
1097 one of @samp{\1} through @samp{\9} after the corresponding group's | |
1098 close-group operator, you can match a substring identical to the | |
1099 one that the group does. | |
1100 | |
1101 Back references match according to the following (in all examples below, | |
1102 @samp{(} represents the open-group, @samp{)} the close-group, @samp{@{} | |
1103 the open-interval and @samp{@}} the close-interval operator): | |
1104 | |
1105 @itemize @bullet | |
1106 @item | |
1107 If the group matches a substring, the back reference matches an | |
1108 identical substring. For example, @samp{(a)\1} matches @samp{aa} and | |
1109 @samp{(bana)na\1bo\1} matches @samp{bananabanabobana}. Likewise, | |
1110 @samp{(.*)\1} matches any (newline-free if the syntax bit | |
1111 @code{RE_DOT_NEWLINE} isn't set) string that is composed of two | |
1112 identical halves; the @samp{(.*)} matches the first half and the | |
1113 @samp{\1} matches the second half. | |
1114 | |
1115 @item | |
1116 If the group matches more than once (as it might if followed | |
1117 by, e.g., a repetition operator), then the back reference matches the | |
1118 substring the group @emph{last} matched. For example, | |
1119 @samp{((a*)b)*\1\2} matches @samp{aabababa}; first @w{group 1} (the | |
1120 outer one) matches @samp{aab} and @w{group 2} (the inner one) matches | |
1121 @samp{aa}. Then @w{group 1} matches @samp{ab} and @w{group 2} matches | |
1122 @samp{a}. So, @samp{\1} matches @samp{ab} and @samp{\2} matches | |
1123 @samp{a}. | |
1124 | |
1125 @item | |
1126 If the group doesn't participate in a match, i.e., it is part of an | |
1127 alternative not taken or a repetition operator allows zero repetitions | |
1128 of it, then the back reference makes the whole match fail. For example, | |
1129 @samp{(one()|two())-and-(three\2|four\3)} matches @samp{one-and-three} | |
1130 and @samp{two-and-four}, but not @samp{one-and-four} or | |
1131 @samp{two-and-three}. For example, if the pattern matches | |
1132 @samp{one-and-}, then its @w{group 2} matches the empty string and its | |
1133 @w{group 3} doesn't participate in the match. So, if it then matches | |
1134 @samp{four}, then when it tries to back reference @w{group 3}---which it | |
1135 will attempt to do because @samp{\3} follows the @samp{four}---the match | |
1136 will fail because @w{group 3} didn't participate in the match. | |
1137 | |
1138 @end itemize | |
1139 | |
1140 You can use a back reference as an argument to a repetition operator. For | |
1141 example, @samp{(a(b))\2*} matches @samp{a} followed by two or more | |
1142 @samp{b}s. Similarly, @samp{(a(b))\2@{3@}} matches @samp{abbbb}. | |
1143 | |
1144 If there is no preceding @w{@var{digit}-th} subexpression, the regular | |
1145 expression is invalid. | |
1146 | |
1147 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1148 @node Anchoring Operators |
13532 | 1149 @section Anchoring Operators |
13531 | 1150 |
1151 @cindex anchoring | |
1152 @cindex regexp anchoring | |
1153 | |
1154 These operators can constrain a pattern to match only at the beginning or | |
1155 end of the entire string or at the beginning or end of a line. | |
1156 | |
1157 @menu | |
1158 * Match-beginning-of-line Operator:: ^ | |
1159 * Match-end-of-line Operator:: $ | |
1160 @end menu | |
1161 | |
1162 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1163 @node Match-beginning-of-line Operator |
13531 | 1164 @subsection The Match-beginning-of-line Operator (@code{^}) |
1165 | |
1166 @kindex ^ | |
1167 @cindex beginning-of-line operator | |
1168 @cindex anchors | |
1169 | |
1170 This operator can match the empty string either at the beginning of the | |
1171 string or after a newline character. Thus, it is said to @dfn{anchor} | |
1172 the pattern to the beginning of a line. | |
1173 | |
1174 In the cases following, @samp{^} represents this operator. (Otherwise, | |
1175 @samp{^} is ordinary.) | |
1176 | |
1177 @itemize @bullet | |
1178 | |
1179 @item | |
1180 It (the @samp{^}) is first in the pattern, as in @samp{^foo}. | |
1181 | |
1182 @cnindex RE_CONTEXT_INDEP_ANCHORS @r{(and @samp{^})} | |
1183 @item | |
1184 The syntax bit @code{RE_CONTEXT_INDEP_ANCHORS} is set, and it is outside | |
1185 a bracket expression. | |
1186 | |
1187 @cindex open-group operator and @samp{^} | |
1188 @cindex alternation operator and @samp{^} | |
1189 @item | |
1190 It follows an open-group or alternation operator, as in @samp{a\(^b\)} | |
1191 and @samp{a\|^b}. @xref{Grouping Operators}, and @ref{Alternation | |
1192 Operator}. | |
1193 | |
1194 @end itemize | |
1195 | |
1196 These rules imply that some valid patterns containing @samp{^} cannot be | |
1197 matched; for example, @samp{foo^bar} if @code{RE_CONTEXT_INDEP_ANCHORS} | |
1198 is set. | |
1199 | |
1200 @vindex not_bol @r{field in pattern buffer} | |
1201 If the @code{not_bol} field is set in the pattern buffer (@pxref{GNU | |
1202 Pattern Buffers}), then @samp{^} fails to match at the beginning of the | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1203 string. This lets you match against pieces of a line, as you would need to if, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1204 say, searching for repeated instances of a given pattern in a line; it |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1205 would work correctly for patterns both with and without |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1206 match-beginning-of-line operators. |
13531 | 1207 |
1208 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1209 @node Match-end-of-line Operator |
13531 | 1210 @subsection The Match-end-of-line Operator (@code{$}) |
1211 | |
1212 @kindex $ | |
1213 @cindex end-of-line operator | |
1214 @cindex anchors | |
1215 | |
1216 This operator can match the empty string either at the end of | |
1217 the string or before a newline character in the string. Thus, it is | |
1218 said to @dfn{anchor} the pattern to the end of a line. | |
1219 | |
1220 It is always represented by @samp{$}. For example, @samp{foo$} usually | |
1221 matches, e.g., @samp{foo} and, e.g., the first three characters of | |
1222 @samp{foo\nbar}. | |
1223 | |
1224 Its interaction with the syntax bits and pattern buffer fields is | |
1225 exactly the dual of @samp{^}'s; see the previous section. (That is, | |
13554 | 1226 ``@samp{^}'' becomes ``@samp{$}'', ``beginning'' becomes ``end'', |
1227 ``next'' becomes ``previous'', ``after'' becomes ``before'', and | |
1228 ``@code{not_bol}'' becomes ``@code{not_eol}''.) | |
13531 | 1229 |
1230 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1231 @node GNU Operators |
13531 | 1232 @chapter GNU Operators |
1233 | |
17274 | 1234 Following are operators that GNU defines (and POSIX doesn't). |
13531 | 1235 |
1236 @menu | |
1237 * Word Operators:: | |
1238 * Buffer Operators:: | |
1239 @end menu | |
1240 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1241 @node Word Operators |
13531 | 1242 @section Word Operators |
1243 | |
1244 The operators in this section require Regex to recognize parts of words. | |
1245 Regex uses a syntax table to determine whether or not a character is | |
1246 part of a word, i.e., whether or not it is @dfn{word-constituent}. | |
1247 | |
1248 @menu | |
1249 * Non-Emacs Syntax Tables:: | |
1250 * Match-word-boundary Operator:: \b | |
1251 * Match-within-word Operator:: \B | |
1252 * Match-beginning-of-word Operator:: \< | |
1253 * Match-end-of-word Operator:: \> | |
1254 * Match-word-constituent Operator:: \w | |
1255 * Match-non-word-constituent Operator:: \W | |
1256 @end menu | |
1257 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1258 @node Non-Emacs Syntax Tables |
13532 | 1259 @subsection Non-Emacs Syntax Tables |
13531 | 1260 |
1261 A @dfn{syntax table} is an array indexed by the characters in your | |
17274 | 1262 character set. In the ASCII encoding, therefore, a syntax table |
13531 | 1263 has 256 elements. Regex always uses a @code{char *} variable |
1264 @code{re_syntax_table} as its syntax table. In some cases, it | |
1265 initializes this variable and in others it expects you to initialize it. | |
1266 | |
1267 @itemize @bullet | |
1268 @item | |
1269 If Regex is compiled with the preprocessor symbols @code{emacs} and | |
1270 @code{SYNTAX_TABLE} both undefined, then Regex allocates | |
1271 @code{re_syntax_table} and initializes an element @var{i} either to | |
1272 @code{Sword} (which it defines) if @var{i} is a letter, number, or | |
1273 @samp{_}, or to zero if it's not. | |
1274 | |
1275 @item | |
1276 If Regex is compiled with @code{emacs} undefined but @code{SYNTAX_TABLE} | |
1277 defined, then Regex expects you to define a @code{char *} variable | |
1278 @code{re_syntax_table} to be a valid syntax table. | |
1279 | |
1280 @item | |
1281 @xref{Emacs Syntax Tables}, for what happens when Regex is compiled with | |
1282 the preprocessor symbol @code{emacs} defined. | |
1283 | |
1284 @end itemize | |
1285 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1286 @node Match-word-boundary Operator |
13531 | 1287 @subsection The Match-word-boundary Operator (@code{\b}) |
1288 | |
1289 @cindex @samp{\b} | |
1290 @cindex word boundaries, matching | |
1291 | |
1292 This operator (represented by @samp{\b}) matches the empty string at | |
1293 either the beginning or the end of a word. For example, @samp{\brat\b} | |
1294 matches the separate word @samp{rat}. | |
1295 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1296 @node Match-within-word Operator |
13531 | 1297 @subsection The Match-within-word Operator (@code{\B}) |
1298 | |
1299 @cindex @samp{\B} | |
1300 | |
1301 This operator (represented by @samp{\B}) matches the empty string within | |
1302 a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but | |
1303 @samp{dirty \Brat} doesn't match @samp{dirty rat}. | |
1304 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1305 @node Match-beginning-of-word Operator |
13531 | 1306 @subsection The Match-beginning-of-word Operator (@code{\<}) |
1307 | |
1308 @cindex @samp{\<} | |
1309 | |
1310 This operator (represented by @samp{\<}) matches the empty string at the | |
1311 beginning of a word. | |
1312 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1313 @node Match-end-of-word Operator |
13531 | 1314 @subsection The Match-end-of-word Operator (@code{\>}) |
1315 | |
1316 @cindex @samp{\>} | |
1317 | |
1318 This operator (represented by @samp{\>}) matches the empty string at the | |
1319 end of a word. | |
1320 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1321 @node Match-word-constituent Operator |
13531 | 1322 @subsection The Match-word-constituent Operator (@code{\w}) |
1323 | |
1324 @cindex @samp{\w} | |
1325 | |
1326 This operator (represented by @samp{\w}) matches any word-constituent | |
1327 character. | |
1328 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1329 @node Match-non-word-constituent Operator |
13531 | 1330 @subsection The Match-non-word-constituent Operator (@code{\W}) |
1331 | |
1332 @cindex @samp{\W} | |
1333 | |
1334 This operator (represented by @samp{\W}) matches any character that is | |
1335 not word-constituent. | |
1336 | |
1337 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1338 @node Buffer Operators |
13532 | 1339 @section Buffer Operators |
13531 | 1340 |
1341 Following are operators which work on buffers. In Emacs, a @dfn{buffer} | |
1342 is, naturally, an Emacs buffer. For other programs, Regex considers the | |
1343 entire string to be matched as the buffer. | |
1344 | |
1345 @menu | |
1346 * Match-beginning-of-buffer Operator:: \` | |
1347 * Match-end-of-buffer Operator:: \' | |
1348 @end menu | |
1349 | |
1350 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1351 @node Match-beginning-of-buffer Operator |
13531 | 1352 @subsection The Match-beginning-of-buffer Operator (@code{\`}) |
1353 | |
1354 @cindex @samp{\`} | |
1355 | |
1356 This operator (represented by @samp{\`}) matches the empty string at the | |
1357 beginning of the buffer. | |
1358 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1359 @node Match-end-of-buffer Operator |
13531 | 1360 @subsection The Match-end-of-buffer Operator (@code{\'}) |
1361 | |
1362 @cindex @samp{\'} | |
1363 | |
1364 This operator (represented by @samp{\'}) matches the empty string at the | |
1365 end of the buffer. | |
1366 | |
1367 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1368 @node GNU Emacs Operators |
13531 | 1369 @chapter GNU Emacs Operators |
1370 | |
17274 | 1371 Following are operators that GNU defines (and POSIX doesn't) |
13531 | 1372 that you can use only when Regex is compiled with the preprocessor |
13532 | 1373 symbol @code{emacs} defined. |
13531 | 1374 |
1375 @menu | |
1376 * Syntactic Class Operators:: | |
1377 @end menu | |
1378 | |
1379 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1380 @node Syntactic Class Operators |
13531 | 1381 @section Syntactic Class Operators |
1382 | |
1383 The operators in this section require Regex to recognize the syntactic | |
1384 classes of characters. Regex uses a syntax table to determine this. | |
1385 | |
1386 @menu | |
1387 * Emacs Syntax Tables:: | |
1388 * Match-syntactic-class Operator:: \sCLASS | |
1389 * Match-not-syntactic-class Operator:: \SCLASS | |
1390 @end menu | |
1391 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1392 @node Emacs Syntax Tables |
13531 | 1393 @subsection Emacs Syntax Tables |
1394 | |
1395 A @dfn{syntax table} is an array indexed by the characters in your | |
17274 | 1396 character set. In the ASCII encoding, therefore, a syntax table |
13531 | 1397 has 256 elements. |
1398 | |
1399 If Regex is compiled with the preprocessor symbol @code{emacs} defined, | |
1400 then Regex expects you to define and initialize the variable | |
1401 @code{re_syntax_table} to be an Emacs syntax table. Emacs' syntax | |
1402 tables are more complicated than Regex's own (@pxref{Non-Emacs Syntax | |
1403 Tables}). @xref{Syntax, , Syntax, emacs, The GNU Emacs User's Manual}, | |
1404 for a description of Emacs' syntax tables. | |
1405 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1406 @node Match-syntactic-class Operator |
13531 | 1407 @subsection The Match-syntactic-class Operator (@code{\s}@var{class}) |
1408 | |
1409 @cindex @samp{\s} | |
1410 | |
1411 This operator matches any character whose syntactic class is represented | |
1412 by a specified character. @samp{\s@var{class}} represents this operator | |
1413 where @var{class} is the character representing the syntactic class you | |
1414 want. For example, @samp{w} represents the syntactic | |
1415 class of word-constituent characters, so @samp{\sw} matches any | |
1416 word-constituent character. | |
1417 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1418 @node Match-not-syntactic-class Operator |
13531 | 1419 @subsection The Match-not-syntactic-class Operator (@code{\S}@var{class}) |
1420 | |
1421 @cindex @samp{\S} | |
1422 | |
1423 This operator is similar to the match-syntactic-class operator except | |
1424 that it matches any character whose syntactic class is @emph{not} | |
1425 represented by the specified character. @samp{\S@var{class}} represents | |
1426 this operator. For example, @samp{w} represents the syntactic class of | |
1427 word-constituent characters, so @samp{\Sw} matches any character that is | |
1428 not word-constituent. | |
1429 | |
1430 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1431 @node What Gets Matched? |
13531 | 1432 @chapter What Gets Matched? |
1433 | |
1434 Regex usually matches strings according to the ``leftmost longest'' | |
1435 rule; that is, it chooses the longest of the leftmost matches. This | |
1436 does not mean that for a regular expression containing subexpressions | |
1437 that it simply chooses the longest match for each subexpression, left to | |
1438 right; the overall match must also be the longest possible one. | |
1439 | |
1440 For example, @samp{(ac*)(c*d[ac]*)\1} matches @samp{acdacaaa}, not | |
1441 @samp{acdac}, as it would if it were to choose the longest match for the | |
1442 first subexpression. | |
1443 | |
1444 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1445 @node Programming with Regex |
13531 | 1446 @chapter Programming with Regex |
1447 | |
1448 Here we describe how you use the Regex data structures and functions in | |
17274 | 1449 C programs. Regex has three interfaces: one designed for GNU, one |
1450 compatible with POSIX (as specified by POSIX, draft | |
1451 1003.2/D11.2), and one compatible with Berkeley Unix. The | |
1452 POSIX interface is not documented here; see the documentation of | |
1453 GNU libc, or the POSIX man pages. The Berkeley Unix interface is | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1454 documented here for convenience, since its documentation is not |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1455 otherwise readily available on GNU systems. |
13531 | 1456 |
1457 @menu | |
1458 * GNU Regex Functions:: | |
1459 * BSD Regex Functions:: | |
1460 @end menu | |
1461 | |
1462 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1463 @node GNU Regex Functions |
13531 | 1464 @section GNU Regex Functions |
1465 | |
1466 If you're writing code that doesn't need to be compatible with either | |
17274 | 1467 POSIX or Berkeley Unix, you can use these functions. They |
13531 | 1468 provide more options than the other interfaces. |
1469 | |
1470 @menu | |
1471 * GNU Pattern Buffers:: The re_pattern_buffer type. | |
1472 * GNU Regular Expression Compiling:: re_compile_pattern () | |
1473 * GNU Matching:: re_match () | |
1474 * GNU Searching:: re_search () | |
1475 * Matching/Searching with Split Data:: re_match_2 (), re_search_2 () | |
1476 * Searching with Fastmaps:: re_compile_fastmap () | |
16236
8d0c35a0ae1d
doc: fix minor quoting issues, mostly with `
Paul Eggert <eggert@cs.ucla.edu>
parents:
15563
diff
changeset
|
1477 * GNU Translate Tables:: The @code{translate} field. |
13531 | 1478 * Using Registers:: The re_registers type and related fns. |
1479 * Freeing GNU Pattern Buffers:: regfree () | |
1480 @end menu | |
1481 | |
1482 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1483 @node GNU Pattern Buffers |
13531 | 1484 @subsection GNU Pattern Buffers |
1485 | |
1486 @cindex pattern buffer, definition of | |
1487 @tindex re_pattern_buffer @r{definition} | |
1488 @tindex struct re_pattern_buffer @r{definition} | |
1489 | |
1490 To compile, match, or search for a given regular expression, you must | |
1491 supply a pattern buffer. A @dfn{pattern buffer} holds one compiled | |
1492 regular expression.@footnote{Regular expressions are also referred to as | |
1493 ``patterns,'' hence the name ``pattern buffer.''} | |
1494 | |
1495 You can have several different pattern buffers simultaneously, each | |
1496 holding a compiled pattern for a different regular expression. | |
1497 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1498 @file{regex.h} defines the pattern buffer @code{struct} with the |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1499 following public fields: |
13531 | 1500 |
1501 @example | |
1502 unsigned char *buffer; | |
1503 unsigned long allocated; | |
1504 char *fastmap; | |
1505 char *translate; | |
1506 size_t re_nsub; | |
1507 unsigned no_sub : 1; | |
1508 unsigned not_bol : 1; | |
1509 unsigned not_eol : 1; | |
1510 @end example | |
1511 | |
1512 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1513 @node GNU Regular Expression Compiling |
13531 | 1514 @subsection GNU Regular Expression Compiling |
1515 | |
17274 | 1516 In GNU, you can both match and search for a given regular |
13531 | 1517 expression. To do either, you must first compile it in a pattern buffer |
1518 (@pxref{GNU Pattern Buffers}). | |
1519 | |
1520 @cindex syntax initialization | |
1521 @vindex re_syntax_options @r{initialization} | |
1522 Regular expressions match according to the syntax with which they were | |
17274 | 1523 compiled; with GNU, you indicate what syntax you want by setting |
13553
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
1524 the variable @code{re_syntax_options} (declared in @file{regex.h}) |
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
1525 before calling the compiling function, @code{re_compile_pattern} (see |
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
1526 below). @xref{Syntax Bits}, and @ref{Predefined Syntaxes}. |
13531 | 1527 |
1528 You can change the value of @code{re_syntax_options} at any time. | |
1529 Usually, however, you set its value once and then never change it. | |
1530 | |
1531 @cindex pattern buffer initialization | |
1532 @code{re_compile_pattern} takes a pattern buffer as an argument. You | |
1533 must initialize the following fields: | |
1534 | |
1535 @table @code | |
1536 | |
1537 @item translate @r{initialization} | |
1538 | |
1539 @item translate | |
1540 @vindex translate @r{initialization} | |
1541 Initialize this to point to a translate table if you want one, or to | |
1542 zero if you don't. We explain translate tables in @ref{GNU Translate | |
1543 Tables}. | |
1544 | |
1545 @item fastmap | |
1546 @vindex fastmap @r{initialization} | |
1547 Initialize this to nonzero if you want a fastmap, or to zero if you | |
1548 don't. | |
1549 | |
1550 @item buffer | |
1551 @itemx allocated | |
1552 @vindex buffer @r{initialization} | |
1553 @vindex allocated @r{initialization} | |
1554 @findex malloc | |
1555 If you want @code{re_compile_pattern} to allocate memory for the | |
1556 compiled pattern, set both of these to zero. If you have an existing | |
1557 block of memory (allocated with @code{malloc}) you want Regex to use, | |
1558 set @code{buffer} to its address and @code{allocated} to its size (in | |
1559 bytes). | |
1560 | |
1561 @code{re_compile_pattern} uses @code{realloc} to extend the space for | |
1562 the compiled pattern as necessary. | |
1563 | |
1564 @end table | |
1565 | |
1566 To compile a pattern buffer, use: | |
1567 | |
1568 @findex re_compile_pattern | |
1569 @example | |
13532 | 1570 char * |
1571 re_compile_pattern (const char *@var{regex}, const int @var{regex_size}, | |
13531 | 1572 struct re_pattern_buffer *@var{pattern_buffer}) |
1573 @end example | |
1574 | |
1575 @noindent | |
1576 @var{regex} is the regular expression's address, @var{regex_size} is its | |
1577 length, and @var{pattern_buffer} is the pattern buffer's address. | |
1578 | |
1579 If @code{re_compile_pattern} successfully compiles the regular | |
1580 expression, it returns zero and sets @code{*@var{pattern_buffer}} to the | |
1581 compiled pattern. It sets the pattern buffer's fields as follows: | |
1582 | |
1583 @table @code | |
1584 @item buffer | |
1585 @vindex buffer @r{field, set by @code{re_compile_pattern}} | |
1586 to the compiled pattern. | |
1587 | |
1588 @item syntax | |
1589 @vindex syntax @r{field, set by @code{re_compile_pattern}} | |
1590 to the current value of @code{re_syntax_options}. | |
1591 | |
1592 @item re_nsub | |
1593 @vindex re_nsub @r{field, set by @code{re_compile_pattern}} | |
1594 to the number of subexpressions in @var{regex}. | |
1595 | |
1596 @end table | |
1597 | |
1598 If @code{re_compile_pattern} can't compile @var{regex}, it returns an | |
17274 | 1599 error string corresponding to a POSIX error code. |
13531 | 1600 |
1601 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1602 @node GNU Matching |
13532 | 1603 @subsection GNU Matching |
13531 | 1604 |
1605 @cindex matching with GNU functions | |
1606 | |
17274 | 1607 Matching the GNU way means trying to match as much of a string as |
13531 | 1608 possible starting at a position within it you specify. Once you've compiled |
1609 a pattern into a pattern buffer (@pxref{GNU Regular Expression | |
1610 Compiling}), you can ask the matcher to match that pattern against a | |
1611 string using: | |
1612 | |
1613 @findex re_match | |
1614 @example | |
1615 int | |
13532 | 1616 re_match (struct re_pattern_buffer *@var{pattern_buffer}, |
1617 const char *@var{string}, const int @var{size}, | |
13531 | 1618 const int @var{start}, struct re_registers *@var{regs}) |
1619 @end example | |
1620 | |
1621 @noindent | |
1622 @var{pattern_buffer} is the address of a pattern buffer containing a | |
1623 compiled pattern. @var{string} is the string you want to match; it can | |
1624 contain newline and null characters. @var{size} is the length of that | |
1625 string. @var{start} is the string index at which you want to | |
1626 begin matching; the first character of @var{string} is at index zero. | |
14775
a152da4489c4
maint: replace misused "a" with "an"
Jim Meyering <meyering@redhat.com>
parents:
14774
diff
changeset
|
1627 @xref{Using Registers}, for an explanation of @var{regs}; you can safely |
13531 | 1628 pass zero. |
1629 | |
1630 @code{re_match} matches the regular expression in @var{pattern_buffer} | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1631 against the string @var{string} according to the syntax of |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1632 @var{pattern_buffer}. (@xref{GNU Regular Expression Compiling}, for how |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1633 to set it.) The function returns @math{-1} if the compiled pattern does |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1634 not match any part of @var{string} and @math{-2} if an internal error |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1635 happens; otherwise, it returns how many (possibly zero) characters of |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1636 @var{string} the pattern matched. |
13531 | 1637 |
1638 An example: suppose @var{pattern_buffer} points to a pattern buffer | |
1639 containing the compiled pattern for @samp{a*}, and @var{string} points | |
1640 to @samp{aaaaab} (whereupon @var{size} should be 6). Then if @var{start} | |
1641 is 2, @code{re_match} returns 3, i.e., @samp{a*} would have matched the | |
1642 last three @samp{a}s in @var{string}. If @var{start} is 0, | |
1643 @code{re_match} returns 5, i.e., @samp{a*} would have matched all the | |
1644 @samp{a}s in @var{string}. If @var{start} is either 5 or 6, it returns | |
1645 zero. | |
1646 | |
1647 If @var{start} is not between zero and @var{size}, then | |
1648 @code{re_match} returns @math{-1}. | |
1649 | |
1650 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1651 @node GNU Searching |
13532 | 1652 @subsection GNU Searching |
13531 | 1653 |
1654 @cindex searching with GNU functions | |
1655 | |
1656 @dfn{Searching} means trying to match starting at successive positions | |
1657 within a string. The function @code{re_search} does this. | |
1658 | |
1659 Before calling @code{re_search}, you must compile your regular | |
1660 expression. @xref{GNU Regular Expression Compiling}. | |
1661 | |
1662 Here is the function declaration: | |
1663 | |
1664 @findex re_search | |
1665 @example | |
13532 | 1666 int |
1667 re_search (struct re_pattern_buffer *@var{pattern_buffer}, | |
1668 const char *@var{string}, const int @var{size}, | |
1669 const int @var{start}, const int @var{range}, | |
13531 | 1670 struct re_registers *@var{regs}) |
1671 @end example | |
1672 | |
1673 @noindent | |
1674 @vindex start @r{argument to @code{re_search}} | |
1675 @vindex range @r{argument to @code{re_search}} | |
1676 whose arguments are the same as those to @code{re_match} (@pxref{GNU | |
1677 Matching}) except that the two arguments @var{start} and @var{range} | |
1678 replace @code{re_match}'s argument @var{start}. | |
1679 | |
1680 If @var{range} is positive, then @code{re_search} attempts a match | |
1681 starting first at index @var{start}, then at @math{@var{start} + 1} if | |
1682 that fails, and so on, up to @math{@var{start} + @var{range}}; if | |
1683 @var{range} is negative, then it attempts a match starting first at | |
1684 index @var{start}, then at @math{@var{start} -1} if that fails, and so | |
13532 | 1685 on. |
13531 | 1686 |
1687 If @var{start} is not between zero and @var{size}, then @code{re_search} | |
1688 returns @math{-1}. When @var{range} is positive, @code{re_search} | |
1689 adjusts @var{range} so that @math{@var{start} + @var{range} - 1} is | |
1690 between zero and @var{size}, if necessary; that way it won't search | |
1691 outside of @var{string}. Similarly, when @var{range} is negative, | |
1692 @code{re_search} adjusts @var{range} so that @math{@var{start} + | |
1693 @var{range} + 1} is between zero and @var{size}, if necessary. | |
1694 | |
1695 If the @code{fastmap} field of @var{pattern_buffer} is zero, | |
1696 @code{re_search} matches starting at consecutive positions; otherwise, | |
1697 it uses @code{fastmap} to make the search more efficient. | |
1698 @xref{Searching with Fastmaps}. | |
1699 | |
1700 If no match is found, @code{re_search} returns @math{-1}. If | |
1701 a match is found, it returns the index where the match began. If an | |
1702 internal error happens, it returns @math{-2}. | |
1703 | |
1704 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1705 @node Matching/Searching with Split Data |
13531 | 1706 @subsection Matching and Searching with Split Data |
1707 | |
1708 Using the functions @code{re_match_2} and @code{re_search_2}, you can | |
13532 | 1709 match or search in data that is divided into two strings. |
13531 | 1710 |
1711 The function: | |
1712 | |
1713 @findex re_match_2 | |
1714 @example | |
1715 int | |
13532 | 1716 re_match_2 (struct re_pattern_buffer *@var{buffer}, |
1717 const char *@var{string1}, const int @var{size1}, | |
1718 const char *@var{string2}, const int @var{size2}, | |
1719 const int @var{start}, | |
1720 struct re_registers *@var{regs}, | |
13531 | 1721 const int @var{stop}) |
1722 @end example | |
1723 | |
1724 @noindent | |
1725 is similar to @code{re_match} (@pxref{GNU Matching}) except that you | |
1726 pass @emph{two} data strings and sizes, and an index @var{stop} beyond | |
1727 which you don't want the matcher to try matching. As with | |
1728 @code{re_match}, if it succeeds, @code{re_match_2} returns how many | |
1729 characters of @var{string} it matched. Regard @var{string1} and | |
1730 @var{string2} as concatenated when you set the arguments @var{start} and | |
1731 @var{stop} and use the contents of @var{regs}; @code{re_match_2} never | |
13532 | 1732 returns a value larger than @math{@var{size1} + @var{size2}}. |
13531 | 1733 |
1734 The function: | |
1735 | |
1736 @findex re_search_2 | |
1737 @example | |
1738 int | |
13532 | 1739 re_search_2 (struct re_pattern_buffer *@var{buffer}, |
1740 const char *@var{string1}, const int @var{size1}, | |
1741 const char *@var{string2}, const int @var{size2}, | |
1742 const int @var{start}, const int @var{range}, | |
1743 struct re_registers *@var{regs}, | |
13531 | 1744 const int @var{stop}) |
1745 @end example | |
1746 | |
1747 @noindent | |
1748 is similarly related to @code{re_search}. | |
1749 | |
1750 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1751 @node Searching with Fastmaps |
13531 | 1752 @subsection Searching with Fastmaps |
1753 | |
1754 @cindex fastmaps | |
1755 If you're searching through a long string, you should use a fastmap. | |
1756 Without one, the searcher tries to match at consecutive positions in the | |
1757 string. Generally, most of the characters in the string could not start | |
1758 a match. It takes much longer to try matching at a given position in the | |
1759 string than it does to check in a table whether or not the character at | |
1760 that position could start a match. A @dfn{fastmap} is such a table. | |
1761 | |
1762 More specifically, a fastmap is an array indexed by the characters in | |
17274 | 1763 your character set. Under the ASCII encoding, therefore, a fastmap |
13531 | 1764 has 256 elements. If you want the searcher to use a fastmap with a |
1765 given pattern buffer, you must allocate the array and assign the array's | |
1766 address to the pattern buffer's @code{fastmap} field. You either can | |
1767 compile the fastmap yourself or have @code{re_search} do it for you; | |
1768 when @code{fastmap} is nonzero, it automatically compiles a fastmap the | |
13532 | 1769 first time you search using a particular compiled pattern. |
13531 | 1770 |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1771 By setting the buffer’s @code{fastmap} field before calling |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1772 @code{re_compile_pattern}, you can reuse a buffer data structure across |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1773 multiple searches with different patterns, and allocate the fastmap only |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1774 once. Nonetheless, the fastmap must be recompiled each time the buffer |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1775 has a new pattern compiled into it. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1776 |
13531 | 1777 To compile a fastmap yourself, use: |
1778 | |
1779 @findex re_compile_fastmap | |
1780 @example | |
1781 int | |
1782 re_compile_fastmap (struct re_pattern_buffer *@var{pattern_buffer}) | |
1783 @end example | |
1784 | |
1785 @noindent | |
1786 @var{pattern_buffer} is the address of a pattern buffer. If the | |
1787 character @var{c} could start a match for the pattern, | |
1788 @code{re_compile_fastmap} makes | |
1789 @code{@var{pattern_buffer}->fastmap[@var{c}]} nonzero. It returns | |
1790 @math{0} if it can compile a fastmap and @math{-2} if there is an | |
1791 internal error. For example, if @samp{|} is the alternation operator | |
1792 and @var{pattern_buffer} holds the compiled pattern for @samp{a|b}, then | |
1793 @code{re_compile_fastmap} sets @code{fastmap['a']} and | |
1794 @code{fastmap['b']} (and no others). | |
1795 | |
1796 @code{re_search} uses a fastmap as it moves along in the string: it | |
1797 checks the string's characters until it finds one that's in the fastmap. | |
1798 Then it tries matching at that character. If the match fails, it | |
1799 repeats the process. So, by using a fastmap, @code{re_search} doesn't | |
1800 waste time trying to match at positions in the string that couldn't | |
1801 start a match. | |
1802 | |
1803 If you don't want @code{re_search} to use a fastmap, | |
1804 store zero in the @code{fastmap} field of the pattern buffer before | |
1805 calling @code{re_search}. | |
1806 | |
1807 Once you've initialized a pattern buffer's @code{fastmap} field, you | |
1808 need never do so again---even if you compile a new pattern in | |
1809 it---provided the way the field is set still reflects whether or not you | |
1810 want a fastmap. @code{re_search} will still either do nothing if | |
1811 @code{fastmap} is null or, if it isn't, compile a new fastmap for the | |
1812 new pattern. | |
1813 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1814 @node GNU Translate Tables |
13531 | 1815 @subsection GNU Translate Tables |
1816 | |
1817 If you set the @code{translate} field of a pattern buffer to a translate | |
17274 | 1818 table, then the GNU Regex functions to which you've passed that |
13531 | 1819 pattern buffer use it to apply a simple transformation |
1820 to all the regular expression and string characters at which they look. | |
1821 | |
1822 A @dfn{translate table} is an array indexed by the characters in your | |
17274 | 1823 character set. Under the ASCII encoding, therefore, a translate |
13531 | 1824 table has 256 elements. The array's elements are also characters in |
1825 your character set. When the Regex functions see a character @var{c}, | |
1826 they use @code{translate[@var{c}]} in its place, with one exception: the | |
1827 character after a @samp{\} is not translated. (This ensures that, the | |
1828 operators, e.g., @samp{\B} and @samp{\b}, are always distinguishable.) | |
1829 | |
1830 For example, a table that maps all lowercase letters to the | |
1831 corresponding uppercase ones would cause the matcher to ignore | |
1832 differences in case.@footnote{A table that maps all uppercase letters to | |
1833 the corresponding lowercase ones would work just as well for this | |
1834 purpose.} Such a table would map all characters except lowercase letters | |
1835 to themselves, and lowercase letters to the corresponding uppercase | |
17274 | 1836 ones. Under the ASCII encoding, here's how you could initialize |
13531 | 1837 such a table (we'll call it @code{case_fold}): |
1838 | |
1839 @example | |
1840 for (i = 0; i < 256; i++) | |
1841 case_fold[i] = i; | |
1842 for (i = 'a'; i <= 'z'; i++) | |
1843 case_fold[i] = i - ('a' - 'A'); | |
1844 @end example | |
1845 | |
1846 You tell Regex to use a translate table on a given pattern buffer by | |
1847 assigning that table's address to the @code{translate} field of that | |
1848 buffer. If you don't want Regex to do any translation, put zero into | |
1849 this field. You'll get weird results if you change the table's contents | |
1850 anytime between compiling the pattern buffer, compiling its fastmap, and | |
1851 matching or searching with the pattern buffer. | |
1852 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1853 @node Using Registers |
13531 | 1854 @subsection Using Registers |
1855 | |
16358 | 1856 A group in a regular expression can match a (possibly empty) substring |
13531 | 1857 of the string that regular expression as a whole matched. The matcher |
1858 remembers the beginning and end of the substring matched by | |
1859 each group. | |
1860 | |
1861 To find out what they matched, pass a nonzero @var{regs} argument to a | |
17274 | 1862 GNU matching or searching function (@pxref{GNU Matching} and |
13531 | 1863 @ref{GNU Searching}), i.e., the address of a structure of this type, as |
1864 defined in @file{regex.h}: | |
1865 | |
1866 @c We don't bother to include this directly from regex.h, | |
1867 @c since it changes so rarely. | |
1868 @example | |
1869 @tindex re_registers | |
1870 @vindex num_regs @r{in @code{struct re_registers}} | |
1871 @vindex start @r{in @code{struct re_registers}} | |
1872 @vindex end @r{in @code{struct re_registers}} | |
1873 struct re_registers | |
1874 @{ | |
1875 unsigned num_regs; | |
1876 regoff_t *start; | |
1877 regoff_t *end; | |
1878 @}; | |
1879 @end example | |
1880 | |
1881 Except for (possibly) the @var{num_regs}'th element (see below), the | |
1882 @var{i}th element of the @code{start} and @code{end} arrays records | |
1883 information about the @var{i}th group in the pattern. (They're declared | |
1884 as C pointers, but this is only because not all C compilers accept | |
1885 zero-length arrays; conceptually, it is simplest to think of them as | |
1886 arrays.) | |
1887 | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1888 The @code{start} and @code{end} arrays are allocated in one of two ways. |
13531 | 1889 The simplest and perhaps most useful is to let the matcher (re)allocate |
1890 enough space to record information for all the groups in the regular | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1891 expression. If @code{re_set_registers} is not called before searching |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1892 or matching, then the matcher allocates two arrays each of @math{1 + |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1893 @var{re_nsub}} elements (@var{re_nsub} is another field in the pattern |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1894 buffer; @pxref{GNU Pattern Buffers}). The extra element is set to |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1895 @math{-1}. Then on subsequent calls with the same pattern buffer and |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1896 @var{regs} arguments, the matcher reallocates more space if necessary. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1897 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1898 The function: |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1899 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1900 @findex re_set_registers |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1901 @example |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1902 void |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1903 re_set_registers (struct re_pattern_buffer *@var{buffer}, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1904 struct re_registers *@var{regs}, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1905 size_t @var{num_regs}, |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1906 regoff_t *@var{starts}, regoff_t *@var{ends}) |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1907 @end example |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1908 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1909 @noindent sets @var{regs} to hold @var{num_regs} registers, storing |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1910 them in @var{starts} and @var{ends}. Subsequent matches using |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1911 @var{buffer} and @var{regs} will use this memory for recording |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1912 register information. @var{starts} and @var{ends} must be allocated |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1913 with malloc, and must each be at least @math{@var{num_regs} * |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1914 @code{sizeof (regoff_t)}} bytes long. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1915 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1916 If @var{num_regs} is zero, then subsequent matches should allocate |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1917 their own register data. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1918 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1919 Unless this function is called, the first search or match using |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1920 @var{buffer} will allocate its own register data, without freeing the |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1921 old data. |
13531 | 1922 |
1923 The following examples illustrate the information recorded in the | |
1924 @code{re_registers} structure. (In all of them, @samp{(} represents the | |
1925 open-group and @samp{)} the close-group operator. The first character | |
1926 in the string @var{string} is at index 0.) | |
1927 | |
1928 @itemize @bullet | |
1929 | |
13532 | 1930 @item |
13531 | 1931 If the regular expression has an @w{@var{i}-th} |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
1932 group that matches a |
13531 | 1933 substring of @var{string}, then the function sets |
1934 @code{@w{@var{regs}->}start[@var{i}]} to the index in @var{string} where | |
1935 the substring matched by the @w{@var{i}-th} group begins, and | |
1936 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that | |
1937 substring's end. The function sets @code{@w{@var{regs}->}start[0]} and | |
1938 @code{@w{@var{regs}->}end[0]} to analogous information about the entire | |
1939 pattern. | |
1940 | |
1941 For example, when you match @samp{((a)(b))} against @samp{ab}, you get: | |
1942 | |
1943 @itemize | |
1944 @item | |
13532 | 1945 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]} |
13531 | 1946 |
1947 @item | |
13532 | 1948 0 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]} |
13531 | 1949 |
1950 @item | |
13532 | 1951 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]} |
13531 | 1952 |
1953 @item | |
13532 | 1954 1 in @code{@w{@var{regs}->}start[3]} and 2 in @code{@w{@var{regs}->}end[3]} |
13531 | 1955 @end itemize |
1956 | |
1957 @item | |
1958 If a group matches more than once (as it might if followed by, | |
1959 e.g., a repetition operator), then the function reports the information | |
1960 about what the group @emph{last} matched. | |
1961 | |
1962 For example, when you match the pattern @samp{(a)*} against the string | |
1963 @samp{aa}, you get: | |
1964 | |
1965 @itemize | |
1966 @item | |
13532 | 1967 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]} |
13531 | 1968 |
1969 @item | |
13532 | 1970 1 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]} |
13531 | 1971 @end itemize |
1972 | |
1973 @item | |
1974 If the @w{@var{i}-th} group does not participate in a | |
1975 successful match, e.g., it is an alternative not taken or a | |
1976 repetition operator allows zero repetitions of it, then the function | |
1977 sets @code{@w{@var{regs}->}start[@var{i}]} and | |
1978 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}. | |
1979 | |
1980 For example, when you match the pattern @samp{(a)*b} against | |
1981 the string @samp{b}, you get: | |
1982 | |
1983 @itemize | |
1984 @item | |
13532 | 1985 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]} |
13531 | 1986 |
1987 @item | |
13532 | 1988 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]} |
13531 | 1989 @end itemize |
1990 | |
1991 @item | |
1992 If the @w{@var{i}-th} group matches a zero-length string, then the | |
1993 function sets @code{@w{@var{regs}->}start[@var{i}]} and | |
1994 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that | |
13532 | 1995 zero-length string. |
13531 | 1996 |
1997 For example, when you match the pattern @samp{(a*)b} against the string | |
1998 @samp{b}, you get: | |
1999 | |
2000 @itemize | |
2001 @item | |
13532 | 2002 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]} |
13531 | 2003 |
2004 @item | |
13532 | 2005 0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]} |
13531 | 2006 @end itemize |
2007 | |
2008 @item | |
13532 | 2009 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group |
13531 | 2010 in turn not contained within any other group within group @var{i} and |
2011 the function reports a match of the @w{@var{i}-th} group, then it | |
2012 records in @code{@w{@var{regs}->}start[@var{j}]} and | |
2013 @code{@w{@var{regs}->}end[@var{j}]} the last match (if it matched) of | |
2014 the @w{@var{j}-th} group. | |
2015 | |
2016 For example, when you match the pattern @samp{((a*)b)*} against the | |
2017 string @samp{abb}, @w{group 2} last matches the empty string, so you | |
2018 get what it previously matched: | |
2019 | |
2020 @itemize | |
2021 @item | |
13532 | 2022 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]} |
13531 | 2023 |
2024 @item | |
13532 | 2025 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]} |
13531 | 2026 |
2027 @item | |
13532 | 2028 2 in @code{@w{@var{regs}->}start[2]} and 2 in @code{@w{@var{regs}->}end[2]} |
13531 | 2029 @end itemize |
2030 | |
2031 When you match the pattern @samp{((a)*b)*} against the string | |
2032 @samp{abb}, @w{group 2} doesn't participate in the last match, so you | |
2033 get: | |
2034 | |
2035 @itemize | |
2036 @item | |
13532 | 2037 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]} |
13531 | 2038 |
2039 @item | |
13532 | 2040 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]} |
13531 | 2041 |
2042 @item | |
13532 | 2043 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]} |
13531 | 2044 @end itemize |
2045 | |
2046 @item | |
2047 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group | |
2048 in turn not contained within any other group within group @var{i} | |
13532 | 2049 and the function sets |
2050 @code{@w{@var{regs}->}start[@var{i}]} and | |
13531 | 2051 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}, then it also sets |
2052 @code{@w{@var{regs}->}start[@var{j}]} and | |
2053 @code{@w{@var{regs}->}end[@var{j}]} to @math{-1}. | |
2054 | |
2055 For example, when you match the pattern @samp{((a)*b)*c} against the | |
2056 string @samp{c}, you get: | |
2057 | |
2058 @itemize | |
2059 @item | |
13532 | 2060 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]} |
13531 | 2061 |
2062 @item | |
13532 | 2063 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]} |
13531 | 2064 |
2065 @item | |
13532 | 2066 @math{-1} in @code{@w{@var{regs}->}start[2]} and @math{-1} in @code{@w{@var{regs}->}end[2]} |
13531 | 2067 @end itemize |
2068 | |
2069 @end itemize | |
2070 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2071 @node Freeing GNU Pattern Buffers |
13531 | 2072 @subsection Freeing GNU Pattern Buffers |
2073 | |
17274 | 2074 To free any allocated fields of a pattern buffer, use the POSIX |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2075 function @code{regfree}: |
13531 | 2076 |
2077 @findex regfree | |
2078 @example | |
13532 | 2079 void |
13531 | 2080 regfree (regex_t *@var{preg}) |
2081 @end example | |
2082 | |
2083 @noindent | |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2084 @var{preg} is the pattern buffer whose allocated fields you want freed; |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2085 this works because since the type @code{regex_t}---the type for |
17274 | 2086 POSIX pattern buffers---is equivalent to the type |
13647
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2087 @code{re_pattern_buffer}. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2088 |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2089 @code{regfree} also sets @var{preg}'s @code{allocated} field to zero. |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2090 After a buffer has been freed, it must have a regular expression |
e5c0e28232bc
regex documentation update from Reuben Thomas <rrt@sc3d.org>, 20 Aug 2010 12:04:39 +0100
Karl Berry <karl@freefriends.org>
parents:
13554
diff
changeset
|
2091 compiled in it before passing it to a matching or searching function. |
13531 | 2092 |
2093 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2094 @node BSD Regex Functions |
13531 | 2095 @section BSD Regex Functions |
2096 | |
17274 | 2097 If you're writing code that has to be Berkeley Unix compatible, |
13531 | 2098 you'll need to use these functions whose interfaces are the same as those |
17274 | 2099 in Berkeley Unix. |
13531 | 2100 |
2101 @menu | |
2102 * BSD Regular Expression Compiling:: re_comp () | |
2103 * BSD Searching:: re_exec () | |
2104 @end menu | |
2105 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2106 @node BSD Regular Expression Compiling |
13531 | 2107 @subsection BSD Regular Expression Compiling |
2108 | |
17274 | 2109 With Berkeley Unix, you can only search for a given regular |
13531 | 2110 expression; you can't match one. To search for it, you must first |
2111 compile it. Before you compile it, you must indicate the regular | |
13532 | 2112 expression syntax you want it compiled according to by setting the |
13531 | 2113 variable @code{re_syntax_options} (declared in @file{regex.h} to some |
2114 syntax (@pxref{Regular Expression Syntax}). | |
2115 | |
2116 To compile a regular expression use: | |
2117 | |
2118 @findex re_comp | |
2119 @example | |
2120 char * | |
2121 re_comp (char *@var{regex}) | |
2122 @end example | |
2123 | |
2124 @noindent | |
2125 @var{regex} is the address of a null-terminated regular expression. | |
2126 @code{re_comp} uses an internal pattern buffer, so you can use only the | |
2127 most recently compiled pattern buffer. This means that if you want to | |
2128 use a given regular expression that you've already compiled---but it | |
2129 isn't the latest one you've compiled---you'll have to recompile it. If | |
2130 you call @code{re_comp} with the null string (@emph{not} the empty | |
2131 string) as the argument, it doesn't change the contents of the pattern | |
2132 buffer. | |
2133 | |
2134 If @code{re_comp} successfully compiles the regular expression, it | |
2135 returns zero. If it can't compile the regular expression, it returns | |
2136 an error string. @code{re_comp}'s error messages are identical to those | |
2137 of @code{re_compile_pattern} (@pxref{GNU Regular Expression | |
2138 Compiling}). | |
2139 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2140 @node BSD Searching |
13532 | 2141 @subsection BSD Searching |
13531 | 2142 |
17274 | 2143 Searching the Berkeley Unix way means searching in a string |
13531 | 2144 starting at its first character and trying successive positions within |
2145 it to find a match. Once you've compiled a pattern using @code{re_comp} | |
2146 (@pxref{BSD Regular Expression Compiling}), you can ask Regex | |
2147 to search for that pattern in a string using: | |
2148 | |
2149 @findex re_exec | |
2150 @example | |
2151 int | |
2152 re_exec (char *@var{string}) | |
2153 @end example | |
2154 | |
2155 @noindent | |
2156 @var{string} is the address of the null-terminated string in which you | |
2157 want to search. | |
2158 | |
2159 @code{re_exec} returns either 1 for success or 0 for failure. It | |
17274 | 2160 automatically uses a GNU fastmap (@pxref{Searching with Fastmaps}). |