Mercurial > hg > octave-kai > gnulib-hg
annotate doc/regex.texi @ 13553:8fc3314fe460
Document not_eol and remove mention of regex.c.
author | Reuben Thomas <rrt@sc3d.org> |
---|---|
date | Sat, 14 Aug 2010 16:40:16 +0100 |
parents | bb0ceefd22dc |
children | 3a3b9d29af1b |
rev | line source |
---|---|
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1 @node Overview |
13531 | 2 @chapter Overview |
3 | |
4 A @dfn{regular expression} (or @dfn{regexp}, or @dfn{pattern}) is a text | |
5 string that describes some (mathematical) set of strings. A regexp | |
6 @var{r} @dfn{matches} a string @var{s} if @var{s} is in the set of | |
7 strings described by @var{r}. | |
8 | |
9 Using the Regex library, you can: | |
10 | |
11 @itemize @bullet | |
12 | |
13 @item | |
13532 | 14 see if a string matches a specified pattern as a whole, and |
13531 | 15 |
16 @item | |
17 search within a string for a substring matching a specified pattern. | |
18 | |
19 @end itemize | |
20 | |
21 Some regular expressions match only one string, i.e., the set they | |
22 describe has only one member. For example, the regular expression | |
23 @samp{foo} matches the string @samp{foo} and no others. Other regular | |
24 expressions match more than one string, i.e., the set they describe has | |
25 more than one member. For example, the regular expression @samp{f*} | |
26 matches the set of strings made up of any number (including zero) of | |
27 @samp{f}s. As you can see, some characters in regular expressions match | |
28 themselves (such as @samp{f}) and some don't (such as @samp{*}); the | |
29 ones that don't match themselves instead let you specify patterns that | |
30 describe many different strings. | |
31 | |
32 To either match or search for a regular expression with the Regex | |
33 library functions, you must first compile it with a Regex pattern | |
34 compiling function. A @dfn{compiled pattern} is a regular expression | |
35 converted to the internal format used by the library functions. Once | |
36 you've compiled a pattern, you can use it for matching or searching any | |
37 number of times. | |
38 | |
13553
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
39 The Regex library is used by including @file{regex.h}. |
13531 | 40 @pindex regex.h |
41 Regex provides three groups of functions with which you can operate on | |
42 regular expressions. One group---the @sc{gnu} group---is more powerful | |
43 but not completely compatible with the other two, namely the @sc{posix} | |
44 and Berkeley @sc{unix} groups; its interface was designed specifically | |
45 for @sc{gnu}. The other groups have the same interfaces as do the | |
46 regular expression functions in @sc{posix} and Berkeley | |
47 @sc{unix}. | |
48 | |
49 We wrote this chapter with programmers in mind, not users of | |
50 programs---such as Emacs---that use Regex. We describe the Regex | |
51 library in its entirety, not how to write regular expressions that a | |
52 particular program understands. | |
53 | |
54 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
55 @node Regular Expression Syntax |
13531 | 56 @chapter Regular Expression Syntax |
57 | |
58 @cindex regular expressions, syntax of | |
59 @cindex syntax of regular expressions | |
60 | |
61 @dfn{Characters} are things you can type. @dfn{Operators} are things in | |
62 a regular expression that match one or more characters. You compose | |
63 regular expressions from operators, which in turn you specify using one | |
64 or more characters. | |
65 | |
66 Most characters represent what we call the match-self operator, i.e., | |
67 they match themselves; we call these characters @dfn{ordinary}. Other | |
68 characters represent either all or parts of fancier operators; e.g., | |
69 @samp{.} represents what we call the match-any-character operator | |
70 (which, no surprise, matches (almost) any character); we call these | |
71 characters @dfn{special}. Two different things determine what | |
72 characters represent what operators: | |
73 | |
74 @enumerate | |
75 @item | |
76 the regular expression syntax your program has told the Regex library to | |
77 recognize, and | |
78 | |
79 @item | |
80 the context of the character in the regular expression. | |
81 @end enumerate | |
82 | |
83 In the following sections, we describe these things in more detail. | |
84 | |
85 @menu | |
86 * Syntax Bits:: | |
87 * Predefined Syntaxes:: | |
88 * Collating Elements vs. Characters:: | |
89 * The Backslash Character:: | |
90 @end menu | |
91 | |
92 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
93 @node Syntax Bits |
13532 | 94 @section Syntax Bits |
13531 | 95 |
96 @cindex syntax bits | |
97 | |
98 In any particular syntax for regular expressions, some characters are | |
99 always special, others are sometimes special, and others are never | |
100 special. The particular syntax that Regex recognizes for a given | |
101 regular expression depends on the value in the @code{syntax} field of | |
102 the pattern buffer of that regular expression. | |
103 | |
104 You get a pattern buffer by compiling a regular expression. @xref{GNU | |
105 Pattern Buffers}, and @ref{POSIX Pattern Buffers}, for more information | |
106 on pattern buffers. @xref{GNU Regular Expression Compiling}, @ref{POSIX | |
107 Regular Expression Compiling}, and @ref{BSD Regular Expression | |
108 Compiling}, for more information on compiling. | |
109 | |
110 Regex considers the value of the @code{syntax} field to be a collection | |
111 of bits; we refer to these bits as @dfn{syntax bits}. In most cases, | |
112 they affect what characters represent what operators. We describe the | |
113 meanings of the operators to which we refer in @ref{Common Operators}, | |
13532 | 114 @ref{GNU Operators}, and @ref{GNU Emacs Operators}. |
13531 | 115 |
116 For reference, here is the complete list of syntax bits, in alphabetical | |
117 order: | |
118 | |
119 @table @code | |
120 | |
121 @cnindex RE_BACKSLASH_ESCAPE_IN_LIST | |
122 @item RE_BACKSLASH_ESCAPE_IN_LISTS | |
123 If this bit is set, then @samp{\} inside a list (@pxref{List Operators} | |
124 quotes (makes ordinary, if it's special) the following character; if | |
125 this bit isn't set, then @samp{\} is an ordinary character inside lists. | |
126 (@xref{The Backslash Character}, for what `\' does outside of lists.) | |
127 | |
128 @cnindex RE_BK_PLUS_QM | |
129 @item RE_BK_PLUS_QM | |
130 If this bit is set, then @samp{\+} represents the match-one-or-more | |
131 operator and @samp{\?} represents the match-zero-or-more operator; if | |
132 this bit isn't set, then @samp{+} represents the match-one-or-more | |
133 operator and @samp{?} represents the match-zero-or-one operator. This | |
134 bit is irrelevant if @code{RE_LIMITED_OPS} is set. | |
135 | |
136 @cnindex RE_CHAR_CLASSES | |
137 @item RE_CHAR_CLASSES | |
138 If this bit is set, then you can use character classes in lists; if this | |
139 bit isn't set, then you can't. | |
140 | |
141 @cnindex RE_CONTEXT_INDEP_ANCHORS | |
142 @item RE_CONTEXT_INDEP_ANCHORS | |
143 If this bit is set, then @samp{^} and @samp{$} are special anywhere outside | |
144 a list; if this bit isn't set, then these characters are special only in | |
145 certain contexts. @xref{Match-beginning-of-line Operator}, and | |
146 @ref{Match-end-of-line Operator}. | |
147 | |
148 @cnindex RE_CONTEXT_INDEP_OPS | |
149 @item RE_CONTEXT_INDEP_OPS | |
150 If this bit is set, then certain characters are special anywhere outside | |
151 a list; if this bit isn't set, then those characters are special only in | |
152 some contexts and are ordinary elsewhere. Specifically, if this bit | |
153 isn't set then @samp{*}, and (if the syntax bit @code{RE_LIMITED_OPS} | |
154 isn't set) @samp{+} and @samp{?} (or @samp{\+} and @samp{\?}, depending | |
155 on the syntax bit @code{RE_BK_PLUS_QM}) represent repetition operators | |
156 only if they're not first in a regular expression or just after an | |
157 open-group or alternation operator. The same holds for @samp{@{} (or | |
158 @samp{\@{}, depending on the syntax bit @code{RE_NO_BK_BRACES}) if | |
159 it is the beginning of a valid interval and the syntax bit | |
160 @code{RE_INTERVALS} is set. | |
161 | |
162 @cnindex RE_CONTEXT_INVALID_OPS | |
163 @item RE_CONTEXT_INVALID_OPS | |
164 If this bit is set, then repetition and alternation operators can't be | |
165 in certain positions within a regular expression. Specifically, the | |
166 regular expression is invalid if it has: | |
167 | |
168 @itemize @bullet | |
169 | |
170 @item | |
171 a repetition operator first in the regular expression or just after a | |
172 match-beginning-of-line, open-group, or alternation operator; or | |
173 | |
174 @item | |
175 an alternation operator first or last in the regular expression, just | |
176 before a match-end-of-line operator, or just after an alternation or | |
177 open-group operator. | |
178 | |
179 @end itemize | |
180 | |
181 If this bit isn't set, then you can put the characters representing the | |
182 repetition and alternation characters anywhere in a regular expression. | |
183 Whether or not they will in fact be operators in certain positions | |
184 depends on other syntax bits. | |
185 | |
186 @cnindex RE_DOT_NEWLINE | |
187 @item RE_DOT_NEWLINE | |
188 If this bit is set, then the match-any-character operator matches | |
189 a newline; if this bit isn't set, then it doesn't. | |
190 | |
191 @cnindex RE_DOT_NOT_NULL | |
192 @item RE_DOT_NOT_NULL | |
193 If this bit is set, then the match-any-character operator doesn't match | |
194 a null character; if this bit isn't set, then it does. | |
195 | |
196 @cnindex RE_INTERVALS | |
197 @item RE_INTERVALS | |
198 If this bit is set, then Regex recognizes interval operators; if this bit | |
199 isn't set, then it doesn't. | |
200 | |
201 @cnindex RE_LIMITED_OPS | |
202 @item RE_LIMITED_OPS | |
203 If this bit is set, then Regex doesn't recognize the match-one-or-more, | |
204 match-zero-or-one or alternation operators; if this bit isn't set, then | |
205 it does. | |
206 | |
207 @cnindex RE_NEWLINE_ALT | |
208 @item RE_NEWLINE_ALT | |
209 If this bit is set, then newline represents the alternation operator; if | |
210 this bit isn't set, then newline is ordinary. | |
211 | |
212 @cnindex RE_NO_BK_BRACES | |
213 @item RE_NO_BK_BRACES | |
214 If this bit is set, then @samp{@{} represents the open-interval operator | |
215 and @samp{@}} represents the close-interval operator; if this bit isn't | |
216 set, then @samp{\@{} represents the open-interval operator and | |
217 @samp{\@}} represents the close-interval operator. This bit is relevant | |
218 only if @code{RE_INTERVALS} is set. | |
219 | |
220 @cnindex RE_NO_BK_PARENS | |
221 @item RE_NO_BK_PARENS | |
222 If this bit is set, then @samp{(} represents the open-group operator and | |
223 @samp{)} represents the close-group operator; if this bit isn't set, then | |
224 @samp{\(} represents the open-group operator and @samp{\)} represents | |
225 the close-group operator. | |
226 | |
227 @cnindex RE_NO_BK_REFS | |
228 @item RE_NO_BK_REFS | |
229 If this bit is set, then Regex doesn't recognize @samp{\}@var{digit} as | |
230 the back reference operator; if this bit isn't set, then it does. | |
231 | |
232 @cnindex RE_NO_BK_VBAR | |
233 @item RE_NO_BK_VBAR | |
234 If this bit is set, then @samp{|} represents the alternation operator; | |
235 if this bit isn't set, then @samp{\|} represents the alternation | |
236 operator. This bit is irrelevant if @code{RE_LIMITED_OPS} is set. | |
237 | |
238 @cnindex RE_NO_EMPTY_RANGES | |
239 @item RE_NO_EMPTY_RANGES | |
240 If this bit is set, then a regular expression with a range whose ending | |
241 point collates lower than its starting point is invalid; if this bit | |
242 isn't set, then Regex considers such a range to be empty. | |
243 | |
244 @cnindex RE_UNMATCHED_RIGHT_PAREN_ORD | |
245 @item RE_UNMATCHED_RIGHT_PAREN_ORD | |
246 If this bit is set and the regular expression has no matching open-group | |
247 operator, then Regex considers what would otherwise be a close-group | |
248 operator (based on how @code{RE_NO_BK_PARENS} is set) to match @samp{)}. | |
249 | |
250 @end table | |
251 | |
252 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
253 @node Predefined Syntaxes |
13532 | 254 @section Predefined Syntaxes |
13531 | 255 |
256 If you're programming with Regex, you can set a pattern buffer's | |
257 (@pxref{GNU Pattern Buffers}, and @ref{POSIX Pattern Buffers}) | |
258 @code{syntax} field either to an arbitrary combination of syntax bits | |
259 (@pxref{Syntax Bits}) or else to the configurations defined by Regex. | |
260 These configurations define the syntaxes used by certain | |
261 programs---@sc{gnu} Emacs, | |
13532 | 262 @cindex Emacs |
13531 | 263 @sc{posix} Awk, |
264 @cindex POSIX Awk | |
13532 | 265 traditional Awk, |
13531 | 266 @cindex Awk |
267 Grep, | |
268 @cindex Grep | |
269 @cindex Egrep | |
270 Egrep---in addition to syntaxes for @sc{posix} basic and extended | |
271 regular expressions. | |
272 | |
13549
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
273 The predefined syntaxes---taken directly from @file{regex.h}---are: |
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
274 |
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
275 @smallexample |
13531 | 276 #define RE_SYNTAX_EMACS 0 |
277 | |
278 #define RE_SYNTAX_AWK \ | |
279 (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \ | |
280 | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | |
281 | RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \ | |
282 | RE_UNMATCHED_RIGHT_PAREN_ORD) | |
283 | |
284 #define RE_SYNTAX_POSIX_AWK \ | |
285 (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS) | |
286 | |
287 #define RE_SYNTAX_GREP \ | |
288 (RE_BK_PLUS_QM | RE_CHAR_CLASSES \ | |
289 | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \ | |
290 | RE_NEWLINE_ALT) | |
291 | |
292 #define RE_SYNTAX_EGREP \ | |
293 (RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \ | |
294 | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \ | |
295 | RE_NEWLINE_ALT | RE_NO_BK_PARENS \ | |
296 | RE_NO_BK_VBAR) | |
297 | |
298 #define RE_SYNTAX_POSIX_EGREP \ | |
299 (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES) | |
300 | |
301 /* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */ | |
302 #define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC | |
303 | |
304 #define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC | |
305 | |
306 /* Syntax bits common to both basic and extended POSIX regex syntax. */ | |
307 #define _RE_SYNTAX_POSIX_COMMON \ | |
308 (RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \ | |
309 | RE_INTERVALS | RE_NO_EMPTY_RANGES) | |
310 | |
311 #define RE_SYNTAX_POSIX_BASIC \ | |
312 (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM) | |
313 | |
314 /* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes | |
315 RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this | |
316 isn't minimal, since other operators, such as \`, aren't disabled. */ | |
317 #define RE_SYNTAX_POSIX_MINIMAL_BASIC \ | |
318 (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS) | |
319 | |
320 #define RE_SYNTAX_POSIX_EXTENDED \ | |
321 (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | |
322 | RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \ | |
323 | RE_NO_BK_PARENS | RE_NO_BK_VBAR \ | |
324 | RE_UNMATCHED_RIGHT_PAREN_ORD) | |
325 | |
326 /* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS | |
327 replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */ | |
328 #define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \ | |
329 (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | |
330 | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \ | |
331 | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | |
332 | RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD) | |
13549
bb0ceefd22dc
avoid some overlong lines from posix urls, etc.
Karl Berry <karl@freefriends.org>
parents:
13537
diff
changeset
|
333 @end smallexample |
13531 | 334 |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
335 @node Collating Elements vs. Characters |
13532 | 336 @section Collating Elements vs.@: Characters |
13531 | 337 |
338 @sc{posix} generalizes the notion of a character to that of a | |
339 collating element. It defines a @dfn{collating element} to be ``a | |
340 sequence of one or more bytes defined in the current collating sequence | |
341 as a unit of collation.'' | |
342 | |
343 This generalizes the notion of a character in | |
344 two ways. First, a single character can map into two or more collating | |
345 elements. For example, the German | |
346 @tex | |
347 `\ss' | |
348 @end tex | |
349 @ifinfo | |
350 ``es-zet'' | |
351 @end ifinfo | |
352 collates as the collating element @samp{s} followed by another collating | |
353 element @samp{s}. Second, two or more characters can map into one | |
354 collating element. For example, the Spanish @samp{ll} collates after | |
355 @samp{l} and before @samp{m}. | |
356 | |
357 Since @sc{posix}'s ``collating element'' preserves the essential idea of | |
358 a ``character,'' we use the latter, more familiar, term in this document. | |
359 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
360 @node The Backslash Character |
13531 | 361 @section The Backslash Character |
362 | |
363 @cindex \ | |
364 The @samp{\} character has one of four different meanings, depending on | |
365 the context in which you use it and what syntax bits are set | |
366 (@pxref{Syntax Bits}). It can: 1) stand for itself, 2) quote the next | |
367 character, 3) introduce an operator, or 4) do nothing. | |
368 | |
369 @enumerate | |
370 @item | |
371 It stands for itself inside a list | |
372 (@pxref{List Operators}) if the syntax bit | |
373 @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is not set. For example, @samp{[\]} | |
374 would match @samp{\}. | |
375 | |
376 @item | |
377 It quotes (makes ordinary, if it's special) the next character when you | |
378 use it either: | |
379 | |
380 @itemize @bullet | |
381 @item | |
382 outside a list,@footnote{Sometimes | |
383 you don't have to explicitly quote special characters to make | |
384 them ordinary. For instance, most characters lose any special meaning | |
385 inside a list (@pxref{List Operators}). In addition, if the syntax bits | |
386 @code{RE_CONTEXT_INVALID_OPS} and @code{RE_CONTEXT_INDEP_OPS} | |
387 aren't set, then (for historical reasons) the matcher considers special | |
388 characters ordinary if they are in contexts where the operations they | |
389 represent make no sense; for example, then the match-zero-or-more | |
390 operator (represented by @samp{*}) matches itself in the regular | |
391 expression @samp{*foo} because there is no preceding expression on which | |
392 it can operate. It is poor practice, however, to depend on this | |
393 behavior; if you want a special character to be ordinary outside a list, | |
394 it's better to always quote it, regardless.} or | |
395 | |
396 @item | |
397 inside a list and the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is set. | |
398 | |
399 @end itemize | |
400 | |
401 @item | |
402 It introduces an operator when followed by certain ordinary | |
403 characters---sometimes only when certain syntax bits are set. See the | |
404 cases @code{RE_BK_PLUS_QM}, @code{RE_NO_BK_BRACES}, @code{RE_NO_BK_VAR}, | |
405 @code{RE_NO_BK_PARENS}, @code{RE_NO_BK_REF} in @ref{Syntax Bits}. Also: | |
406 | |
407 @itemize @bullet | |
408 @item | |
409 @samp{\b} represents the match-word-boundary operator | |
410 (@pxref{Match-word-boundary Operator}). | |
411 | |
412 @item | |
413 @samp{\B} represents the match-within-word operator | |
414 (@pxref{Match-within-word Operator}). | |
415 | |
416 @item | |
417 @samp{\<} represents the match-beginning-of-word operator @* | |
418 (@pxref{Match-beginning-of-word Operator}). | |
419 | |
420 @item | |
421 @samp{\>} represents the match-end-of-word operator | |
422 (@pxref{Match-end-of-word Operator}). | |
423 | |
424 @item | |
425 @samp{\w} represents the match-word-constituent operator | |
426 (@pxref{Match-word-constituent Operator}). | |
427 | |
428 @item | |
429 @samp{\W} represents the match-non-word-constituent operator | |
430 (@pxref{Match-non-word-constituent Operator}). | |
431 | |
432 @item | |
433 @samp{\`} represents the match-beginning-of-buffer | |
434 operator and @samp{\'} represents the match-end-of-buffer operator | |
435 (@pxref{Buffer Operators}). | |
436 | |
437 @item | |
438 If Regex was compiled with the C preprocessor symbol @code{emacs} | |
439 defined, then @samp{\s@var{class}} represents the match-syntactic-class | |
440 operator and @samp{\S@var{class}} represents the | |
441 match-not-syntactic-class operator (@pxref{Syntactic Class Operators}). | |
442 | |
443 @end itemize | |
444 | |
445 @item | |
446 In all other cases, Regex ignores @samp{\}. For example, | |
447 @samp{\n} matches @samp{n}. | |
448 | |
449 @end enumerate | |
450 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
451 @node Common Operators |
13531 | 452 @chapter Common Operators |
453 | |
454 You compose regular expressions from operators. In the following | |
455 sections, we describe the regular expression operators specified by | |
456 @sc{posix}; @sc{gnu} also uses these. Most operators have more than one | |
457 representation as characters. @xref{Regular Expression Syntax}, for | |
458 what characters represent what operators under what circumstances. | |
459 | |
460 For most operators that can be represented in two ways, one | |
461 representation is a single character and the other is that character | |
462 preceded by @samp{\}. For example, either @samp{(} or @samp{\(} | |
463 represents the open-group operator. Which one does depends on the | |
464 setting of a syntax bit, in this case @code{RE_NO_BK_PARENS}. Why is | |
465 this so? Historical reasons dictate some of the varying | |
13532 | 466 representations, while @sc{posix} dictates others. |
13531 | 467 |
468 Finally, almost all characters lose any special meaning inside a list | |
469 (@pxref{List Operators}). | |
470 | |
471 @menu | |
472 * Match-self Operator:: Ordinary characters. | |
473 * Match-any-character Operator:: . | |
474 * Concatenation Operator:: Juxtaposition. | |
475 * Repetition Operators:: * + ? @{@} | |
476 * Alternation Operator:: | | |
477 * List Operators:: [...] [^...] | |
478 * Grouping Operators:: (...) | |
479 * Back-reference Operator:: \digit | |
480 * Anchoring Operators:: ^ $ | |
481 @end menu | |
482 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
483 @node Match-self Operator |
13531 | 484 @section The Match-self Operator (@var{ordinary character}) |
485 | |
486 This operator matches the character itself. All ordinary characters | |
487 (@pxref{Regular Expression Syntax}) represent this operator. For | |
488 example, @samp{f} is always an ordinary character, so the regular | |
489 expression @samp{f} matches only the string @samp{f}. In | |
490 particular, it does @emph{not} match the string @samp{ff}. | |
491 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
492 @node Match-any-character Operator |
13531 | 493 @section The Match-any-character Operator (@code{.}) |
494 | |
495 @cindex @samp{.} | |
496 | |
497 This operator matches any single printing or nonprinting character | |
498 except it won't match a: | |
499 | |
500 @table @asis | |
501 @item newline | |
502 if the syntax bit @code{RE_DOT_NEWLINE} isn't set. | |
503 | |
504 @item null | |
505 if the syntax bit @code{RE_DOT_NOT_NULL} is set. | |
506 | |
507 @end table | |
508 | |
509 The @samp{.} (period) character represents this operator. For example, | |
510 @samp{a.b} matches any three-character string beginning with @samp{a} | |
511 and ending with @samp{b}. | |
512 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
513 @node Concatenation Operator |
13531 | 514 @section The Concatenation Operator |
515 | |
516 This operator concatenates two regular expressions @var{a} and @var{b}. | |
517 No character represents this operator; you simply put @var{b} after | |
518 @var{a}. The result is a regular expression that will match a string if | |
519 @var{a} matches its first part and @var{b} matches the rest. For | |
520 example, @samp{xy} (two match-self operators) matches @samp{xy}. | |
521 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
522 @node Repetition Operators |
13532 | 523 @section Repetition Operators |
13531 | 524 |
525 Repetition operators repeat the preceding regular expression a specified | |
526 number of times. | |
527 | |
528 @menu | |
529 * Match-zero-or-more Operator:: * | |
530 * Match-one-or-more Operator:: + | |
531 * Match-zero-or-one Operator:: ? | |
532 * Interval Operators:: @{@} | |
533 @end menu | |
534 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
535 @node Match-zero-or-more Operator |
13531 | 536 @subsection The Match-zero-or-more Operator (@code{*}) |
537 | |
538 @cindex @samp{*} | |
539 | |
540 This operator repeats the smallest possible preceding regular expression | |
541 as many times as necessary (including zero) to match the pattern. | |
542 @samp{*} represents this operator. For example, @samp{o*} | |
543 matches any string made up of zero or more @samp{o}s. Since this | |
544 operator operates on the smallest preceding regular expression, | |
545 @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}. So, | |
546 @samp{fo*} matches @samp{f}, @samp{fo}, @samp{foo}, and so on. | |
547 | |
548 Since the match-zero-or-more operator is a suffix operator, it may be | |
549 useless as such when no regular expression precedes it. This is the | |
550 case when it: | |
551 | |
552 @itemize @bullet | |
13532 | 553 @item |
13531 | 554 is first in a regular expression, or |
555 | |
13532 | 556 @item |
13531 | 557 follows a match-beginning-of-line, open-group, or alternation |
558 operator. | |
559 | |
560 @end itemize | |
561 | |
562 @noindent | |
563 Three different things can happen in these cases: | |
564 | |
565 @enumerate | |
566 @item | |
567 If the syntax bit @code{RE_CONTEXT_INVALID_OPS} is set, then the | |
568 regular expression is invalid. | |
569 | |
570 @item | |
571 If @code{RE_CONTEXT_INVALID_OPS} isn't set, but | |
572 @code{RE_CONTEXT_INDEP_OPS} is, then @samp{*} represents the | |
573 match-zero-or-more operator (which then operates on the empty string). | |
574 | |
575 @item | |
576 Otherwise, @samp{*} is ordinary. | |
577 | |
578 @end enumerate | |
579 | |
580 @cindex backtracking | |
581 The matcher processes a match-zero-or-more operator by first matching as | |
582 many repetitions of the smallest preceding regular expression as it can. | |
13532 | 583 Then it continues to match the rest of the pattern. |
13531 | 584 |
585 If it can't match the rest of the pattern, it backtracks (as many times | |
586 as necessary), each time discarding one of the matches until it can | |
587 either match the entire pattern or be certain that it cannot get a | |
588 match. For example, when matching @samp{ca*ar} against @samp{caaar}, | |
589 the matcher first matches all three @samp{a}s of the string with the | |
590 @samp{a*} of the regular expression. However, it cannot then match the | |
591 final @samp{ar} of the regular expression against the final @samp{r} of | |
592 the string. So it backtracks, discarding the match of the last @samp{a} | |
593 in the string. It can then match the remaining @samp{ar}. | |
594 | |
595 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
596 @node Match-one-or-more Operator |
13531 | 597 @subsection The Match-one-or-more Operator (@code{+} or @code{\+}) |
598 | |
13532 | 599 @cindex @samp{+} |
13531 | 600 |
601 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't recognize | |
602 this operator. Otherwise, if the syntax bit @code{RE_BK_PLUS_QM} isn't | |
603 set, then @samp{+} represents this operator; if it is, then @samp{\+} | |
604 does. | |
605 | |
606 This operator is similar to the match-zero-or-more operator except that | |
607 it repeats the preceding regular expression at least once; | |
608 @pxref{Match-zero-or-more Operator}, for what it operates on, how some | |
609 syntax bits affect it, and how Regex backtracks to match it. | |
610 | |
611 For example, supposing that @samp{+} represents the match-one-or-more | |
612 operator; then @samp{ca+r} matches, e.g., @samp{car} and | |
613 @samp{caaaar}, but not @samp{cr}. | |
614 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
615 @node Match-zero-or-one Operator |
13531 | 616 @subsection The Match-zero-or-one Operator (@code{?} or @code{\?}) |
617 @cindex @samp{?} | |
618 | |
619 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't | |
620 recognize this operator. Otherwise, if the syntax bit | |
621 @code{RE_BK_PLUS_QM} isn't set, then @samp{?} represents this operator; | |
622 if it is, then @samp{\?} does. | |
623 | |
624 This operator is similar to the match-zero-or-more operator except that | |
625 it repeats the preceding regular expression once or not at all; | |
626 @pxref{Match-zero-or-more Operator}, to see what it operates on, how | |
627 some syntax bits affect it, and how Regex backtracks to match it. | |
628 | |
629 For example, supposing that @samp{?} represents the match-zero-or-one | |
630 operator; then @samp{ca?r} matches both @samp{car} and @samp{cr}, but | |
631 nothing else. | |
632 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
633 @node Interval Operators |
13531 | 634 @subsection Interval Operators (@code{@{} @dots{} @code{@}} or @code{\@{} @dots{} @code{\@}}) |
635 | |
636 @cindex interval expression | |
637 @cindex @samp{@{} | |
638 @cindex @samp{@}} | |
639 @cindex @samp{\@{} | |
640 @cindex @samp{\@}} | |
641 | |
642 If the syntax bit @code{RE_INTERVALS} is set, then Regex recognizes | |
643 @dfn{interval expressions}. They repeat the smallest possible preceding | |
644 regular expression a specified number of times. | |
645 | |
646 If the syntax bit @code{RE_NO_BK_BRACES} is set, @samp{@{} represents | |
647 the @dfn{open-interval operator} and @samp{@}} represents the | |
648 @dfn{close-interval operator} ; otherwise, @samp{\@{} and @samp{\@}} do. | |
649 | |
650 Specifically, supposing that @samp{@{} and @samp{@}} represent the | |
651 open-interval and close-interval operators; then: | |
652 | |
653 @table @code | |
654 @item @{@var{count}@} | |
655 matches exactly @var{count} occurrences of the preceding regular | |
656 expression. | |
657 | |
13537
77dd6d58a96b
erroneous commas inside @var
Karl Berry <karl@freefriends.org>
parents:
13533
diff
changeset
|
658 @item @{@var{min},@} |
13531 | 659 matches @var{min} or more occurrences of the preceding regular |
660 expression. | |
661 | |
13537
77dd6d58a96b
erroneous commas inside @var
Karl Berry <karl@freefriends.org>
parents:
13533
diff
changeset
|
662 @item @{@var{min}, @var{max}@} |
13531 | 663 matches at least @var{min} but no more than @var{max} occurrences of |
664 the preceding regular expression. | |
665 | |
666 @end table | |
667 | |
668 The interval expression (but not necessarily the regular expression that | |
669 contains it) is invalid if: | |
670 | |
671 @itemize @bullet | |
672 @item | |
13532 | 673 @var{min} is greater than @var{max}, or |
13531 | 674 |
675 @item | |
676 any of @var{count}, @var{min}, or @var{max} are outside the range | |
677 zero to @code{RE_DUP_MAX} (which symbol @file{regex.h} | |
678 defines). | |
679 | |
680 @end itemize | |
681 | |
682 If the interval expression is invalid and the syntax bit | |
683 @code{RE_NO_BK_BRACES} is set, then Regex considers all the | |
684 characters in the would-be interval to be ordinary. If that bit | |
685 isn't set, then the regular expression is invalid. | |
686 | |
687 If the interval expression is valid but there is no preceding regular | |
688 expression on which to operate, then if the syntax bit | |
689 @code{RE_CONTEXT_INVALID_OPS} is set, the regular expression is invalid. | |
690 If that bit isn't set, then Regex considers all the characters---other | |
691 than backslashes, which it ignores---in the would-be interval to be | |
692 ordinary. | |
693 | |
694 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
695 @node Alternation Operator |
13531 | 696 @section The Alternation Operator (@code{|} or @code{\|}) |
697 | |
698 @kindex | | |
699 @kindex \| | |
700 @cindex alternation operator | |
701 @cindex or operator | |
702 | |
703 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't | |
704 recognize this operator. Otherwise, if the syntax bit | |
705 @code{RE_NO_BK_VBAR} is set, then @samp{|} represents this operator; | |
706 otherwise, @samp{\|} does. | |
707 | |
708 Alternatives match one of a choice of regular expressions: | |
709 if you put the character(s) representing the alternation operator between | |
710 any two regular expressions @var{a} and @var{b}, the result matches | |
711 the union of the strings that @var{a} and @var{b} match. For | |
712 example, supposing that @samp{|} is the alternation operator, then | |
713 @samp{foo|bar|quux} would match any of @samp{foo}, @samp{bar} or | |
714 @samp{quux}. | |
715 | |
716 @ignore | |
717 @c Nobody needs to disallow empty alternatives any more. | |
718 If the syntax bit @code{RE_NO_EMPTY_ALTS} is set, then if either of the regular | |
719 expressions @var{a} or @var{b} is empty, the | |
720 regular expression is invalid. More precisely, if this syntax bit is | |
721 set, then the alternation operator can't: | |
722 | |
723 @itemize @bullet | |
724 @item | |
725 be first or last in a regular expression; | |
726 | |
727 @item | |
728 follow either another alternation operator or an open-group operator | |
729 (@pxref{Grouping Operators}); or | |
730 | |
731 @item | |
732 precede a close-group operator. | |
733 | |
734 @end itemize | |
735 | |
736 @noindent | |
737 For example, supposing @samp{(} and @samp{)} represent the open and | |
738 close-group operators, then @samp{|foo}, @samp{foo|}, @samp{foo||bar}, | |
739 @samp{foo(|bar)}, and @samp{(foo|)bar} would all be invalid. | |
740 @end ignore | |
741 | |
742 The alternation operator operates on the @emph{largest} possible | |
743 surrounding regular expressions. (Put another way, it has the lowest | |
744 precedence of any regular expression operator.) | |
745 Thus, the only way you can | |
746 delimit its arguments is to use grouping. For example, if @samp{(} and | |
747 @samp{)} are the open and close-group operators, then @samp{fo(o|b)ar} | |
748 would match either @samp{fooar} or @samp{fobar}. (@samp{foo|bar} would | |
749 match @samp{foo} or @samp{bar}.) | |
750 | |
751 @cindex backtracking | |
13532 | 752 The matcher usually tries all combinations of alternatives so as to |
13531 | 753 match the longest possible string. For example, when matching |
754 @samp{(fooq|foo)*(qbarquux|bar)} against @samp{fooqbarquux}, it cannot | |
755 take, say, the first (``depth-first'') combination it could match, since | |
13532 | 756 then it would be content to match just @samp{fooqbar}. |
13531 | 757 |
758 @comment xx something about leftmost-longest | |
759 | |
760 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
761 @node List Operators |
13531 | 762 @section List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]}) |
763 | |
764 @cindex matching list | |
765 @cindex @samp{[} | |
766 @cindex @samp{]} | |
767 @cindex @samp{^} | |
768 @cindex @samp{-} | |
769 @cindex @samp{\} | |
770 @cindex @samp{[^} | |
771 @cindex nonmatching list | |
772 @cindex matching newline | |
773 @cindex bracket expression | |
774 | |
775 @dfn{Lists}, also called @dfn{bracket expressions}, are a set of one or | |
776 more items. An @dfn{item} is a character, | |
777 @ignore | |
778 (These get added when they get implemented.) | |
13532 | 779 a collating symbol, an equivalence class expression, |
13531 | 780 @end ignore |
781 a character class expression, or a range expression. The syntax bits | |
782 affect which kinds of items you can put in a list. We explain the last | |
783 two items in subsections below. Empty lists are invalid. | |
784 | |
785 A @dfn{matching list} matches a single character represented by one of | |
786 the list items. You form a matching list by enclosing one or more items | |
787 within an @dfn{open-matching-list operator} (represented by @samp{[}) | |
13532 | 788 and a @dfn{close-list operator} (represented by @samp{]}). |
13531 | 789 |
790 For example, @samp{[ab]} matches either @samp{a} or @samp{b}. | |
791 @samp{[ad]*} matches the empty string and any string composed of just | |
792 @samp{a}s and @samp{d}s in any order. Regex considers invalid a regular | |
793 expression with a @samp{[} but no matching | |
794 @samp{]}. | |
795 | |
796 @dfn{Nonmatching lists} are similar to matching lists except that they | |
797 match a single character @emph{not} represented by one of the list | |
798 items. You use an @dfn{open-nonmatching-list operator} (represented by | |
799 @samp{[^}@footnote{Regex therefore doesn't consider the @samp{^} to be | |
800 the first character in the list. If you put a @samp{^} character first | |
801 in (what you think is) a matching list, you'll turn it into a | |
802 nonmatching list.}) instead of an open-matching-list operator to start a | |
13532 | 803 nonmatching list. |
13531 | 804 |
805 For example, @samp{[^ab]} matches any character except @samp{a} or | |
13532 | 806 @samp{b}. |
13531 | 807 |
808 If the @code{posix_newline} field in the pattern buffer (@pxref{GNU | |
809 Pattern Buffers} is set, then nonmatching lists do not match a newline. | |
810 | |
811 Most characters lose any special meaning inside a list. The special | |
812 characters inside a list follow. | |
813 | |
814 @table @samp | |
815 @item ] | |
816 ends the list if it's not the first list item. So, if you want to make | |
817 the @samp{]} character a list item, you must put it first. | |
818 | |
819 @item \ | |
820 quotes the next character if the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is | |
821 set. | |
822 | |
823 @ignore | |
824 Put these in if they get implemented. | |
825 | |
826 @item [. | |
827 represents the open-collating-symbol operator (@pxref{Collating Symbol | |
828 Operators}). | |
829 | |
830 @item .] | |
831 represents the close-collating-symbol operator. | |
832 | |
833 @item [= | |
834 represents the open-equivalence-class operator (@pxref{Equivalence Class | |
835 Operators}). | |
836 | |
837 @item =] | |
838 represents the close-equivalence-class operator. | |
839 | |
840 @end ignore | |
841 | |
842 @item [: | |
843 represents the open-character-class operator (@pxref{Character Class | |
844 Operators}) if the syntax bit @code{RE_CHAR_CLASSES} is set and what | |
845 follows is a valid character class expression. | |
846 | |
847 @item :] | |
848 represents the close-character-class operator if the syntax bit | |
849 @code{RE_CHAR_CLASSES} is set and what precedes it is an | |
850 open-character-class operator followed by a valid character class name. | |
851 | |
13532 | 852 @item - |
13531 | 853 represents the range operator (@pxref{Range Operator}) if it's |
854 not first or last in a list or the ending point of a range. | |
855 | |
856 @end table | |
857 | |
858 @noindent | |
13532 | 859 All other characters are ordinary. For example, @samp{[.*]} matches |
860 @samp{.} and @samp{*}. | |
13531 | 861 |
862 @menu | |
863 * Character Class Operators:: [:class:] | |
864 * Range Operator:: start-end | |
865 @end menu | |
866 | |
867 @ignore | |
868 (If collating symbols and equivalence class expressions get implemented, | |
869 then add this.) | |
870 | |
871 node Collating Symbol Operators | |
872 subsubsection Collating Symbol Operators (@code{[.} @dots{} @code{.]}) | |
873 | |
874 If the syntax bit @code{XX} is set, then you can represent | |
875 collating symbols inside lists. You form a @dfn{collating symbol} by | |
876 putting a collating element between an @dfn{open-collating-symbol | |
877 operator} and an @dfn{close-collating-symbol operator}. @samp{[.} | |
878 represents the open-collating-symbol operator and @samp{.]} represents | |
879 the close-collating-symbol operator. For example, if @samp{ll} is a | |
880 collating element, then @samp{[[.ll.]]} would match @samp{ll}. | |
881 | |
882 node Equivalence Class Operators | |
883 subsubsection Equivalence Class Operators (@code{[=} @dots{} @code{=]}) | |
884 @cindex equivalence class expression in regex | |
885 @cindex @samp{[=} in regex | |
886 @cindex @samp{=]} in regex | |
887 | |
888 If the syntax bit @code{XX} is set, then Regex recognizes equivalence class | |
889 expressions inside lists. A @dfn{equivalence class expression} is a set | |
890 of collating elements which all belong to the same equivalence class. | |
891 You form an equivalence class expression by putting a collating | |
892 element between an @dfn{open-equivalence-class operator} and a | |
893 @dfn{close-equivalence-class operator}. @samp{[=} represents the | |
894 open-equivalence-class operator and @samp{=]} represents the | |
895 close-equivalence-class operator. For example, if @samp{a} and @samp{A} | |
896 were an equivalence class, then both @samp{[[=a=]]} and @samp{[[=A=]]} | |
897 would match both @samp{a} and @samp{A}. If the collating element in an | |
898 equivalence class expression isn't part of an equivalence class, then | |
899 the matcher considers the equivalence class expression to be a collating | |
900 symbol. | |
901 | |
902 @end ignore | |
903 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
904 @node Character Class Operators |
13531 | 905 @subsection Character Class Operators (@code{[:} @dots{} @code{:]}) |
906 | |
907 @cindex character classes | |
908 @cindex @samp{[:} in regex | |
909 @cindex @samp{:]} in regex | |
910 | |
911 If the syntax bit @code{RE_CHARACTER_CLASSES} is set, then Regex | |
912 recognizes character class expressions inside lists. A @dfn{character | |
913 class expression} matches one character from a given class. You form a | |
914 character class expression by putting a character class name between an | |
915 @dfn{open-character-class operator} (represented by @samp{[:}) and a | |
916 @dfn{close-character-class operator} (represented by @samp{:]}). The | |
917 character class names and their meanings are: | |
918 | |
919 @table @code | |
920 | |
13532 | 921 @item alnum |
13531 | 922 letters and digits |
923 | |
924 @item alpha | |
925 letters | |
926 | |
927 @item blank | |
928 system-dependent; for @sc{gnu}, a space or tab | |
929 | |
930 @item cntrl | |
931 control characters (in the @sc{ascii} encoding, code 0177 and codes | |
932 less than 040) | |
933 | |
934 @item digit | |
935 digits | |
936 | |
937 @item graph | |
938 same as @code{print} except omits space | |
939 | |
13532 | 940 @item lower |
13531 | 941 lowercase letters |
942 | |
943 @item print | |
13532 | 944 printable characters (in the @sc{ascii} encoding, space |
13531 | 945 tilde---codes 040 through 0176) |
946 | |
947 @item punct | |
948 neither control nor alphanumeric characters | |
949 | |
950 @item space | |
951 space, carriage return, newline, vertical tab, and form feed | |
952 | |
953 @item upper | |
954 uppercase letters | |
955 | |
956 @item xdigit | |
957 hexadecimal digits: @code{0}--@code{9}, @code{a}--@code{f}, @code{A}--@code{F} | |
958 | |
959 @end table | |
960 | |
961 @noindent | |
962 These correspond to the definitions in the C library's @file{<ctype.h>} | |
963 facility. For example, @samp{[:alpha:]} corresponds to the standard | |
964 facility @code{isalpha}. Regex recognizes character class expressions | |
965 only inside of lists; so @samp{[[:alpha:]]} matches any letter, but | |
966 @samp{[:alpha:]} outside of a bracket expression and not followed by a | |
967 repetition operator matches just itself. | |
968 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
969 @node Range Operator |
13531 | 970 @subsection The Range Operator (@code{-}) |
971 | |
972 Regex recognizes @dfn{range expressions} inside a list. They represent | |
973 those characters | |
974 that fall between two elements in the current collating sequence. You | |
13532 | 975 form a range expression by putting a @dfn{range operator} between two |
13531 | 976 @ignore |
977 (If these get implemented, then substitute this for ``characters.'') | |
978 of any of the following: characters, collating elements, collating symbols, | |
979 and equivalence class expressions. The starting point of the range and | |
980 the ending point of the range don't have to be the same kind of item, | |
981 e.g., the starting point could be a collating element and the ending | |
982 point could be an equivalence class expression. If a range's ending | |
983 point is an equivalence class, then all the collating elements in that | |
984 class will be in the range. | |
985 @end ignore | |
986 characters.@footnote{You can't use a character class for the starting | |
987 or ending point of a range, since a character class is not a single | |
988 character.} @samp{-} represents the range operator. For example, | |
989 @samp{a-f} within a list represents all the characters from @samp{a} | |
990 through @samp{f} | |
991 inclusively. | |
992 | |
993 If the syntax bit @code{RE_NO_EMPTY_RANGES} is set, then if the range's | |
994 ending point collates less than its starting point, the range (and the | |
995 regular expression containing it) is invalid. For example, the regular | |
996 expression @samp{[z-a]} would be invalid. If this bit isn't set, then | |
997 Regex considers such a range to be empty. | |
998 | |
999 Since @samp{-} represents the range operator, if you want to make a | |
1000 @samp{-} character itself | |
1001 a list item, you must do one of the following: | |
1002 | |
1003 @itemize @bullet | |
1004 @item | |
1005 Put the @samp{-} either first or last in the list. | |
1006 | |
1007 @item | |
1008 Include a range whose starting point collates strictly lower than | |
1009 @samp{-} and whose ending point collates equal or higher. Unless a | |
1010 range is the first item in a list, a @samp{-} can't be its starting | |
1011 point, but @emph{can} be its ending point. That is because Regex | |
1012 considers @samp{-} to be the range operator unless it is preceded by | |
1013 another @samp{-}. For example, in the @sc{ascii} encoding, @samp{)}, | |
1014 @samp{*}, @samp{+}, @samp{,}, @samp{-}, @samp{.}, and @samp{/} are | |
1015 contiguous characters in the collating sequence. You might think that | |
1016 @samp{[)-+--/]} has two ranges: @samp{)-+} and @samp{--/}. Rather, it | |
1017 has the ranges @samp{)-+} and @samp{+--}, plus the character @samp{/}, so | |
1018 it matches, e.g., @samp{,}, not @samp{.}. | |
1019 | |
1020 @item | |
1021 Put a range whose starting point is @samp{-} first in the list. | |
1022 | |
1023 @end itemize | |
1024 | |
1025 For example, @samp{[-a-z]} matches a lowercase letter or a hyphen (in | |
1026 English, in @sc{ascii}). | |
1027 | |
1028 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1029 @node Grouping Operators |
13531 | 1030 @section Grouping Operators (@code{(} @dots{} @code{)} or @code{\(} @dots{} @code{\)}) |
1031 | |
1032 @kindex ( | |
1033 @kindex ) | |
1034 @kindex \( | |
1035 @kindex \) | |
1036 @cindex grouping | |
1037 @cindex subexpressions | |
1038 @cindex parenthesizing | |
1039 | |
1040 A @dfn{group}, also known as a @dfn{subexpression}, consists of an | |
1041 @dfn{open-group operator}, any number of other operators, and a | |
1042 @dfn{close-group operator}. Regex treats this sequence as a unit, just | |
1043 as mathematics and programming languages treat a parenthesized | |
1044 expression as a unit. | |
1045 | |
1046 Therefore, using @dfn{groups}, you can: | |
1047 | |
1048 @itemize @bullet | |
1049 @item | |
1050 delimit the argument(s) to an alternation operator (@pxref{Alternation | |
1051 Operator}) or a repetition operator (@pxref{Repetition | |
1052 Operators}). | |
1053 | |
13532 | 1054 @item |
13531 | 1055 keep track of the indices of the substring that matched a given group. |
1056 @xref{Using Registers}, for a precise explanation. | |
1057 This lets you: | |
1058 | |
1059 @itemize @bullet | |
1060 @item | |
1061 use the back-reference operator (@pxref{Back-reference Operator}). | |
1062 | |
13532 | 1063 @item |
13531 | 1064 use registers (@pxref{Using Registers}). |
1065 | |
1066 @end itemize | |
1067 | |
1068 @end itemize | |
1069 | |
1070 If the syntax bit @code{RE_NO_BK_PARENS} is set, then @samp{(} represents | |
1071 the open-group operator and @samp{)} represents the | |
1072 close-group operator; otherwise, @samp{\(} and @samp{\)} do. | |
1073 | |
1074 If the syntax bit @code{RE_UNMATCHED_RIGHT_PAREN_ORD} is set and a | |
1075 close-group operator has no matching open-group operator, then Regex | |
1076 considers it to match @samp{)}. | |
1077 | |
1078 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1079 @node Back-reference Operator |
13531 | 1080 @section The Back-reference Operator (@dfn{\}@var{digit}) |
1081 | |
1082 @cindex back references | |
1083 | |
1084 If the syntax bit @code{RE_NO_BK_REF} isn't set, then Regex recognizes | |
1085 back references. A back reference matches a specified preceding group. | |
1086 The back reference operator is represented by @samp{\@var{digit}} | |
1087 anywhere after the end of a regular expression's @w{@var{digit}-th} | |
1088 group (@pxref{Grouping Operators}). | |
1089 | |
1090 @var{digit} must be between @samp{1} and @samp{9}. The matcher assigns | |
1091 numbers 1 through 9 to the first nine groups it encounters. By using | |
1092 one of @samp{\1} through @samp{\9} after the corresponding group's | |
1093 close-group operator, you can match a substring identical to the | |
1094 one that the group does. | |
1095 | |
1096 Back references match according to the following (in all examples below, | |
1097 @samp{(} represents the open-group, @samp{)} the close-group, @samp{@{} | |
1098 the open-interval and @samp{@}} the close-interval operator): | |
1099 | |
1100 @itemize @bullet | |
1101 @item | |
1102 If the group matches a substring, the back reference matches an | |
1103 identical substring. For example, @samp{(a)\1} matches @samp{aa} and | |
1104 @samp{(bana)na\1bo\1} matches @samp{bananabanabobana}. Likewise, | |
1105 @samp{(.*)\1} matches any (newline-free if the syntax bit | |
1106 @code{RE_DOT_NEWLINE} isn't set) string that is composed of two | |
1107 identical halves; the @samp{(.*)} matches the first half and the | |
1108 @samp{\1} matches the second half. | |
1109 | |
1110 @item | |
1111 If the group matches more than once (as it might if followed | |
1112 by, e.g., a repetition operator), then the back reference matches the | |
1113 substring the group @emph{last} matched. For example, | |
1114 @samp{((a*)b)*\1\2} matches @samp{aabababa}; first @w{group 1} (the | |
1115 outer one) matches @samp{aab} and @w{group 2} (the inner one) matches | |
1116 @samp{aa}. Then @w{group 1} matches @samp{ab} and @w{group 2} matches | |
1117 @samp{a}. So, @samp{\1} matches @samp{ab} and @samp{\2} matches | |
1118 @samp{a}. | |
1119 | |
1120 @item | |
1121 If the group doesn't participate in a match, i.e., it is part of an | |
1122 alternative not taken or a repetition operator allows zero repetitions | |
1123 of it, then the back reference makes the whole match fail. For example, | |
1124 @samp{(one()|two())-and-(three\2|four\3)} matches @samp{one-and-three} | |
1125 and @samp{two-and-four}, but not @samp{one-and-four} or | |
1126 @samp{two-and-three}. For example, if the pattern matches | |
1127 @samp{one-and-}, then its @w{group 2} matches the empty string and its | |
1128 @w{group 3} doesn't participate in the match. So, if it then matches | |
1129 @samp{four}, then when it tries to back reference @w{group 3}---which it | |
1130 will attempt to do because @samp{\3} follows the @samp{four}---the match | |
1131 will fail because @w{group 3} didn't participate in the match. | |
1132 | |
1133 @end itemize | |
1134 | |
1135 You can use a back reference as an argument to a repetition operator. For | |
1136 example, @samp{(a(b))\2*} matches @samp{a} followed by two or more | |
1137 @samp{b}s. Similarly, @samp{(a(b))\2@{3@}} matches @samp{abbbb}. | |
1138 | |
1139 If there is no preceding @w{@var{digit}-th} subexpression, the regular | |
1140 expression is invalid. | |
1141 | |
1142 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1143 @node Anchoring Operators |
13532 | 1144 @section Anchoring Operators |
13531 | 1145 |
1146 @cindex anchoring | |
1147 @cindex regexp anchoring | |
1148 | |
1149 These operators can constrain a pattern to match only at the beginning or | |
1150 end of the entire string or at the beginning or end of a line. | |
1151 | |
1152 @menu | |
1153 * Match-beginning-of-line Operator:: ^ | |
1154 * Match-end-of-line Operator:: $ | |
1155 @end menu | |
1156 | |
1157 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1158 @node Match-beginning-of-line Operator |
13531 | 1159 @subsection The Match-beginning-of-line Operator (@code{^}) |
1160 | |
1161 @kindex ^ | |
1162 @cindex beginning-of-line operator | |
1163 @cindex anchors | |
1164 | |
1165 This operator can match the empty string either at the beginning of the | |
1166 string or after a newline character. Thus, it is said to @dfn{anchor} | |
1167 the pattern to the beginning of a line. | |
1168 | |
1169 In the cases following, @samp{^} represents this operator. (Otherwise, | |
1170 @samp{^} is ordinary.) | |
1171 | |
1172 @itemize @bullet | |
1173 | |
1174 @item | |
1175 It (the @samp{^}) is first in the pattern, as in @samp{^foo}. | |
1176 | |
1177 @cnindex RE_CONTEXT_INDEP_ANCHORS @r{(and @samp{^})} | |
1178 @item | |
1179 The syntax bit @code{RE_CONTEXT_INDEP_ANCHORS} is set, and it is outside | |
1180 a bracket expression. | |
1181 | |
1182 @cindex open-group operator and @samp{^} | |
1183 @cindex alternation operator and @samp{^} | |
1184 @item | |
1185 It follows an open-group or alternation operator, as in @samp{a\(^b\)} | |
1186 and @samp{a\|^b}. @xref{Grouping Operators}, and @ref{Alternation | |
1187 Operator}. | |
1188 | |
1189 @end itemize | |
1190 | |
1191 These rules imply that some valid patterns containing @samp{^} cannot be | |
1192 matched; for example, @samp{foo^bar} if @code{RE_CONTEXT_INDEP_ANCHORS} | |
1193 is set. | |
1194 | |
1195 @vindex not_bol @r{field in pattern buffer} | |
1196 If the @code{not_bol} field is set in the pattern buffer (@pxref{GNU | |
1197 Pattern Buffers}), then @samp{^} fails to match at the beginning of the | |
1198 string. @xref{POSIX Matching}, for when you might find this useful. | |
1199 | |
1200 @vindex newline_anchor @r{field in pattern buffer} | |
1201 If the @code{newline_anchor} field is set in the pattern buffer, then | |
1202 @samp{^} fails to match after a newline. This is useful when you do not | |
1203 regard the string to be matched as broken into lines. | |
1204 | |
1205 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1206 @node Match-end-of-line Operator |
13531 | 1207 @subsection The Match-end-of-line Operator (@code{$}) |
1208 | |
1209 @kindex $ | |
1210 @cindex end-of-line operator | |
1211 @cindex anchors | |
1212 | |
1213 This operator can match the empty string either at the end of | |
1214 the string or before a newline character in the string. Thus, it is | |
1215 said to @dfn{anchor} the pattern to the end of a line. | |
1216 | |
1217 It is always represented by @samp{$}. For example, @samp{foo$} usually | |
1218 matches, e.g., @samp{foo} and, e.g., the first three characters of | |
1219 @samp{foo\nbar}. | |
1220 | |
1221 Its interaction with the syntax bits and pattern buffer fields is | |
1222 exactly the dual of @samp{^}'s; see the previous section. (That is, | |
1223 ``beginning'' becomes ``end'', ``next'' becomes ``previous'', and | |
1224 ``after'' becomes ``before''.) | |
1225 | |
1226 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1227 @node GNU Operators |
13531 | 1228 @chapter GNU Operators |
1229 | |
1230 Following are operators that @sc{gnu} defines (and @sc{posix} doesn't). | |
1231 | |
1232 @menu | |
1233 * Word Operators:: | |
1234 * Buffer Operators:: | |
1235 @end menu | |
1236 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1237 @node Word Operators |
13531 | 1238 @section Word Operators |
1239 | |
1240 The operators in this section require Regex to recognize parts of words. | |
1241 Regex uses a syntax table to determine whether or not a character is | |
1242 part of a word, i.e., whether or not it is @dfn{word-constituent}. | |
1243 | |
1244 @menu | |
1245 * Non-Emacs Syntax Tables:: | |
1246 * Match-word-boundary Operator:: \b | |
1247 * Match-within-word Operator:: \B | |
1248 * Match-beginning-of-word Operator:: \< | |
1249 * Match-end-of-word Operator:: \> | |
1250 * Match-word-constituent Operator:: \w | |
1251 * Match-non-word-constituent Operator:: \W | |
1252 @end menu | |
1253 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1254 @node Non-Emacs Syntax Tables |
13532 | 1255 @subsection Non-Emacs Syntax Tables |
13531 | 1256 |
1257 A @dfn{syntax table} is an array indexed by the characters in your | |
1258 character set. In the @sc{ascii} encoding, therefore, a syntax table | |
1259 has 256 elements. Regex always uses a @code{char *} variable | |
1260 @code{re_syntax_table} as its syntax table. In some cases, it | |
1261 initializes this variable and in others it expects you to initialize it. | |
1262 | |
1263 @itemize @bullet | |
1264 @item | |
1265 If Regex is compiled with the preprocessor symbols @code{emacs} and | |
1266 @code{SYNTAX_TABLE} both undefined, then Regex allocates | |
1267 @code{re_syntax_table} and initializes an element @var{i} either to | |
1268 @code{Sword} (which it defines) if @var{i} is a letter, number, or | |
1269 @samp{_}, or to zero if it's not. | |
1270 | |
1271 @item | |
1272 If Regex is compiled with @code{emacs} undefined but @code{SYNTAX_TABLE} | |
1273 defined, then Regex expects you to define a @code{char *} variable | |
1274 @code{re_syntax_table} to be a valid syntax table. | |
1275 | |
1276 @item | |
1277 @xref{Emacs Syntax Tables}, for what happens when Regex is compiled with | |
1278 the preprocessor symbol @code{emacs} defined. | |
1279 | |
1280 @end itemize | |
1281 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1282 @node Match-word-boundary Operator |
13531 | 1283 @subsection The Match-word-boundary Operator (@code{\b}) |
1284 | |
1285 @cindex @samp{\b} | |
1286 @cindex word boundaries, matching | |
1287 | |
1288 This operator (represented by @samp{\b}) matches the empty string at | |
1289 either the beginning or the end of a word. For example, @samp{\brat\b} | |
1290 matches the separate word @samp{rat}. | |
1291 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1292 @node Match-within-word Operator |
13531 | 1293 @subsection The Match-within-word Operator (@code{\B}) |
1294 | |
1295 @cindex @samp{\B} | |
1296 | |
1297 This operator (represented by @samp{\B}) matches the empty string within | |
1298 a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but | |
1299 @samp{dirty \Brat} doesn't match @samp{dirty rat}. | |
1300 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1301 @node Match-beginning-of-word Operator |
13531 | 1302 @subsection The Match-beginning-of-word Operator (@code{\<}) |
1303 | |
1304 @cindex @samp{\<} | |
1305 | |
1306 This operator (represented by @samp{\<}) matches the empty string at the | |
1307 beginning of a word. | |
1308 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1309 @node Match-end-of-word Operator |
13531 | 1310 @subsection The Match-end-of-word Operator (@code{\>}) |
1311 | |
1312 @cindex @samp{\>} | |
1313 | |
1314 This operator (represented by @samp{\>}) matches the empty string at the | |
1315 end of a word. | |
1316 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1317 @node Match-word-constituent Operator |
13531 | 1318 @subsection The Match-word-constituent Operator (@code{\w}) |
1319 | |
1320 @cindex @samp{\w} | |
1321 | |
1322 This operator (represented by @samp{\w}) matches any word-constituent | |
1323 character. | |
1324 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1325 @node Match-non-word-constituent Operator |
13531 | 1326 @subsection The Match-non-word-constituent Operator (@code{\W}) |
1327 | |
1328 @cindex @samp{\W} | |
1329 | |
1330 This operator (represented by @samp{\W}) matches any character that is | |
1331 not word-constituent. | |
1332 | |
1333 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1334 @node Buffer Operators |
13532 | 1335 @section Buffer Operators |
13531 | 1336 |
1337 Following are operators which work on buffers. In Emacs, a @dfn{buffer} | |
1338 is, naturally, an Emacs buffer. For other programs, Regex considers the | |
1339 entire string to be matched as the buffer. | |
1340 | |
1341 @menu | |
1342 * Match-beginning-of-buffer Operator:: \` | |
1343 * Match-end-of-buffer Operator:: \' | |
1344 @end menu | |
1345 | |
1346 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1347 @node Match-beginning-of-buffer Operator |
13531 | 1348 @subsection The Match-beginning-of-buffer Operator (@code{\`}) |
1349 | |
1350 @cindex @samp{\`} | |
1351 | |
1352 This operator (represented by @samp{\`}) matches the empty string at the | |
1353 beginning of the buffer. | |
1354 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1355 @node Match-end-of-buffer Operator |
13531 | 1356 @subsection The Match-end-of-buffer Operator (@code{\'}) |
1357 | |
1358 @cindex @samp{\'} | |
1359 | |
1360 This operator (represented by @samp{\'}) matches the empty string at the | |
1361 end of the buffer. | |
1362 | |
1363 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1364 @node GNU Emacs Operators |
13531 | 1365 @chapter GNU Emacs Operators |
1366 | |
1367 Following are operators that @sc{gnu} defines (and @sc{posix} doesn't) | |
1368 that you can use only when Regex is compiled with the preprocessor | |
13532 | 1369 symbol @code{emacs} defined. |
13531 | 1370 |
1371 @menu | |
1372 * Syntactic Class Operators:: | |
1373 @end menu | |
1374 | |
1375 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1376 @node Syntactic Class Operators |
13531 | 1377 @section Syntactic Class Operators |
1378 | |
1379 The operators in this section require Regex to recognize the syntactic | |
1380 classes of characters. Regex uses a syntax table to determine this. | |
1381 | |
1382 @menu | |
1383 * Emacs Syntax Tables:: | |
1384 * Match-syntactic-class Operator:: \sCLASS | |
1385 * Match-not-syntactic-class Operator:: \SCLASS | |
1386 @end menu | |
1387 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1388 @node Emacs Syntax Tables |
13531 | 1389 @subsection Emacs Syntax Tables |
1390 | |
1391 A @dfn{syntax table} is an array indexed by the characters in your | |
1392 character set. In the @sc{ascii} encoding, therefore, a syntax table | |
1393 has 256 elements. | |
1394 | |
1395 If Regex is compiled with the preprocessor symbol @code{emacs} defined, | |
1396 then Regex expects you to define and initialize the variable | |
1397 @code{re_syntax_table} to be an Emacs syntax table. Emacs' syntax | |
1398 tables are more complicated than Regex's own (@pxref{Non-Emacs Syntax | |
1399 Tables}). @xref{Syntax, , Syntax, emacs, The GNU Emacs User's Manual}, | |
1400 for a description of Emacs' syntax tables. | |
1401 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1402 @node Match-syntactic-class Operator |
13531 | 1403 @subsection The Match-syntactic-class Operator (@code{\s}@var{class}) |
1404 | |
1405 @cindex @samp{\s} | |
1406 | |
1407 This operator matches any character whose syntactic class is represented | |
1408 by a specified character. @samp{\s@var{class}} represents this operator | |
1409 where @var{class} is the character representing the syntactic class you | |
1410 want. For example, @samp{w} represents the syntactic | |
1411 class of word-constituent characters, so @samp{\sw} matches any | |
1412 word-constituent character. | |
1413 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1414 @node Match-not-syntactic-class Operator |
13531 | 1415 @subsection The Match-not-syntactic-class Operator (@code{\S}@var{class}) |
1416 | |
1417 @cindex @samp{\S} | |
1418 | |
1419 This operator is similar to the match-syntactic-class operator except | |
1420 that it matches any character whose syntactic class is @emph{not} | |
1421 represented by the specified character. @samp{\S@var{class}} represents | |
1422 this operator. For example, @samp{w} represents the syntactic class of | |
1423 word-constituent characters, so @samp{\Sw} matches any character that is | |
1424 not word-constituent. | |
1425 | |
1426 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1427 @node What Gets Matched? |
13531 | 1428 @chapter What Gets Matched? |
1429 | |
1430 Regex usually matches strings according to the ``leftmost longest'' | |
1431 rule; that is, it chooses the longest of the leftmost matches. This | |
1432 does not mean that for a regular expression containing subexpressions | |
1433 that it simply chooses the longest match for each subexpression, left to | |
1434 right; the overall match must also be the longest possible one. | |
1435 | |
1436 For example, @samp{(ac*)(c*d[ac]*)\1} matches @samp{acdacaaa}, not | |
1437 @samp{acdac}, as it would if it were to choose the longest match for the | |
1438 first subexpression. | |
1439 | |
1440 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1441 @node Programming with Regex |
13531 | 1442 @chapter Programming with Regex |
1443 | |
1444 Here we describe how you use the Regex data structures and functions in | |
1445 C programs. Regex has three interfaces: one designed for @sc{gnu}, one | |
1446 compatible with @sc{posix} and one compatible with Berkeley @sc{unix}. | |
1447 | |
1448 @menu | |
1449 * GNU Regex Functions:: | |
1450 * POSIX Regex Functions:: | |
1451 * BSD Regex Functions:: | |
1452 @end menu | |
1453 | |
1454 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1455 @node GNU Regex Functions |
13531 | 1456 @section GNU Regex Functions |
1457 | |
1458 If you're writing code that doesn't need to be compatible with either | |
1459 @sc{posix} or Berkeley @sc{unix}, you can use these functions. They | |
1460 provide more options than the other interfaces. | |
1461 | |
1462 @menu | |
1463 * GNU Pattern Buffers:: The re_pattern_buffer type. | |
1464 * GNU Regular Expression Compiling:: re_compile_pattern () | |
1465 * GNU Matching:: re_match () | |
1466 * GNU Searching:: re_search () | |
1467 * Matching/Searching with Split Data:: re_match_2 (), re_search_2 () | |
1468 * Searching with Fastmaps:: re_compile_fastmap () | |
1469 * GNU Translate Tables:: The `translate' field. | |
1470 * Using Registers:: The re_registers type and related fns. | |
1471 * Freeing GNU Pattern Buffers:: regfree () | |
1472 @end menu | |
1473 | |
1474 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1475 @node GNU Pattern Buffers |
13531 | 1476 @subsection GNU Pattern Buffers |
1477 | |
1478 @cindex pattern buffer, definition of | |
1479 @tindex re_pattern_buffer @r{definition} | |
1480 @tindex struct re_pattern_buffer @r{definition} | |
1481 | |
1482 To compile, match, or search for a given regular expression, you must | |
1483 supply a pattern buffer. A @dfn{pattern buffer} holds one compiled | |
1484 regular expression.@footnote{Regular expressions are also referred to as | |
1485 ``patterns,'' hence the name ``pattern buffer.''} | |
1486 | |
1487 You can have several different pattern buffers simultaneously, each | |
1488 holding a compiled pattern for a different regular expression. | |
1489 | |
1490 @file{regex.h} defines the pattern buffer @code{struct} as follows: | |
1491 | |
1492 @example | |
1493 /* Space that holds the compiled pattern. It is declared as | |
1494 `unsigned char *' because its elements are | |
1495 sometimes used as array indexes. */ | |
1496 unsigned char *buffer; | |
1497 | |
1498 /* Number of bytes to which `buffer' points. */ | |
1499 unsigned long allocated; | |
1500 | |
1501 /* Number of bytes actually used in `buffer'. */ | |
13532 | 1502 unsigned long used; |
13531 | 1503 |
1504 /* Syntax setting with which the pattern was compiled. */ | |
1505 reg_syntax_t syntax; | |
1506 | |
1507 /* Pointer to a fastmap, if any, otherwise zero. re_search uses | |
1508 the fastmap, if there is one, to skip over impossible | |
1509 starting points for matches. */ | |
1510 char *fastmap; | |
1511 | |
1512 /* Either a translate table to apply to all characters before | |
1513 comparing them, or zero for no translation. The translation | |
1514 is applied to a pattern when it is compiled and to a string | |
1515 when it is matched. */ | |
1516 char *translate; | |
1517 | |
1518 /* Number of subexpressions found by the compiler. */ | |
1519 size_t re_nsub; | |
1520 | |
1521 /* Zero if this pattern cannot match the empty string, one else. | |
1522 Well, in truth it's used only in `re_search_2', to see | |
1523 whether or not we should use the fastmap, so we don't set | |
1524 this absolutely perfectly; see `re_compile_fastmap' (the | |
1525 `duplicate' case). */ | |
1526 unsigned can_be_null : 1; | |
1527 | |
1528 /* If REGS_UNALLOCATED, allocate space in the `regs' structure | |
1529 for `max (RE_NREGS, re_nsub + 1)' groups. | |
1530 If REGS_REALLOCATE, reallocate space if necessary. | |
1531 If REGS_FIXED, use what's there. */ | |
1532 #define REGS_UNALLOCATED 0 | |
1533 #define REGS_REALLOCATE 1 | |
1534 #define REGS_FIXED 2 | |
1535 unsigned regs_allocated : 2; | |
1536 | |
1537 /* Set to zero when `regex_compile' compiles a pattern; set to one | |
1538 by `re_compile_fastmap' if it updates the fastmap. */ | |
1539 unsigned fastmap_accurate : 1; | |
1540 | |
1541 /* If set, `re_match_2' does not return information about | |
1542 subexpressions. */ | |
1543 unsigned no_sub : 1; | |
1544 | |
1545 /* If set, a beginning-of-line anchor doesn't match at the | |
13532 | 1546 beginning of the string. */ |
13531 | 1547 unsigned not_bol : 1; |
1548 | |
1549 /* Similarly for an end-of-line anchor. */ | |
1550 unsigned not_eol : 1; | |
1551 | |
1552 /* If true, an anchor at a newline matches. */ | |
1553 unsigned newline_anchor : 1; | |
1554 | |
1555 @end example | |
1556 | |
1557 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1558 @node GNU Regular Expression Compiling |
13531 | 1559 @subsection GNU Regular Expression Compiling |
1560 | |
1561 In @sc{gnu}, you can both match and search for a given regular | |
1562 expression. To do either, you must first compile it in a pattern buffer | |
1563 (@pxref{GNU Pattern Buffers}). | |
1564 | |
1565 @cindex syntax initialization | |
1566 @vindex re_syntax_options @r{initialization} | |
1567 Regular expressions match according to the syntax with which they were | |
1568 compiled; with @sc{gnu}, you indicate what syntax you want by setting | |
13553
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
1569 the variable @code{re_syntax_options} (declared in @file{regex.h}) |
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
1570 before calling the compiling function, @code{re_compile_pattern} (see |
8fc3314fe460
Document not_eol and remove mention of regex.c.
Reuben Thomas <rrt@sc3d.org>
parents:
13549
diff
changeset
|
1571 below). @xref{Syntax Bits}, and @ref{Predefined Syntaxes}. |
13531 | 1572 |
1573 You can change the value of @code{re_syntax_options} at any time. | |
1574 Usually, however, you set its value once and then never change it. | |
1575 | |
1576 @cindex pattern buffer initialization | |
1577 @code{re_compile_pattern} takes a pattern buffer as an argument. You | |
1578 must initialize the following fields: | |
1579 | |
1580 @table @code | |
1581 | |
1582 @item translate @r{initialization} | |
1583 | |
1584 @item translate | |
1585 @vindex translate @r{initialization} | |
1586 Initialize this to point to a translate table if you want one, or to | |
1587 zero if you don't. We explain translate tables in @ref{GNU Translate | |
1588 Tables}. | |
1589 | |
1590 @item fastmap | |
1591 @vindex fastmap @r{initialization} | |
1592 Initialize this to nonzero if you want a fastmap, or to zero if you | |
1593 don't. | |
1594 | |
1595 @item buffer | |
1596 @itemx allocated | |
1597 @vindex buffer @r{initialization} | |
1598 @vindex allocated @r{initialization} | |
1599 @findex malloc | |
1600 If you want @code{re_compile_pattern} to allocate memory for the | |
1601 compiled pattern, set both of these to zero. If you have an existing | |
1602 block of memory (allocated with @code{malloc}) you want Regex to use, | |
1603 set @code{buffer} to its address and @code{allocated} to its size (in | |
1604 bytes). | |
1605 | |
1606 @code{re_compile_pattern} uses @code{realloc} to extend the space for | |
1607 the compiled pattern as necessary. | |
1608 | |
1609 @end table | |
1610 | |
1611 To compile a pattern buffer, use: | |
1612 | |
1613 @findex re_compile_pattern | |
1614 @example | |
13532 | 1615 char * |
1616 re_compile_pattern (const char *@var{regex}, const int @var{regex_size}, | |
13531 | 1617 struct re_pattern_buffer *@var{pattern_buffer}) |
1618 @end example | |
1619 | |
1620 @noindent | |
1621 @var{regex} is the regular expression's address, @var{regex_size} is its | |
1622 length, and @var{pattern_buffer} is the pattern buffer's address. | |
1623 | |
1624 If @code{re_compile_pattern} successfully compiles the regular | |
1625 expression, it returns zero and sets @code{*@var{pattern_buffer}} to the | |
1626 compiled pattern. It sets the pattern buffer's fields as follows: | |
1627 | |
1628 @table @code | |
1629 @item buffer | |
1630 @vindex buffer @r{field, set by @code{re_compile_pattern}} | |
1631 to the compiled pattern. | |
1632 | |
1633 @item used | |
1634 @vindex used @r{field, set by @code{re_compile_pattern}} | |
1635 to the number of bytes the compiled pattern in @code{buffer} occupies. | |
1636 | |
1637 @item syntax | |
1638 @vindex syntax @r{field, set by @code{re_compile_pattern}} | |
1639 to the current value of @code{re_syntax_options}. | |
1640 | |
1641 @item re_nsub | |
1642 @vindex re_nsub @r{field, set by @code{re_compile_pattern}} | |
1643 to the number of subexpressions in @var{regex}. | |
1644 | |
1645 @item fastmap_accurate | |
1646 @vindex fastmap_accurate @r{field, set by @code{re_compile_pattern}} | |
1647 to zero on the theory that the pattern you're compiling is different | |
1648 than the one previously compiled into @code{buffer}; in that case (since | |
13532 | 1649 you can't make a fastmap without a compiled pattern), |
13531 | 1650 @code{fastmap} would either contain an incompatible fastmap, or nothing |
1651 at all. | |
1652 | |
1653 @c xx what else? | |
1654 @end table | |
1655 | |
1656 If @code{re_compile_pattern} can't compile @var{regex}, it returns an | |
1657 error string corresponding to one of the errors listed in @ref{POSIX | |
1658 Regular Expression Compiling}. | |
1659 | |
1660 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1661 @node GNU Matching |
13532 | 1662 @subsection GNU Matching |
13531 | 1663 |
1664 @cindex matching with GNU functions | |
1665 | |
1666 Matching the @sc{gnu} way means trying to match as much of a string as | |
1667 possible starting at a position within it you specify. Once you've compiled | |
1668 a pattern into a pattern buffer (@pxref{GNU Regular Expression | |
1669 Compiling}), you can ask the matcher to match that pattern against a | |
1670 string using: | |
1671 | |
1672 @findex re_match | |
1673 @example | |
1674 int | |
13532 | 1675 re_match (struct re_pattern_buffer *@var{pattern_buffer}, |
1676 const char *@var{string}, const int @var{size}, | |
13531 | 1677 const int @var{start}, struct re_registers *@var{regs}) |
1678 @end example | |
1679 | |
1680 @noindent | |
1681 @var{pattern_buffer} is the address of a pattern buffer containing a | |
1682 compiled pattern. @var{string} is the string you want to match; it can | |
1683 contain newline and null characters. @var{size} is the length of that | |
1684 string. @var{start} is the string index at which you want to | |
1685 begin matching; the first character of @var{string} is at index zero. | |
1686 @xref{Using Registers}, for a explanation of @var{regs}; you can safely | |
1687 pass zero. | |
1688 | |
1689 @code{re_match} matches the regular expression in @var{pattern_buffer} | |
1690 against the string @var{string} according to the syntax in | |
1691 @var{pattern_buffers}'s @code{syntax} field. (@xref{GNU Regular | |
1692 Expression Compiling}, for how to set it.) The function returns | |
1693 @math{-1} if the compiled pattern does not match any part of | |
1694 @var{string} and @math{-2} if an internal error happens; otherwise, it | |
1695 returns how many (possibly zero) characters of @var{string} the pattern | |
1696 matched. | |
1697 | |
1698 An example: suppose @var{pattern_buffer} points to a pattern buffer | |
1699 containing the compiled pattern for @samp{a*}, and @var{string} points | |
1700 to @samp{aaaaab} (whereupon @var{size} should be 6). Then if @var{start} | |
1701 is 2, @code{re_match} returns 3, i.e., @samp{a*} would have matched the | |
1702 last three @samp{a}s in @var{string}. If @var{start} is 0, | |
1703 @code{re_match} returns 5, i.e., @samp{a*} would have matched all the | |
1704 @samp{a}s in @var{string}. If @var{start} is either 5 or 6, it returns | |
1705 zero. | |
1706 | |
1707 If @var{start} is not between zero and @var{size}, then | |
1708 @code{re_match} returns @math{-1}. | |
1709 | |
1710 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1711 @node GNU Searching |
13532 | 1712 @subsection GNU Searching |
13531 | 1713 |
1714 @cindex searching with GNU functions | |
1715 | |
1716 @dfn{Searching} means trying to match starting at successive positions | |
1717 within a string. The function @code{re_search} does this. | |
1718 | |
1719 Before calling @code{re_search}, you must compile your regular | |
1720 expression. @xref{GNU Regular Expression Compiling}. | |
1721 | |
1722 Here is the function declaration: | |
1723 | |
1724 @findex re_search | |
1725 @example | |
13532 | 1726 int |
1727 re_search (struct re_pattern_buffer *@var{pattern_buffer}, | |
1728 const char *@var{string}, const int @var{size}, | |
1729 const int @var{start}, const int @var{range}, | |
13531 | 1730 struct re_registers *@var{regs}) |
1731 @end example | |
1732 | |
1733 @noindent | |
1734 @vindex start @r{argument to @code{re_search}} | |
1735 @vindex range @r{argument to @code{re_search}} | |
1736 whose arguments are the same as those to @code{re_match} (@pxref{GNU | |
1737 Matching}) except that the two arguments @var{start} and @var{range} | |
1738 replace @code{re_match}'s argument @var{start}. | |
1739 | |
1740 If @var{range} is positive, then @code{re_search} attempts a match | |
1741 starting first at index @var{start}, then at @math{@var{start} + 1} if | |
1742 that fails, and so on, up to @math{@var{start} + @var{range}}; if | |
1743 @var{range} is negative, then it attempts a match starting first at | |
1744 index @var{start}, then at @math{@var{start} -1} if that fails, and so | |
13532 | 1745 on. |
13531 | 1746 |
1747 If @var{start} is not between zero and @var{size}, then @code{re_search} | |
1748 returns @math{-1}. When @var{range} is positive, @code{re_search} | |
1749 adjusts @var{range} so that @math{@var{start} + @var{range} - 1} is | |
1750 between zero and @var{size}, if necessary; that way it won't search | |
1751 outside of @var{string}. Similarly, when @var{range} is negative, | |
1752 @code{re_search} adjusts @var{range} so that @math{@var{start} + | |
1753 @var{range} + 1} is between zero and @var{size}, if necessary. | |
1754 | |
1755 If the @code{fastmap} field of @var{pattern_buffer} is zero, | |
1756 @code{re_search} matches starting at consecutive positions; otherwise, | |
1757 it uses @code{fastmap} to make the search more efficient. | |
1758 @xref{Searching with Fastmaps}. | |
1759 | |
1760 If no match is found, @code{re_search} returns @math{-1}. If | |
1761 a match is found, it returns the index where the match began. If an | |
1762 internal error happens, it returns @math{-2}. | |
1763 | |
1764 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1765 @node Matching/Searching with Split Data |
13531 | 1766 @subsection Matching and Searching with Split Data |
1767 | |
1768 Using the functions @code{re_match_2} and @code{re_search_2}, you can | |
13532 | 1769 match or search in data that is divided into two strings. |
13531 | 1770 |
1771 The function: | |
1772 | |
1773 @findex re_match_2 | |
1774 @example | |
1775 int | |
13532 | 1776 re_match_2 (struct re_pattern_buffer *@var{buffer}, |
1777 const char *@var{string1}, const int @var{size1}, | |
1778 const char *@var{string2}, const int @var{size2}, | |
1779 const int @var{start}, | |
1780 struct re_registers *@var{regs}, | |
13531 | 1781 const int @var{stop}) |
1782 @end example | |
1783 | |
1784 @noindent | |
1785 is similar to @code{re_match} (@pxref{GNU Matching}) except that you | |
1786 pass @emph{two} data strings and sizes, and an index @var{stop} beyond | |
1787 which you don't want the matcher to try matching. As with | |
1788 @code{re_match}, if it succeeds, @code{re_match_2} returns how many | |
1789 characters of @var{string} it matched. Regard @var{string1} and | |
1790 @var{string2} as concatenated when you set the arguments @var{start} and | |
1791 @var{stop} and use the contents of @var{regs}; @code{re_match_2} never | |
13532 | 1792 returns a value larger than @math{@var{size1} + @var{size2}}. |
13531 | 1793 |
1794 The function: | |
1795 | |
1796 @findex re_search_2 | |
1797 @example | |
1798 int | |
13532 | 1799 re_search_2 (struct re_pattern_buffer *@var{buffer}, |
1800 const char *@var{string1}, const int @var{size1}, | |
1801 const char *@var{string2}, const int @var{size2}, | |
1802 const int @var{start}, const int @var{range}, | |
1803 struct re_registers *@var{regs}, | |
13531 | 1804 const int @var{stop}) |
1805 @end example | |
1806 | |
1807 @noindent | |
1808 is similarly related to @code{re_search}. | |
1809 | |
1810 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1811 @node Searching with Fastmaps |
13531 | 1812 @subsection Searching with Fastmaps |
1813 | |
1814 @cindex fastmaps | |
1815 If you're searching through a long string, you should use a fastmap. | |
1816 Without one, the searcher tries to match at consecutive positions in the | |
1817 string. Generally, most of the characters in the string could not start | |
1818 a match. It takes much longer to try matching at a given position in the | |
1819 string than it does to check in a table whether or not the character at | |
1820 that position could start a match. A @dfn{fastmap} is such a table. | |
1821 | |
1822 More specifically, a fastmap is an array indexed by the characters in | |
1823 your character set. Under the @sc{ascii} encoding, therefore, a fastmap | |
1824 has 256 elements. If you want the searcher to use a fastmap with a | |
1825 given pattern buffer, you must allocate the array and assign the array's | |
1826 address to the pattern buffer's @code{fastmap} field. You either can | |
1827 compile the fastmap yourself or have @code{re_search} do it for you; | |
1828 when @code{fastmap} is nonzero, it automatically compiles a fastmap the | |
13532 | 1829 first time you search using a particular compiled pattern. |
13531 | 1830 |
1831 To compile a fastmap yourself, use: | |
1832 | |
1833 @findex re_compile_fastmap | |
1834 @example | |
1835 int | |
1836 re_compile_fastmap (struct re_pattern_buffer *@var{pattern_buffer}) | |
1837 @end example | |
1838 | |
1839 @noindent | |
1840 @var{pattern_buffer} is the address of a pattern buffer. If the | |
1841 character @var{c} could start a match for the pattern, | |
1842 @code{re_compile_fastmap} makes | |
1843 @code{@var{pattern_buffer}->fastmap[@var{c}]} nonzero. It returns | |
1844 @math{0} if it can compile a fastmap and @math{-2} if there is an | |
1845 internal error. For example, if @samp{|} is the alternation operator | |
1846 and @var{pattern_buffer} holds the compiled pattern for @samp{a|b}, then | |
1847 @code{re_compile_fastmap} sets @code{fastmap['a']} and | |
1848 @code{fastmap['b']} (and no others). | |
1849 | |
1850 @code{re_search} uses a fastmap as it moves along in the string: it | |
1851 checks the string's characters until it finds one that's in the fastmap. | |
1852 Then it tries matching at that character. If the match fails, it | |
1853 repeats the process. So, by using a fastmap, @code{re_search} doesn't | |
1854 waste time trying to match at positions in the string that couldn't | |
1855 start a match. | |
1856 | |
1857 If you don't want @code{re_search} to use a fastmap, | |
1858 store zero in the @code{fastmap} field of the pattern buffer before | |
1859 calling @code{re_search}. | |
1860 | |
1861 Once you've initialized a pattern buffer's @code{fastmap} field, you | |
1862 need never do so again---even if you compile a new pattern in | |
1863 it---provided the way the field is set still reflects whether or not you | |
1864 want a fastmap. @code{re_search} will still either do nothing if | |
1865 @code{fastmap} is null or, if it isn't, compile a new fastmap for the | |
1866 new pattern. | |
1867 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1868 @node GNU Translate Tables |
13531 | 1869 @subsection GNU Translate Tables |
1870 | |
1871 If you set the @code{translate} field of a pattern buffer to a translate | |
1872 table, then the @sc{gnu} Regex functions to which you've passed that | |
1873 pattern buffer use it to apply a simple transformation | |
1874 to all the regular expression and string characters at which they look. | |
1875 | |
1876 A @dfn{translate table} is an array indexed by the characters in your | |
1877 character set. Under the @sc{ascii} encoding, therefore, a translate | |
1878 table has 256 elements. The array's elements are also characters in | |
1879 your character set. When the Regex functions see a character @var{c}, | |
1880 they use @code{translate[@var{c}]} in its place, with one exception: the | |
1881 character after a @samp{\} is not translated. (This ensures that, the | |
1882 operators, e.g., @samp{\B} and @samp{\b}, are always distinguishable.) | |
1883 | |
1884 For example, a table that maps all lowercase letters to the | |
1885 corresponding uppercase ones would cause the matcher to ignore | |
1886 differences in case.@footnote{A table that maps all uppercase letters to | |
1887 the corresponding lowercase ones would work just as well for this | |
1888 purpose.} Such a table would map all characters except lowercase letters | |
1889 to themselves, and lowercase letters to the corresponding uppercase | |
1890 ones. Under the @sc{ascii} encoding, here's how you could initialize | |
1891 such a table (we'll call it @code{case_fold}): | |
1892 | |
1893 @example | |
1894 for (i = 0; i < 256; i++) | |
1895 case_fold[i] = i; | |
1896 for (i = 'a'; i <= 'z'; i++) | |
1897 case_fold[i] = i - ('a' - 'A'); | |
1898 @end example | |
1899 | |
1900 You tell Regex to use a translate table on a given pattern buffer by | |
1901 assigning that table's address to the @code{translate} field of that | |
1902 buffer. If you don't want Regex to do any translation, put zero into | |
1903 this field. You'll get weird results if you change the table's contents | |
1904 anytime between compiling the pattern buffer, compiling its fastmap, and | |
1905 matching or searching with the pattern buffer. | |
1906 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
1907 @node Using Registers |
13531 | 1908 @subsection Using Registers |
1909 | |
1910 A group in a regular expression can match a (posssibly empty) substring | |
1911 of the string that regular expression as a whole matched. The matcher | |
1912 remembers the beginning and end of the substring matched by | |
1913 each group. | |
1914 | |
1915 To find out what they matched, pass a nonzero @var{regs} argument to a | |
1916 @sc{gnu} matching or searching function (@pxref{GNU Matching} and | |
1917 @ref{GNU Searching}), i.e., the address of a structure of this type, as | |
1918 defined in @file{regex.h}: | |
1919 | |
1920 @c We don't bother to include this directly from regex.h, | |
1921 @c since it changes so rarely. | |
1922 @example | |
1923 @tindex re_registers | |
1924 @vindex num_regs @r{in @code{struct re_registers}} | |
1925 @vindex start @r{in @code{struct re_registers}} | |
1926 @vindex end @r{in @code{struct re_registers}} | |
1927 struct re_registers | |
1928 @{ | |
1929 unsigned num_regs; | |
1930 regoff_t *start; | |
1931 regoff_t *end; | |
1932 @}; | |
1933 @end example | |
1934 | |
1935 Except for (possibly) the @var{num_regs}'th element (see below), the | |
1936 @var{i}th element of the @code{start} and @code{end} arrays records | |
1937 information about the @var{i}th group in the pattern. (They're declared | |
1938 as C pointers, but this is only because not all C compilers accept | |
1939 zero-length arrays; conceptually, it is simplest to think of them as | |
1940 arrays.) | |
1941 | |
1942 The @code{start} and @code{end} arrays are allocated in various ways, | |
1943 depending on the value of the @code{regs_allocated} | |
1944 @vindex regs_allocated | |
1945 field in the pattern buffer passed to the matcher. | |
1946 | |
1947 The simplest and perhaps most useful is to let the matcher (re)allocate | |
1948 enough space to record information for all the groups in the regular | |
1949 expression. If @code{regs_allocated} is @code{REGS_UNALLOCATED}, | |
1950 @vindex REGS_UNALLOCATED | |
1951 the matcher allocates @math{1 + @var{re_nsub}} (another field in the | |
1952 pattern buffer; @pxref{GNU Pattern Buffers}). The extra element is set | |
1953 to @math{-1}, and sets @code{regs_allocated} to @code{REGS_REALLOCATE}. | |
1954 @vindex REGS_REALLOCATE | |
1955 Then on subsequent calls with the same pattern buffer and @var{regs} | |
1956 arguments, the matcher reallocates more space if necessary. | |
1957 | |
1958 It would perhaps be more logical to make the @code{regs_allocated} field | |
1959 part of the @code{re_registers} structure, instead of part of the | |
1960 pattern buffer. But in that case the caller would be forced to | |
1961 initialize the structure before passing it. Much existing code doesn't | |
1962 do this initialization, and it's arguably better to avoid it anyway. | |
1963 | |
1964 @code{re_compile_pattern} sets @code{regs_allocated} to | |
1965 @code{REGS_UNALLOCATED}, | |
1966 so if you use the GNU regular expression | |
1967 functions, you get this behavior by default. | |
1968 | |
1969 xx document re_set_registers | |
1970 | |
1971 @sc{posix}, on the other hand, requires a different interface: the | |
1972 caller is supposed to pass in a fixed-length array which the matcher | |
13532 | 1973 fills. Therefore, if @code{regs_allocated} is @code{REGS_FIXED} |
13531 | 1974 @vindex REGS_FIXED |
1975 the matcher simply fills that array. | |
1976 | |
1977 The following examples illustrate the information recorded in the | |
1978 @code{re_registers} structure. (In all of them, @samp{(} represents the | |
1979 open-group and @samp{)} the close-group operator. The first character | |
1980 in the string @var{string} is at index 0.) | |
1981 | |
1982 @c xx i'm not sure this is all true anymore. | |
1983 | |
1984 @itemize @bullet | |
1985 | |
13532 | 1986 @item |
13531 | 1987 If the regular expression has an @w{@var{i}-th} |
1988 group not contained within another group that matches a | |
1989 substring of @var{string}, then the function sets | |
1990 @code{@w{@var{regs}->}start[@var{i}]} to the index in @var{string} where | |
1991 the substring matched by the @w{@var{i}-th} group begins, and | |
1992 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that | |
1993 substring's end. The function sets @code{@w{@var{regs}->}start[0]} and | |
1994 @code{@w{@var{regs}->}end[0]} to analogous information about the entire | |
1995 pattern. | |
1996 | |
1997 For example, when you match @samp{((a)(b))} against @samp{ab}, you get: | |
1998 | |
1999 @itemize | |
2000 @item | |
13532 | 2001 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]} |
13531 | 2002 |
2003 @item | |
13532 | 2004 0 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]} |
13531 | 2005 |
2006 @item | |
13532 | 2007 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]} |
13531 | 2008 |
2009 @item | |
13532 | 2010 1 in @code{@w{@var{regs}->}start[3]} and 2 in @code{@w{@var{regs}->}end[3]} |
13531 | 2011 @end itemize |
2012 | |
2013 @item | |
2014 If a group matches more than once (as it might if followed by, | |
2015 e.g., a repetition operator), then the function reports the information | |
2016 about what the group @emph{last} matched. | |
2017 | |
2018 For example, when you match the pattern @samp{(a)*} against the string | |
2019 @samp{aa}, you get: | |
2020 | |
2021 @itemize | |
2022 @item | |
13532 | 2023 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]} |
13531 | 2024 |
2025 @item | |
13532 | 2026 1 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]} |
13531 | 2027 @end itemize |
2028 | |
2029 @item | |
2030 If the @w{@var{i}-th} group does not participate in a | |
2031 successful match, e.g., it is an alternative not taken or a | |
2032 repetition operator allows zero repetitions of it, then the function | |
2033 sets @code{@w{@var{regs}->}start[@var{i}]} and | |
2034 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}. | |
2035 | |
2036 For example, when you match the pattern @samp{(a)*b} against | |
2037 the string @samp{b}, you get: | |
2038 | |
2039 @itemize | |
2040 @item | |
13532 | 2041 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]} |
13531 | 2042 |
2043 @item | |
13532 | 2044 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]} |
13531 | 2045 @end itemize |
2046 | |
2047 @item | |
2048 If the @w{@var{i}-th} group matches a zero-length string, then the | |
2049 function sets @code{@w{@var{regs}->}start[@var{i}]} and | |
2050 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that | |
13532 | 2051 zero-length string. |
13531 | 2052 |
2053 For example, when you match the pattern @samp{(a*)b} against the string | |
2054 @samp{b}, you get: | |
2055 | |
2056 @itemize | |
2057 @item | |
13532 | 2058 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]} |
13531 | 2059 |
2060 @item | |
13532 | 2061 0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]} |
13531 | 2062 @end itemize |
2063 | |
2064 @ignore | |
2065 The function sets @code{@w{@var{regs}->}start[0]} and | |
2066 @code{@w{@var{regs}->}end[0]} to analogous information about the entire | |
2067 pattern. | |
2068 | |
2069 For example, when you match the pattern @samp{(a*)} against the empty | |
2070 string, you get: | |
2071 | |
2072 @itemize | |
2073 @item | |
13532 | 2074 0 in @code{@w{@var{regs}->}start[0]} and 0 in @code{@w{@var{regs}->}end[0]} |
13531 | 2075 |
2076 @item | |
13532 | 2077 0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]} |
13531 | 2078 @end itemize |
2079 @end ignore | |
2080 | |
2081 @item | |
13532 | 2082 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group |
13531 | 2083 in turn not contained within any other group within group @var{i} and |
2084 the function reports a match of the @w{@var{i}-th} group, then it | |
2085 records in @code{@w{@var{regs}->}start[@var{j}]} and | |
2086 @code{@w{@var{regs}->}end[@var{j}]} the last match (if it matched) of | |
2087 the @w{@var{j}-th} group. | |
2088 | |
2089 For example, when you match the pattern @samp{((a*)b)*} against the | |
2090 string @samp{abb}, @w{group 2} last matches the empty string, so you | |
2091 get what it previously matched: | |
2092 | |
2093 @itemize | |
2094 @item | |
13532 | 2095 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]} |
13531 | 2096 |
2097 @item | |
13532 | 2098 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]} |
13531 | 2099 |
2100 @item | |
13532 | 2101 2 in @code{@w{@var{regs}->}start[2]} and 2 in @code{@w{@var{regs}->}end[2]} |
13531 | 2102 @end itemize |
2103 | |
2104 When you match the pattern @samp{((a)*b)*} against the string | |
2105 @samp{abb}, @w{group 2} doesn't participate in the last match, so you | |
2106 get: | |
2107 | |
2108 @itemize | |
2109 @item | |
13532 | 2110 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]} |
13531 | 2111 |
2112 @item | |
13532 | 2113 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]} |
13531 | 2114 |
2115 @item | |
13532 | 2116 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]} |
13531 | 2117 @end itemize |
2118 | |
2119 @item | |
2120 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group | |
2121 in turn not contained within any other group within group @var{i} | |
13532 | 2122 and the function sets |
2123 @code{@w{@var{regs}->}start[@var{i}]} and | |
13531 | 2124 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}, then it also sets |
2125 @code{@w{@var{regs}->}start[@var{j}]} and | |
2126 @code{@w{@var{regs}->}end[@var{j}]} to @math{-1}. | |
2127 | |
2128 For example, when you match the pattern @samp{((a)*b)*c} against the | |
2129 string @samp{c}, you get: | |
2130 | |
2131 @itemize | |
2132 @item | |
13532 | 2133 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]} |
13531 | 2134 |
2135 @item | |
13532 | 2136 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]} |
13531 | 2137 |
2138 @item | |
13532 | 2139 @math{-1} in @code{@w{@var{regs}->}start[2]} and @math{-1} in @code{@w{@var{regs}->}end[2]} |
13531 | 2140 @end itemize |
2141 | |
2142 @end itemize | |
2143 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2144 @node Freeing GNU Pattern Buffers |
13531 | 2145 @subsection Freeing GNU Pattern Buffers |
2146 | |
2147 To free any allocated fields of a pattern buffer, you can use the | |
2148 @sc{posix} function described in @ref{Freeing POSIX Pattern Buffers}, | |
2149 since the type @code{regex_t}---the type for @sc{posix} pattern | |
2150 buffers---is equivalent to the type @code{re_pattern_buffer}. After | |
2151 freeing a pattern buffer, you need to again compile a regular expression | |
2152 in it (@pxref{GNU Regular Expression Compiling}) before passing it to | |
2153 a matching or searching function. | |
2154 | |
2155 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2156 @node POSIX Regex Functions |
13531 | 2157 @section POSIX Regex Functions |
2158 | |
2159 If you're writing code that has to be @sc{posix} compatible, you'll need | |
2160 to use these functions. Their interfaces are as specified by @sc{posix}, | |
2161 draft 1003.2/D11.2. | |
2162 | |
2163 @menu | |
2164 * POSIX Pattern Buffers:: The regex_t type. | |
2165 * POSIX Regular Expression Compiling:: regcomp () | |
2166 * POSIX Matching:: regexec () | |
2167 * Reporting Errors:: regerror () | |
2168 * Using Byte Offsets:: The regmatch_t type. | |
2169 * Freeing POSIX Pattern Buffers:: regfree () | |
2170 @end menu | |
2171 | |
2172 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2173 @node POSIX Pattern Buffers |
13531 | 2174 @subsection POSIX Pattern Buffers |
2175 | |
2176 To compile or match a given regular expression the @sc{posix} way, you | |
2177 must supply a pattern buffer exactly the way you do for @sc{gnu} | |
2178 (@pxref{GNU Pattern Buffers}). @sc{posix} pattern buffers have type | |
2179 @code{regex_t}, which is equivalent to the @sc{gnu} pattern buffer | |
2180 type @code{re_pattern_buffer}. | |
2181 | |
2182 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2183 @node POSIX Regular Expression Compiling |
13531 | 2184 @subsection POSIX Regular Expression Compiling |
2185 | |
2186 With @sc{posix}, you can only search for a given regular expression; you | |
2187 can't match it. To do this, you must first compile it in a | |
2188 pattern buffer, using @code{regcomp}. | |
2189 | |
2190 @ignore | |
2191 Before calling @code{regcomp}, you must initialize this pattern buffer | |
2192 as you do for @sc{gnu} (@pxref{GNU Regular Expression Compiling}). See | |
2193 below, however, for how to choose a syntax with which to compile. | |
2194 @end ignore | |
2195 | |
2196 To compile a pattern buffer, use: | |
2197 | |
2198 @findex regcomp | |
2199 @example | |
2200 int | |
2201 regcomp (regex_t *@var{preg}, const char *@var{regex}, int @var{cflags}) | |
2202 @end example | |
2203 | |
2204 @noindent | |
2205 @var{preg} is the initialized pattern buffer's address, @var{regex} is | |
2206 the regular expression's address, and @var{cflags} is the compilation | |
2207 flags, which Regex considers as a collection of bits. Here are the | |
2208 valid bits, as defined in @file{regex.h}: | |
2209 | |
2210 @table @code | |
2211 | |
2212 @item REG_EXTENDED | |
2213 @vindex REG_EXTENDED | |
2214 says to use @sc{posix} Extended Regular Expression syntax; if this isn't | |
2215 set, then says to use @sc{posix} Basic Regular Expression syntax. | |
2216 @code{regcomp} sets @var{preg}'s @code{syntax} field accordingly. | |
2217 | |
2218 @item REG_ICASE | |
2219 @vindex REG_ICASE | |
2220 @cindex ignoring case | |
2221 says to ignore case; @code{regcomp} sets @var{preg}'s @code{translate} | |
2222 field to a translate table which ignores case, replacing anything you've | |
2223 put there before. | |
2224 | |
2225 @item REG_NOSUB | |
2226 @vindex REG_NOSUB | |
2227 says to set @var{preg}'s @code{no_sub} field; @pxref{POSIX Matching}, | |
2228 for what this means. | |
2229 | |
2230 @item REG_NEWLINE | |
2231 @vindex REG_NEWLINE | |
2232 says that a: | |
2233 | |
2234 @itemize @bullet | |
2235 | |
2236 @item | |
2237 match-any-character operator (@pxref{Match-any-character | |
2238 Operator}) doesn't match a newline. | |
2239 | |
2240 @item | |
2241 nonmatching list not containing a newline (@pxref{List | |
2242 Operators}) matches a newline. | |
2243 | |
2244 @item | |
2245 match-beginning-of-line operator (@pxref{Match-beginning-of-line | |
2246 Operator}) matches the empty string immediately after a newline, | |
2247 regardless of how @code{REG_NOTBOL} is set (@pxref{POSIX Matching}, for | |
2248 an explanation of @code{REG_NOTBOL}). | |
2249 | |
2250 @item | |
2251 match-end-of-line operator (@pxref{Match-beginning-of-line | |
2252 Operator}) matches the empty string immediately before a newline, | |
2253 regardless of how @code{REG_NOTEOL} is set (@pxref{POSIX Matching}, | |
2254 for an explanation of @code{REG_NOTEOL}). | |
2255 | |
2256 @end itemize | |
2257 | |
2258 @end table | |
2259 | |
2260 If @code{regcomp} successfully compiles the regular expression, it | |
2261 returns zero and sets @code{*@var{pattern_buffer}} to the compiled | |
2262 pattern. Except for @code{syntax} (which it sets as explained above), it | |
2263 also sets the same fields the same way as does the @sc{gnu} compiling | |
2264 function (@pxref{GNU Regular Expression Compiling}). | |
2265 | |
2266 If @code{regcomp} can't compile the regular expression, it returns one | |
2267 of the error codes listed here. (Except when noted differently, the | |
2268 syntax of in all examples below is basic regular expression syntax.) | |
2269 | |
2270 @table @code | |
2271 | |
2272 @comment repetitions | |
2273 @item REG_BADRPT | |
2274 For example, the consecutive repetition operators @samp{**} in | |
2275 @samp{a**} are invalid. As another example, if the syntax is extended | |
2276 regular expression syntax, then the repetition operator @samp{*} with | |
2277 nothing on which to operate in @samp{*} is invalid. | |
2278 | |
2279 @item REG_BADBR | |
2280 For example, the @var{count} @samp{-1} in @samp{a\@{-1} is invalid. | |
2281 | |
2282 @item REG_EBRACE | |
2283 For example, @samp{a\@{1} is missing a close-interval operator. | |
2284 | |
2285 @comment lists | |
2286 @item REG_EBRACK | |
2287 For example, @samp{[a} is missing a close-list operator. | |
2288 | |
2289 @item REG_ERANGE | |
2290 For example, the range ending point @samp{z} that collates lower than | |
2291 does its starting point @samp{a} in @samp{[z-a]} is invalid. Also, the | |
2292 range with the character class @samp{[:alpha:]} as its starting point in | |
2293 @samp{[[:alpha:]-|]}. | |
2294 | |
2295 @item REG_ECTYPE | |
2296 For example, the character class name @samp{foo} in @samp{[[:foo:]} is | |
2297 invalid. | |
2298 | |
2299 @comment groups | |
2300 @item REG_EPAREN | |
2301 For example, @samp{a\)} is missing an open-group operator and @samp{\(a} | |
2302 is missing a close-group operator. | |
2303 | |
2304 @item REG_ESUBREG | |
2305 For example, the back reference @samp{\2} that refers to a nonexistent | |
2306 subexpression in @samp{\(a\)\2} is invalid. | |
2307 | |
2308 @comment unfinished business | |
2309 | |
2310 @item REG_EEND | |
2311 Returned when a regular expression causes no other more specific error. | |
2312 | |
2313 @item REG_EESCAPE | |
2314 For example, the trailing backslash @samp{\} in @samp{a\} is invalid, as is the | |
2315 one in @samp{\}. | |
2316 | |
2317 @comment kitchen sink | |
2318 @item REG_BADPAT | |
2319 For example, in the extended regular expression syntax, the empty group | |
2320 @samp{()} in @samp{a()b} is invalid. | |
2321 | |
2322 @comment internal | |
2323 @item REG_ESIZE | |
2324 Returned when a regular expression needs a pattern buffer larger than | |
2325 65536 bytes. | |
2326 | |
2327 @item REG_ESPACE | |
2328 Returned when a regular expression makes Regex to run out of memory. | |
2329 | |
2330 @end table | |
2331 | |
2332 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2333 @node POSIX Matching |
13532 | 2334 @subsection POSIX Matching |
13531 | 2335 |
2336 Matching the @sc{posix} way means trying to match a null-terminated | |
2337 string starting at its first character. Once you've compiled a pattern | |
2338 into a pattern buffer (@pxref{POSIX Regular Expression Compiling}), you | |
2339 can ask the matcher to match that pattern against a string using: | |
2340 | |
2341 @findex regexec | |
2342 @example | |
2343 int | |
13532 | 2344 regexec (const regex_t *@var{preg}, const char *@var{string}, |
13531 | 2345 size_t @var{nmatch}, regmatch_t @var{pmatch}[], int @var{eflags}) |
2346 @end example | |
2347 | |
2348 @noindent | |
2349 @var{preg} is the address of a pattern buffer for a compiled pattern. | |
13532 | 2350 @var{string} is the string you want to match. |
13531 | 2351 |
2352 @xref{Using Byte Offsets}, for an explanation of @var{pmatch}. If you | |
2353 pass zero for @var{nmatch} or you compiled @var{preg} with the | |
2354 compilation flag @code{REG_NOSUB} set, then @code{regexec} will ignore | |
2355 @var{pmatch}; otherwise, you must allocate it to have at least | |
2356 @var{nmatch} elements. @code{regexec} will record @var{nmatch} byte | |
2357 offsets in @var{pmatch}, and set to @math{-1} any unused elements up to | |
2358 @math{@var{pmatch}@code{[@var{nmatch}]} - 1}. | |
2359 | |
2360 @var{eflags} specifies @dfn{execution flags}---namely, the two bits | |
2361 @code{REG_NOTBOL} and @code{REG_NOTEOL} (defined in @file{regex.h}). If | |
2362 you set @code{REG_NOTBOL}, then the match-beginning-of-line operator | |
2363 (@pxref{Match-beginning-of-line Operator}) always fails to match. | |
2364 This lets you match against pieces of a line, as you would need to if, | |
2365 say, searching for repeated instances of a given pattern in a line; it | |
2366 would work correctly for patterns both with and without | |
2367 match-beginning-of-line operators. @code{REG_NOTEOL} works analogously | |
2368 for the match-end-of-line operator (@pxref{Match-end-of-line | |
2369 Operator}); it exists for symmetry. | |
2370 | |
2371 @code{regexec} tries to find a match for @var{preg} in @var{string} | |
2372 according to the syntax in @var{preg}'s @code{syntax} field. | |
2373 (@xref{POSIX Regular Expression Compiling}, for how to set it.) The | |
2374 function returns zero if the compiled pattern matches @var{string} and | |
2375 @code{REG_NOMATCH} (defined in @file{regex.h}) if it doesn't. | |
2376 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2377 @node Reporting Errors |
13531 | 2378 @subsection Reporting Errors |
2379 | |
2380 If either @code{regcomp} or @code{regexec} fail, they return a nonzero | |
2381 error code, the possibilities for which are defined in @file{regex.h}. | |
2382 @xref{POSIX Regular Expression Compiling}, and @ref{POSIX Matching}, for | |
2383 what these codes mean. To get an error string corresponding to these | |
2384 codes, you can use: | |
2385 | |
2386 @findex regerror | |
2387 @example | |
2388 size_t | |
2389 regerror (int @var{errcode}, | |
2390 const regex_t *@var{preg}, | |
2391 char *@var{errbuf}, | |
2392 size_t @var{errbuf_size}) | |
2393 @end example | |
2394 | |
2395 @noindent | |
2396 @var{errcode} is an error code, @var{preg} is the address of the pattern | |
2397 buffer which provoked the error, @var{errbuf} is the error buffer, and | |
2398 @var{errbuf_size} is @var{errbuf}'s size. | |
2399 | |
2400 @code{regerror} returns the size in bytes of the error string | |
2401 corresponding to @var{errcode} (including its terminating null). If | |
2402 @var{errbuf} and @var{errbuf_size} are nonzero, it also returns in | |
2403 @var{errbuf} the first @math{@var{errbuf_size} - 1} characters of the | |
13532 | 2404 error string, followed by a null. |
13531 | 2405 @var{errbuf_size} must be a nonnegative number less than or equal to the |
2406 size in bytes of @var{errbuf}. | |
2407 | |
2408 You can call @code{regerror} with a null @var{errbuf} and a zero | |
2409 @var{errbuf_size} to determine how large @var{errbuf} need be to | |
2410 accommodate @code{regerror}'s error string. | |
2411 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2412 @node Using Byte Offsets |
13531 | 2413 @subsection Using Byte Offsets |
2414 | |
2415 In @sc{posix}, variables of type @code{regmatch_t} hold analogous | |
2416 information, but are not identical to, @sc{gnu}'s registers (@pxref{Using | |
2417 Registers}). To get information about registers in @sc{posix}, pass to | |
2418 @code{regexec} a nonzero @var{pmatch} of type @code{regmatch_t}, i.e., | |
2419 the address of a structure of this type, defined in | |
2420 @file{regex.h}: | |
2421 | |
2422 @tindex regmatch_t | |
2423 @example | |
2424 typedef struct | |
2425 @{ | |
2426 regoff_t rm_so; | |
2427 regoff_t rm_eo; | |
2428 @} regmatch_t; | |
2429 @end example | |
2430 | |
2431 When reading in @ref{Using Registers}, about how the matching function | |
2432 stores the information into the registers, substitute @var{pmatch} for | |
2433 @var{regs}, @code{@w{@var{pmatch}[@var{i}]->}rm_so} for | |
2434 @code{@w{@var{regs}->}start[@var{i}]} and | |
2435 @code{@w{@var{pmatch}[@var{i}]->}rm_eo} for | |
2436 @code{@w{@var{regs}->}end[@var{i}]}. | |
2437 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2438 @node Freeing POSIX Pattern Buffers |
13531 | 2439 @subsection Freeing POSIX Pattern Buffers |
2440 | |
2441 To free any allocated fields of a pattern buffer, use: | |
2442 | |
2443 @findex regfree | |
2444 @example | |
13532 | 2445 void |
13531 | 2446 regfree (regex_t *@var{preg}) |
2447 @end example | |
2448 | |
2449 @noindent | |
2450 @var{preg} is the pattern buffer whose allocated fields you want freed. | |
2451 @code{regfree} also sets @var{preg}'s @code{allocated} and @code{used} | |
2452 fields to zero. After freeing a pattern buffer, you need to again | |
2453 compile a regular expression in it (@pxref{POSIX Regular Expression | |
2454 Compiling}) before passing it to the matching function (@pxref{POSIX | |
2455 Matching}). | |
2456 | |
2457 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2458 @node BSD Regex Functions |
13531 | 2459 @section BSD Regex Functions |
2460 | |
2461 If you're writing code that has to be Berkeley @sc{unix} compatible, | |
2462 you'll need to use these functions whose interfaces are the same as those | |
13532 | 2463 in Berkeley @sc{unix}. |
13531 | 2464 |
2465 @menu | |
2466 * BSD Regular Expression Compiling:: re_comp () | |
2467 * BSD Searching:: re_exec () | |
2468 @end menu | |
2469 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2470 @node BSD Regular Expression Compiling |
13531 | 2471 @subsection BSD Regular Expression Compiling |
2472 | |
2473 With Berkeley @sc{unix}, you can only search for a given regular | |
2474 expression; you can't match one. To search for it, you must first | |
2475 compile it. Before you compile it, you must indicate the regular | |
13532 | 2476 expression syntax you want it compiled according to by setting the |
13531 | 2477 variable @code{re_syntax_options} (declared in @file{regex.h} to some |
2478 syntax (@pxref{Regular Expression Syntax}). | |
2479 | |
2480 To compile a regular expression use: | |
2481 | |
2482 @findex re_comp | |
2483 @example | |
2484 char * | |
2485 re_comp (char *@var{regex}) | |
2486 @end example | |
2487 | |
2488 @noindent | |
2489 @var{regex} is the address of a null-terminated regular expression. | |
2490 @code{re_comp} uses an internal pattern buffer, so you can use only the | |
2491 most recently compiled pattern buffer. This means that if you want to | |
2492 use a given regular expression that you've already compiled---but it | |
2493 isn't the latest one you've compiled---you'll have to recompile it. If | |
2494 you call @code{re_comp} with the null string (@emph{not} the empty | |
2495 string) as the argument, it doesn't change the contents of the pattern | |
2496 buffer. | |
2497 | |
2498 If @code{re_comp} successfully compiles the regular expression, it | |
2499 returns zero. If it can't compile the regular expression, it returns | |
2500 an error string. @code{re_comp}'s error messages are identical to those | |
2501 of @code{re_compile_pattern} (@pxref{GNU Regular Expression | |
2502 Compiling}). | |
2503 | |
13533
ca70a11e70e2
Integrate the regex documentation.
Bruno Haible <bruno@clisp.org>
parents:
13532
diff
changeset
|
2504 @node BSD Searching |
13532 | 2505 @subsection BSD Searching |
13531 | 2506 |
2507 Searching the Berkeley @sc{unix} way means searching in a string | |
2508 starting at its first character and trying successive positions within | |
2509 it to find a match. Once you've compiled a pattern using @code{re_comp} | |
2510 (@pxref{BSD Regular Expression Compiling}), you can ask Regex | |
2511 to search for that pattern in a string using: | |
2512 | |
2513 @findex re_exec | |
2514 @example | |
2515 int | |
2516 re_exec (char *@var{string}) | |
2517 @end example | |
2518 | |
2519 @noindent | |
2520 @var{string} is the address of the null-terminated string in which you | |
2521 want to search. | |
2522 | |
2523 @code{re_exec} returns either 1 for success or 0 for failure. It | |
2524 automatically uses a @sc{gnu} fastmap (@pxref{Searching with Fastmaps}). |