elisp 23.2 doc on regex on multibyte char still correct?



in elisp doc for emacs 23.2, section on regex, it
it has a section that talks about multibyte chars.

is that info still correct?


This is edition 3.0 of the GNU Emacs Lisp Reference Manual,
corresponding to Emacs version 23.2.

(elisp) Regexp Special

The beginning and end of a range of multibyte characters must be in
the same character set (*note Character Sets::). Thus,
`"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with
grave accent) is in the Emacs character set for Latin-1 but the
character 0x97c (`u' with diaeresis) is in the Emacs character set
for Latin-2. (We use Lisp string syntax to write that example,
and a few others in the next few paragraphs, in order to include
hex escape sequences in them.)

If a range starts with a unibyte character C and ends with a
multibyte character C2, the range is divided into two parts: one
is `C..?\377', the other is `C1..C2', where C1 is the first
character of the charset to which C2 belongs.

You cannot always match all non-ASCII characters with the regular
expression `"[\200-\377]"'. This works when searching a unibyte
buffer or string (*note Text Representations::), but not in a
multibyte buffer or string, because many non-ASCII characters have
codes above octal 0377. However, the regular expression
`"[^\000-\177]"' does match all non-ASCII characters (see below
regarding `^'), in both multibyte and unibyte representations,
because only the ASCII characters are excluded.

A character alternative can also specify named character classes
(*note Char Classes::). This is a POSIX feature whose syntax is
`[:CLASS:]'. Using a character class is equivalent to mentioning
each of the characters in that class; but the latter is not
feasible in practice, since some classes include thousands of
different characters.


Xah ∑ http://xahlee.org/ ☄

No comments:

Post a Comment