Regex - Never Complete Only Abandoned

# ## Character Classes And "assertions". These shortcuts cannot be used inside of a standard character class definition `[]`. | PCRE | POSIX | Standard | Info | | ---- | ------------- | --------------- | -------------------------------------------- | | `\A` | | | Start of Subject | | `\b` | | | Word Boundary | | `\B` | | | Not Word Boundary | | `\d` | `[[:digit:]]` | `[0-9]` | Digit | | `\D` | | `[^0-9]` | Not a Digit | | `\h` | | `[ \t]` | [[#Horizontal Space Characters]] [^1] | | `\H` | | `[^ \t]` | Not [[#Horizontal Space Characters]] [^1]^1] | | `\R` | | - | [[#Newline Sequences]] | | `\s` | `[[:space:]]` | `[ \t\n\r]` | Whitespace [^1] | | `\S` | `[[:graph:]]` | `[^ \t\n\r]` | Not Whitespace [^1] | | `\v` | | `[\n\r]` | [[#Vertical Space Characters]] [^1] | | `\V` | | `[^\n\r]` | Not [[#Vertical Space Characters]] [^1] | | `\w` | `[[:word:]]` | `[A-Za-z0-9_]` | Word Character | | `\W` | | `[^A-Za-z0-9_]` | Not Word Character | | `\z` | | | End of Subject | | `\Z` | | | End of Subject<br>before a newline | [^1]: Some implementations will also match [[Unicode Standard|Unicode]] whitespace, not just the basic ASCII. ### Newline Sequences ```regex (?>\r\n|\n|\x0b|\f|\r|\x85) ``` ### Horizontal Space Characters ``` U+0009 Horizontal tab (HT) U+0020 Space U+00A0 Non-break space U+1680 Ogham space mark U+180E Mongolian vowel separator U+2000 En quad U+2001 Em quad U+2002 En space U+2003 Em space U+2004 Three-per-em space U+2005 Four-per-em space U+2006 Six-per-em space U+2007 Figure space U+2008 Punctuation space U+2009 Thin space U+200A Hair space U+202F Narrow no-break space U+205F Medium mathematical space U+3000 Ideographic space ``` ### Vertical Space Characters ``` U+000A Linefeed (LF) U+000B Vertical tab (VT) U+000C Form feed (FF) U+000D Carriage return (CR) U+0085 Next line (NEL) U+2028 Line separator U+2029 Paragraph separator ``` ## Character Escapes If there is no escape, then most regex implementations will ignore the backslash. See also: [[ASCII#Tables]] | Escape | Name | Hex | | Info | | ------ | ---- | ---- | --- | --------------- | | `\0` | NUL | 0x00 | | Null | | `\a` | BEL | 0x07 | | Bell | | `\b` | BS | 0x08 | | Backspace | | `\e` | ESC | 0x1B | | Escape | | `\f` | FF | 0x0C | | Form Feed | | `\n` | LF | 0x0A | | Line Feed | | `\r` | CR | 0x0D | | Carriage Return | | `\t` | HT | 0x09 | | Horizontal Tab | | `\v` | VT | 0x0B | | Vertical Tab | | `\z` | SUB | 0x1A | | Substitute | ``` | EOT | 0x04 | | End of Transmission | | ---- | ---- | --- | ------------------- | ``` # Tips ## IPv4 Addresses ```regex ((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?){4} ``` ## Match First Line of Document ```regex ^(?<!\n) ^(?<![\w\W]) ^(?<![\s\S\r]) ^(?<![\s\S\r])//\n ``` - `^` - start of a line - `//` - a `//` substring - `\n` - a line break - `([\s\S\r]*)` - Group 1 (`$1`): any 0 or more chars as many as possible up to the file end. The `^(?<![\s\S\r])//\n` regex means: - `^(?<![\s\S\r])` - match the start of the first line only as `^` matches start of a line and `(?<![\s\S\r])` negative lookbehind fails the match if there is any 1 char immediately to the left of the current location - `//\n` - `//` and a line break. # Resources ## Analysis - https://ihateregex.io - Breaks regexes down into flowcharts - Has a large library of example regexes - https://regexr.com/ - Explains regexes token by token - Has a solid collection of reference information - Supports multiple implementations - https://regex101.com - Also has explanations, but less clear than Regexr - Breaks out each match and capture group separately # References - https://pcre.org/original/doc/html/pcrepattern.html ## Application Specific - https://stackoverflow.com/questions/60090875/match-the-beginning-of-a-file-in-a-vscode-regex-search ## Performance - https://reidy-p.github.io/regex-performance/ - https://www.regular-expressions.info/catastrophic.html ## Libraries - https://devopedia.org/regex-engines - https://github.com/google/re2 ## Regexes - https://stackoverflow.com/questions/5284147/validating-ipv4-addresses-with-regexp