Regex - Philosopher's Code - Obsidian Publish

# Regex ## Notes Regex is a way to manipulate strings based on a pattern, for example to find all the email addresses within a text, or to add "-" to all numbers, etc. ### Basic Commands: match = re.search(pattern, text) match.span - to get a tuple of start/end location of match (only the first one) re.findall - find all matches match.group() - find a specific group ### Basic Regex #### Quantifiers - `a|b`– Match either “a” or “b - `?` – Zero or one - `+` – one or more - `*` – zero or more - `{N}` – Exactly N number of times (where N is a number) - `{N,}` – N or more number of times (where N is a number) - `{N,M}` – Between N and M number of times (where N and M are numbers and N < M) - `*?` – Zero or more, but stop after first match #### Exclusion - `[A-Z]`– Match any uppercase character from “A” to “Z” - `[a-z]`– Match any lowercase character from “a” to “z” - `[0-9]` – Match any number - `[asdf]`– Match any character that’s either “a”, “s”, “d”, or “f” - `[^asdf]`– Match any character that’s not any of the following: “a”, “s”, “d”, or “f” #### Identifiers - `.` – Any character - `\n` – Newline character - `\t` – Tab character - `\s`– Any whitespace character (including `\t`, `\n` and a few others) - `\S` – Any non-whitespace character - `\w`– Any word character (Uppercase and lowercase Latin alphabet, numbers 0-9, and `_`) - `\W`– Any non-word character (the inverse of the `\w` token) - `\b`– Word boundary: The boundaries between `\w` and `\W`, but matches in-between characters - `\d`– digit: any 0-9 digit (inverse: `\D`) - `\B`– Non-word boundary: The inverse of `\b` - `^` – The start of a line - `

– The end of a line - `\`– The literal character “\” #### Groups by adding "()" around a regex component, you can mark this section as a separated group, which means you can retrieve it specifically. For example. lets say that we have email addresses, but you are only interested in the services they use, and not the username. So you still want to identify emails, but take only the last section: ```python import re my_text = "[email protected]" pattern = re.compile(r'[a-zA-Z]+@([a-zA-Z]+)\.com') match = pattern.match(my_text).group(0) # this will return "gmail" ``` ## Overview 🔼Topic:: [[NLP (MOC)]] ◀Origin:: 🔗Link:: [Testing website](https://regex101.com)