phy-regex-audit

GitHub 作者 PHY041

Static ReDoS (Regular Expression Denial of Service) vulnerability scanner and regex quality auditor for codebases. Walks all source files to extract regex literals, detects catastrophic backtracking patterns (nested quantifiers, overlapping alternation, unbounded repetition on complex groups), severity-ranks each finding as CRITICAL/HIGH/MEDIUM, reports file and line number with the dangerous sub-pattern highlighted, identifies high-risk call sites (HTTP request handlers, form validators, URL parsers), and suggests safe rewrites using atomic groups or simplified alternatives. Also detects hardcoded locale assumptions (character classes assuming ASCII), overly permissive patterns, and regexes missing anchors. Supports JS/TS, Python, Go, Java, Ruby, PHP, Rust. Zero external API — pure static analysis. Triggers on "regex security", "ReDoS", "catastrophic backtracking", "regex audit", "slow regex", "regex vulnerability", "/regex-audit".

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install github:LeoYeAI~openclaw-master-skills~phy-regex-audit
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/github%3ALeoYeAI~openclaw-master-skills~phy-regex-audit/file -o phy-regex-audit.md
# ReDoS & Regex Quality Auditor

One regex. One crafted input. Your Node.js server hangs for 30 seconds.

ReDoS (Regular Expression Denial of Service) is real, underestimated, and embarrassingly fixable. This skill walks every source file in your project, extracts regex literals, identifies catastrophic backtracking patterns, and tells you exactly which ones are dangerous and how to fix them.

**Supports JS/TS, Python, Go, Java, Ruby, PHP, Rust. Zero external API.**

---

## Trigger Phrases

- "regex security", "ReDoS", "catastrophic backtracking"
- "regex audit", "slow regex", "regex vulnerability"
- "check my regexes", "regex denial of service"
- "is this regex safe", "regex performance"
- "hardcoded locale regex", "missing anchor"
- "/regex-audit"

---

## How to Provide Input

```bash
# Option 1: Audit current directory (auto-detect all source files)
/regex-audit

# Option 2: Specific directory or file
/regex-audit src/
/regex-audit lib/validators.js

# Option 3: Focus on specific language
/regex-audit --lang js
/regex-audit --lang python

# Option 4: Show only CRITICAL and HIGH severity
/regex-audit --min-severity high

# Option 5: Check a single regex pattern for safety
/regex-audit --pattern "^(a+)+"

# Option 6: Focus on HTTP handler files (highest risk)
/regex-audit --high-risk-only

# Option 7: Output machine-readable JSON for CI
/regex-audit --json
```

---

## Step 1: Discover Source Files

```bash
python3 -c "
import glob, os
from pathlib import Path

# Language file patterns
patterns = {
    'JavaScript/TypeScript': ['**/*.js', '**/*.ts', '**/*.jsx', '**/*.tsx', '**/*.mjs'],
    'Python':    ['**/*.py'],
    'Go':        ['**/*.go'],
    'Java':      ['**/*.java'],
    'Ruby':      ['**/*.rb'],
    'PHP':       ['**/*.php'],
    'Rust':      ['**/*.rs'],
}

skip_dirs = {'node_modules', '.git', 'dist', 'build', '.next', 'vendor', '__pycache__', '.venv', 'venv'}

all_files = []
for lang, file_patterns in patterns.items():
    lang_files = []
    for p in file_patterns:
        for f in glob.glob(p, recursive=True):
            parts = set(Path(f).parts)
            if not parts & skip_dirs:
                lang_files.append(f)
    if lang_files:
        print(f'{lang}: {len(lang_files)} files')
        all_files.extend(lang_files)

print(f'\\nTotal: {len(all_files)} source files to scan')
"
```

---

## Step 2: Extract Regex Literals

```python
import re
from pathlib import Path
from dataclasses import dataclass
from typing import Optional

@dataclass
class RegexMatch:
    file: str
    line: int
    pattern: str
    raw_context: str      # the surrounding code line
    language: str
    in_handler: bool      # is this in an HTTP handler / validator?

# Language-specific regex extraction patterns
EXTRACTORS = {
    'js': [
        # Regex literals: /pattern/flags
        re.compile(r'(?<![=!<>])\/([^\/\n\r]{3,}?)\/([gimsuy]*)'),
        # new RegExp("pattern")
        re.compile(r'new\s+RegExp\(["\']([^"\']{3,})["\']'),
        # .test(), .match(), .exec() with string literal
        re.compile(r'\.(?:test|match|exec|replace|search)\(["\'/]([^"\'\/\n]{3,})["\'/]'),
    ],
    'python': [
        # re.compile(r"pattern")
        re.compile(r're\.(?:compile|match|search|fullmatch|findall|finditer|sub|subn|split)\(["\']([^"\']{3,})["\']'),
        re.compile(r're\.(?:compile|match|search|fullmatch|findall|finditer|sub|subn|split)\(r["\']([^"\']{3,})["\']'),
    ],
    'go': [
        # regexp.MustCompile(`pattern`)
        re.compile(r'regexp\.(?:MustCompile|Compile|Match|MatchString)\(["`]([^"`]{3,})["`]'),
    ],
    'java': [
        # Pattern.compile("pattern")
        re.compile(r'Pattern\.compile\(["\']([^"\']{3,})["\']'),
        re.compile(r'\.matches\(["\']([^"\']{3,})["\']'),
    ],
    'ruby': [
        # /pattern/ or Regexp.new("pattern")
        re.compile(r'\/([^\/\n]{3,})\/'),
        re.compile(r'Regexp\.new\(["\']([^"\']{3,})["\']'),
    ],
    'php': [
        # preg_match('/pattern/', ...)
        re.compile(r'preg_(?:match|replace|split|grep)\(["\']([^"\']{3,})["\']'),
    ],
    'rust': [
        # Regex::new(r"pattern")
        re.compile(r'Regex::new\([r]?["\']([^"\']{3,})["\']'),
    ],
}

# Keywords that indicate high-risk call sites
HIGH_RISK_CONTEXTS = [
    'app.get', 'app.post', 'app.put', 'app.delete',
    'router.', 'express', 'fastify', 'koa',
    'validate', 'validator', 'sanitize',
    'request.body', 'req.body', 'req.params', 'req.query',
    'process.argv', 'sys.argv',
    'input(', 'readline(',
    'url.parse', 'new URL(',
    '@app.route', 'flask.request',
    'r.URL.Query', 'r.FormValue',
]

LANG_MAP = {
    '.js': 'js', '.jsx': 'js', '.ts': 'js', '.tsx': 'js', '.mjs': 'js',
    '.py': 'python',
    '.go': 'go',
    '.java': 'java',
    '.rb': 'ruby',
    '.php': 'php',
    '.rs': 'rust',
}


def extract_regexes_from_file(fpath: str) -> list[RegexMatch]:
    """Extract all regex literals from a source file."""
    ext = Path(fpath).suffix.lower()
    lang = LANG_MAP.get(ext)
    if not lang or lang not in EXTRACTORS:
        return []

    try:
        lines = Path(fpath).read_text(encoding='utf-8', errors='replace').splitlines()
    except Exception:
        return []

    results = []
    for line_num, line in enumerate(lines, 1):
        # Check if this line is in a high-risk context (look at surrounding 5 lines)
        context_window = '\n'.join(lines[max(0, line_num-5):line_num+5])
        in_handler = any(kw in context_window for kw in HIGH_RISK_CONTEXTS)

        for extractor in EXTRACTORS[lang]:
            for m in extractor.finditer(line):
                pattern = m.group(1)
                if len(pattern) >= 3:
                    results.append(RegexMatch(
                        file=fpath,
                        line=line_num,
                        pattern=pattern,
                        raw_context=line.strip(),
                        language=lang,
                        in_handler=in_handler,
                    ))

    return results
```

---

## Step 3: Detect ReDoS Patterns

```python
from enum import Enum

class Severity(Enum):
    CRITICAL = 4   # Exponential backtracking — proven DoS vector
    HIGH = 3       # Polynomial backtracking — slow on long inputs
    MEDIUM = 2     # Potentially slow — context-dependent
    LOW = 1        # Style/quality issue

@dataclass
class ReDoSFinding:
    regex_match: RegexMatch
    severity: Severity
    vulnerability_type: str
    dangerous_subpattern: str
    description: str
    attack_input_example: str
    fix_suggestion: str


# ReDoS pattern signatures
REDOS_PATTERNS = [

    # ===== CRITICAL: Exponential Backtracking =====

    {
        'name': 'NESTED_QUANTIFIERS',
        'severity': Severity.CRITICAL,
        'detector': re.compile(r'\(([^()]{1,30}\+[^()]{0,10})\)\+|\(([^()]{1,30}\*[^()]{0,10})\)\+'),
        'description': 'Nested quantifiers (a+)+ or (a*)+ create exponential backtracking.',
        'attack_shape': 'Long string of matching chars followed by one non-matching char',
        'example_attack': '"aaaaaaaaaaaaaaaaaaaaaaaaaX"',
        'fix': 'Use atomic group (?>...) or possessive quantifier — rewrite to remove nesting.',
    },
    {
        'name': 'NESTED_STAR_PLUS',
        'severity': Severity.CRITICAL,
        'detector': re.compile(r'\(([^()]{1,30})\)\*\+|\(([^()]{1,30})\)\+\*'),
        'description': 'Nested star/plus combination enables exponential path explosion.',
        'attack_shape': 'Repeated matching chars followed by failure',
        'example_attack': '"aaaaaaaaaaaX"',
        'fix': 'Flatten quantifiers or use possessive quantifiers.',
    },

    # ===== HIGH: Polynomial Backtracking =====

    {
        'name': 'ALTERNATION_OVERLAP',
        'severity': Severity.HIGH,
        'detector': re.compile(r'\(([a-zA-Z]{1,5})\|([a-zA-Z]{1,5})\)\+|\(([a-zA-Z]{1,5})\|([a-zA-Z]{1,5})\)\*'),
        'description': 'Overlapping alternation with quantifier: (ab|a)+ causes polynomial backtracking.',