johstracke-url-fetcher

ClawSkills 作者 clawskills

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install clawskills:clawskills~johstracke-url-fetcher

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aclawskills~johstracke-url-fetcher/file -o johstracke-url-fetcher.md

---
name: url-fetcher
description: Simple web content fetching without API keys or external dependencies. Uses Python stdlib (urllib) only. Features: fetch HTML/text from URLs, basic HTML to markdown conversion, path-validated file writes (blocks system paths), URL validation (blocks localhost/internal). Security: File writes use is_safe_path() to prevent malicious writes. Perfect for content aggregation, research collection, and web scraping without API costs or dependencies.
---

# URL Fetcher

Fetch web content without API keys or external dependencies. Uses Python standard library only.

## Quick Start

```bash
url_fetcher.py fetch <url>
url_fetcher.py fetch --markdown <url> [output_file]
```

**Examples:**
```bash
# Fetch and preview
url_fetcher.py fetch https://example.com

# Fetch and save HTML
url_fetcher.py fetch https://example.com ~/workspace/page.html

# Fetch and convert to basic markdown
url_fetcher.py fetch --markdown https://example.com ~/workspace/page.md
```

## Features

- **No dependencies** - Uses Python stdlib (urllib) only
- **No API keys** - Completely free to use
- **URL validation** - Blocks localhost/internal networks
- **Basic markdown conversion** - Extract content from HTML
- **Path validation** - Safe file writes only (workspace, home, /tmp)
- **Error handling** - Timeout and network error handling

## When to Use

- **Content aggregation** - Collect pages for processing
- **Research collection** - Save articles/pages locally
- **Simple scraping** - Extract text from web pages
- **Markdown conversion** - Basic HTML to text/markdown
- **No-API alternatives** - When you can't use paid APIs

## Limitations

- **Basic markdown** - Simple regex-based conversion (not a full parser)
- **No JavaScript** - Only fetches static HTML
- **Rate limiting** - No built-in rate limiting (add your own if needed)
- **Bot detection** - Some sites may block the default User-Agent

## Security Features

### URL Validation
- ✅ Allows: http/https URLs
- ❌ Blocks: file://, data://, javascript: URLs
- ❌ Blocks: localhost, 127.0.0.1, ::1 (internal networks)

### File Path Validation
- ✅ Allows: workspace, home directory, /tmp
- ❌ Blocks: system paths (/etc, /usr, /var, etc.)
- ❌ Blocks: sensitive dotfiles (~/.ssh, ~/.bashrc, etc.)

### Error Handling
- Timeout after 10 seconds
- HTTP error handling
- Network error handling
- Character encoding handling

## Usage Patterns

### Collecting Research
```bash
# Fetch multiple articles
url_fetcher.py fetch https://example.com/article1.md ~/workspace/research/article1.md
url_fetcher.py fetch https://example.com/article2.md ~/workspace/research/article2.md

# Convert to markdown for reading
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/research/article.md
```

### Content Aggregation
```bash
# Fetch pages for processing
url_fetcher.py fetch https://news.example.com ~/workspace/content/latest.html

# Extract text
url_fetcher.py fetch --markdown https://blog.example.com ~/workspace/content/post.md
```

### Quick Preview
```bash
# Just preview content (no file save)
url_fetcher.py fetch https://example.com
```

## Advanced Usage

### Batch Fetching
```bash
#!/bin/bash
# batch_fetch.sh

URLS=(
    "https://example.com/page1"
    "https://example.com/page2"
    "https://example.com/page3"
)

OUTPUT_DIR="$HOME/workspace/fetched"
mkdir -p "$OUTPUT_DIR"

for url in "${URLS[@]}"; do
    filename=$(echo $url | sed 's|/||g')
    url_fetcher.py fetch --markdown "$url" "$OUTPUT_DIR/$filename.md"
    sleep 1  # Be nice to servers
done
```

### Integration with Other Skills

**Combine with research-assistant:**
```bash
# Fetch article
url_fetcher.py fetch --markdown https://example.com/article.md ~/workspace/article.md

# Extract key points
# Then use research-assistant to organize findings
```

**Combine with task-runner:**
```bash
# Add task to fetch content
task_runner.py add "Fetch article on topic X" "research"

# Fetch when ready
url_fetcher.py fetch https://example.com/topic-x.md ~/workspace/research/topic-x.md
```

## Troubleshooting

### Connection Timeout
```
Error: Request timeout after 10s
```
**Solution:** The server is slow or unreachable. Try again later or check the URL.

### HTTP 403/429 Errors
```
Error: HTTP 403: Forbidden
```
**Solution:** The site blocks automated requests. Try:
- Add delay between requests
- Use a different User-Agent (modify source)
- Respect robots.txt
- Consider using an API if available

### Encoding Issues
```
Error with special characters
```
**Solution:** The tool uses UTF-8 with error-ignore. Some characters may be lost.

### Markdown Quality
```
Note: Basic markdown extraction
```
**Solution:** This tool uses simple regex for HTML→MD conversion. For better results:
- Use dedicated markdown parsers
- Or post-process the output
- Or use a paid API with better parsing

## Best Practices

1. **Be respectful** - Add delays between requests (don't hammer servers)
2. **Check robots.txt** - Respect site's crawling policies
3. **Rate limit yourself** - Don't fetch too fast
4. **Validate URLs** - Only fetch from trusted sources
5. **Save safely** - Always use path-validated outputs
6. **Preview first** - Use preview mode before saving

## Integration Examples

### Python Integration
```python
from pathlib import Path
import subprocess

def fetch_and_process(url):
    """Fetch URL and process"""
    output = Path.home() / "workspace" / "fetched" / "page.md"
    output.parent.mkdir(parents=True, exist_ok=True)
    
    # Fetch
    subprocess.run([
        "python3",
        "/path/to/url_fetcher.py",
        "fetch",
        "--markdown",
        url,
        str(output)
    ])
    
    # Process content
    content = output.read_text()
    return content
```

### Bash Integration
```bash
# Function for fetching
fetch_content() {
    local url="$1"
    local output="$2"
    python3 ~/workspace/skills/url-fetcher/scripts/url_fetcher.py \
        fetch --markdown "$url" "$output"
}

# Usage
fetch_content "https://example.com" ~/workspace/example.md
```

## Alternatives

### When You Need More Features

**For full-featured scraping:**
- Use `requests` + `beautifulsoup4` (requires pip install)
- Or use `scrapy` framework (requires pip install)
- Or use paid APIs (Firecrawl, Apify)

**For better markdown:**
- `markdownify` library (requires pip install)
- Or use AI-based parsing (OpenAI, Anthropic APIs)

**For complex workflows:**
- Browser automation (OpenClaw browser tool)
- Headless Chrome (Puppeteer, Playwright)
- Or use scraping APIs (Zyte, ScraperAPI)

## Zero-Cost Advantage

This skill requires:
- ✅ Python 3 (included with OpenClaw)
- ✅ No API keys
- ✅ No external packages
- ✅ No paid services
- ✅ No rate limiting (other than what you add)

Perfect for autonomous agents with budget constraints.

## Contributing

If you improve this skill, please:
1. Test with security-checker
2. Document new features
3. Publish to ClawHub with credit

## License

Use freely in your OpenClaw skills and workflows.