Deeplake

ClawSkills 作者 kaghni v1.0.1

SDK for ingesting data into Deeplake managed tables. Use when users want to store, ingest, or query data in Deeplake.

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:kaghni~deeplake-skills
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Akaghni~deeplake-skills/file -o deeplake-skills.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/63c066c593097085051b4cd86e5fc09a5613f7fb
# Deeplake Managed Service SDK

> Agent-friendly SDK for ingesting data into Deeplake managed tables.
> Use this skill when users want to store, ingest, or query data in Deeplake.
> Available in both **Python** and **Node.js/TypeScript**.

---

## Quick Reference

### Python

```bash
pip install deeplake # uv add deeplake 
```

**Python import (primary):**
```python
from deeplake import Client

# Async variant (requires aiohttp: pip install aiohttp):
from deeplake.managed import AsyncClient
```


```python
from deeplake import Client

# Initialize -- token from DEEPLAKE_API_KEY env var, workspace defaults to "default"
client = Client()
client = Client(token="dl_xxx", workspace_id="my-workspace")

# Ingest files (FILE schema)
client.ingest("videos", {"path": ["video1.mp4", "video2.mp4"]}, schema={"path": "FILE"})

# Ingest structured data with indexes for search
client.ingest("embeddings", {
    "text": ["doc1", "doc2", "doc3"],
    "embedding": [[0.1, 0.2, ...], [0.3, 0.4, ...], [0.5, 0.6, ...]],
}, index=["embedding", "text"])

# Ingest from HuggingFace
client.ingest("cifar", {"_huggingface": "cifar10"})

# Ingest with format object (see formats.md for CocoPanoptic, Coco, LeRobot, custom)
client.ingest("table", format=my_format)

# Fluent query
results = client.table("videos").select("id", "text").where("file_id = $1", "abc").limit(10)()

# Raw SQL
results = client.query("SELECT * FROM videos LIMIT 10")

# Vector similarity search
results = client.query("""
    SELECT id, text, embedding <#> $1 AS similarity
    FROM embeddings ORDER BY similarity DESC LIMIT 10
""", (query_embedding,))

# Table management
client.list_tables()
client.drop_table("old_table")
client.create_index("embeddings", "embedding")
```

### Node.js / TypeScript

```
npm install deeplake
```

**TypeScript import:**
```typescript
import { ManagedClient, initializeWasm } from 'deeplake';
```

**WASM initialization (required before any operations):**
```typescript
await initializeWasm();
```
Call `initializeWasm()` once at startup before any `ManagedClient` operations (ingest, query, etc.). It initializes the underlying WASM module.


```typescript
import { ManagedClient, initializeWasm } from 'deeplake';

await initializeWasm();

const client = new ManagedClient({ token: 'dl_xxx', workspaceId: 'my-workspace' });

// Ingest files (FILE schema)
await client.ingest("videos", { path: ["video1.mp4"] }, { schema: { path: "FILE" } });

// Ingest structured data
await client.ingest("embeddings", {
    text: ["doc1", "doc2"],
    embedding: [[0.1, 0.2], [0.3, 0.4]],
});

// Ingest with format object (see formats.md)
await client.ingest("table", null, { format: myFormat });

// Fluent query (use .execute())
const results = await client.table("videos")
    .select("id", "text").where("file_id = $1", "abc").limit(10).execute();

// Raw SQL
const rows = await client.query("SELECT * FROM videos LIMIT 10");

// Table management
await client.listTables();
await client.dropTable("old_table");
await client.createIndex("embeddings", "embedding");
```

---

## Dependancies and Prerequisite

**Required services:**
- Deeplake API server running (default: `https://api.deeplake.ai`)

**Optional python dependencies (per file type):**
- Video ingestion: `ffmpeg` (`sudo apt-get install ffmpeg`)
- PDF ingestion: `pymupdf` (`pip install pymupdf`)
- Thumbnail generation: `Pillow` (`pip install Pillow`)
- COCO detection format: `pycocotools`, `Pillow`, `numpy` (`pip install pycocotools Pillow numpy`)
- LeRobot frames format: `pandas`, `numpy` (`pip install pandas numpy`)

**Optional typescript dependencies (per file type):**
- Video ingestion: `ffmpeg` (system binary)
- PDF ingestion: `pdfjs-dist` (`npm install pdfjs-dist`)
- Thumbnail generation: `sharp` (`npm install sharp`)
- COCO detection format: no external deps (pure JS mask rendering)

## Architecture

```
Python:  Client(token, workspace_id)
Node.js: ManagedClient({ token, workspaceId })
  |-- .ingest(table, data)       -> creates PG table via API, opens al://{ws}/{table}
  |                                 via deeplake SDK (auto credential rotation)
  |-- .query(sql)                -> POST /workspaces/{id}/tables/query -> list[dict] / QueryRow[]
  |-- .table(table)...           -> fluent SQL builder -> list[dict] / QueryRow[]
  |-- .create_index(table, col)  -> CREATE INDEX USING deeplake_index (for search)
  |-- .open_table(table)         -> deeplake.open("al://{ws}/{table}") with auto creds
  |-- .list_tables()             -> GET /workspaces/{id}/tables -> list[str] / string[]
  `-- .drop_table(table)         -> DELETE /workspaces/{id}/tables/{name}
                    |
                    v
              REST API -> PostgreSQL + pg_deeplake
  - All DB operations go through the REST API (no direct PG connection)
  - Dataset access uses al:// paths with automatic credential resolution
  - Creds endpoint: GET /api/org/{workspace}/ds/{table}/creds
  - Vector similarity: embedding <#> query_vec
  - BM25 text search:  text <#> 'search query'
  - Hybrid search:     (embedding, text)::deeplake_hybrid_record
```

---

## Client Initialization

### Python

```python
from deeplake import Client

client = Client(
    token: str = None,           # API token (falls back to DEEPLAKE_API_KEY env var)
    workspace_id: str = "default",  # Target workspace (default: "default")
    api_url: str = None,         # API URL (default: https://api.deeplake.ai)
)
```

### Node.js / TypeScript

```typescript
import { ManagedClient, initializeWasm } from 'deeplake';

await initializeWasm();

const client = new ManagedClient({
    token: string,               // API token (required)
    workspaceId?: string,        // Target workspace (default: "default")
    apiUrl?: string,             // API URL (default: https://api.deeplake.ai)
});
```

**Token:** Create API tokens from the Deeplake platform at `https://app.deeplake.ai/<org_name>/workspace/<workspace>/apitoken`. The token is a JWT with `org_id` embedded. Falls back to the `DEEPLAKE_API_KEY` environment variable (Python only).

**Backend endpoint:** The client sets the C++ backend endpoint to `api_url` before each dataset open (not on initialization) so that `al://` path resolution (credential fetching) goes through deeplake-api instead of the legacy controlplane. This avoids global state clobbering when multiple clients use different API URLs. Python: `deeplake.client.endpoint = api_url`. Node.js: `deeplakeSetEndpoint(apiUrl)`.

**Connection lifecycle:**
```python
# Python: just create and use -- no connection to manage
client = Client()
client.ingest("table", {"path": ["file.txt"]}, schema={"path": "FILE"})
# No close() method -- client is stateless (REST API calls only)
```

---

## Ingestion

### Python: client.ingest()

```python
result = client.ingest(
    table_name: str,                    # Table name to create (must not already exist)
    data: dict[str, list] = None,       # Data dict (required unless format= is set).
                                        #   {"_huggingface": "name"} -> HuggingFace dataset
                                        #   schema has "FILE" cols -> file paths processed
                                        #   otherwise -> column data {col: [values]}
    *,
    format: Format = None,              # Format object (subclass of Format) with
                                        #   normalize() method. When set, data is ignored.
                                        #   e.g. CocoPanoptic(images_dir=..., ...)
    schema: dict[str, str] = None,      # Schema override {col: type}
                                        #   Use "FILE" for columns containing file paths
                                        #   See reference.md for all type names
    index: list[str] = None,            # Columns to create deeplake_index on after ingestion.
                                        #   Use for EMBEDDING (vector search) and TEXT (BM25) columns.
    on_progress: Callable = None,       #