Deeplake
SDK for ingesting data into Deeplake managed tables. Use when users want to store, ingest, or query data in Deeplake.
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install clawskills:kaghni~deeplake-skillscURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Akaghni~deeplake-skills/file -o deeplake-skills.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/63c066c593097085051b4cd86e5fc09a5613f7fb# Deeplake Managed Service SDK
> Agent-friendly SDK for ingesting data into Deeplake managed tables.
> Use this skill when users want to store, ingest, or query data in Deeplake.
> Available in both **Python** and **Node.js/TypeScript**.
---
## Quick Reference
### Python
```bash
pip install deeplake # uv add deeplake
```
**Python import (primary):**
```python
from deeplake import Client
# Async variant (requires aiohttp: pip install aiohttp):
from deeplake.managed import AsyncClient
```
```python
from deeplake import Client
# Initialize -- token from DEEPLAKE_API_KEY env var, workspace defaults to "default"
client = Client()
client = Client(token="dl_xxx", workspace_id="my-workspace")
# Ingest files (FILE schema)
client.ingest("videos", {"path": ["video1.mp4", "video2.mp4"]}, schema={"path": "FILE"})
# Ingest structured data with indexes for search
client.ingest("embeddings", {
"text": ["doc1", "doc2", "doc3"],
"embedding": [[0.1, 0.2, ...], [0.3, 0.4, ...], [0.5, 0.6, ...]],
}, index=["embedding", "text"])
# Ingest from HuggingFace
client.ingest("cifar", {"_huggingface": "cifar10"})
# Ingest with format object (see formats.md for CocoPanoptic, Coco, LeRobot, custom)
client.ingest("table", format=my_format)
# Fluent query
results = client.table("videos").select("id", "text").where("file_id = $1", "abc").limit(10)()
# Raw SQL
results = client.query("SELECT * FROM videos LIMIT 10")
# Vector similarity search
results = client.query("""
SELECT id, text, embedding <#> $1 AS similarity
FROM embeddings ORDER BY similarity DESC LIMIT 10
""", (query_embedding,))
# Table management
client.list_tables()
client.drop_table("old_table")
client.create_index("embeddings", "embedding")
```
### Node.js / TypeScript
```
npm install deeplake
```
**TypeScript import:**
```typescript
import { ManagedClient, initializeWasm } from 'deeplake';
```
**WASM initialization (required before any operations):**
```typescript
await initializeWasm();
```
Call `initializeWasm()` once at startup before any `ManagedClient` operations (ingest, query, etc.). It initializes the underlying WASM module.
```typescript
import { ManagedClient, initializeWasm } from 'deeplake';
await initializeWasm();
const client = new ManagedClient({ token: 'dl_xxx', workspaceId: 'my-workspace' });
// Ingest files (FILE schema)
await client.ingest("videos", { path: ["video1.mp4"] }, { schema: { path: "FILE" } });
// Ingest structured data
await client.ingest("embeddings", {
text: ["doc1", "doc2"],
embedding: [[0.1, 0.2], [0.3, 0.4]],
});
// Ingest with format object (see formats.md)
await client.ingest("table", null, { format: myFormat });
// Fluent query (use .execute())
const results = await client.table("videos")
.select("id", "text").where("file_id = $1", "abc").limit(10).execute();
// Raw SQL
const rows = await client.query("SELECT * FROM videos LIMIT 10");
// Table management
await client.listTables();
await client.dropTable("old_table");
await client.createIndex("embeddings", "embedding");
```
---
## Dependancies and Prerequisite
**Required services:**
- Deeplake API server running (default: `https://api.deeplake.ai`)
**Optional python dependencies (per file type):**
- Video ingestion: `ffmpeg` (`sudo apt-get install ffmpeg`)
- PDF ingestion: `pymupdf` (`pip install pymupdf`)
- Thumbnail generation: `Pillow` (`pip install Pillow`)
- COCO detection format: `pycocotools`, `Pillow`, `numpy` (`pip install pycocotools Pillow numpy`)
- LeRobot frames format: `pandas`, `numpy` (`pip install pandas numpy`)
**Optional typescript dependencies (per file type):**
- Video ingestion: `ffmpeg` (system binary)
- PDF ingestion: `pdfjs-dist` (`npm install pdfjs-dist`)
- Thumbnail generation: `sharp` (`npm install sharp`)
- COCO detection format: no external deps (pure JS mask rendering)
## Architecture
```
Python: Client(token, workspace_id)
Node.js: ManagedClient({ token, workspaceId })
|-- .ingest(table, data) -> creates PG table via API, opens al://{ws}/{table}
| via deeplake SDK (auto credential rotation)
|-- .query(sql) -> POST /workspaces/{id}/tables/query -> list[dict] / QueryRow[]
|-- .table(table)... -> fluent SQL builder -> list[dict] / QueryRow[]
|-- .create_index(table, col) -> CREATE INDEX USING deeplake_index (for search)
|-- .open_table(table) -> deeplake.open("al://{ws}/{table}") with auto creds
|-- .list_tables() -> GET /workspaces/{id}/tables -> list[str] / string[]
`-- .drop_table(table) -> DELETE /workspaces/{id}/tables/{name}
|
v
REST API -> PostgreSQL + pg_deeplake
- All DB operations go through the REST API (no direct PG connection)
- Dataset access uses al:// paths with automatic credential resolution
- Creds endpoint: GET /api/org/{workspace}/ds/{table}/creds
- Vector similarity: embedding <#> query_vec
- BM25 text search: text <#> 'search query'
- Hybrid search: (embedding, text)::deeplake_hybrid_record
```
---
## Client Initialization
### Python
```python
from deeplake import Client
client = Client(
token: str = None, # API token (falls back to DEEPLAKE_API_KEY env var)
workspace_id: str = "default", # Target workspace (default: "default")
api_url: str = None, # API URL (default: https://api.deeplake.ai)
)
```
### Node.js / TypeScript
```typescript
import { ManagedClient, initializeWasm } from 'deeplake';
await initializeWasm();
const client = new ManagedClient({
token: string, // API token (required)
workspaceId?: string, // Target workspace (default: "default")
apiUrl?: string, // API URL (default: https://api.deeplake.ai)
});
```
**Token:** Create API tokens from the Deeplake platform at `https://app.deeplake.ai/<org_name>/workspace/<workspace>/apitoken`. The token is a JWT with `org_id` embedded. Falls back to the `DEEPLAKE_API_KEY` environment variable (Python only).
**Backend endpoint:** The client sets the C++ backend endpoint to `api_url` before each dataset open (not on initialization) so that `al://` path resolution (credential fetching) goes through deeplake-api instead of the legacy controlplane. This avoids global state clobbering when multiple clients use different API URLs. Python: `deeplake.client.endpoint = api_url`. Node.js: `deeplakeSetEndpoint(apiUrl)`.
**Connection lifecycle:**
```python
# Python: just create and use -- no connection to manage
client = Client()
client.ingest("table", {"path": ["file.txt"]}, schema={"path": "FILE"})
# No close() method -- client is stateless (REST API calls only)
```
---
## Ingestion
### Python: client.ingest()
```python
result = client.ingest(
table_name: str, # Table name to create (must not already exist)
data: dict[str, list] = None, # Data dict (required unless format= is set).
# {"_huggingface": "name"} -> HuggingFace dataset
# schema has "FILE" cols -> file paths processed
# otherwise -> column data {col: [values]}
*,
format: Format = None, # Format object (subclass of Format) with
# normalize() method. When set, data is ignored.
# e.g. CocoPanoptic(images_dir=..., ...)
schema: dict[str, str] = None, # Schema override {col: type}
# Use "FILE" for columns containing file paths
# See reference.md for all type names
index: list[str] = None, # Columns to create deeplake_index on after ingestion.
# Use for EMBEDDING (vector search) and TEXT (BM25) columns.
on_progress: Callable = None, #