
# Data tools

Nineteen in-process dataframe operations over tabular data — the **transform** half of ETL, including **Excel in/out** (`from-xlsx` / `to-xlsx`, the format DuckDB can't emit) and structured `derive` / cleaning ops. Each is a `POST https://api.relaystation.ai/v1/data/<op>`. Input is the csv-family union (`csv`, `tsv`, `json` array-of-rows, `ndjson`); pass `from`/`to` to pick formats (default `csv`). cputools reshapes your rows — it never evaluates a cell or a user expression as code (see `filter` below), so there is no expression-execution surface. For full SQL, see [Data — SQL](/docs/data-sql); to chain several of these in one call, see [Pipeline](/docs/pipeline).

## Inputs and outputs

Same uniform shape as the [CSV tools](/docs/csv-tools): the `file` is an **input source** — `{ "inline": "<base64>" }` for ≤ 4 MB, or `{ "inputKey": "..." }` (from `POST /v1/cputools/upload-url`) for up to 50 MB. The verbs return the uniform **output envelope** (the reshaped table, default `text/csv`); `profile` returns a JSON report.

## Billing

Per-MB of input, **minimum 1 MiB**, at the operator-tunable `cputools.price.data.<op>.per_mb_micros` (launch default $0.0002 / MB). `join` bills on the **sum** of both inputs. The live `402` challenge is authoritative.

| Op | Billed on | Rate |
|---|---|---|
| `filter` | input MB | $0.0002 / MB |
| `sort` | input MB | $0.0002 / MB |
| `groupby` | input MB | $0.0002 / MB |
| `join` | input MB (both inputs) | $0.0002 / MB |
| `union` | input MB (all inputs) | $0.0002 / MB |
| `profile` | input MB | $0.0002 / MB |
| `cast` | input MB | $0.0002 / MB |
| `pivot` | input MB | $0.0002 / MB |
| `schema-infer` | input MB | $0.0002 / MB |
| `validate` | input MB | $0.0002 / MB |
| `diff` | input MB (both inputs) | $0.0002 / MB |
| `from-xlsx` | input MB | $0.0002 / MB |
| `to-xlsx` | input MB | $0.0002 / MB |
| `derive` | input MB | $0.0002 / MB |
| `fillna` | input MB | $0.0002 / MB |
| `dropna` | input MB | $0.0002 / MB |
| `rename` | input MB | $0.0002 / MB |
| `slice` | input MB | $0.0002 / MB |
| `explode` | input MB | $0.0002 / MB |

## filter

Keep rows matching a **structured predicate** — a tree of `{ column, op, value }` leaves combined with `and` / `or` / `not`. Ops: `eq`, `ne`, `gt`, `gte`, `lt`, `lte`, `contains`, `startsWith`, `endsWith`, `in`, `isNull`, `notNull`. Comparisons are value-typed (numeric when both sides parse as numbers, else string). There is **no expression string** — a `value` is always a literal compared against your cell data, never executed — so the predicate is safe by construction.

```json
POST /v1/data/filter
{ "file": {"inline":"<b64>"},
  "where": { "and": [ {"column":"status","op":"eq","value":"active"},
                      {"column":"age","op":"gte","value":18} ] } }
```

## sort

Sort by one or more columns. Each key is `{ column, dir?: "asc"|"desc", type?: "string"|"number" }`.

```json
POST /v1/data/sort
{ "file": {"inline":"<b64>"}, "by": [ {"column":"age","dir":"desc","type":"number"} ] }
```

## groupby

Group by one or more columns and apply named aggregate functions — `count`, `sum`, `avg`, `min`, `max`, `first`, `last`. `count` needs no column; the rest require one. Name an aggregate with `as`.

```json
POST /v1/data/groupby
{ "file": {"inline":"<b64>"}, "by": ["country"],
  "aggregate": [ {"fn":"count","as":"n"}, {"column":"spend","fn":"sum","as":"total"} ] }
```

## join

Join two tables on one or more key pairs. `how` is `inner` (default), `left`, `right`, or `outer`. Pass `left` and `right` as input sources and `on` as `[{ "left": "...", "right": "..." }]`.

```json
POST /v1/data/join
{ "left": {"inline":"<b64>"}, "right": {"inputKey":"..."},
  "on": [ {"left":"id","right":"user_id"} ], "how": "left" }
```

## union

Concatenate two or more tables, preserving row order; the output schema is the ordered union of all columns (missing cells come back empty).

```json
POST /v1/data/union
{ "files": [ {"inline":"<b64>"}, {"inline":"<b64>"} ] }
```

## profile

A read-only data-quality report: per column the inferred type, row/null counts, distinct count, min/max, numeric mean, and the top values. Returns JSON.

```json
POST /v1/data/profile
{ "file": {"inline":"<b64>"} }
```

## cast

Coerce named columns to a target type — `string`, `integer`, `number`, `boolean`, or `date`. Best-effort: a value that can't be coerced becomes null (the cell isn't dropped).

```json
POST /v1/data/cast
{ "file": {"inline":"<b64>"}, "columns": [ {"column":"age","to":"integer"}, {"column":"joined","to":"date"} ] }
```

## pivot

Reshape between long and wide. `mode: "pivot"` spreads a column's values into new columns — `{ index: [...], columns: "<col>", values: "<col>", agg?: count|sum|avg|min|max }`. `mode: "unpivot"` melts columns into key/value rows — `{ id: [...], value?: [...] }`.

```json
POST /v1/data/pivot
{ "file": {"inline":"<b64>"}, "mode": "pivot",
  "index": ["country"], "columns": "quarter", "values": "spend", "agg": "sum" }
```

## schema-infer

Infer each column's type from the data and emit a schema. `dialect: "json-schema"` (default) returns a JSON Schema; `dialect: "sql-ddl"` returns a `CREATE TABLE` statement. Returns JSON/text.

```json
POST /v1/data/schema-infer
{ "file": {"inline":"<b64>"}, "dialect": "sql-ddl" }
```

## validate

Check each row against a **structured schema** — per column `{ type, required?, min?, max?, pattern? }`. `pattern` is a bounded, length-capped regular expression matched against the cell string (it is **not** executable code). Returns `{ valid, rowCount, errorCount, errors: [{row, column, reason}] }`.

```json
POST /v1/data/validate
{ "file": {"inline":"<b64>"},
  "schema": { "email": {"type":"string","required":true,"pattern":"^[^@]+@[^@]+$"},
              "age": {"type":"integer","min":0,"max":120} } }
```

## diff

Row-level changeset between two tables. Pass `left` and `right` as input sources and optional `key` columns (omit to compare whole rows). Returns `{ added, removed, changed: [{key, before, after}] }`.

```json
POST /v1/data/diff
{ "left": {"inline":"<b64>"}, "right": {"inputKey":"..."}, "key": ["id"] }
```

## from-xlsx

Parse an Excel workbook into a table. `sheet` (name or index, default the first) picks the sheet; `to` is the output format. The first row is the header, and cell **values** are read — formulas are never evaluated. A workbook over the cell-count cap is rejected pre-charge.

```json
POST /v1/data/from-xlsx
{ "file": {"inputKey":"..."}, "sheet": 0, "to": "json" }
```

## to-xlsx

Write a table to an Excel `.xlsx`. Any cell whose value starts with `= + - @` is written as **text**, so a payload can't become an active formula when the file is opened — formula-injection-safe, and it round-trips cleanly.

```json
POST /v1/data/to-xlsx
{ "file": {"inline":"<b64 csv>"}, "sheetName": "Export" }
```

**Multi-sheet workbooks.** Instead of a single `file`, pass a `sheets` array (≤ 20) — each entry is `{ name, file, from? }` with its own tabular source, written to its own tab. Provide **exactly one** of `file` or `sheets`. Multi-sheet billing sums every sheet's input bytes.

**Styling.** An optional `style` object applies to the workbook: `headerBold`, `freezeHeader`, `columnWidths` (an array by index or an object keyed by column name), and `numberFormats` (per-column Excel format strings like `"#,##0.00"`, `"yyyy-mm-dd"`, `"@"`).

```json
POST /v1/data/to-xlsx
{ "sheets": [
    { "name": "Sales", "file": {"inline":"<b64 csv>"} },
    { "name": "Costs", "file": {"inputKey":"..."}, "from": "json" }
  ],
  "style": { "headerBold": true, "freezeHeader": true, "numberFormats": { "amount": "#,##0.00" } } }
```

The result is binary `.xlsx` in the uniform output envelope — see [receiving outputs](/docs/receiving-outputs) and [persistence tiers](/docs/persistence-tiers).

## derive

Add computed columns from a **structured expression** — no formula string, no eval. Each column is `{ name, expr }` where `expr` is `{col}` | `{lit}` | `{op, args}`: arithmetic (`add`/`sub`/`mul`/`div`/`mod`/`abs`/`round`), string (`concat`/`upper`/`lower`/`trim`/`substring`), `coalesce`, and comparisons. Values are literals; ops are an allowlist; nothing is executed.

```json
POST /v1/data/derive
{ "file": {"inline":"<b64>"},
  "columns": [ {"name":"total","expr":{"op":"mul","args":[{"col":"qty"},{"col":"price"}]}} ] }
```

## fillna

Fill empty / null cells. `value` is a literal, or one of `ffill` / `bfill` / `mean` / `median` / `mode`. Restrict to `columns`, or apply to all.

```json
POST /v1/data/fillna
{ "file": {"inline":"<b64>"}, "columns": ["spend"], "value": 0 }
```

## dropna

Drop rows with empty cells. `how: "any"` (default) drops a row if any of the named `columns` is empty; `"all"` only if all are.

```json
POST /v1/data/dropna
{ "file": {"inline":"<b64>"}, "columns": ["email"], "how": "any" }
```

## rename

Rename columns. Each pair is `{ from, to }`; a reserved or duplicate target → 422.

```json
POST /v1/data/rename
{ "file": {"inline":"<b64>"}, "columns": [ {"from":"col1","to":"name"} ] }
```

## slice

Row slicing / sampling — `offset` + `limit`, `head` (first n), `tail` (last n), or `sample` (random n).

```json
POST /v1/data/slice
{ "file": {"inline":"<b64>"}, "head": 100 }
```

## explode

Unnest one cell's array/list into multiple rows (one per element). Default reads a JSON array; pass `separator` to split a delimited string instead.

```json
POST /v1/data/explode
{ "file": {"inline":"<b64>"}, "column": "tags" }
```

## sample

Sample rows from a dataset — pass **exactly one** of `n` (an exact count) or `fraction` (0–1, a share of the rows). `method` is `random` (default), `head` (first n), or `systematic` (every step-th row). Pass a `seed` for **reproducible** random draws — the same seed over the same input returns a byte-identical sample. The result preserves original row order. Billed per MB of input.

```json
POST /v1/data/sample
{ "file": {"inline":"<b64>"}, "fraction": 0.1, "seed": 42 }
```

(`data/slice` also has a bare `sample` parameter for a one-off random count; `data/sample` is the richer tool — fractions, methods, and the reproducible seed.)

## Sample

```bash
curl -X POST https://api.relaystation.ai/v1/data/groupby \
  -H 'X-Payment: <base64 EIP-3009 auth>' \
  -H 'Idempotency-Key: rollup-by-country-20260609' \
  -H 'Content-Type: application/json' \
  -d '{"file":{"inline":"Y291bnRyeSxzcGVuZApVUywxMApVUyw1CkRFLDcK"},"by":["country"],"aggregate":[{"fn":"count","as":"n"},{"column":"spend","fn":"sum","as":"total"}]}'
```

## Errors

- `402 PAYMENT_REQUIRED` — no valid payment.
- `422 DATA_PARSE_FAILED` — malformed input. Rejected **before** any charge.
- `422 UNKNOWN_COLUMN` — a referenced column isn't in the data (the response lists the offenders).
- `422 PREDICATE_TOO_DEEP` — a `filter` `where` nests past the depth cap.
- `422 BAD_AGGREGATE` — an unknown aggregate function, a missing/duplicate output name, or a reserved name.
- `422 BAD_PATTERN` — a `validate` `pattern` isn't a valid bounded regular expression.
- `422 XLSX_TOO_LARGE` — a `from-xlsx` workbook exceeds the cell-count cap (`cputools.data.max_xlsx_cells`).
- `422 BAD_EXPRESSION` — a `derive` expression uses an unknown op or references a missing column.
- `413 OUTPUT_TOO_LARGE` — a `join`/`union`/`groupby`/`pivot` result exceeds the output-byte cap (the cartesian-blowup guard); the charge is refunded.

## Next

[Data — SQL](/docs/data-sql) · [Pipeline](/docs/pipeline) · [CSV tools](/docs/csv-tools) · [Pricing](/pricing) · [API reference](/api-reference)
