Data tools

Nineteen in-process dataframe operations over tabular data — the transform half of ETL, including Excel in/out (from-xlsx / to-xlsx, the format DuckDB can’t emit) and structured derive / cleaning ops. Each is a POST https://api.relaystation.ai/v1/data/<op>. Input is the csv-family union (csv, tsv, json array-of-rows, ndjson); pass from/to to pick formats (default csv). cputools reshapes your rows — it never evaluates a cell or a user expression as code (see filter below), so there is no expression-execution surface. For full SQL, see Data — SQL; to chain several of these in one call, see Pipeline.

Inputs and outputs

Same uniform shape as the CSV tools: the file is an input source — { "inline": "<base64>" } for ≤ 4 MB, or { "inputKey": "..." } (from POST /v1/cputools/upload-url) for up to 50 MB. The verbs return the uniform output envelope (the reshaped table, default text/csv); profile returns a JSON report.

Billing

Per-MB of input, minimum 1 MiB, at the operator-tunable cputools.price.data.<op>.per_mb_micros (launch default $0.0002 / MB). join bills on the sum of both inputs. The live 402 challenge is authoritative.

Op	Billed on	Rate
`filter`	input MB	$0.0002 / MB
`sort`	input MB	$0.0002 / MB
`groupby`	input MB	$0.0002 / MB
`join`	input MB (both inputs)	$0.0002 / MB
`union`	input MB (all inputs)	$0.0002 / MB
`profile`	input MB	$0.0002 / MB
`cast`	input MB	$0.0002 / MB
`pivot`	input MB	$0.0002 / MB
`schema-infer`	input MB	$0.0002 / MB
`validate`	input MB	$0.0002 / MB
`diff`	input MB (both inputs)	$0.0002 / MB
`from-xlsx`	input MB	$0.0002 / MB
`to-xlsx`	input MB	$0.0002 / MB
`derive`	input MB	$0.0002 / MB
`fillna`	input MB	$0.0002 / MB
`dropna`	input MB	$0.0002 / MB
`rename`	input MB	$0.0002 / MB
`slice`	input MB	$0.0002 / MB
`explode`	input MB	$0.0002 / MB

filter

Keep rows matching a structured predicate — a tree of { column, op, value } leaves combined with and / or / not. Ops: eq, ne, gt, gte, lt, lte, contains, startsWith, endsWith, in, isNull, notNull. Comparisons are value-typed (numeric when both sides parse as numbers, else string). There is no expression string — a value is always a literal compared against your cell data, never executed — so the predicate is safe by construction.

POST /v1/data/filter
{ "file": {"inline":"<b64>"},
  "where": { "and": [ {"column":"status","op":"eq","value":"active"},
                      {"column":"age","op":"gte","value":18} ] } }

sort

Sort by one or more columns. Each key is { column, dir?: "asc"|"desc", type?: "string"|"number" }.

POST /v1/data/sort
{ "file": {"inline":"<b64>"}, "by": [ {"column":"age","dir":"desc","type":"number"} ] }

groupby

Group by one or more columns and apply named aggregate functions — count, sum, avg, min, max, first, last. count needs no column; the rest require one. Name an aggregate with as.

POST /v1/data/groupby
{ "file": {"inline":"<b64>"}, "by": ["country"],
  "aggregate": [ {"fn":"count","as":"n"}, {"column":"spend","fn":"sum","as":"total"} ] }

join

Join two tables on one or more key pairs. how is inner (default), left, right, or outer. Pass left and right as input sources and on as [{ "left": "...", "right": "..." }].

POST /v1/data/join
{ "left": {"inline":"<b64>"}, "right": {"inputKey":"..."},
  "on": [ {"left":"id","right":"user_id"} ], "how": "left" }

union

Concatenate two or more tables, preserving row order; the output schema is the ordered union of all columns (missing cells come back empty).

POST /v1/data/union
{ "files": [ {"inline":"<b64>"}, {"inline":"<b64>"} ] }

profile

A read-only data-quality report: per column the inferred type, row/null counts, distinct count, min/max, numeric mean, and the top values. Returns JSON.

POST /v1/data/profile
{ "file": {"inline":"<b64>"} }

cast

Coerce named columns to a target type — string, integer, number, boolean, or date. Best-effort: a value that can’t be coerced becomes null (the cell isn’t dropped).

POST /v1/data/cast
{ "file": {"inline":"<b64>"}, "columns": [ {"column":"age","to":"integer"}, {"column":"joined","to":"date"} ] }

pivot

Reshape between long and wide. mode: "pivot" spreads a column’s values into new columns — { index: [...], columns: "<col>", values: "<col>", agg?: count|sum|avg|min|max }. mode: "unpivot" melts columns into key/value rows — { id: [...], value?: [...] }.

POST /v1/data/pivot
{ "file": {"inline":"<b64>"}, "mode": "pivot",
  "index": ["country"], "columns": "quarter", "values": "spend", "agg": "sum" }

schema-infer

Infer each column’s type from the data and emit a schema. dialect: "json-schema" (default) returns a JSON Schema; dialect: "sql-ddl" returns a CREATE TABLE statement. Returns JSON/text.

POST /v1/data/schema-infer
{ "file": {"inline":"<b64>"}, "dialect": "sql-ddl" }

validate

Check each row against a structured schema — per column { type, required?, min?, max?, pattern? }. pattern is a bounded, length-capped regular expression matched against the cell string (it is not executable code). Returns { valid, rowCount, errorCount, errors: [{row, column, reason}] }.

POST /v1/data/validate
{ "file": {"inline":"<b64>"},
  "schema": { "email": {"type":"string","required":true,"pattern":"^[^@]+@[^@]+$"},
              "age": {"type":"integer","min":0,"max":120} } }

diff

Row-level changeset between two tables. Pass left and right as input sources and optional key columns (omit to compare whole rows). Returns { added, removed, changed: [{key, before, after}] }.

POST /v1/data/diff
{ "left": {"inline":"<b64>"}, "right": {"inputKey":"..."}, "key": ["id"] }

from-xlsx

Parse an Excel workbook into a table. sheet (name or index, default the first) picks the sheet; to is the output format. The first row is the header, and cell values are read — formulas are never evaluated. A workbook over the cell-count cap is rejected pre-charge.

POST /v1/data/from-xlsx
{ "file": {"inputKey":"..."}, "sheet": 0, "to": "json" }

to-xlsx

Write a table to an Excel .xlsx. Any cell whose value starts with = + - @ is written as text, so a payload can’t become an active formula when the file is opened — formula-injection-safe, and it round-trips cleanly.

POST /v1/data/to-xlsx
{ "file": {"inline":"<b64 csv>"}, "sheetName": "Export" }

Multi-sheet workbooks. Instead of a single file, pass a sheets array (≤ 20) — each entry is { name, file, from? } with its own tabular source, written to its own tab. Provide exactly one of file or sheets. Multi-sheet billing sums every sheet’s input bytes.

Styling. An optional style object applies to the workbook: headerBold, freezeHeader, columnWidths (an array by index or an object keyed by column name), and numberFormats (per-column Excel format strings like "#,##0.00", "yyyy-mm-dd", "@").

POST /v1/data/to-xlsx
{ "sheets": [
    { "name": "Sales", "file": {"inline":"<b64 csv>"} },
    { "name": "Costs", "file": {"inputKey":"..."}, "from": "json" }
  ],
  "style": { "headerBold": true, "freezeHeader": true, "numberFormats": { "amount": "#,##0.00" } } }

The result is binary .xlsx in the uniform output envelope — see Passing & receiving files.

derive

Add computed columns from a structured expression — no formula string, no eval. Each column is { name, expr } where expr is {col} | {lit} | {op, args}: arithmetic (add/sub/mul/div/mod/abs/round), string (concat/upper/lower/trim/substring), coalesce, and comparisons. Values are literals; ops are an allowlist; nothing is executed.

POST /v1/data/derive
{ "file": {"inline":"<b64>"},
  "columns": [ {"name":"total","expr":{"op":"mul","args":[{"col":"qty"},{"col":"price"}]}} ] }

fillna

Fill empty / null cells. value is a literal, or one of ffill / bfill / mean / median / mode. Restrict to columns, or apply to all.

POST /v1/data/fillna
{ "file": {"inline":"<b64>"}, "columns": ["spend"], "value": 0 }

dropna

Drop rows with empty cells. how: "any" (default) drops a row if any of the named columns is empty; "all" only if all are.

POST /v1/data/dropna
{ "file": {"inline":"<b64>"}, "columns": ["email"], "how": "any" }

rename

Rename columns. Each pair is { from, to }; a reserved or duplicate target → 422.

POST /v1/data/rename
{ "file": {"inline":"<b64>"}, "columns": [ {"from":"col1","to":"name"} ] }

slice

Row slicing / sampling — offset + limit, head (first n), tail (last n), or sample (random n).

POST /v1/data/slice
{ "file": {"inline":"<b64>"}, "head": 100 }

explode

Unnest one cell’s array/list into multiple rows (one per element). Default reads a JSON array; pass separator to split a delimited string instead.

POST /v1/data/explode
{ "file": {"inline":"<b64>"}, "column": "tags" }

sample

Sample rows from a dataset — pass exactly one of n (an exact count) or fraction (0–1, a share of the rows). method is random (default), head (first n), or systematic (every step-th row). Pass a seed for reproducible random draws — the same seed over the same input returns a byte-identical sample. The result preserves original row order. Billed per MB of input.

POST /v1/data/sample
{ "file": {"inline":"<b64>"}, "fraction": 0.1, "seed": 42 }

(data/slice also has a bare sample parameter for a one-off random count; data/sample is the richer tool — fractions, methods, and the reproducible seed.)

Sample

curl -X POST https://api.relaystation.ai/v1/data/groupby \
  -H 'X-Payment: <base64 EIP-3009 auth>' \
  -H 'Idempotency-Key: rollup-by-country-20260609' \
  -H 'Content-Type: application/json' \
  -d '{"file":{"inline":"Y291bnRyeSxzcGVuZApVUywxMApVUyw1CkRFLDcK"},"by":["country"],"aggregate":[{"fn":"count","as":"n"},{"column":"spend","fn":"sum","as":"total"}]}'

Errors

402 PAYMENT_REQUIRED — no valid payment.
422 DATA_PARSE_FAILED — malformed input. Rejected before any charge.
422 UNKNOWN_COLUMN — a referenced column isn’t in the data (the response lists the offenders).
422 PREDICATE_TOO_DEEP — a filter where nests past the depth cap.
422 BAD_AGGREGATE — an unknown aggregate function, a missing/duplicate output name, or a reserved name.
422 BAD_PATTERN — a validate pattern isn’t a valid bounded regular expression.
422 XLSX_TOO_LARGE — a from-xlsx workbook exceeds the cell-count cap (cputools.data.max_xlsx_cells).
422 BAD_EXPRESSION — a derive expression uses an unknown op or references a missing column.
413 OUTPUT_TOO_LARGE — a join/union/groupby/pivot result exceeds the output-byte cap (the cartesian-blowup guard); the charge is refunded.

Data — SQL · Pipeline · CSV tools · Pricing · API reference