openclaw
← All products
Services & Infrastructure

dataset-foundry

Framework + 2 starter datasets for selling niche data on Datarade / Snowflake. Ship a $5-50K dataset in a weekend.

Get startedSource on GitHub

Launch kit

dataset-foundry — launch kit

1-liner

Framework + 2 starter datasets for selling niche data on Datarade / Snowflake. Ship a $5-50K dataset in a weekend.

Tweet hook

The most underrated revenue stream for indie devs in 2026: niche data.

Datarade buyers pay $1-50K for clean, refreshed lists nobody else has.

I built the framework + 2 example schemas to ship them fast.

Open code 🧵

Reddit

  • r/datasets: "Open-source framework for packaging niche datasets for sale"
  • r/dataisbeautiful: cross-promote a polished sample
  • r/sideproject: "$5K dataset → first sale on Datarade in 30 days"

Cold-email ICP

  • AI training-data buyers (Scale, Surge subcontractors needing niche corpora)
  • Industry analysts who'd buy ready-to-go niche feeds
  • Boutique consulting firms

Cold-email template

Subject: niche data feed for {their domain}

Hi {first} — your post on {topic} mentioned needing {data type}.

I run a small data-product practice. We assemble + QC + license niche
datasets (sample at link). I've worked on:
- US craft breweries with weekly tap lists
- US podcasts with verified-active RSS

If you have a recurring data need we could build into a feed for you,
$5-25K range, refreshed monthly. Reply with the spec and I'll come
back with a quote.

SEO content

  1. "How to sell data on Datarade: 2026 walkthrough"
  2. "Niche dataset selection: 50 ideas with TAM estimates"
  3. "QC + schema validation for sellable data"
  4. "Snowflake Marketplace vs Datarade vs AWS DX: where to list first"
  5. "From scraper to sellable: a 4-week workflow"

Documentation

dataset-foundry

Framework for building, QC'ing, and packaging niche datasets for sale on Datarade, Snowflake Marketplace, AWS Data Exchange, and direct-to-buyer.

What this gives you

  1. Schema framework — declarative YAML defining columns, types, validation rules, and documentation.
  2. QC pipeline — runs your dataset through completeness + correctness checks. Fails loudly when something's broken.
  3. Multi-format export — CSV, JSONL, Parquet from one source.
  4. Marketplace manifest — Datarade-compatible YAML manifest with schema docs + sample link.
  5. Two starter dataset schemas in examples/:
    • us-podcasts-with-rss — verified-active RSS feeds (AI training data buyers love this).
    • us-craft-breweries-with-tap-list — weekly-refresh beer lists (beer-rating apps, distribution, tourism).

Usage

cd C:\openclaw-products\dataset-foundry
python -m venv .venv
.\.venv\Scripts\activate
pip install -e ".[dev]"

# 1. Pick a schema (or write your own)
cd examples/podcasts-with-rss

# 2. Populate raw.csv with your scraped data (use whatever scraper)

# 3. QC it
foundry check schema.yaml raw.csv --min-rows 50000

# 4. Package for marketplace upload
foundry package schema.yaml raw.csv --out dist/

# Output:
#   dist/us-podcasts-with-rss.csv
#   dist/us-podcasts-with-rss.jsonl
#   dist/us-podcasts-with-rss.parquet
#   dist/us-podcasts-with-rss-sample.csv
#   dist/manifest.yaml

Pricing strategy by dataset type

Dataset shape Recurring price One-time Best marketplace
Verified-active feeds (podcasts, RSS, sitemaps) $1-5K/yr $499-1999 Datarade
Locational directory (breweries, restaurants, dentists) $500/yr $99-499 Direct
Niche industry tracking (M&A, hiring, regulation) $5-20K/yr $999-4999 Snowflake
Real-time price feeds $10-50K/yr n/a AWS Data Exchange
AI training corpora (text, audio, code) $5-100K $1K-50K Direct + Datarade

Selecting a niche

Good dataset niches share three properties:

  1. Hard to assemble. If anyone could grep one API, the data is commodity. Look for things requiring scraping, normalization, or manual verification.
  2. Recurring buyer need. A snapshot is one sale; a feed is yearly.
  3. Buyer has budget. B2B > consumer. Niche industry > generalist.

Bad niches:

  • Anything available on Kaggle for free.
  • Pure web-scraped public-API data (commodity).
  • Infrequently-updated reference (one sale, dies).

Distribution channels

  • Datarade — easiest onboarding, ~30% take, no exclusivity.
  • Snowflake Marketplace — premium pricing; requires Snowflake account; longer onboarding.
  • AWS Data Exchange — enterprise reach; slowest onboarding.
  • Direct sale — best margin, highest sales effort. Pitch to: industry analysts, AI training labs, market research firms.

Test

pytest tests/ -v

Roadmap

  • Direct upload-to-Datarade automation
  • Diff-and-update for delta refreshes (vs full snapshot)
  • PII redaction utility
  • Schema migration tool (when you renumber/rename columns)
  • Buyer-facing data dictionary HTML generator
  • License-text generator for common terms