dataset-finder

⚠Review·Scanned 2/17/2026

This skill searches, downloads, previews, and documents datasets via the CLI (python scripts/dataset.py) across Kaggle, Hugging Face, UCI ML Repository, and Data.gov. It instructs running shell commands (e.g., pip install, python scripts/dataset.py ...), requires credentials in ~/.kaggle/kaggle.json or %USERPROFILE%\.kaggle\ and HF_TOKEN, and makes network requests to https://archive.ics.uci.edu/ml/datasets.php and other external APIs.

from clawhub.ai·v44e9c02·55.9 KB·0 installs

Scanned from 0.1.0 at 44e9c02 · Transparency log ↗

$ vett add clawhub.ai/anisafifi/dataset-finderReview findings below

Dataset Finder

A powerful OpenClaw skill for discovering, downloading, and managing datasets from multiple repositories.

Features

✅ Multi-Repository Search

Kaggle (ML competitions & community datasets)
Hugging Face (NLP, vision, audio datasets)
UCI ML Repository (classic ML datasets)
Data.gov (US government open data)

✅ Smart Download

Automatic format detection
Multiple format support (CSV, JSON, Parquet, Excel)
Batch downloading
Progress tracking

✅ Dataset Preview

Quick statistics without full load
Column types and missing values
Sample data inspection
Memory usage estimation

✅ Documentation Generation

Auto-generate data cards
Schema documentation
Usage examples
Statistics summaries

Installation

Prerequisites

Install OpenClawCLI for Windows or MacOS
Install Python dependencies:

# Standard installation
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4

# Or install from requirements.txt
pip install -r requirements.txt

Using Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

⚠️ Never use --break-system-packages - use virtual environments instead!

API Credentials

Kaggle Setup

Go to https://www.kaggle.com/settings
Click "Create New API Token"
Save kaggle.json to:
- Linux/Mac: ~/.kaggle/
- Windows: %USERPROFILE%\.kaggle\
Set permissions (Linux/Mac): chmod 600 ~/.kaggle/kaggle.json

Hugging Face Setup (Optional)

# Login via CLI
huggingface-cli login

# Or set environment variable
export HF_TOKEN="your_token_here"

Quick Start

Search Datasets

# Search Kaggle
python scripts/dataset.py kaggle search "house prices"

# Search Hugging Face
python scripts/dataset.py huggingface search "sentiment analysis"

# Search UCI ML Repository
python scripts/dataset.py uci search "classification"

Download Datasets

# Download from Kaggle
python scripts/dataset.py kaggle download "zillow/zecon"

# Download from Hugging Face
python scripts/dataset.py huggingface download "imdb"

# Download from UCI
python scripts/dataset.py uci download "iris"

Preview and Document

# Preview dataset
python scripts/dataset.py preview data.csv --detailed

# Generate data card
python scripts/dataset.py datacard data.csv --output DATACARD.md

Common Use Cases

1. ML Project Setup

# Search for datasets
python scripts/dataset.py kaggle search "housing prices" --max-results 10

# Download selected dataset
python scripts/dataset.py kaggle download "zillow/zecon"

# Preview the data
python scripts/dataset.py preview datasets/kaggle/zillow_zecon/train.csv --detailed

# Generate documentation
python scripts/dataset.py datacard datasets/kaggle/zillow_zecon/train.csv

2. NLP Dataset Collection

# Search for sentiment datasets
python scripts/dataset.py huggingface search "sentiment" --task text-classification --language en

# Download multiple datasets
python scripts/dataset.py huggingface download "imdb"
python scripts/dataset.py huggingface download "sst2"
python scripts/dataset.py huggingface download "yelp_polarity"

3. Dataset Comparison

# Search multiple sources
python scripts/dataset.py kaggle search "titanic" --output kaggle_results.json
python scripts/dataset.py huggingface search "titanic" --output hf_results.json

# Compare results
cat kaggle_results.json hf_results.json | jq '.'

4. Build Dataset Library

# Create organized structure
mkdir -p datasets/{kaggle,huggingface,uci}

# Download datasets
python scripts/dataset.py kaggle download "dataset1" --output-dir datasets/kaggle/
python scripts/dataset.py huggingface download "dataset2" --output-dir datasets/huggingface/

# Generate data cards for all
for file in datasets/**/*.csv; do
  python scripts/dataset.py datacard "$file" --output "${file%.csv}_DATACARD.md"
done

Repository-Specific Features

Kaggle

# Search with filters
python scripts/dataset.py kaggle search "NLP" \
  --file-type csv \
  --sort-by hotness \
  --max-results 20

# Download specific files
python scripts/dataset.py kaggle download "owner/dataset" --file "train.csv"

# List dataset files
python scripts/dataset.py kaggle list "owner/dataset"

Hugging Face

# Search with task filter
python scripts/dataset.py huggingface search "text" \
  --task text-classification \
  --language en \
  --max-results 15

# Download specific split
python scripts/dataset.py huggingface download "imdb" --split train

# Download with configuration
python scripts/dataset.py huggingface download "glue" --config mrpc

# Stream large datasets
python scripts/dataset.py huggingface download "large-dataset" --streaming

UCI ML Repository

# Search by task type
python scripts/dataset.py uci search "regression" --task-type regression

# Search by size
python scripts/dataset.py uci search "classification" --min-samples 1000

# Download classic datasets
python scripts/dataset.py uci download "iris"
python scripts/dataset.py uci download "wine-quality"

Dataset Preview Features

Basic Preview

python scripts/dataset.py preview data.csv

Shows:

Dataset shape (rows × columns)
Column names and types
Missing value counts
Memory usage
Sample rows

Detailed Preview

python scripts/dataset.py preview data.csv --detailed

Additional information:

Numeric statistics (mean, std, min, max)
Categorical value counts
Unique value counts
Top values per column

Save Preview

python scripts/dataset.py preview data.csv --detailed --output preview.txt

Data Card Generation

Generate professional dataset documentation:

# Basic data card
python scripts/dataset.py datacard dataset.csv --output DATACARD.md

# Include statistics
python scripts/dataset.py datacard dataset.csv --include-stats --output README.md

Generated data card includes:

Dataset description
File information
Schema table
Statistics (if enabled)
Sample data
Usage examples
License placeholder
Citation placeholder

Supported File Formats

Reading:

CSV, TSV
JSON, JSONL
Parquet
Excel (XLSX, XLS)
HDF5
Feather

Writing:

CSV
JSON
Parquet
Markdown (data cards)

Command Reference

python scripts/dataset.py <command> <subcommand> [OPTIONS]

KAGGLE:
  kaggle search QUERY       Search Kaggle datasets
    --file-type TYPE        Filter by file type
    --license LICENSE       Filter by license
    --sort-by SORT          Sort by (hotness|votes|updated|relevance)
    --max-results N         Limit results
    --output FILE           Save to JSON
  
  kaggle download DATASET   Download dataset
    --file FILE             Download specific file
    --output-dir DIR        Output directory
  
  kaggle list DATASET       List dataset files

HUGGING FACE:
  huggingface search QUERY  Search HF datasets
    --task TASK             Filter by task
    --language LANG         Filter by language
    --max-results N         Limit results
    --output FILE           Save to JSON
  
  huggingface download ID   Download dataset
    --split SPLIT           Specific split
    --config CONFIG         Configuration
    --streaming             Stream mode
    --output-dir DIR        Output directory

UCI:
  uci search QUERY          Search UCI datasets
    --task-type TYPE        Filter by task
    --min-samples N         Minimum samples
  
  uci download ID           Download dataset
    --output-dir DIR        Output directory

PREVIEW:
  preview FILE              Preview dataset
    --detailed              Detailed stats
    --sample N              Sample size
    --output FILE           Save output

DATACARD:
  datacard FILE             Generate data card
    --output FILE           Output file
    --include-stats         Include statistics

Best Practices

Search Strategy

Start with broad keywords
Use filters to narrow results
Check multiple repositories
Review metadata before downloading

Download Management

Organize by repository
Check dataset size first
Use descriptive directory names
Keep original file structures

Data Quality

Always preview before using
Generate data cards for documentation
Check for missing values
Validate data types

Storage

Use Parquet for large datasets
Compress when possible
Keep separate train/test/val sets
Version control dataset metadata

Troubleshooting

"Kaggle API credentials not found"

# Download from https://www.kaggle.com/settings
# Place in ~/.kaggle/kaggle.json (Linux/Mac)
# Or %USERPROFILE%\.kaggle\kaggle.json (Windows)

# Set permissions (Linux/Mac)
chmod 600 ~/.kaggle/kaggle.json

"Library not installed"

pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4

"Download failed"

Check internet connection
Verify dataset still exists
Check available disk space
Try downloading specific files

"Cannot load dataset"

Verify file format
Check file encoding
Ensure file is not corrupted
Try different reader options

"Out of memory"

Use streaming mode for large datasets
Preview with smaller sample size
Use Parquet instead of CSV
Process in chunks

Tips and Tricks

Quick Dataset Search

# Create alias for common searches
alias kaggle-search='python scripts/dataset.py kaggle search'
alias hf-search='python scripts/dataset.py huggingface search'

# Use them
kaggle-search "house prices"
hf-search "sentiment"

Batch Operations

# Search and save results
python scripts/dataset.py kaggle search "ML" --output results.json

# Extract dataset IDs
cat results.json | jq -r '.[].owner' > datasets.txt

# Download all
while read dataset; do
  python scripts/dataset.py kaggle download "$dataset"
done < datasets.txt

Preview Multiple Files

# Preview all CSV files
for file in *.csv; do
  echo "=== $file ==="
  python scripts/dataset.py preview "$file"
done

Version

0.1.0 - Initial release

License

Proprietary - See LICENSE.txt

Credits

Built for OpenClaw using:

kaggle - Kaggle API
datasets - Hugging Face Datasets
pandas - Data analysis
requests - HTTP library
BeautifulSoup - Web scraping