dataset-finder

Review·Scanned 2/17/2026

This skill searches, downloads, previews, and documents datasets via the CLI (python scripts/dataset.py) across Kaggle, Hugging Face, UCI ML Repository, and Data.gov. It instructs running shell commands (e.g., pip install, python scripts/dataset.py ...), requires credentials in ~/.kaggle/kaggle.json or %USERPROFILE%\.kaggle\ and HF_TOKEN, and makes network requests to https://archive.ics.uci.edu/ml/datasets.php and other external APIs.

from clawhub.ai·v44e9c02·55.9 KB·0 installs
Scanned from 0.1.0 at 44e9c02 · Transparency log ↗
$ vett add clawhub.ai/anisafifi/dataset-finderReview findings below

Dataset Finder

A powerful OpenClaw skill for discovering, downloading, and managing datasets from multiple repositories.

Features

Multi-Repository Search

  • Kaggle (ML competitions & community datasets)
  • Hugging Face (NLP, vision, audio datasets)
  • UCI ML Repository (classic ML datasets)
  • Data.gov (US government open data)

Smart Download

  • Automatic format detection
  • Multiple format support (CSV, JSON, Parquet, Excel)
  • Batch downloading
  • Progress tracking

Dataset Preview

  • Quick statistics without full load
  • Column types and missing values
  • Sample data inspection
  • Memory usage estimation

Documentation Generation

  • Auto-generate data cards
  • Schema documentation
  • Usage examples
  • Statistics summaries

Installation

Prerequisites

  1. Install OpenClawCLI for Windows or MacOS

  2. Install Python dependencies:

# Standard installation
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4

# Or install from requirements.txt
pip install -r requirements.txt

Using Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

⚠️ Never use --break-system-packages - use virtual environments instead!

API Credentials

Kaggle Setup

  1. Go to https://www.kaggle.com/settings
  2. Click "Create New API Token"
  3. Save kaggle.json to:
    • Linux/Mac: ~/.kaggle/
    • Windows: %USERPROFILE%\.kaggle\
  4. Set permissions (Linux/Mac): chmod 600 ~/.kaggle/kaggle.json

Hugging Face Setup (Optional)

# Login via CLI
huggingface-cli login

# Or set environment variable
export HF_TOKEN="your_token_here"

Quick Start

Search Datasets

# Search Kaggle
python scripts/dataset.py kaggle search "house prices"

# Search Hugging Face
python scripts/dataset.py huggingface search "sentiment analysis"

# Search UCI ML Repository
python scripts/dataset.py uci search "classification"

Download Datasets

# Download from Kaggle
python scripts/dataset.py kaggle download "zillow/zecon"

# Download from Hugging Face
python scripts/dataset.py huggingface download "imdb"

# Download from UCI
python scripts/dataset.py uci download "iris"

Preview and Document

# Preview dataset
python scripts/dataset.py preview data.csv --detailed

# Generate data card
python scripts/dataset.py datacard data.csv --output DATACARD.md

Common Use Cases

1. ML Project Setup

# Search for datasets
python scripts/dataset.py kaggle search "housing prices" --max-results 10

# Download selected dataset
python scripts/dataset.py kaggle download "zillow/zecon"

# Preview the data
python scripts/dataset.py preview datasets/kaggle/zillow_zecon/train.csv --detailed

# Generate documentation
python scripts/dataset.py datacard datasets/kaggle/zillow_zecon/train.csv

2. NLP Dataset Collection

# Search for sentiment datasets
python scripts/dataset.py huggingface search "sentiment" --task text-classification --language en

# Download multiple datasets
python scripts/dataset.py huggingface download "imdb"
python scripts/dataset.py huggingface download "sst2"
python scripts/dataset.py huggingface download "yelp_polarity"

3. Dataset Comparison

# Search multiple sources
python scripts/dataset.py kaggle search "titanic" --output kaggle_results.json
python scripts/dataset.py huggingface search "titanic" --output hf_results.json

# Compare results
cat kaggle_results.json hf_results.json | jq '.'

4. Build Dataset Library

# Create organized structure
mkdir -p datasets/{kaggle,huggingface,uci}

# Download datasets
python scripts/dataset.py kaggle download "dataset1" --output-dir datasets/kaggle/
python scripts/dataset.py huggingface download "dataset2" --output-dir datasets/huggingface/

# Generate data cards for all
for file in datasets/**/*.csv; do
  python scripts/dataset.py datacard "$file" --output "${file%.csv}_DATACARD.md"
done

Repository-Specific Features

Kaggle

# Search with filters
python scripts/dataset.py kaggle search "NLP" \
  --file-type csv \
  --sort-by hotness \
  --max-results 20

# Download specific files
python scripts/dataset.py kaggle download "owner/dataset" --file "train.csv"

# List dataset files
python scripts/dataset.py kaggle list "owner/dataset"

Hugging Face

# Search with task filter
python scripts/dataset.py huggingface search "text" \
  --task text-classification \
  --language en \
  --max-results 15

# Download specific split
python scripts/dataset.py huggingface download "imdb" --split train

# Download with configuration
python scripts/dataset.py huggingface download "glue" --config mrpc

# Stream large datasets
python scripts/dataset.py huggingface download "large-dataset" --streaming

UCI ML Repository

# Search by task type
python scripts/dataset.py uci search "regression" --task-type regression

# Search by size
python scripts/dataset.py uci search "classification" --min-samples 1000

# Download classic datasets
python scripts/dataset.py uci download "iris"
python scripts/dataset.py uci download "wine-quality"

Dataset Preview Features

Basic Preview

python scripts/dataset.py preview data.csv

Shows:

  • Dataset shape (rows × columns)
  • Column names and types
  • Missing value counts
  • Memory usage
  • Sample rows

Detailed Preview

python scripts/dataset.py preview data.csv --detailed

Additional information:

  • Numeric statistics (mean, std, min, max)
  • Categorical value counts
  • Unique value counts
  • Top values per column

Save Preview

python scripts/dataset.py preview data.csv --detailed --output preview.txt

Data Card Generation

Generate professional dataset documentation:

# Basic data card
python scripts/dataset.py datacard dataset.csv --output DATACARD.md

# Include statistics
python scripts/dataset.py datacard dataset.csv --include-stats --output README.md

Generated data card includes:

  • Dataset description
  • File information
  • Schema table
  • Statistics (if enabled)
  • Sample data
  • Usage examples
  • License placeholder
  • Citation placeholder

Supported File Formats

Reading:

  • CSV, TSV
  • JSON, JSONL
  • Parquet
  • Excel (XLSX, XLS)
  • HDF5
  • Feather

Writing:

  • CSV
  • JSON
  • Parquet
  • Markdown (data cards)

Command Reference

python scripts/dataset.py <command> <subcommand> [OPTIONS]

KAGGLE:
  kaggle search QUERY       Search Kaggle datasets
    --file-type TYPE        Filter by file type
    --license LICENSE       Filter by license
    --sort-by SORT          Sort by (hotness|votes|updated|relevance)
    --max-results N         Limit results
    --output FILE           Save to JSON
  
  kaggle download DATASET   Download dataset
    --file FILE             Download specific file
    --output-dir DIR        Output directory
  
  kaggle list DATASET       List dataset files

HUGGING FACE:
  huggingface search QUERY  Search HF datasets
    --task TASK             Filter by task
    --language LANG         Filter by language
    --max-results N         Limit results
    --output FILE           Save to JSON
  
  huggingface download ID   Download dataset
    --split SPLIT           Specific split
    --config CONFIG         Configuration
    --streaming             Stream mode
    --output-dir DIR        Output directory

UCI:
  uci search QUERY          Search UCI datasets
    --task-type TYPE        Filter by task
    --min-samples N         Minimum samples
  
  uci download ID           Download dataset
    --output-dir DIR        Output directory

PREVIEW:
  preview FILE              Preview dataset
    --detailed              Detailed stats
    --sample N              Sample size
    --output FILE           Save output

DATACARD:
  datacard FILE             Generate data card
    --output FILE           Output file
    --include-stats         Include statistics

Best Practices

Search Strategy

  1. Start with broad keywords
  2. Use filters to narrow results
  3. Check multiple repositories
  4. Review metadata before downloading

Download Management

  1. Organize by repository
  2. Check dataset size first
  3. Use descriptive directory names
  4. Keep original file structures

Data Quality

  1. Always preview before using
  2. Generate data cards for documentation
  3. Check for missing values
  4. Validate data types

Storage

  1. Use Parquet for large datasets
  2. Compress when possible
  3. Keep separate train/test/val sets
  4. Version control dataset metadata

Troubleshooting

"Kaggle API credentials not found"

# Download from https://www.kaggle.com/settings
# Place in ~/.kaggle/kaggle.json (Linux/Mac)
# Or %USERPROFILE%\.kaggle\kaggle.json (Windows)

# Set permissions (Linux/Mac)
chmod 600 ~/.kaggle/kaggle.json

"Library not installed"

pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4

"Download failed"

  • Check internet connection
  • Verify dataset still exists
  • Check available disk space
  • Try downloading specific files

"Cannot load dataset"

  • Verify file format
  • Check file encoding
  • Ensure file is not corrupted
  • Try different reader options

"Out of memory"

  • Use streaming mode for large datasets
  • Preview with smaller sample size
  • Use Parquet instead of CSV
  • Process in chunks

Tips and Tricks

Quick Dataset Search

# Create alias for common searches
alias kaggle-search='python scripts/dataset.py kaggle search'
alias hf-search='python scripts/dataset.py huggingface search'

# Use them
kaggle-search "house prices"
hf-search "sentiment"

Batch Operations

# Search and save results
python scripts/dataset.py kaggle search "ML" --output results.json

# Extract dataset IDs
cat results.json | jq -r '.[].owner' > datasets.txt

# Download all
while read dataset; do
  python scripts/dataset.py kaggle download "$dataset"
done < datasets.txt

Preview Multiple Files

# Preview all CSV files
for file in *.csv; do
  echo "=== $file ==="
  python scripts/dataset.py preview "$file"
done

Version

0.1.0 - Initial release

License

Proprietary - See LICENSE.txt

Credits

Built for OpenClaw using: