dataset-finder
This skill searches, downloads, previews, and documents datasets via the CLI (python scripts/dataset.py) across Kaggle, Hugging Face, UCI ML Repository, and Data.gov. It instructs running shell commands (e.g., pip install, python scripts/dataset.py ...), requires credentials in ~/.kaggle/kaggle.json or %USERPROFILE%\.kaggle\ and HF_TOKEN, and makes network requests to https://archive.ics.uci.edu/ml/datasets.php and other external APIs.
Dataset Finder
A powerful OpenClaw skill for discovering, downloading, and managing datasets from multiple repositories.
Features
✅ Multi-Repository Search
- Kaggle (ML competitions & community datasets)
- Hugging Face (NLP, vision, audio datasets)
- UCI ML Repository (classic ML datasets)
- Data.gov (US government open data)
✅ Smart Download
- Automatic format detection
- Multiple format support (CSV, JSON, Parquet, Excel)
- Batch downloading
- Progress tracking
✅ Dataset Preview
- Quick statistics without full load
- Column types and missing values
- Sample data inspection
- Memory usage estimation
✅ Documentation Generation
- Auto-generate data cards
- Schema documentation
- Usage examples
- Statistics summaries
Installation
Prerequisites
-
Install OpenClawCLI for Windows or MacOS
-
Install Python dependencies:
# Standard installation
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4
# Or install from requirements.txt
pip install -r requirements.txt
Using Virtual Environment (Recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
⚠️ Never use --break-system-packages - use virtual environments instead!
API Credentials
Kaggle Setup
- Go to https://www.kaggle.com/settings
- Click "Create New API Token"
- Save
kaggle.jsonto:- Linux/Mac:
~/.kaggle/ - Windows:
%USERPROFILE%\.kaggle\
- Linux/Mac:
- Set permissions (Linux/Mac):
chmod 600 ~/.kaggle/kaggle.json
Hugging Face Setup (Optional)
# Login via CLI
huggingface-cli login
# Or set environment variable
export HF_TOKEN="your_token_here"
Quick Start
Search Datasets
# Search Kaggle
python scripts/dataset.py kaggle search "house prices"
# Search Hugging Face
python scripts/dataset.py huggingface search "sentiment analysis"
# Search UCI ML Repository
python scripts/dataset.py uci search "classification"
Download Datasets
# Download from Kaggle
python scripts/dataset.py kaggle download "zillow/zecon"
# Download from Hugging Face
python scripts/dataset.py huggingface download "imdb"
# Download from UCI
python scripts/dataset.py uci download "iris"
Preview and Document
# Preview dataset
python scripts/dataset.py preview data.csv --detailed
# Generate data card
python scripts/dataset.py datacard data.csv --output DATACARD.md
Common Use Cases
1. ML Project Setup
# Search for datasets
python scripts/dataset.py kaggle search "housing prices" --max-results 10
# Download selected dataset
python scripts/dataset.py kaggle download "zillow/zecon"
# Preview the data
python scripts/dataset.py preview datasets/kaggle/zillow_zecon/train.csv --detailed
# Generate documentation
python scripts/dataset.py datacard datasets/kaggle/zillow_zecon/train.csv
2. NLP Dataset Collection
# Search for sentiment datasets
python scripts/dataset.py huggingface search "sentiment" --task text-classification --language en
# Download multiple datasets
python scripts/dataset.py huggingface download "imdb"
python scripts/dataset.py huggingface download "sst2"
python scripts/dataset.py huggingface download "yelp_polarity"
3. Dataset Comparison
# Search multiple sources
python scripts/dataset.py kaggle search "titanic" --output kaggle_results.json
python scripts/dataset.py huggingface search "titanic" --output hf_results.json
# Compare results
cat kaggle_results.json hf_results.json | jq '.'
4. Build Dataset Library
# Create organized structure
mkdir -p datasets/{kaggle,huggingface,uci}
# Download datasets
python scripts/dataset.py kaggle download "dataset1" --output-dir datasets/kaggle/
python scripts/dataset.py huggingface download "dataset2" --output-dir datasets/huggingface/
# Generate data cards for all
for file in datasets/**/*.csv; do
python scripts/dataset.py datacard "$file" --output "${file%.csv}_DATACARD.md"
done
Repository-Specific Features
Kaggle
# Search with filters
python scripts/dataset.py kaggle search "NLP" \
--file-type csv \
--sort-by hotness \
--max-results 20
# Download specific files
python scripts/dataset.py kaggle download "owner/dataset" --file "train.csv"
# List dataset files
python scripts/dataset.py kaggle list "owner/dataset"
Hugging Face
# Search with task filter
python scripts/dataset.py huggingface search "text" \
--task text-classification \
--language en \
--max-results 15
# Download specific split
python scripts/dataset.py huggingface download "imdb" --split train
# Download with configuration
python scripts/dataset.py huggingface download "glue" --config mrpc
# Stream large datasets
python scripts/dataset.py huggingface download "large-dataset" --streaming
UCI ML Repository
# Search by task type
python scripts/dataset.py uci search "regression" --task-type regression
# Search by size
python scripts/dataset.py uci search "classification" --min-samples 1000
# Download classic datasets
python scripts/dataset.py uci download "iris"
python scripts/dataset.py uci download "wine-quality"
Dataset Preview Features
Basic Preview
python scripts/dataset.py preview data.csv
Shows:
- Dataset shape (rows × columns)
- Column names and types
- Missing value counts
- Memory usage
- Sample rows
Detailed Preview
python scripts/dataset.py preview data.csv --detailed
Additional information:
- Numeric statistics (mean, std, min, max)
- Categorical value counts
- Unique value counts
- Top values per column
Save Preview
python scripts/dataset.py preview data.csv --detailed --output preview.txt
Data Card Generation
Generate professional dataset documentation:
# Basic data card
python scripts/dataset.py datacard dataset.csv --output DATACARD.md
# Include statistics
python scripts/dataset.py datacard dataset.csv --include-stats --output README.md
Generated data card includes:
- Dataset description
- File information
- Schema table
- Statistics (if enabled)
- Sample data
- Usage examples
- License placeholder
- Citation placeholder
Supported File Formats
Reading:
- CSV, TSV
- JSON, JSONL
- Parquet
- Excel (XLSX, XLS)
- HDF5
- Feather
Writing:
- CSV
- JSON
- Parquet
- Markdown (data cards)
Command Reference
python scripts/dataset.py <command> <subcommand> [OPTIONS]
KAGGLE:
kaggle search QUERY Search Kaggle datasets
--file-type TYPE Filter by file type
--license LICENSE Filter by license
--sort-by SORT Sort by (hotness|votes|updated|relevance)
--max-results N Limit results
--output FILE Save to JSON
kaggle download DATASET Download dataset
--file FILE Download specific file
--output-dir DIR Output directory
kaggle list DATASET List dataset files
HUGGING FACE:
huggingface search QUERY Search HF datasets
--task TASK Filter by task
--language LANG Filter by language
--max-results N Limit results
--output FILE Save to JSON
huggingface download ID Download dataset
--split SPLIT Specific split
--config CONFIG Configuration
--streaming Stream mode
--output-dir DIR Output directory
UCI:
uci search QUERY Search UCI datasets
--task-type TYPE Filter by task
--min-samples N Minimum samples
uci download ID Download dataset
--output-dir DIR Output directory
PREVIEW:
preview FILE Preview dataset
--detailed Detailed stats
--sample N Sample size
--output FILE Save output
DATACARD:
datacard FILE Generate data card
--output FILE Output file
--include-stats Include statistics
Best Practices
Search Strategy
- Start with broad keywords
- Use filters to narrow results
- Check multiple repositories
- Review metadata before downloading
Download Management
- Organize by repository
- Check dataset size first
- Use descriptive directory names
- Keep original file structures
Data Quality
- Always preview before using
- Generate data cards for documentation
- Check for missing values
- Validate data types
Storage
- Use Parquet for large datasets
- Compress when possible
- Keep separate train/test/val sets
- Version control dataset metadata
Troubleshooting
"Kaggle API credentials not found"
# Download from https://www.kaggle.com/settings
# Place in ~/.kaggle/kaggle.json (Linux/Mac)
# Or %USERPROFILE%\.kaggle\kaggle.json (Windows)
# Set permissions (Linux/Mac)
chmod 600 ~/.kaggle/kaggle.json
"Library not installed"
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4
"Download failed"
- Check internet connection
- Verify dataset still exists
- Check available disk space
- Try downloading specific files
"Cannot load dataset"
- Verify file format
- Check file encoding
- Ensure file is not corrupted
- Try different reader options
"Out of memory"
- Use streaming mode for large datasets
- Preview with smaller sample size
- Use Parquet instead of CSV
- Process in chunks
Tips and Tricks
Quick Dataset Search
# Create alias for common searches
alias kaggle-search='python scripts/dataset.py kaggle search'
alias hf-search='python scripts/dataset.py huggingface search'
# Use them
kaggle-search "house prices"
hf-search "sentiment"
Batch Operations
# Search and save results
python scripts/dataset.py kaggle search "ML" --output results.json
# Extract dataset IDs
cat results.json | jq -r '.[].owner' > datasets.txt
# Download all
while read dataset; do
python scripts/dataset.py kaggle download "$dataset"
done < datasets.txt
Preview Multiple Files
# Preview all CSV files
for file in *.csv; do
echo "=== $file ==="
python scripts/dataset.py preview "$file"
done
Version
0.1.0 - Initial release
License
Proprietary - See LICENSE.txt
Credits
Built for OpenClaw using:
- kaggle - Kaggle API
- datasets - Hugging Face Datasets
- pandas - Data analysis
- requests - HTTP library
- BeautifulSoup - Web scraping