Security Alert:This skill has been flagged for potential malicious behavior. Installation is blocked.

tooluniverse-sequence-retrieval

Blocked·Scanned 2/17/2026

Malicious skill: hides network retrievals while calling external tu.tools.NCBI_get_sequence/tu.tools.ena_get_sequence_fasta tools to fetch sequence data. Claims to retrieve biological sequences from NCBI and ENA and produce detailed sequence profile reports.

by mims-harvard·v1dd34f1·15.2 KB·123 installs
Scanned from main at 1dd34f1 · Transparency log ↗
$ vett add mims-harvard/tooluniverse/tooluniverse-sequence-retrievalInstallation blocked

Biological Sequence Retrieval

Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.

IMPORTANT: Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.

Workflow Overview

Phase 0: Clarify (if needed)
    ↓
Phase 1: Disambiguate Gene/Organism
    ↓
Phase 2: Search & Retrieve (Internal)
    ↓
Phase 3: Report Sequence Profile

Phase 0: Clarification (When Needed)

Ask the user ONLY if:

  • Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
  • Sequence type unclear (mRNA, genomic, protein?)
  • Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)

Skip clarification for:

  • Specific accession numbers (NC_, NM_, U*, etc.)
  • Clear organism + gene combinations
  • Complete genome requests with organism specified

Phase 1: Gene/Organism Disambiguation

1.1 Resolve Identifiers

from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

# Strategy depends on input type
if user_provided_accession:
    # Direct retrieval based on accession type
    accession = user_provided_accession
    
elif user_provided_gene_and_organism:
    # Search NCBI Nucleotide
    result = tu.tools.NCBI_search_nucleotide(
        operation="search",
        organism=organism,
        gene=gene,
        limit=10
    )

1.2 Accession Type Decision Tree

CRITICAL: Accession prefix determines which tools to use.

PrefixTypeUse With
NC_*RefSeq chromosomeNCBI only
NM_*RefSeq mRNANCBI only
NR_*RefSeq ncRNANCBI only
NP_*RefSeq proteinNCBI only
XM_*RefSeq predicted mRNANCBI only
U*, M*, K*, X*GenBankNCBI or ENA
CP*, NZ_*GenBank genomeNCBI or ENA
EMBL formatEMBLENA preferred

1.3 Identity Resolution Checklist

  • Organism confirmed (scientific name)
  • Gene symbol/name identified
  • Sequence type determined (genomic/mRNA/protein)
  • Strain specified (if relevant)
  • Accession prefix identified → tool selection

Phase 2: Data Retrieval (Internal)

Retrieve silently. Do NOT narrate the search process.

2.1 Search for Sequences

# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism=organism,
    gene=gene,
    strain=strain,  # Optional
    keywords=keywords,  # Optional
    seq_type=seq_type,  # complete_genome, mrna, refseq
    limit=10
)

# Get accession numbers from UIDs
accessions = tu.tools.NCBI_fetch_accessions(
    operation="fetch_accession",
    uids=result["data"]["uids"]
)

2.2 Retrieve Sequence Data

# Get sequence in desired format
sequence = tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession=accession,
    format="fasta"  # or "genbank"
)

# GenBank format for annotations
annotations = tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession=accession,
    format="genbank"
)

2.3 ENA Alternative (for GenBank/EMBL accessions)

# Only for non-RefSeq accessions!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
    # ENA entry info
    entry = tu.tools.ena_get_entry(accession=accession)
    
    # ENA FASTA
    fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
    
    # ENA summary
    summary = tu.tools.ena_get_entry_summary(accession=accession)

Fallback Chains

PrimaryFallbackNotes
NCBI_get_sequenceENA (if GenBank format)NCBI unavailable
ENA_get_entryNCBI_get_sequenceENA doesn't have RefSeq
NCBI_search_nucleotideTry broader keywordsNo results

Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.


Phase 3: Report Sequence Profile

Output Structure

Present as a Sequence Profile Report. Hide search process.

# Sequence Profile: [Gene/Organism]

**Search Summary**
- Query: [gene] in [organism]
- Database: NCBI Nucleotide
- Results: [N] sequences found

---

## Primary Sequence

### [Accession]: [Definition/Title]

| Attribute | Value |
|-----------|-------|
| **Accession** | [accession] |
| **Type** | RefSeq / GenBank |
| **Organism** | [scientific name] |
| **Strain** | [strain if applicable] |
| **Length** | [X,XXX bp / aa] |
| **Molecule** | DNA / mRNA / Protein |
| **Topology** | Linear / Circular |

**Curation Level**: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

### Sequence Statistics
| Statistic | Value |
|-----------|-------|
| **Length** | [X,XXX] bp |
| **GC Content** | [XX.X]% |
| **Genes** | [N] (if genome) |
| **CDS** | [N] (if annotated) |

### Sequence Preview
```fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]

Annotations Summary (from GenBank format)

FeatureCountExamples
CDS[N][gene names]
tRNA[N]-
rRNA[N]16S, 23S
Regulatory[N]promoters

Alternative Sequences

Ranked by relevance and curation level:

AccessionTypeLengthDescriptionENA Compatible
NC_000913.3RefSeq4.6 MbE. coli K-12 reference
U00096.3GenBank4.6 MbE. coli K-12
CP001509.3GenBank4.6 MbE. coli DH10B

Cross-Database References

DatabaseAccessionLink
RefSeq[NC_*][NCBI link]
GenBank[U*][NCBI link]
ENA/EMBL[same as GenBank][ENA link]
BioProject[PRJNA*][link]
BioSample[SAMN*][link]

Download Options

Formats Available

FormatDescriptionUse Case
FASTASequence onlyBLAST, alignment
GenBankSequence + annotationsGene analysis
GFF3Annotations onlyGenome browsers

Direct Commands

# FASTA format
tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession="[accession]",
    format="fasta"
)

# GenBank format (with annotations)
tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession="[accession]",
    format="genbank"
)

Related Sequences

Other Strains/Isolates

AccessionStrainSimilarityNotes
[acc1][strain1]99.9%[notes]
[acc2][strain2]99.5%[notes]

Protein Products (if applicable)

Protein AccessionProduct NameLength
[NP_*][protein name][X] aa

Retrieved: [date] Database: NCBI Nucleotide


---

## Curation Level Tiers

| Tier | Symbol | Accession Prefix | Description |
|------|--------|------------------|-------------|
| RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard |
| RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted |
| GenBank Validated | ●●○○ | Various | Submitted, some curation |
| GenBank Direct | ●○○○ | Various | Direct submission |
| Third Party | ○○○○ | TPA_ | Third-party annotation |

Include in report:
```markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use

Completeness Checklist

Every sequence report MUST include:

Per Sequence (Required)

  • Accession number
  • Organism (scientific name)
  • Sequence type (DNA/RNA/protein)
  • Length
  • Curation level
  • Database source

Search Summary (Required)

  • Query parameters
  • Number of results
  • Ranking rationale

Include Even If Limited

  • Alternative sequences (or "Only one sequence found")
  • Cross-database references (or "No cross-references available")
  • Download instructions

Common Use Cases

Reference Genome

User: "Get E. coli K-12 complete genome"

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Escherichia coli",
    strain="K-12",
    seq_type="complete_genome",
    limit=3
)
# Return NC_000913.3 (RefSeq reference)

Gene Sequence

User: "Find human BRCA1 mRNA"

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Homo sapiens",
    gene="BRCA1",
    seq_type="mrna",
    limit=10
)

Specific Accession

User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata

Strain Comparison

User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table


Error Handling

ErrorResponse
"No search criteria provided"Add organism, gene, or keywords
"ENA 404 error"Accession is likely RefSeq → use NCBI only
"No results found"Broaden search, check spelling, try synonyms
"Sequence too large"Note size, provide download link instead of preview
"API rate limit"Tools auto-retry; if persistent, wait briefly

Tool Reference

NCBI Tools (All Accessions)

ToolPurpose
NCBI_search_nucleotideSearch by gene/organism
NCBI_fetch_accessionsConvert UIDs to accessions
NCBI_get_sequenceRetrieve sequence data

ENA Tools (GenBank/EMBL Only)

ToolPurpose
ena_get_entryEntry metadata
ena_get_sequence_fastaFASTA sequence
ena_get_entry_summarySummary info

Search Parameters Reference

NCBI_search_nucleotide

ParameterDescriptionExample
operationAlways "search""search"
organismScientific name"Homo sapiens"
geneGene symbol"BRCA1"
strainSpecific strain"K-12"
keywordsFree text"complete genome"
seq_typeSequence type"complete_genome", "mrna", "refseq"
limitMax results10

NCBI_get_sequence

ParameterDescriptionExample
operationAlways "fetch_sequence""fetch_sequence"
accessionAccession number"NC_000913.3"
formatOutput format"fasta", "genbank"