Skip to contents

A comprehensive function for automated cell type annotation using multiple Large Language Models (LLMs). This function supports both Seurat's differential gene expression results and custom gene lists as input. It implements a sophisticated annotation pipeline that leverages state-of-the-art LLMs to identify cell types based on marker gene expression patterns.

Usage

annotate_cell_types(
  input,
  tissue_name = NULL,
  model = "gpt-4o",
  api_key = NA,
  top_gene_count = 10,
  debug = FALSE
)

Arguments

input

One of the following:

  • A data frame from Seurat's FindAllMarkers() function containing differential gene expression results (must have columns: 'cluster', 'gene', and 'avg_log2FC'). The function will select the top genes based on avg_log2FC for each cluster.

  • A list where each element has a 'genes' field containing marker genes for a cluster. This can be in one of these formats:

    • Named with cluster IDs: list("0" = list(genes = c(...)), "1" = list(genes = c(...)))

    • Named with cell type names: list(t_cells = list(genes = c(...)), b_cells = list(genes = c(...)))

    • Unnamed list: list(list(genes = c(...)), list(genes = c(...)))

  • For both input types, if cluster IDs are numeric and start from 1, they will be automatically converted to 0-based indexing (e.g., cluster 1 becomes cluster 0) for consistency.

IMPORTANT NOTE ON CLUSTER IDs: The 'cluster' column must contain numeric values or values that can be converted to numeric. Non-numeric cluster IDs (e.g., "cluster_1", "T_cells", "7_0") may cause errors or unexpected behavior. Before using this function, it is recommended to:

  1. Ensure all cluster IDs are numeric or can be cleanly converted to numeric values

  2. If your data contains non-numeric cluster IDs, consider creating a mapping between original IDs and numeric IDs:

    # Example of standardizing cluster IDs
    original_ids <- unique(markers$cluster)
    id_mapping <- data.frame(
      original = original_ids,
      numeric = seq(0, length(original_ids) - 1)
    )
    markers$cluster <- id_mapping$numeric[match(markers$cluster, id_mapping$original)]

tissue_name

Character string specifying the tissue type or cell source (e.g., 'human PBMC', 'mouse brain'). This helps provide context for more accurate annotations.

model

Character string specifying the LLM model to use. Supported models:

  • OpenAI: 'gpt-4o', 'o1'

  • Anthropic: 'claude-3-7-sonnet-20250219', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'

  • DeepSeek: 'deepseek-chat', 'deepseek-reasoner'

  • Google: 'gemini-2.0-flash', 'gemini-2.0-flash-exp', 'gemini-1.5-pro', 'gemini-1.5-flash'

  • Alibaba: 'qwen-max-2025-01-25', 'qwen3-72b'

  • Stepfun: 'step-2-16k', 'step-2-mini', 'step-1-8k'

  • Zhipu: 'glm-4-plus', 'glm-3-turbo'

  • MiniMax: 'minimax-text-01'

  • X.AI: 'grok-3-latest', 'grok-3', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'

  • OpenRouter: Provides access to models from multiple providers through a single API. Format: 'provider/model-name'

    • OpenAI models: 'openai/gpt-4o', 'openai/gpt-4o-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'

    • Anthropic models: 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'

    • Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'

    • Google models: 'google/gemini-2.5-pro-preview-03-25', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'

    • Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'

    • Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'

api_key

Character string containing the API key for the selected model. Each provider requires a specific API key format and authentication method:

  • OpenAI: "sk-..." (obtain from https://platform.openai.com/api-keys)

  • Anthropic: "sk-ant-..." (obtain from https://console.anthropic.com/keys)

  • Google: A Google API key for Gemini models (obtain from https://ai.google.dev/)

  • DeepSeek: API key from DeepSeek platform

  • Qwen: API key from Alibaba Cloud

  • Stepfun: API key from Stepfun AI

  • Zhipu: API key from Zhipu AI

  • MiniMax: API key from MiniMax

  • X.AI: API key for Grok models

  • OpenRouter: "sk-or-..." (obtain from https://openrouter.ai/keys) OpenRouter provides access to multiple models through a single API key

The API key can be provided directly or stored in environment variables:

# Direct API key
result <- annotate_cell_types(input, tissue_name, model="gpt-4o",
                             api_key="sk-...")

# Using environment variables
Sys.setenv(OPENAI_API_KEY="sk-...")
Sys.setenv(ANTHROPIC_API_KEY="sk-ant-...")
Sys.setenv(OPENROUTER_API_KEY="sk-or-...")

# Then use with environment variables
result <- annotate_cell_types(input, tissue_name, model="claude-3-opus",
                             api_key=Sys.getenv("ANTHROPIC_API_KEY"))

If NA, returns the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API.

top_gene_count

Integer specifying the number of top marker genes to use per cluster.

debug

Logical. If TRUE, prints additional debugging information during execution. when input is from Seurat's FindAllMarkers(). Default: 10

Value

A character vector containing:

  • When api_key is provided: One cell type annotation per cluster, in the order of input clusters

  • When api_key is NA: The generated prompt string that would be sent to the LLM

Examples

# Example 1: Using custom gene lists, returning prompt only (no API call)
annotate_cell_types(
  input = list(
    t_cells = list(genes = c('CD3D', 'CD3E', 'CD3G', 'CD28')),
    b_cells = list(genes = c('CD19', 'CD79A', 'CD79B', 'MS4A1')),
    monocytes = list(genes = c('CD14', 'CD68', 'CSF1R', 'FCGR3A'))
  ),
  tissue_name = 'human PBMC',
  model = 'gpt-4o',
  api_key = NA  # Returns prompt only without making API call
)
#> [2025-05-10 07:27:43] Processing input with Model: gpt-4o (Provider: openai)
#> DEBUG: Formatted lines for prompt:
#> [2025-05-10 07:27:43] 
#> Gene lists for each cluster:
#> [2025-05-10 07:27:43] Cluster t_cells: CD3D, CD3E, CD3G, CD28
#> [2025-05-10 07:27:43] Cluster b_cells: CD19, CD79A, CD79B, MS4A1
#> [2025-05-10 07:27:43] Cluster monocytes: CD14, CD68, CSF1R, FCGR3A
#> [2025-05-10 07:27:43] 
#> Generated prompt:
#> [2025-05-10 07:27:43] You are a cell type annotation expert. Below are marker genes for different cell clusters in human PBMC.
#> 
#> 
#> 
#> For each numbered cluster, provide only the cell type name in a new line, without any explanation.
#> [1] "You are a cell type annotation expert. Below are marker genes for different cell clusters in human PBMC.\n\n\n\nFor each numbered cluster, provide only the cell type name in a new line, without any explanation."

# Example 2: Using with Seurat pipeline and OpenAI model
if (FALSE) { # \dontrun{
library(Seurat)

# Load example data
data("pbmc_small")

# Find marker genes
all.markers <- FindAllMarkers(
  object = pbmc_small,
  only.pos = TRUE,
  min.pct = 0.25,
  logfc.threshold = 0.25
)

# Set API key in environment variable (recommended approach)
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key")

# Get cell type annotations using OpenAI model
openai_annotations <- annotate_cell_types(
  input = all.markers,
  tissue_name = 'human PBMC',
  model = 'gpt-4o',
  api_key = Sys.getenv("OPENAI_API_KEY"),
  top_gene_count = 15
)

# Example 3: Using Anthropic Claude model
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key")

claude_annotations <- annotate_cell_types(
  input = all.markers,
  tissue_name = 'human PBMC',
  model = 'claude-3-opus',
  api_key = Sys.getenv("ANTHROPIC_API_KEY"),
  top_gene_count = 15
)

# Example 4: Using OpenRouter to access multiple models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")

# Access OpenAI models through OpenRouter
openrouter_gpt4_annotations <- annotate_cell_types(
  input = all.markers,
  tissue_name = 'human PBMC',
  model = 'openai/gpt-4o',  # Note the provider/model format
  api_key = Sys.getenv("OPENROUTER_API_KEY"),
  top_gene_count = 15
)

# Access Anthropic models through OpenRouter
openrouter_claude_annotations <- annotate_cell_types(
  input = all.markers,
  tissue_name = 'human PBMC',
  model = 'anthropic/claude-3-opus',  # Note the provider/model format
  api_key = Sys.getenv("OPENROUTER_API_KEY"),
  top_gene_count = 15
)

# Example 5: Using with mouse brain data
mouse_annotations <- annotate_cell_types(
  input = mouse_markers,  # Your mouse marker genes
  tissue_name = 'mouse brain',  # Specify correct tissue for context
  model = 'gpt-4o',
  api_key = Sys.getenv("OPENAI_API_KEY"),
  top_gene_count = 20,  # Use more genes for complex tissues
  debug = TRUE  # Enable debug output
)
} # }