Cell Type Annotation with Multi-LLM Framework
Source:R/cell_type_annotation.R
annotate_cell_types.RdA comprehensive function for automated cell type annotation using multiple Large Language Models (LLMs). This function supports both Seurat's differential gene expression results and custom gene lists as input. It implements a sophisticated annotation pipeline that leverages state-of-the-art LLMs to identify cell types based on marker gene expression patterns.
A data frame from Seurat's FindAllMarkers() function containing differential gene expression results (must have columns: 'cluster', 'gene', and 'avg_log2FC'). The function will select the top genes based on avg_log2FC for each cluster.
A list where each element has a 'genes' field containing marker genes for a cluster. This can be in one of these formats:
Named with cluster IDs: list("0" = list(genes = c(...)), "1" = list(genes = c(...)))
Named with cell type names: list(t_cells = list(genes = c(...)), b_cells = list(genes = c(...)))
Unnamed list: list(list(genes = c(...)), list(genes = c(...)))
For both input types, if cluster IDs are numeric and start from 1, they will be automatically converted to 0-based indexing (e.g., cluster 1 becomes cluster 0) for consistency.
IMPORTANT NOTE ON CLUSTER IDs: The 'cluster' column must contain numeric values or values that can be converted to numeric. Non-numeric cluster IDs (e.g., "cluster_1", "T_cells", "7_0") may cause errors or unexpected behavior. Before using this function, it is recommended to:
Ensure all cluster IDs are numeric or can be cleanly converted to numeric values
If your data contains non-numeric cluster IDs, consider creating a mapping between original IDs and numeric IDs:
# Example of standardizing cluster IDs original_ids <- unique(markers$cluster) id_mapping <- data.frame( original = original_ids, numeric = seq(0, length(original_ids) - 1) ) markers$cluster <- id_mapping$numeric[match(markers$cluster, id_mapping$original)]
'mouse brain'). This helps provide context for more accurate annotations.
OpenAI: 'gpt-5', 'gpt-5-mini', 'gpt-5-nano', 'gpt-4o', 'gpt-4o-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'
Anthropic: 'claude-opus-4-1-20250805', 'claude-sonnet-4-20250514', 'claude-opus-4-20250514', 'claude-3-7-sonnet-20250219', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'
DeepSeek: 'deepseek-chat', 'deepseek-r1', 'deepseek-r1-zero', 'deepseek-reasoner'
Google: 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-1.5-pro-latest', 'gemini-1.5-flash-latest', 'gemini-1.5-flash-8b'
Alibaba: 'qwen-max-2025-01-25', 'qwen3-72b'
Stepfun: 'step-2-16k', 'step-2-mini', 'step-1-8k'
Zhipu: 'glm-4-plus', 'glm-3-turbo'
MiniMax: 'minimax-text-01'
X.AI: 'grok-3-latest', 'grok-3', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'
OpenRouter: Provides access to models from multiple providers through a single API. Format: 'provider/model-name'
OpenAI models: 'openai/gpt-5', 'openai/gpt-5-mini', 'openai/gpt-4o', 'openai/gpt-4o-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'
Anthropic models: 'anthropic/claude-opus-4.1', 'anthropic/claude-sonnet-4', 'anthropic/claude-opus-4', 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'
Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'
Google models: 'google/gemini-2.5-pro', 'google/gemini-2.5-flash', 'google/gemini-2.0-flash', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'
Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'
Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b' Each provider requires a specific API key format and authentication method:
OpenAI: "sk-..." (obtain from OpenAI platform)
Anthropic: "sk-ant-..." (obtain from Anthropic console)
Google: A Google API key for Gemini models (obtain from Google AI)
DeepSeek: API key from DeepSeek platform
Qwen: API key from Alibaba Cloud
Stepfun: API key from Stepfun AI
Zhipu: API key from Zhipu AI
MiniMax: API key from MiniMax
X.AI: API key for Grok models
OpenRouter: "sk-or-..." (obtain from OpenRouter) OpenRouter provides access to multiple models through a single API key
The API key can be provided directly or stored in environment variables:
# Direct API key
result <- annotate_cell_types(input, tissue_name, model="gpt-5",
api_key="sk-...")
# Using environment variables
Sys.setenv(OPENAI_API_KEY="sk-...")
Sys.setenv(ANTHROPIC_API_KEY="sk-ant-...")
Sys.setenv(OPENROUTER_API_KEY="sk-or-...")
# Then use with environment variables
result <- annotate_cell_types(input, tissue_name, model="claude-3-opus",
api_key=Sys.getenv("ANTHROPIC_API_KEY"))If NA, returns the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API. when input is from Seurat's FindAllMarkers(). Default: 10
A single character string: Applied to all providers (e.g., "https://api.proxy.com/v1")
A named list: Provider-specific URLs (e.g., list(openai = "https://openai-proxy.com/v1", anthropic = "https://anthropic-proxy.com/v1")). This is useful for:
Users accessing international APIs through proxies
Enterprise users with internal API gateways
Development/testing with local or alternative endpoints If NULL (default), uses official API endpoints for each provider.
Usage
annotate_cell_types(
input,
tissue_name = NULL,
model = "gpt-5",
api_key = NA,
top_gene_count = 10,
debug = FALSE,
base_urls = NULL
)Arguments
- input
Either a data frame from Seurat's FindAllMarkers() containing columns 'cluster', 'gene', and 'avg_log2FC', or a list with 'genes' field for each cluster
- tissue_name
Optional tissue context (e.g., 'human PBMC', 'mouse brain') for more accurate annotations
- model
Model name to use. Default: 'gpt-5'. See details for supported models
- api_key
API key for the selected model provider. If NA, returns prompt only
- top_gene_count
Number of top genes to use per cluster when input is from Seurat. Default: 10
- debug
Logical indicating whether to enable debug output. Default: FALSE
- base_urls
Optional base URLs for API endpoints. Can be a string or named list for custom endpoints
Value
When api_key is provided: Vector of cell type annotations per cluster. When api_key is NA: The generated prompt string
Examples
# Example 1: Using custom gene lists, returning prompt only (no API call)
annotate_cell_types(
input = list(
t_cells = list(genes = c('CD3D', 'CD3E', 'CD3G', 'CD28')),
b_cells = list(genes = c('CD19', 'CD79A', 'CD79B', 'MS4A1')),
monocytes = list(genes = c('CD14', 'CD68', 'CSF1R', 'FCGR3A'))
),
tissue_name = 'human PBMC',
model = 'gpt-5',
api_key = NA # Returns prompt only without making API call
)
#> {"timestamp":"2025-11-23 02:57:31","session_id":"20251123_025731","level":"INFO","message":"Unified logger initialized","context":{"session_id":"20251123_025731","log_level":"INFO","log_dir":"logs"},"pid":10815}
#> {"timestamp":"2025-11-23 02:57:31","session_id":"20251123_025731","level":"INFO","message":"Processing input with model and provider","context":{"model":"gpt-5","provider":"openai","custom_url":false},"pid":10815}
#> DEBUG: Formatted lines for prompt:
#> [1] "You are a cell type annotation expert. Below are marker genes for different cell clusters in human PBMC.\n\n\n\nFor each numbered cluster, provide only the cell type name in a new line, without any explanation."
# Example 2: Using with Seurat pipeline and OpenAI model
if (FALSE) { # \dontrun{
library(Seurat)
# Load example data
data("pbmc_small")
# Find marker genes
all.markers <- FindAllMarkers(
object = pbmc_small,
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25
)
# Set API key in environment variable (recommended approach)
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key")
# Get cell type annotations using OpenAI model
openai_annotations <- annotate_cell_types(
input = all.markers,
tissue_name = 'human PBMC',
model = 'gpt-5',
api_key = Sys.getenv("OPENAI_API_KEY"),
top_gene_count = 15
)
# Example 3: Using Anthropic Claude model
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key")
claude_annotations <- annotate_cell_types(
input = all.markers,
tissue_name = 'human PBMC',
model = 'claude-3-opus',
api_key = Sys.getenv("ANTHROPIC_API_KEY"),
top_gene_count = 15
)
# Example 4: Using OpenRouter to access multiple models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")
# Access OpenAI models through OpenRouter
openrouter_gpt4_annotations <- annotate_cell_types(
input = all.markers,
tissue_name = 'human PBMC',
model = 'openai/gpt-5', # Note the provider/model format
api_key = Sys.getenv("OPENROUTER_API_KEY"),
top_gene_count = 15
)
# Access Anthropic models through OpenRouter
openrouter_claude_annotations <- annotate_cell_types(
input = all.markers,
tissue_name = 'human PBMC',
model = 'anthropic/claude-3-opus', # Note the provider/model format
api_key = Sys.getenv("OPENROUTER_API_KEY"),
top_gene_count = 15
)
# Example 5: Using with mouse brain data
mouse_annotations <- annotate_cell_types(
input = mouse_markers, # Your mouse marker genes
tissue_name = 'mouse brain', # Specify correct tissue for context
model = 'gpt-5',
api_key = Sys.getenv("OPENAI_API_KEY"),
top_gene_count = 20, # Use more genes for complex tissues
debug = TRUE # Enable debug output
)
} # }