
Interactive consensus building for cell type annotation
Source:R/consensus_annotation.R
interactive_consensus_annotation.Rd
This function implements an interactive voting and discussion mechanism where multiple LLMs collaborate to reach a consensus on cell type annotations, particularly focusing on clusters with low agreement. The process includes:
Initial voting by all LLMs
Identification of controversial clusters
Detailed discussion for controversial clusters
Final summary by a designated LLM (default: Claude)
This function implements an interactive voting and discussion mechanism where multiple LLMs collaborate to reach a consensus on cell type annotations, particularly focusing on clusters with low agreement. The process includes:
Initial voting by all LLMs
Identification of controversial clusters
Detailed discussion for controversial clusters
Final summary by a designated LLM (default: Claude)
Usage
interactive_consensus_annotation(
input,
tissue_name = NULL,
models = c("claude-sonnet-4-20250514", "claude-3-7-sonnet-20250219",
"claude-3-5-sonnet-20241022", "claude-3-5-haiku-20241022", "gemini-2.0-flash",
"gemini-1.5-pro", "qwen-max-2025-01-25", "gpt-4o", "grok-3-latest"),
api_keys,
top_gene_count = 10,
controversy_threshold = 0.7,
entropy_threshold = 1,
max_discussion_rounds = 3,
consensus_check_model = NULL,
log_dir = "logs",
cache_dir = "consensus_cache",
use_cache = TRUE,
base_urls = NULL,
clusters_to_analyze = NULL,
force_rerun = FALSE
)
Arguments
- input
One of the following:
A data frame from Seurat's FindAllMarkers() function containing differential gene expression results (must have columns: 'cluster', 'gene', and 'avg_log2FC'). The function will select the top genes based on avg_log2FC for each cluster.
A list where each element has a 'genes' field containing marker genes for a cluster. This can be in one of these formats:
Named with cluster IDs: list("0" = list(genes = c(...)), "1" = list(genes = c(...)))
Named with cell type names: list(t_cells = list(genes = c(...)), b_cells = list(genes = c(...)))
Unnamed list: list(list(genes = c(...)), list(genes = c(...)))
For both input types, if cluster IDs are numeric and start from 1, they will be automatically converted to 0-based indexing (e.g., cluster 1 becomes cluster 0) for consistency.
- tissue_name
Optional input of tissue name
- models
Vector of model names to participate in the discussion. Supported models:
OpenAI: 'gpt-4o', 'gpt-4o-mini', 'gpt-4.1', 'gpt-4.1-mini', 'gpt-4.1-nano', 'gpt-4-turbo', 'gpt-3.5-turbo', 'o1', 'o1-mini', 'o1-preview', 'o1-pro'
Anthropic: 'claude-sonnet-4-20250514', 'claude-opus-4-20250514', 'claude-3-7-sonnet-20250219', 'claude-3-5-sonnet-20241022', 'claude-3-5-haiku-20241022', 'claude-3-opus-20240229'
DeepSeek: 'deepseek-chat', 'deepseek-r1', 'deepseek-r1-zero', 'deepseek-reasoner'
Google: 'gemini-2.5-pro', 'gemini-2.5-flash', 'gemini-2.0-flash', 'gemini-2.0-flash-lite', 'gemini-1.5-pro-latest', 'gemini-1.5-flash-latest', 'gemini-1.5-flash-8b'
Alibaba: 'qwen-max-2025-01-25', 'qwen3-72b'
Stepfun: 'step-2-16k', 'step-2-mini', 'step-1-8k'
Zhipu: 'glm-4-plus', 'glm-3-turbo'
MiniMax: 'minimax-text-01'
X.AI: 'grok-3-latest', 'grok-3', 'grok-3-fast', 'grok-3-fast-latest', 'grok-3-mini', 'grok-3-mini-latest', 'grok-3-mini-fast', 'grok-3-mini-fast-latest'
OpenRouter: Provides access to models from multiple providers through a single API. Format: 'provider/model-name'
OpenAI models: 'openai/gpt-4o', 'openai/gpt-4o-mini', 'openai/gpt-4-turbo', 'openai/gpt-4', 'openai/gpt-3.5-turbo'
Anthropic models: 'anthropic/claude-sonnet-4', 'anthropic/claude-opus-4', 'anthropic/claude-3.7-sonnet', 'anthropic/claude-3.5-sonnet', 'anthropic/claude-3.5-haiku', 'anthropic/claude-3-opus'
Meta models: 'meta-llama/llama-3-70b-instruct', 'meta-llama/llama-3-8b-instruct', 'meta-llama/llama-2-70b-chat'
Google models: 'google/gemini-2.5-pro', 'google/gemini-2.5-flash', 'google/gemini-2.0-flash', 'google/gemini-1.5-pro-latest', 'google/gemini-1.5-flash'
Mistral models: 'mistralai/mistral-large', 'mistralai/mistral-medium', 'mistralai/mistral-small'
Other models: 'microsoft/mai-ds-r1', 'perplexity/sonar-small-chat', 'cohere/command-r', 'deepseek/deepseek-chat', 'thudm/glm-z1-32b'
- api_keys
Named list of API keys. Can be provided in two formats:
With provider names as keys:
list("openai" = "sk-...", "anthropic" = "sk-ant-...", "openrouter" = "sk-or-...")
With model names as keys:
list("gpt-4o" = "sk-...", "claude-3-opus" = "sk-ant-...")
The system first tries to find the API key using the provider name. If not found, it then tries using the model name. Example:
api_keys <- list( "openai" = Sys.getenv("OPENAI_API_KEY"), "anthropic" = Sys.getenv("ANTHROPIC_API_KEY"), "openrouter" = Sys.getenv("OPENROUTER_API_KEY"), "claude-3-opus" = "sk-ant-api03-specific-key-for-opus" )
- top_gene_count
Number of top differential genes to use
- controversy_threshold
Consensus proportion threshold (default: 0.7). Clusters with consensus proportion below this value will be marked as controversial
- entropy_threshold
Entropy threshold for identifying controversial clusters (default: 1.0)
- max_discussion_rounds
Maximum number of discussion rounds for controversial clusters (default: 3)
- consensus_check_model
Model to use for consensus checking
- log_dir
Directory for storing logs
- cache_dir
Directory for storing cache
- use_cache
Whether to use cached results
- base_urls
Optional custom base URLs for API endpoints. Can be:
A single character string: Applied to all providers (e.g., "https://api.proxy.com/v1")
A named list: Provider-specific URLs (e.g., list(openai = "https://openai-proxy.com/v1", anthropic = "https://anthropic-proxy.com/v1")). This is useful for:
Chinese users accessing international APIs through proxies
Enterprise users with internal API gateways
Development/testing with local or alternative endpoints If NULL (default), uses official API endpoints for each provider.
- clusters_to_analyze
Optional vector of cluster IDs to analyze. If NULL (default), all clusters in the input will be analyzed. Must be character or numeric values that match the cluster IDs in your input. Examples:
For numeric clusters: c(0, 2, 5) or c("0", "2", "5")
This is useful when you want to focus on specific clusters without filtering the input data
Non-existent cluster IDs will be ignored with a warning
- force_rerun
Logical. If TRUE, ignore cached results and force re-analysis of all specified clusters. Useful when you want to re-analyze clusters with different context or for subtype identification. Default is FALSE. Note: This parameter only affects the discussion phase for controversial clusters.