
Getting Started with mLLMCelltype
Chen Yang
2025-05-10
Source:vignettes/03-getting-started.Rmd
03-getting-started.Rmd
Getting Started with mLLMCelltype
This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We’ll cover the basic workflow, input data requirements, and a simple example to get you started.
Basic Workflow
The mLLMCelltype workflow consists of these main steps:
- Prepare marker gene data for each cluster
- Run annotation using one or multiple LLMs
- Create consensus from multiple model predictions (optional)
- Integrate results with your Seurat or Scanpy object
- Visualize results with uncertainty metrics
Loading the Package and Setting Up API Keys
First, load the mLLMCelltype package:
Setting Up API Keys
Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:
# Set API keys as environment variables
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") # For Claude models
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # For GPT models
Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key") # For Gemini models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter models
You can obtain API keys from: - Anthropic: https://console.anthropic.com/keys - OpenAI: https://platform.openai.com/api-keys - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys
Alternatively, you can provide API keys directly in function calls:
results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "claude-3-7-sonnet-20250219",
api_key = "your-anthropic-api-key", # Direct API key
top_gene_count = 10
)
Input Data Requirements
mLLMCelltype accepts marker gene data in several formats:
1. Data Frame Format
A data frame with the following columns: - cluster
:
Cluster ID (must be 0-based) - gene
: Gene name/symbol -
avg_log2FC
or similar metric: Log fold change -
p_val_adj
or similar metric: Adjusted p-value
Example:
# Example marker data frame
markers_df <- data.frame(
cluster = c(0, 0, 0, 1, 1, 1),
gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"),
avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5),
p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005)
)
2. Seurat FindMarkers Output
You can directly use the output from Seurat’s
FindAllMarkers()
function:
# Assuming you have a Seurat object named 'seurat_obj'
library(Seurat)
all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Function Parameters
The annotate_cell_types
function has the following
parameters:
Parameter | Description | Default Value |
---|---|---|
input |
Marker gene data (data frame, list, or file path) | (required) |
tissue_name |
Tissue name (e.g., “human PBMC”, “mouse brain”) | NULL |
model |
LLM model to use | "gpt-4o" |
api_key |
API key (if not set in environment) | NA |
top_gene_count |
Number of top genes per cluster to use | 10 |
debug |
Whether to print debugging information | FALSE |
Note: If api_key
is set to NA
, the function
will return the generated prompt without making an API call, which is
useful for reviewing the prompt before sending it to the API.
Basic Usage Example
Here’s a simple example using a single LLM model for annotation:
# Example marker data
markers <- data.frame(
cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"),
avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0),
p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002)
)
# Run annotation with a single model
results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "claude-3-7-sonnet-20250219",
api_key = Sys.getenv("ANTHROPIC_API_KEY"),
top_gene_count = 10,
debug = FALSE # Set to TRUE for more detailed output
)
# Print results
print(results)
Multi-Model Consensus Example
For more reliable annotations, you can use multiple models and create a consensus:
# Define models to use
models <- c(
"claude-3-7-sonnet-20250219", # Anthropic
"gpt-4o", # OpenAI
"gemini-1.5-pro" # Google
)
# API keys for different providers
api_keys <- list(
anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
openai = Sys.getenv("OPENAI_API_KEY"),
gemini = Sys.getenv("GEMINI_API_KEY")
)
# Run annotation with multiple models
results <- list()
for (model in models) {
provider <- get_provider(model)
api_key <- api_keys[[provider]]
results[[model]] <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = model,
api_key = api_key,
top_gene_count = 10
)
}
# Create consensus
consensus_results <- interactive_consensus_annotation(
input = markers,
tissue_name = "human PBMC",
models = models, # Use all the models defined above
api_keys = api_keys,
controversy_threshold = 0.7,
entropy_threshold = 1.0,
consensus_check_model = "claude-3-7-sonnet-20250219"
)
# Print consensus results
print_consensus_summary(consensus_results)
Consensus Output Example
The consensus results contain more detailed information:
> print_consensus_summary(consensus_results)
Consensus Summary:
-----------------
Total clusters: 2
Controversial clusters: 0
Consensus achieved for all clusters
Cluster 0:
Final annotation: T cells
Consensus proportion: 1.0
Entropy: 0.0
Model predictions:
- claude-3-7-sonnet-20250219: T cells
- gpt-4o: T cells
- gemini-1.5-pro: T cells
Cluster 1:
Final annotation: Monocytes
Consensus proportion: 1.0
Entropy: 0.0
Model predictions:
- claude-3-7-sonnet-20250219: Monocytes
- gpt-4o: Monocytes
- gemini-1.5-pro: Monocytes
Integrating with Seurat
To add the annotations to your Seurat object:
# Assuming you have a Seurat object named 'seurat_obj' and consensus results
library(Seurat)
# Add consensus annotations to Seurat object
seurat_obj$cell_type_consensus <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = as.character(0:(length(consensus_results$final_annotations)-1)),
to = consensus_results$final_annotations
)
# Extract consensus metrics from the consensus results
# Note: These metrics are available in the consensus_results$initial_results$consensus_results
consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) {
metrics <- consensus_results$initial_results$consensus_results[[cluster_id]]
return(list(
cluster = cluster_id,
consensus_proportion = metrics$consensus_proportion,
entropy = metrics$entropy
))
})
# Convert to data frame for easier handling
metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame))
# Add consensus proportion to Seurat object
seurat_obj$consensus_proportion <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = metrics_df$cluster,
to = metrics_df$consensus_proportion
)
# Add entropy to Seurat object
seurat_obj$entropy <- plyr::mapvalues(
x = as.character(Idents(seurat_obj)),
from = metrics_df$cluster,
to = metrics_df$entropy
)
Basic Visualization
Here’s a simple visualization of the results using Seurat:
# Plot UMAP with cell type annotations
DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) +
ggtitle("Cell Type Annotations") +
theme(plot.title = element_text(hjust = 0.5))
Understanding the Output
The output of annotate_cell_types()
is a vector of cell
type annotations, where each element corresponds to a cluster.
The output of interactive_consensus_annotation()
is a
list containing:
-
final_annotations
: Final consensus cell type annotations -
initial_results
: Initial predictions from each model -
controversial_clusters
: List of clusters that required discussion -
discussion_logs
: Detailed logs of the discussion process -
session_id
: Unique identifier for the annotation session
Understanding Uncertainty Metrics
When using consensus annotation, two key metrics help evaluate the reliability of annotations:
- Consensus Proportion: Ranges from 0 to 1, indicating the proportion of models that agree on the final annotation. Higher values indicate stronger agreement.
- Entropy: Measures the uncertainty in model predictions. Lower values indicate more certainty. An entropy of 0 means all models agree perfectly.
Clusters with low consensus proportion or high entropy may require manual review.
Using OpenRouter Free Models
If you don’t have access to paid API keys, you can use OpenRouter’s free models:
# Set OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")
# Use a free model
free_results <- annotate_cell_types(
input = markers,
tissue_name = "human PBMC",
model = "meta-llama/llama-4-maverick:free", # Note the :free suffix
api_key = Sys.getenv("OPENROUTER_API_KEY"),
top_gene_count = 10
)
# Print results
print(free_results)
Available free models include:
-
meta-llama/llama-4-maverick:free
- Meta Llama 4 Maverick (256K context) -
nvidia/llama-3.1-nemotron-ultra-253b-v1:free
- NVIDIA Nemotron Ultra 253B -
deepseek/deepseek-chat-v3-0324:free
- DeepSeek Chat v3 -
microsoft/mai-ds-r1:free
- Microsoft MAI-DS-R1
Free models don’t consume credits but may have limitations compared to paid models.
Troubleshooting
Common Issues
-
API Key Not Found:
Solution: Ensure you’ve set the correct API key environment variable or provided it directly in the function call.
-
Rate Limiting:
Solution: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.
-
Invalid Model Name:
Solution: Check that you’re using a supported model name and that it’s spelled correctly.
-
Network Issues:
Solution: Check your internet connection and try again. If the problem persists, the API service might be down.
Next Steps
Now that you understand the basics of mLLMCelltype, you can explore:
- Usage Tutorial: More detailed usage examples
- Consensus Annotation Principles: Learn about the consensus mechanism
- Visualization Guide: Create publication-ready visualizations
- Advanced Features: Explore hierarchical annotation and other advanced features
- FAQ: Answers to common questions
If you encounter any issues, check the FAQ or open an issue on our GitHub repository.