Skip to contents

Getting Started with mLLMCelltype

This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We’ll cover the basic workflow, input data requirements, and a simple example to get you started.

Basic Workflow

The mLLMCelltype workflow consists of these main steps:

  1. Prepare marker gene data for each cluster
  2. Run annotation using one or multiple LLMs
  3. Create consensus from multiple model predictions (optional)
  4. Integrate results with your Seurat or Scanpy object
  5. Visualize results with uncertainty metrics

Loading the Package and Setting Up API Keys

First, load the mLLMCelltype package:

Setting Up API Keys

Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:

# Set API keys as environment variables
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key")  # For Claude models
Sys.setenv(OPENAI_API_KEY = "your-openai-api-key")        # For GPT models
Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key")        # For Gemini models
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter models

You can obtain API keys from: - Anthropic: https://console.anthropic.com/keys - OpenAI: https://platform.openai.com/api-keys - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys

Alternatively, you can provide API keys directly in function calls:

results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "claude-3-7-sonnet-20250219",
  api_key = "your-anthropic-api-key",  # Direct API key
  top_gene_count = 10
)

Input Data Requirements

mLLMCelltype accepts marker gene data in several formats:

1. Data Frame Format

A data frame with the following columns: - cluster: Cluster ID (must be 0-based) - gene: Gene name/symbol - avg_log2FC or similar metric: Log fold change - p_val_adj or similar metric: Adjusted p-value

Example:

# Example marker data frame
markers_df <- data.frame(
  cluster = c(0, 0, 0, 1, 1, 1),
  gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"),
  avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5),
  p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005)
)

2. Seurat FindMarkers Output

You can directly use the output from Seurat’s FindAllMarkers() function:

# Assuming you have a Seurat object named 'seurat_obj'
library(Seurat)
all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

3. CSV File Path

A path to a CSV file containing marker gene data:

# Path to your CSV file
markers_file <- "path/to/markers.csv"

4. List Format

A list where each element contains marker genes for a cluster:

# Example marker list
markers_list <- list(
  "0" = c("CD3D", "CD3E", "CD2", "IL7R", "LTB"),
  "1" = c("CD14", "LYZ", "CST3", "MS4A7", "FCGR3A")
)

Function Parameters

The annotate_cell_types function has the following parameters:

Parameter Description Default Value
input Marker gene data (data frame, list, or file path) (required)
tissue_name Tissue name (e.g., “human PBMC”, “mouse brain”) NULL
model LLM model to use "gpt-4o"
api_key API key (if not set in environment) NA
top_gene_count Number of top genes per cluster to use 10
debug Whether to print debugging information FALSE

Note: If api_key is set to NA, the function will return the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API.

Basic Usage Example

Here’s a simple example using a single LLM model for annotation:

# Example marker data
markers <- data.frame(
  cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1),
  gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"),
  avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0),
  p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002)
)

# Run annotation with a single model
results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "claude-3-7-sonnet-20250219",
  api_key = Sys.getenv("ANTHROPIC_API_KEY"),
  top_gene_count = 10,
  debug = FALSE  # Set to TRUE for more detailed output
)

# Print results
print(results)

Example Output

When using a single model like Claude, the output will be a character vector with one annotation per cluster:

> print(results)
[1] "0: T cells"   "1: Monocytes"

Multi-Model Consensus Example

For more reliable annotations, you can use multiple models and create a consensus:

# Define models to use
models <- c(
  "claude-3-7-sonnet-20250219",  # Anthropic
  "gpt-4o",                      # OpenAI
  "gemini-1.5-pro"               # Google
)

# API keys for different providers
api_keys <- list(
  anthropic = Sys.getenv("ANTHROPIC_API_KEY"),
  openai = Sys.getenv("OPENAI_API_KEY"),
  gemini = Sys.getenv("GEMINI_API_KEY")
)

# Run annotation with multiple models
results <- list()
for (model in models) {
  provider <- get_provider(model)
  api_key <- api_keys[[provider]]

  results[[model]] <- annotate_cell_types(
    input = markers,
    tissue_name = "human PBMC",
    model = model,
    api_key = api_key,
    top_gene_count = 10
  )
}

# Create consensus
consensus_results <- interactive_consensus_annotation(
  input = markers,
  tissue_name = "human PBMC",
  models = models,  # Use all the models defined above
  api_keys = api_keys,
  controversy_threshold = 0.7,
  entropy_threshold = 1.0,
  consensus_check_model = "claude-3-7-sonnet-20250219"
)

# Print consensus results
print_consensus_summary(consensus_results)

Consensus Output Example

The consensus results contain more detailed information:

> print_consensus_summary(consensus_results)
Consensus Summary:
-----------------
Total clusters: 2
Controversial clusters: 0
Consensus achieved for all clusters

Cluster 0:
  Final annotation: T cells
  Consensus proportion: 1.0
  Entropy: 0.0
  Model predictions:
    - claude-3-7-sonnet-20250219: T cells
    - gpt-4o: T cells
    - gemini-1.5-pro: T cells

Cluster 1:
  Final annotation: Monocytes
  Consensus proportion: 1.0
  Entropy: 0.0
  Model predictions:
    - claude-3-7-sonnet-20250219: Monocytes
    - gpt-4o: Monocytes
    - gemini-1.5-pro: Monocytes

Integrating with Seurat

To add the annotations to your Seurat object:

# Assuming you have a Seurat object named 'seurat_obj' and consensus results
library(Seurat)

# Add consensus annotations to Seurat object
seurat_obj$cell_type_consensus <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = as.character(0:(length(consensus_results$final_annotations)-1)),
  to = consensus_results$final_annotations
)

# Extract consensus metrics from the consensus results
# Note: These metrics are available in the consensus_results$initial_results$consensus_results
consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) {
  metrics <- consensus_results$initial_results$consensus_results[[cluster_id]]
  return(list(
    cluster = cluster_id,
    consensus_proportion = metrics$consensus_proportion,
    entropy = metrics$entropy
  ))
})

# Convert to data frame for easier handling
metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame))

# Add consensus proportion to Seurat object
seurat_obj$consensus_proportion <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = metrics_df$cluster,
  to = metrics_df$consensus_proportion
)

# Add entropy to Seurat object
seurat_obj$entropy <- plyr::mapvalues(
  x = as.character(Idents(seurat_obj)),
  from = metrics_df$cluster,
  to = metrics_df$entropy
)

Basic Visualization

Here’s a simple visualization of the results using Seurat:

# Plot UMAP with cell type annotations
DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) +
  ggtitle("Cell Type Annotations") +
  theme(plot.title = element_text(hjust = 0.5))

Understanding the Output

The output of annotate_cell_types() is a vector of cell type annotations, where each element corresponds to a cluster.

The output of interactive_consensus_annotation() is a list containing:

  • final_annotations: Final consensus cell type annotations
  • initial_results: Initial predictions from each model
  • controversial_clusters: List of clusters that required discussion
  • discussion_logs: Detailed logs of the discussion process
  • session_id: Unique identifier for the annotation session

Understanding Uncertainty Metrics

When using consensus annotation, two key metrics help evaluate the reliability of annotations:

  • Consensus Proportion: Ranges from 0 to 1, indicating the proportion of models that agree on the final annotation. Higher values indicate stronger agreement.
  • Entropy: Measures the uncertainty in model predictions. Lower values indicate more certainty. An entropy of 0 means all models agree perfectly.

Clusters with low consensus proportion or high entropy may require manual review.

Using OpenRouter Free Models

If you don’t have access to paid API keys, you can use OpenRouter’s free models:

# Set OpenRouter API key
Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key")

# Use a free model
free_results <- annotate_cell_types(
  input = markers,
  tissue_name = "human PBMC",
  model = "meta-llama/llama-4-maverick:free",  # Note the :free suffix
  api_key = Sys.getenv("OPENROUTER_API_KEY"),
  top_gene_count = 10
)

# Print results
print(free_results)

Available free models include:

  • meta-llama/llama-4-maverick:free - Meta Llama 4 Maverick (256K context)
  • nvidia/llama-3.1-nemotron-ultra-253b-v1:free - NVIDIA Nemotron Ultra 253B
  • deepseek/deepseek-chat-v3-0324:free - DeepSeek Chat v3
  • microsoft/mai-ds-r1:free - Microsoft MAI-DS-R1

Free models don’t consume credits but may have limitations compared to paid models.

Troubleshooting

Common Issues

  1. API Key Not Found:

    Error: No auth credentials found

    Solution: Ensure you’ve set the correct API key environment variable or provided it directly in the function call.

  2. Rate Limiting:

    Error: Rate limit exceeded

    Solution: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.

  3. Invalid Model Name:

    Error: Unsupported model: [model_name]

    Solution: Check that you’re using a supported model name and that it’s spelled correctly.

  4. Network Issues:

    Error: Could not connect to API

    Solution: Check your internet connection and try again. If the problem persists, the API service might be down.

Next Steps

Now that you understand the basics of mLLMCelltype, you can explore:

If you encounter any issues, check the FAQ or open an issue on our GitHub repository.