Vertex AI Setup Guide
Setup guide for Google Vertex AI (Gemini 2.5) powering StemBlock AI evaluations, RAG embeddings, context caching, and fine-tuning.
Table of Contents
- Overview
- GCP Setup
- Local Development Setup
- Production Deployment
- Environment Variables Reference
- Testing the Integration
- RAG & Embeddings
- Context Caching
- Fine-Tuning
- Rollback Procedure
- Cost Monitoring
- Troubleshooting
Overview
StemBlock AI uses the @google/genai SDK (v1.43.0+) with the Vertex AI backend for all AI features.
SDK Details
| Field | Value |
|---|---|
| Package | @google/genai |
| Backend | Vertex AI (vertexai: true) |
| Auth | Service account (JSON/base64) or default credentials |
| Key File | src/shared/genai.service.ts |
Models in Use
| Use Case | Model | Why |
|---|---|---|
| AI Evaluation | gemini-2.5-flash | High volume, cost-effective |
| Coach Feedback | gemini-2.5-flash | Balanced quality/speed |
| Parent Insights | gemini-2.5-flash | Cached heavily |
| Content Moderation | gemini-2.5-flash-lite | Speed critical, low cost |
| English Writing Feedback | gemini-2.5-flash | Quality + speed balance |
| English Writing Assessment | gemini-2.5-flash | Scoring accuracy |
| RAG Embeddings | text-embedding-005 | 256D vectors for pgvector |
| Fine-Tuning Base | gemini-2.0-flash-001 | Supervised tuning |
GCP Setup
Step 1: Create or Select a GCP Project
- Go to Google Cloud Console
- Create or select a project (e.g.,
stemblock-ai-prod) - Note your Project ID
Step 2: Enable Required APIs
# Required APIs
gcloud services enable aiplatform.googleapis.com --project=YOUR_PROJECT_ID
Step 3: Create a Service Account
- Go to IAM & Admin > Service Accounts
- Click + Create Service Account
- Name:
stemblock-vertex-ai - Description:
Service account for StemBlock AI Vertex AI access
Step 4: Grant Required Roles
Add these roles to the service account:
| Role | Purpose |
|---|---|
Vertex AI User (roles/aiplatform.user) | API calls for generation and embeddings |
Service Usage Consumer (roles/serviceusage.serviceUsageConsumer) | API quota |
Vertex AI Tuning User (roles/aiplatform.tuningUser) | Fine-tuning jobs (if using fine-tuning) |
Step 5: Create and Download Service Account Key
- Click the service account email → Keys tab
- Add Key > Create new key > JSON
- Rename downloaded file to
gcp-service-account.json - Never commit this file to git — it's in
.gitignore
Step 6: Set Up Billing & Alerts
- Ensure a billing account is linked to your project
- Go to Billing > Budgets & alerts
- Create a budget with alerts at 50%, 80%, 100% thresholds
Local Development Setup
Step 1: Place the Service Account Key
cp ~/Downloads/gcp-service-account.json ./stemblockai-backend/
# Verify it's in .gitignore
grep -q "gcp-service-account.json" .gitignore || echo "gcp-service-account.json" >> .gitignore
Step 2: Update Your .env File
# AI Provider Selection
LLM_PROVIDER="gemini"
WRITING_EVALUATOR_PROVIDER="gemini"
# Google GenAI SDK Configuration
GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
GOOGLE_CLOUD_LOCATION="us-central1"
GOOGLE_APPLICATION_CREDENTIALS="./gcp-service-account.json"
# Model Configuration (defaults — usually don't need to change)
VERTEX_FLASH_MODEL="gemini-2.5-flash"
VERTEX_PRO_MODEL="gemini-2.5-pro"
VERTEX_LITE_MODEL="gemini-2.5-flash-lite"
Step 3: Test Locally
cd stemblockai-backend
npm run start:dev
Verify in logs:
[GenAIService] GenAI initialized: project=your-project, location=us-central1, flash=gemini-2.5-flash, pro=gemini-2.5-pro, lite=gemini-2.5-flash-lite
[GeminiLLMProvider] Initialized with model: gemini-2.5-flash, rate limit: 500ms, RAG enabled
[GeminiWritingProvider] Initialized with Flash: gemini-2.5-flash, Pro: gemini-2.5-pro
Production Deployment
Authentication via Base64-Encoded Key (Recommended for non-GCP hosts)
The GenAIService supports base64-encoded service account keys natively — no startup script needed.
Step 1: Encode the Key
# macOS
cat gcp-service-account.json | base64 > gcp-key-base64.txt
# Linux
cat gcp-service-account.json | base64 -w 0 > gcp-key-base64.txt
Step 2: Set Environment Variables
| Variable | Value | Encrypt |
|---|---|---|
LLM_PROVIDER | gemini | No |
WRITING_EVALUATOR_PROVIDER | gemini | No |
GOOGLE_CLOUD_PROJECT | your-gcp-project-id | No |
GOOGLE_CLOUD_LOCATION | us-central1 | No |
GCP_SERVICE_ACCOUNT_KEY_BASE64 | <paste base64 content> | Yes |
VERTEX_FLASH_MODEL | gemini-2.5-flash | No |
VERTEX_PRO_MODEL | gemini-2.5-pro | No |
VERTEX_LITE_MODEL | gemini-2.5-flash-lite | No |
The SDK initialization in genai.service.ts handles base64 decoding automatically:
if (serviceAccountKeyBase64) {
const credentials = JSON.parse(
Buffer.from(serviceAccountKeyBase64, 'base64').toString('utf-8'),
);
options.googleAuthOptions = { credentials };
}
Authentication via Default Credentials (For GCP-hosted environments)
On Cloud Run or GKE, no credentials file is needed. The SDK uses the attached service account automatically. Just set:
GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
GOOGLE_CLOUD_LOCATION="us-central1"
Environment Variables Reference
Required
| Variable | Description | Example |
|---|---|---|
LLM_PROVIDER | Provider for STEM evaluations | gemini, mistral, mock |
WRITING_EVALUATOR_PROVIDER | Provider for English writing | gemini, mistral, claude, mock |
GOOGLE_CLOUD_PROJECT | GCP project ID | stemblock-ai-prod |
GOOGLE_CLOUD_LOCATION | GCP region for Vertex AI | us-central1 |
Authentication (one of these)
| Variable | Description | When |
|---|---|---|
GCP_SERVICE_ACCOUNT_KEY_BASE64 | Base64-encoded service account JSON | Non-GCP environments |
GOOGLE_APPLICATION_CREDENTIALS | Path to service account JSON file | Local development |
| (none) | Default credentials | Cloud Run / GKE |
Optional
| Variable | Description | Default |
|---|---|---|
VERTEX_FLASH_MODEL | Flash model version | gemini-2.5-flash |
VERTEX_PRO_MODEL | Pro model version | gemini-2.5-pro |
VERTEX_LITE_MODEL | Lite model version | gemini-2.5-flash-lite |
VERTEX_MIN_REQUEST_INTERVAL | Min ms between requests | 500 |
VERTEX_MAX_RETRIES | Max retry attempts | 5 |
Available GCP Regions
| Region | Location | Notes |
|---|---|---|
us-central1 | Iowa, USA | Default, best availability |
us-east1 | South Carolina, USA | East coast US |
us-west1 | Oregon, USA | West coast US |
northamerica-northeast1 | Montreal, Canada | Canadian users |
europe-west1 | Belgium | European users |
Testing the Integration
Verify Initialization
Check startup logs for:
[GenAIService] Using service account credentials from GCP_SERVICE_ACCOUNT_KEY_BASE64
[GenAIService] GenAI initialized: project=..., location=..., flash=gemini-2.5-flash, pro=gemini-2.5-pro, lite=gemini-2.5-flash-lite
Test AI Evaluation
curl -X POST https://api.stemblock.ai/api/v1/evaluations/generate/{submissionId} \
-H "Authorization: Bearer YOUR_JWT_TOKEN"
Test RAG System
Currently RAG has no public API controller — it's used internally. Verify via:
- Check the health endpoint:
GET /health - Trigger an evaluation (RAG context is retrieved automatically)
- Check logs for:
RAG context retrieval skipped(if no documents ingested yet) or retrieved context sources
RAG & Embeddings
The RAG system uses Vertex AI for embedding generation.
Embedding Model
| Setting | Value |
|---|---|
| Model | text-embedding-005 |
| Dimensions | 256 (reduced from default 768) |
| Storage | PostgreSQL + pgvector (HNSW index) |
SDK Usage
// Single text embedding
const response = await client.models.embedContent({
model: 'text-embedding-005',
contents: text,
config: { outputDimensionality: 256 },
});
// Batch embedding
const response = await client.models.embedContent({
model: 'text-embedding-005',
contents: ['text1', 'text2', 'text3'],
config: { outputDimensionality: 256 },
});
Key Files
| File | Purpose |
|---|---|
src/rag/embedding.service.ts | Embedding generation |
src/rag/vector-store.service.ts | pgvector storage & similarity search |
src/rag/rag.service.ts | RAG pipeline orchestrator |
src/rag/ingestion.service.ts | Document chunking & ingestion |
Context Caching
Server-side Gemini context caching provides 90% discount on cached input tokens.
How It Works
- System prompts (evaluation guidelines, rubrics) are cached server-side
- Subsequent requests referencing the same cache pay 10% of normal input token cost
- Minimum cache size: 32,768 tokens (Gemini enforced)
- Default TTL: 1 hour
SDK Usage
// Create cache
const cache = await client.caches.create({
model,
config: {
contents: [{ role: 'user', parts: [{ text: systemInstruction }] }],
systemInstruction: '...',
ttl: '3600s',
},
});
// Use cache in generation
const response = await client.models.generateContent({
model,
contents: userPrompt,
config: { cachedContent: cache.name },
});
Key File
src/shared/context-cache.service.ts — handles cache creation, usage, invalidation, and graceful fallback.
Fine-Tuning
Supervised fine-tuning for custom model creation from user feedback.
Configuration
| Setting | Value |
|---|---|
| Base model | gemini-2.0-flash-001 |
| Min examples | 20 |
| Default epochs | 5 |
| Cost | ~$3/1M training tokens |
SDK Usage
const tuningJob = await client.tunings.tune({
baseModel: 'gemini-2.0-flash-001',
trainingDataset: {
examples: [
{ textInput: 'prompt', output: 'corrected response' },
// ... 20+ examples
],
},
config: { epochCount: 5, learningRateMultiplier: 1.0 },
});
// Check status
const job = await client.tunings.get({ name: tuningJob.name });
// job.state: 'JOB_STATE_SUCCEEDED' | 'JOB_STATE_FAILED' | ...
Key Files
| File | Purpose |
|---|---|
src/rag/fine-tuning.service.ts | Job submission, status tracking, training data collection |
src/rag/feedback-loop.service.ts | Collects user feedback, triggers fine-tuning at 50+ examples |
IAM Requirement
The service account needs the Vertex AI Tuning User role for fine-tuning operations.
Rollback Procedure
If Gemini issues occur, switch to Mistral immediately:
- Change
LLM_PROVIDERfromgeminitomistral - Change
WRITING_EVALUATOR_PROVIDERfromgeminitomistral - Ensure
MISTRAL_API_KEYis set - Redeploy
Verify rollback in logs:
[LLMProviderFactory] Using provider: Mistral AI (mistral-large-latest)
Note: RAG context retrieval and fine-tuning will be unavailable under Mistral — evaluations will work without RAG augmentation.
Cost Monitoring
Set Up GCP Budget Alerts
- Go to Billing > Budgets & alerts
- Create budget: scope to Vertex AI service
- Set alerts at 50%, 80%, 100% thresholds
- Add notification email addresses
Monitor in GCP Console
- Go to Vertex AI > Dashboard
- View: total predictions, token usage, latency, error rates
Expected Monthly Costs
| Workload | Model | Est. Cost/Month |
|---|---|---|
| AI Evaluation | gemini-2.5-flash | ~$75 |
| English Writing (3-stage) | flash-lite + flash | ~$300 |
| Coach Feedback | gemini-2.5-flash | ~$12 |
| Parent Insights | gemini-2.5-flash | ~$6 |
| RAG Embeddings | text-embedding-005 | ~$5 |
| Context Caching | (90% savings on cached) | ~-$200 savings |
| Total | Mixed | ~$200-400 |
Costs depend on volume. Above estimates assume moderate usage with context caching enabled.
Troubleshooting
Error: "GenAI client not initialized"
Cause: GOOGLE_CLOUD_PROJECT not set or credentials invalid.
Solution:
- Verify
GOOGLE_CLOUD_PROJECTis set - Check credentials (base64 decode test:
echo $GCP_SERVICE_ACCOUNT_KEY_BASE64 | base64 -d | jq .project_id) - Restart the application
Error: "Permission denied" or "403 Forbidden"
Cause: Service account missing IAM roles.
Solution:
- Verify roles:
Vertex AI User+Service Usage Consumer - Add
Vertex AI Tuning Userif using fine-tuning - Wait 1-2 min for propagation
Error: "429 Too Many Requests" / "RESOURCE_EXHAUSTED"
Cause: Hit Vertex AI rate limits.
Solution:
- Built-in retry logic handles this (exponential backoff, max 5 retries)
- If persistent, request quota increase in IAM & Admin > Quotas
- Adjust
VERTEX_MIN_REQUEST_INTERVAL(default 500ms)
Error: "Model not found"
Cause: Model name invalid or not available in region.
Solution:
- Verify model names in env vars
- Check region availability for the model
- Fallback: change
VERTEX_FLASH_MODELto a known-available model
RAG: "No embedding returned from GenAI"
Cause: Embedding API call failed.
Solution:
- Verify
text-embedding-005is available in your region - Check Vertex AI API is enabled
- Check service account has
Vertex AI Userrole
Context Cache: Falls back to inline prompt
Info: Not an error. Context caching requires 32,768+ tokens minimum. Short system prompts will gracefully fall back to inline prompts with no functional impact.
Security Best Practices
- Never commit credentials to git — use encrypted env vars
- Least-privilege IAM roles — only grant what's needed
- Rotate service account keys every 90 days
- Enable Cloud Audit Logs for Vertex AI
- Set billing alerts to detect unexpected usage spikes
Document Version: 2.0
Last Updated: February 27, 2026
Previous Location: stemblockai-docs/VERTEX_AI_SETUP.md