Skip to main content

Vertex AI Setup Guide

Setup guide for Google Vertex AI (Gemini 2.5) powering StemBlock AI evaluations, RAG embeddings, context caching, and fine-tuning.

Table of Contents

  1. Overview
  2. GCP Setup
  3. Local Development Setup
  4. Production Deployment
  5. Environment Variables Reference
  6. Testing the Integration
  7. RAG & Embeddings
  8. Context Caching
  9. Fine-Tuning
  10. Rollback Procedure
  11. Cost Monitoring
  12. Troubleshooting

Overview

StemBlock AI uses the @google/genai SDK (v1.43.0+) with the Vertex AI backend for all AI features.

SDK Details

FieldValue
Package@google/genai
BackendVertex AI (vertexai: true)
AuthService account (JSON/base64) or default credentials
Key Filesrc/shared/genai.service.ts

Models in Use

Use CaseModelWhy
AI Evaluationgemini-2.5-flashHigh volume, cost-effective
Coach Feedbackgemini-2.5-flashBalanced quality/speed
Parent Insightsgemini-2.5-flashCached heavily
Content Moderationgemini-2.5-flash-liteSpeed critical, low cost
English Writing Feedbackgemini-2.5-flashQuality + speed balance
English Writing Assessmentgemini-2.5-flashScoring accuracy
RAG Embeddingstext-embedding-005256D vectors for pgvector
Fine-Tuning Basegemini-2.0-flash-001Supervised tuning

GCP Setup

Step 1: Create or Select a GCP Project

  1. Go to Google Cloud Console
  2. Create or select a project (e.g., stemblock-ai-prod)
  3. Note your Project ID

Step 2: Enable Required APIs

# Required APIs
gcloud services enable aiplatform.googleapis.com --project=YOUR_PROJECT_ID

Step 3: Create a Service Account

  1. Go to IAM & Admin > Service Accounts
  2. Click + Create Service Account
  3. Name: stemblock-vertex-ai
  4. Description: Service account for StemBlock AI Vertex AI access

Step 4: Grant Required Roles

Add these roles to the service account:

RolePurpose
Vertex AI User (roles/aiplatform.user)API calls for generation and embeddings
Service Usage Consumer (roles/serviceusage.serviceUsageConsumer)API quota
Vertex AI Tuning User (roles/aiplatform.tuningUser)Fine-tuning jobs (if using fine-tuning)

Step 5: Create and Download Service Account Key

  1. Click the service account email → Keys tab
  2. Add Key > Create new key > JSON
  3. Rename downloaded file to gcp-service-account.json
  4. Never commit this file to git — it's in .gitignore

Step 6: Set Up Billing & Alerts

  1. Ensure a billing account is linked to your project
  2. Go to Billing > Budgets & alerts
  3. Create a budget with alerts at 50%, 80%, 100% thresholds

Local Development Setup

Step 1: Place the Service Account Key

cp ~/Downloads/gcp-service-account.json ./stemblockai-backend/
# Verify it's in .gitignore
grep -q "gcp-service-account.json" .gitignore || echo "gcp-service-account.json" >> .gitignore

Step 2: Update Your .env File

# AI Provider Selection
LLM_PROVIDER="gemini"
WRITING_EVALUATOR_PROVIDER="gemini"

# Google GenAI SDK Configuration
GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
GOOGLE_CLOUD_LOCATION="us-central1"
GOOGLE_APPLICATION_CREDENTIALS="./gcp-service-account.json"

# Model Configuration (defaults — usually don't need to change)
VERTEX_FLASH_MODEL="gemini-2.5-flash"
VERTEX_PRO_MODEL="gemini-2.5-pro"
VERTEX_LITE_MODEL="gemini-2.5-flash-lite"

Step 3: Test Locally

cd stemblockai-backend
npm run start:dev

Verify in logs:

[GenAIService] GenAI initialized: project=your-project, location=us-central1, flash=gemini-2.5-flash, pro=gemini-2.5-pro, lite=gemini-2.5-flash-lite
[GeminiLLMProvider] Initialized with model: gemini-2.5-flash, rate limit: 500ms, RAG enabled
[GeminiWritingProvider] Initialized with Flash: gemini-2.5-flash, Pro: gemini-2.5-pro

Production Deployment

The GenAIService supports base64-encoded service account keys natively — no startup script needed.

Step 1: Encode the Key

# macOS
cat gcp-service-account.json | base64 > gcp-key-base64.txt

# Linux
cat gcp-service-account.json | base64 -w 0 > gcp-key-base64.txt

Step 2: Set Environment Variables

VariableValueEncrypt
LLM_PROVIDERgeminiNo
WRITING_EVALUATOR_PROVIDERgeminiNo
GOOGLE_CLOUD_PROJECTyour-gcp-project-idNo
GOOGLE_CLOUD_LOCATIONus-central1No
GCP_SERVICE_ACCOUNT_KEY_BASE64<paste base64 content>Yes
VERTEX_FLASH_MODELgemini-2.5-flashNo
VERTEX_PRO_MODELgemini-2.5-proNo
VERTEX_LITE_MODELgemini-2.5-flash-liteNo

The SDK initialization in genai.service.ts handles base64 decoding automatically:

if (serviceAccountKeyBase64) {
const credentials = JSON.parse(
Buffer.from(serviceAccountKeyBase64, 'base64').toString('utf-8'),
);
options.googleAuthOptions = { credentials };
}

Authentication via Default Credentials (For GCP-hosted environments)

On Cloud Run or GKE, no credentials file is needed. The SDK uses the attached service account automatically. Just set:

GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
GOOGLE_CLOUD_LOCATION="us-central1"

Environment Variables Reference

Required

VariableDescriptionExample
LLM_PROVIDERProvider for STEM evaluationsgemini, mistral, mock
WRITING_EVALUATOR_PROVIDERProvider for English writinggemini, mistral, claude, mock
GOOGLE_CLOUD_PROJECTGCP project IDstemblock-ai-prod
GOOGLE_CLOUD_LOCATIONGCP region for Vertex AIus-central1

Authentication (one of these)

VariableDescriptionWhen
GCP_SERVICE_ACCOUNT_KEY_BASE64Base64-encoded service account JSONNon-GCP environments
GOOGLE_APPLICATION_CREDENTIALSPath to service account JSON fileLocal development
(none)Default credentialsCloud Run / GKE

Optional

VariableDescriptionDefault
VERTEX_FLASH_MODELFlash model versiongemini-2.5-flash
VERTEX_PRO_MODELPro model versiongemini-2.5-pro
VERTEX_LITE_MODELLite model versiongemini-2.5-flash-lite
VERTEX_MIN_REQUEST_INTERVALMin ms between requests500
VERTEX_MAX_RETRIESMax retry attempts5

Available GCP Regions

RegionLocationNotes
us-central1Iowa, USADefault, best availability
us-east1South Carolina, USAEast coast US
us-west1Oregon, USAWest coast US
northamerica-northeast1Montreal, CanadaCanadian users
europe-west1BelgiumEuropean users

Testing the Integration

Verify Initialization

Check startup logs for:

[GenAIService] Using service account credentials from GCP_SERVICE_ACCOUNT_KEY_BASE64
[GenAIService] GenAI initialized: project=..., location=..., flash=gemini-2.5-flash, pro=gemini-2.5-pro, lite=gemini-2.5-flash-lite

Test AI Evaluation

curl -X POST https://api.stemblock.ai/api/v1/evaluations/generate/{submissionId} \
-H "Authorization: Bearer YOUR_JWT_TOKEN"

Test RAG System

Currently RAG has no public API controller — it's used internally. Verify via:

  1. Check the health endpoint: GET /health
  2. Trigger an evaluation (RAG context is retrieved automatically)
  3. Check logs for: RAG context retrieval skipped (if no documents ingested yet) or retrieved context sources

RAG & Embeddings

The RAG system uses Vertex AI for embedding generation.

Embedding Model

SettingValue
Modeltext-embedding-005
Dimensions256 (reduced from default 768)
StoragePostgreSQL + pgvector (HNSW index)

SDK Usage

// Single text embedding
const response = await client.models.embedContent({
model: 'text-embedding-005',
contents: text,
config: { outputDimensionality: 256 },
});

// Batch embedding
const response = await client.models.embedContent({
model: 'text-embedding-005',
contents: ['text1', 'text2', 'text3'],
config: { outputDimensionality: 256 },
});

Key Files

FilePurpose
src/rag/embedding.service.tsEmbedding generation
src/rag/vector-store.service.tspgvector storage & similarity search
src/rag/rag.service.tsRAG pipeline orchestrator
src/rag/ingestion.service.tsDocument chunking & ingestion

Context Caching

Server-side Gemini context caching provides 90% discount on cached input tokens.

How It Works

  1. System prompts (evaluation guidelines, rubrics) are cached server-side
  2. Subsequent requests referencing the same cache pay 10% of normal input token cost
  3. Minimum cache size: 32,768 tokens (Gemini enforced)
  4. Default TTL: 1 hour

SDK Usage

// Create cache
const cache = await client.caches.create({
model,
config: {
contents: [{ role: 'user', parts: [{ text: systemInstruction }] }],
systemInstruction: '...',
ttl: '3600s',
},
});

// Use cache in generation
const response = await client.models.generateContent({
model,
contents: userPrompt,
config: { cachedContent: cache.name },
});

Key File

src/shared/context-cache.service.ts — handles cache creation, usage, invalidation, and graceful fallback.


Fine-Tuning

Supervised fine-tuning for custom model creation from user feedback.

Configuration

SettingValue
Base modelgemini-2.0-flash-001
Min examples20
Default epochs5
Cost~$3/1M training tokens

SDK Usage

const tuningJob = await client.tunings.tune({
baseModel: 'gemini-2.0-flash-001',
trainingDataset: {
examples: [
{ textInput: 'prompt', output: 'corrected response' },
// ... 20+ examples
],
},
config: { epochCount: 5, learningRateMultiplier: 1.0 },
});

// Check status
const job = await client.tunings.get({ name: tuningJob.name });
// job.state: 'JOB_STATE_SUCCEEDED' | 'JOB_STATE_FAILED' | ...

Key Files

FilePurpose
src/rag/fine-tuning.service.tsJob submission, status tracking, training data collection
src/rag/feedback-loop.service.tsCollects user feedback, triggers fine-tuning at 50+ examples

IAM Requirement

The service account needs the Vertex AI Tuning User role for fine-tuning operations.


Rollback Procedure

If Gemini issues occur, switch to Mistral immediately:

  1. Change LLM_PROVIDER from gemini to mistral
  2. Change WRITING_EVALUATOR_PROVIDER from gemini to mistral
  3. Ensure MISTRAL_API_KEY is set
  4. Redeploy

Verify rollback in logs:

[LLMProviderFactory] Using provider: Mistral AI (mistral-large-latest)

Note: RAG context retrieval and fine-tuning will be unavailable under Mistral — evaluations will work without RAG augmentation.


Cost Monitoring

Set Up GCP Budget Alerts

  1. Go to Billing > Budgets & alerts
  2. Create budget: scope to Vertex AI service
  3. Set alerts at 50%, 80%, 100% thresholds
  4. Add notification email addresses

Monitor in GCP Console

  1. Go to Vertex AI > Dashboard
  2. View: total predictions, token usage, latency, error rates

Expected Monthly Costs

WorkloadModelEst. Cost/Month
AI Evaluationgemini-2.5-flash~$75
English Writing (3-stage)flash-lite + flash~$300
Coach Feedbackgemini-2.5-flash~$12
Parent Insightsgemini-2.5-flash~$6
RAG Embeddingstext-embedding-005~$5
Context Caching(90% savings on cached)~-$200 savings
TotalMixed~$200-400

Costs depend on volume. Above estimates assume moderate usage with context caching enabled.


Troubleshooting

Error: "GenAI client not initialized"

Cause: GOOGLE_CLOUD_PROJECT not set or credentials invalid.

Solution:

  1. Verify GOOGLE_CLOUD_PROJECT is set
  2. Check credentials (base64 decode test: echo $GCP_SERVICE_ACCOUNT_KEY_BASE64 | base64 -d | jq .project_id)
  3. Restart the application

Error: "Permission denied" or "403 Forbidden"

Cause: Service account missing IAM roles.

Solution:

  1. Verify roles: Vertex AI User + Service Usage Consumer
  2. Add Vertex AI Tuning User if using fine-tuning
  3. Wait 1-2 min for propagation

Error: "429 Too Many Requests" / "RESOURCE_EXHAUSTED"

Cause: Hit Vertex AI rate limits.

Solution:

  1. Built-in retry logic handles this (exponential backoff, max 5 retries)
  2. If persistent, request quota increase in IAM & Admin > Quotas
  3. Adjust VERTEX_MIN_REQUEST_INTERVAL (default 500ms)

Error: "Model not found"

Cause: Model name invalid or not available in region.

Solution:

  1. Verify model names in env vars
  2. Check region availability for the model
  3. Fallback: change VERTEX_FLASH_MODEL to a known-available model

RAG: "No embedding returned from GenAI"

Cause: Embedding API call failed.

Solution:

  1. Verify text-embedding-005 is available in your region
  2. Check Vertex AI API is enabled
  3. Check service account has Vertex AI User role

Context Cache: Falls back to inline prompt

Info: Not an error. Context caching requires 32,768+ tokens minimum. Short system prompts will gracefully fall back to inline prompts with no functional impact.


Security Best Practices

  1. Never commit credentials to git — use encrypted env vars
  2. Least-privilege IAM roles — only grant what's needed
  3. Rotate service account keys every 90 days
  4. Enable Cloud Audit Logs for Vertex AI
  5. Set billing alerts to detect unexpected usage spikes

Document Version: 2.0 Last Updated: February 27, 2026 Previous Location: stemblockai-docs/VERTEX_AI_SETUP.md