Vertex AI Setup Guide

Setup guide for Google Vertex AI (Gemini 2.5) powering StemBlock AI evaluations, RAG embeddings, context caching, and fine-tuning.

Overview
GCP Setup
Local Development Setup
Production Deployment
Environment Variables Reference
Testing the Integration
RAG & Embeddings
Context Caching
Fine-Tuning
Rollback Procedure
Cost Monitoring
Troubleshooting

Overview

StemBlock AI uses the @google/genai SDK (v1.43.0+) with the Vertex AI backend for all AI features.

SDK Details

Field	Value
Package	`@google/genai`
Backend	Vertex AI (`vertexai: true`)
Auth	Service account (JSON/base64) or default credentials
Key File	`src/shared/genai.service.ts`

Models in Use

Use Case	Model	Why
AI Evaluation	`gemini-2.5-flash`	High volume, cost-effective
Coach Feedback	`gemini-2.5-flash`	Balanced quality/speed
Parent Insights	`gemini-2.5-flash`	Cached heavily
Content Moderation	`gemini-2.5-flash-lite`	Speed critical, low cost
English Writing Feedback	`gemini-2.5-flash`	Quality + speed balance
English Writing Assessment	`gemini-2.5-flash`	Scoring accuracy
RAG Embeddings	`text-embedding-005`	256D vectors for pgvector
Fine-Tuning Base	`gemini-2.0-flash-001`	Supervised tuning

GCP Setup

Step 1: Create or Select a GCP Project

Go to Google Cloud Console
Create or select a project (e.g., stemblock-ai-prod)
Note your Project ID

Step 2: Enable Required APIs

# Required APIs
gcloud services enable aiplatform.googleapis.com --project=YOUR_PROJECT_ID

Step 3: Create a Service Account

Go to IAM & Admin > Service Accounts
Click + Create Service Account
Name: stemblock-vertex-ai
Description: Service account for StemBlock AI Vertex AI access

Step 4: Grant Required Roles

Add these roles to the service account:

Role	Purpose
`Vertex AI User` (roles/aiplatform.user)	API calls for generation and embeddings
`Service Usage Consumer` (roles/serviceusage.serviceUsageConsumer)	API quota
`Vertex AI Tuning User` (roles/aiplatform.tuningUser)	Fine-tuning jobs (if using fine-tuning)

Step 5: Create and Download Service Account Key

Click the service account email → Keys tab
Add Key > Create new key > JSON
Rename downloaded file to gcp-service-account.json
Never commit this file to git — it's in .gitignore

Step 6: Set Up Billing & Alerts

Ensure a billing account is linked to your project
Go to Billing > Budgets & alerts
Create a budget with alerts at 50%, 80%, 100% thresholds

Local Development Setup

Step 1: Place the Service Account Key

cp ~/Downloads/gcp-service-account.json ./stemblockai-backend/
# Verify it's in .gitignore
grep -q "gcp-service-account.json" .gitignore || echo "gcp-service-account.json" >> .gitignore

Step 2: Update Your .env File

# AI Provider Selection
LLM_PROVIDER="gemini"
WRITING_EVALUATOR_PROVIDER="gemini"

# Google GenAI SDK Configuration
GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
GOOGLE_CLOUD_LOCATION="us-central1"
GOOGLE_APPLICATION_CREDENTIALS="./gcp-service-account.json"

# Model Configuration (defaults — usually don't need to change)
VERTEX_FLASH_MODEL="gemini-2.5-flash"
VERTEX_PRO_MODEL="gemini-2.5-pro"
VERTEX_LITE_MODEL="gemini-2.5-flash-lite"

Step 3: Test Locally

cd stemblockai-backend
npm run start:dev

Verify in logs:

[GenAIService] GenAI initialized: project=your-project, location=us-central1, flash=gemini-2.5-flash, pro=gemini-2.5-pro, lite=gemini-2.5-flash-lite
[GeminiLLMProvider] Initialized with model: gemini-2.5-flash, rate limit: 500ms, RAG enabled
[GeminiWritingProvider] Initialized with Flash: gemini-2.5-flash, Pro: gemini-2.5-pro

Production Deployment

Authentication via Base64-Encoded Key (Recommended for non-GCP hosts)

The GenAIService supports base64-encoded service account keys natively — no startup script needed.

Step 1: Encode the Key

# macOS
cat gcp-service-account.json | base64 > gcp-key-base64.txt

# Linux
cat gcp-service-account.json | base64 -w 0 > gcp-key-base64.txt

Step 2: Set Environment Variables

Variable	Value	Encrypt
`LLM_PROVIDER`	`gemini`	No
`WRITING_EVALUATOR_PROVIDER`	`gemini`	No
`GOOGLE_CLOUD_PROJECT`	`your-gcp-project-id`	No
`GOOGLE_CLOUD_LOCATION`	`us-central1`	No
`GCP_SERVICE_ACCOUNT_KEY_BASE64`	`<paste base64 content>`	Yes
`VERTEX_FLASH_MODEL`	`gemini-2.5-flash`	No
`VERTEX_PRO_MODEL`	`gemini-2.5-pro`	No
`VERTEX_LITE_MODEL`	`gemini-2.5-flash-lite`	No

The SDK initialization in genai.service.ts handles base64 decoding automatically:

if (serviceAccountKeyBase64) {
  const credentials = JSON.parse(
    Buffer.from(serviceAccountKeyBase64, 'base64').toString('utf-8'),
  );
  options.googleAuthOptions = { credentials };
}

Authentication via Default Credentials (For GCP-hosted environments)

On Cloud Run or GKE, no credentials file is needed. The SDK uses the attached service account automatically. Just set:

GOOGLE_CLOUD_PROJECT="your-gcp-project-id"
GOOGLE_CLOUD_LOCATION="us-central1"

Environment Variables Reference

Required

Variable	Description	Example
`LLM_PROVIDER`	Provider for STEM evaluations	`gemini`, `mistral`, `mock`
`WRITING_EVALUATOR_PROVIDER`	Provider for English writing	`gemini`, `mistral`, `claude`, `mock`
`GOOGLE_CLOUD_PROJECT`	GCP project ID	`stemblock-ai-prod`
`GOOGLE_CLOUD_LOCATION`	GCP region for Vertex AI	`us-central1`

Authentication (one of these)

Variable	Description	When
`GCP_SERVICE_ACCOUNT_KEY_BASE64`	Base64-encoded service account JSON	Non-GCP environments
`GOOGLE_APPLICATION_CREDENTIALS`	Path to service account JSON file	Local development
(none)	Default credentials	Cloud Run / GKE

Optional

Variable	Description	Default
`VERTEX_FLASH_MODEL`	Flash model version	`gemini-2.5-flash`
`VERTEX_PRO_MODEL`	Pro model version	`gemini-2.5-pro`
`VERTEX_LITE_MODEL`	Lite model version	`gemini-2.5-flash-lite`
`VERTEX_MIN_REQUEST_INTERVAL`	Min ms between requests	`500`
`VERTEX_MAX_RETRIES`	Max retry attempts	`5`

Available GCP Regions

Region	Location	Notes
`us-central1`	Iowa, USA	Default, best availability
`us-east1`	South Carolina, USA	East coast US
`us-west1`	Oregon, USA	West coast US
`northamerica-northeast1`	Montreal, Canada	Canadian users
`europe-west1`	Belgium	European users

Testing the Integration

Verify Initialization

Check startup logs for:

[GenAIService] Using service account credentials from GCP_SERVICE_ACCOUNT_KEY_BASE64
[GenAIService] GenAI initialized: project=..., location=..., flash=gemini-2.5-flash, pro=gemini-2.5-pro, lite=gemini-2.5-flash-lite

Test AI Evaluation

curl -X POST https://api.stemblock.ai/api/v1/evaluations/generate/{submissionId} \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Test RAG System

Currently RAG has no public API controller — it's used internally. Verify via:

Check the health endpoint: GET /health
Trigger an evaluation (RAG context is retrieved automatically)
Check logs for: RAG context retrieval skipped (if no documents ingested yet) or retrieved context sources

RAG & Embeddings

The RAG system uses Vertex AI for embedding generation.

Embedding Model

Setting	Value
Model	`text-embedding-005`
Dimensions	256 (reduced from default 768)
Storage	PostgreSQL + pgvector (HNSW index)

SDK Usage

// Single text embedding
const response = await client.models.embedContent({
  model: 'text-embedding-005',
  contents: text,
  config: { outputDimensionality: 256 },
});

// Batch embedding
const response = await client.models.embedContent({
  model: 'text-embedding-005',
  contents: ['text1', 'text2', 'text3'],
  config: { outputDimensionality: 256 },
});

Key Files

File	Purpose
`src/rag/embedding.service.ts`	Embedding generation
`src/rag/vector-store.service.ts`	pgvector storage & similarity search
`src/rag/rag.service.ts`	RAG pipeline orchestrator
`src/rag/ingestion.service.ts`	Document chunking & ingestion

Context Caching

Server-side Gemini context caching provides 90% discount on cached input tokens.

How It Works

System prompts (evaluation guidelines, rubrics) are cached server-side
Subsequent requests referencing the same cache pay 10% of normal input token cost
Minimum cache size: 32,768 tokens (Gemini enforced)
Default TTL: 1 hour

SDK Usage

// Create cache
const cache = await client.caches.create({
  model,
  config: {
    contents: [{ role: 'user', parts: [{ text: systemInstruction }] }],
    systemInstruction: '...',
    ttl: '3600s',
  },
});

// Use cache in generation
const response = await client.models.generateContent({
  model,
  contents: userPrompt,
  config: { cachedContent: cache.name },
});

Key File

src/shared/context-cache.service.ts — handles cache creation, usage, invalidation, and graceful fallback.

Fine-Tuning

Supervised fine-tuning for custom model creation from user feedback.

Configuration

Setting	Value
Base model	`gemini-2.0-flash-001`
Min examples	20
Default epochs	5
Cost	~$3/1M training tokens

SDK Usage

const tuningJob = await client.tunings.tune({
  baseModel: 'gemini-2.0-flash-001',
  trainingDataset: {
    examples: [
      { textInput: 'prompt', output: 'corrected response' },
      // ... 20+ examples
    ],
  },
  config: { epochCount: 5, learningRateMultiplier: 1.0 },
});

// Check status
const job = await client.tunings.get({ name: tuningJob.name });
// job.state: 'JOB_STATE_SUCCEEDED' | 'JOB_STATE_FAILED' | ...

Key Files

File	Purpose
`src/rag/fine-tuning.service.ts`	Job submission, status tracking, training data collection
`src/rag/feedback-loop.service.ts`	Collects user feedback, triggers fine-tuning at 50+ examples

IAM Requirement

The service account needs the Vertex AI Tuning User role for fine-tuning operations.

Rollback Procedure

If Gemini issues occur, switch to Mistral immediately:

Change LLM_PROVIDER from gemini to mistral
Change WRITING_EVALUATOR_PROVIDER from gemini to mistral
Ensure MISTRAL_API_KEY is set
Redeploy

Verify rollback in logs:

[LLMProviderFactory] Using provider: Mistral AI (mistral-large-latest)

Note: RAG context retrieval and fine-tuning will be unavailable under Mistral — evaluations will work without RAG augmentation.

Cost Monitoring

Set Up GCP Budget Alerts

Go to Billing > Budgets & alerts
Create budget: scope to Vertex AI service
Set alerts at 50%, 80%, 100% thresholds
Add notification email addresses

Monitor in GCP Console

Go to Vertex AI > Dashboard
View: total predictions, token usage, latency, error rates

Expected Monthly Costs

Workload	Model	Est. Cost/Month
AI Evaluation	gemini-2.5-flash	~$75
English Writing (3-stage)	flash-lite + flash	~$300
Coach Feedback	gemini-2.5-flash	~$12
Parent Insights	gemini-2.5-flash	~$6
RAG Embeddings	text-embedding-005	~$5
Context Caching	(90% savings on cached)	~-$200 savings
Total	Mixed	~$200-400

Costs depend on volume. Above estimates assume moderate usage with context caching enabled.

Troubleshooting

Error: "GenAI client not initialized"

Cause: GOOGLE_CLOUD_PROJECT not set or credentials invalid.

Solution:

Verify GOOGLE_CLOUD_PROJECT is set
Check credentials (base64 decode test: echo $GCP_SERVICE_ACCOUNT_KEY_BASE64 | base64 -d | jq .project_id)
Restart the application

Error: "Permission denied" or "403 Forbidden"

Cause: Service account missing IAM roles.

Solution:

Verify roles: Vertex AI User + Service Usage Consumer
Add Vertex AI Tuning User if using fine-tuning
Wait 1-2 min for propagation

Error: "429 Too Many Requests" / "RESOURCE_EXHAUSTED"

Cause: Hit Vertex AI rate limits.

Solution:

Built-in retry logic handles this (exponential backoff, max 5 retries)
If persistent, request quota increase in IAM & Admin > Quotas
Adjust VERTEX_MIN_REQUEST_INTERVAL (default 500ms)

Error: "Model not found"

Cause: Model name invalid or not available in region.

Solution:

Verify model names in env vars
Check region availability for the model
Fallback: change VERTEX_FLASH_MODEL to a known-available model

RAG: "No embedding returned from GenAI"

Cause: Embedding API call failed.

Solution:

Verify text-embedding-005 is available in your region
Check Vertex AI API is enabled
Check service account has Vertex AI User role

Context Cache: Falls back to inline prompt

Info: Not an error. Context caching requires 32,768+ tokens minimum. Short system prompts will gracefully fall back to inline prompts with no functional impact.

Security Best Practices

Never commit credentials to git — use encrypted env vars
Least-privilege IAM roles — only grant what's needed
Rotate service account keys every 90 days
Enable Cloud Audit Logs for Vertex AI
Set billing alerts to detect unexpected usage spikes

Document Version: 2.0 Last Updated: February 27, 2026 Previous Location: stemblockai-docs/VERTEX_AI_SETUP.md

Table of Contents​

Overview​

SDK Details​

Models in Use​

GCP Setup​

Step 1: Create or Select a GCP Project​

Step 2: Enable Required APIs​

Step 3: Create a Service Account​

Step 4: Grant Required Roles​

Step 5: Create and Download Service Account Key​

Step 6: Set Up Billing & Alerts​

Local Development Setup​

Step 1: Place the Service Account Key​

Step 2: Update Your .env File​

Step 3: Test Locally​

Production Deployment​

Authentication via Base64-Encoded Key (Recommended for non-GCP hosts)​

Step 1: Encode the Key​

Step 2: Set Environment Variables​

Authentication via Default Credentials (For GCP-hosted environments)​

Environment Variables Reference​

Required​

Authentication (one of these)​

Optional​

Available GCP Regions​

Testing the Integration​

Verify Initialization​

Test AI Evaluation​

Test RAG System​

RAG & Embeddings​

Embedding Model​

SDK Usage​

Key Files​

Context Caching​

How It Works​

SDK Usage​

Key File​

Fine-Tuning​

Configuration​

SDK Usage​

Key Files​

IAM Requirement​

Rollback Procedure​

Cost Monitoring​

Set Up GCP Budget Alerts​

Monitor in GCP Console​

Expected Monthly Costs​

Troubleshooting​

Error: "GenAI client not initialized"​

Error: "Permission denied" or "403 Forbidden"​

Error: "429 Too Many Requests" / "RESOURCE_EXHAUSTED"​

Error: "Model not found"​

RAG: "No embedding returned from GenAI"​

Context Cache: Falls back to inline prompt​

Security Best Practices​

Table of Contents

Overview

SDK Details

Models in Use

GCP Setup

Step 1: Create or Select a GCP Project

Step 2: Enable Required APIs

Step 3: Create a Service Account

Step 4: Grant Required Roles

Step 5: Create and Download Service Account Key

Step 6: Set Up Billing & Alerts

Local Development Setup

Step 1: Place the Service Account Key

Step 2: Update Your .env File

Step 3: Test Locally

Production Deployment

Authentication via Base64-Encoded Key (Recommended for non-GCP hosts)

Step 1: Encode the Key

Step 2: Set Environment Variables

Authentication via Default Credentials (For GCP-hosted environments)

Environment Variables Reference

Required

Authentication (one of these)

Optional

Available GCP Regions

Testing the Integration

Verify Initialization

Test AI Evaluation

Test RAG System

RAG & Embeddings

Embedding Model

SDK Usage

Key Files

Context Caching

How It Works

SDK Usage

Key File

Fine-Tuning

Configuration

SDK Usage

Key Files

IAM Requirement

Rollback Procedure

Cost Monitoring

Set Up GCP Budget Alerts

Monitor in GCP Console

Expected Monthly Costs

Troubleshooting

Error: "GenAI client not initialized"

Error: "Permission denied" or "403 Forbidden"

Error: "429 Too Many Requests" / "RESOURCE_EXHAUSTED"

Error: "Model not found"

RAG: "No embedding returned from GenAI"

Context Cache: Falls back to inline prompt

Security Best Practices