AI Engineer
Updated for 2026: AI Engineer interview questions and answers covering core skills, tools, and best practices for roles in the US, Europe & Canada.
What is Retrieval-Augmented Generation (RAG) and how do you build it?
RAG combines retrieval (search) with generation (LLM) to ground answers in your data. Core steps: - Chunk documents and create embeddings - Store in a vector database - Retrieve top-k relevant chunks - Prompt the model with retrieved context Quality depends on chunking, retrieval, and evaluation—not just the LLM.
What are embeddings and how do you use them for search and recommendations?
Embeddings are vector representations that capture semantic similarity. Use cases: - Semantic search - Clustering - Recommendations - Deduplication Key considerations: model choice, normalization, distance metric, and evaluation with real queries. Monitor drift and update embeddings when content changes.
How do vector databases work and what should you consider when choosing one?
Vector DBs store embeddings and support approximate nearest neighbor (ANN) search. Consider: - Index type and recall/latency trade-offs - Filtering + hybrid search (keyword + vector) - Update frequency and reindexing - Multi-tenant isolation and cost Choose based on workload: query volume, freshness needs, and filter complexity.
What is prompt injection and how do you mitigate it in LLM applications?
Prompt injection is when untrusted input manipulates the model to ignore instructions or reveal secrets. Mitigations: - Treat all external text as untrusted - Separate system instructions from user content - Use allowlisted tools/actions - Output filtering + policy checks - Least-privilege tool permissions Test with red-team prompts and monitor for policy violations.
How do you evaluate LLM applications beyond simple accuracy?
LLM evaluation is multi-dimensional. Measure: - Factuality/grounding - Relevance and completeness - Toxicity/safety - Latency and cost - User satisfaction Use golden sets, human review, and automated checks. Track regressions when prompts/models change.
How do you reduce hallucinations in LLM-powered products?
Hallucinations happen when the model generates unsupported claims. Mitigations: - Use RAG with high-quality retrieval - Require citations from sources - Add refusal behavior when context is missing - Use constrained outputs (schemas) Also improve prompts and evaluate on failure cases. Never present answers as authoritative without grounding for high-stakes domains.
Why use structured outputs (JSON schemas) with LLMs and how do you implement them safely?
Structured outputs reduce parsing errors and enable reliable automation. Use: - JSON schema constraints - Post-parse validation - Retry-on-parse-failure with strict prompts Never let the model execute privileged actions directly—validate and authorize tool calls server-side.
Fine-tuning vs RAG: when should you use each for an AI product?
RAG is best for injecting up-to-date knowledge and citations. Fine-tuning is best for style, format, and domain behavior. Often you combine them: - Fine-tune for tone and instruction-following - RAG for factual, current content Choose based on latency, cost, update frequency, and evaluation results.
What safety guardrails should AI engineers implement for user-facing assistants?
Guardrails reduce harmful outputs and unsafe actions. Include: - Content policy filters - Sensitive topic handling - Tool/action allowlists - Rate limiting and abuse detection - Logging + review workflows Design for least privilege and handle jailbreak attempts as a normal threat, not an edge case.
How do you build maintainable prompt templates and avoid prompt spaghetti?
Treat prompts like code. Best practices: - Use versioned templates and small reusable components - Separate instructions, context, and output schema - Add tests with golden inputs - Track changes and regressions This makes prompt iterations auditable and reduces accidental behavior changes.
How do you implement conversation memory without leaking sensitive data or growing costs?
Memory should be selective and privacy-safe. Approaches: - Summarize history - Store structured user preferences - Retrieve only relevant past context Avoid storing secrets, implement retention policies, and cap tokens. Use RAG-style retrieval for long-term memory instead of sending full history every time.
How do you design safe tool calling (function calling) in AI agents?
Tool calling must be constrained and authorized. Best practices: - Allowlist tools and validate arguments - Require confirmations for destructive actions - Enforce permissions server-side - Log tool calls for auditing Never let the model directly execute privileged actions without validation and policy checks.
How do you cache LLM responses safely to reduce latency and cost?
Cache when outputs are deterministic enough. Techniques: - Cache embeddings and retrieval results - Cache prompt+context hashes - Use short TTLs for dynamic data Avoid caching sensitive content, and include model/version in cache keys to prevent mixing outputs across model changes.
How do you choose chunking strategies for RAG (size, overlap, structure)?
Chunking quality strongly affects retrieval. Guidelines: - Chunk by structure (headings/sections) - Keep chunks small enough to be specific - Use overlap to preserve context - Store metadata (source, section) Evaluate retrieval with real queries and tune chunk size/overlap based on recall and answer quality.
What is hybrid search and when is it better than pure vector search?
Hybrid search combines keyword (BM25) and vector similarity. It’s better when: - Exact terms matter (IDs, error codes) - Queries are short or ambiguous - You need filtering and precision Hybrid approaches often outperform pure vector search for enterprise docs where terminology is important.
How do you curate datasets for evaluation and fine-tuning in AI products?
Dataset quality drives model behavior. Practices: - Define user intents and failure cases - Create balanced, labeled examples - Remove sensitive data - Version datasets and track provenance Use a golden set for regression testing and update it as product requirements evolve.
How do you handle privacy and sensitive data in AI/LLM applications?
LLM apps can leak data if not designed carefully. Practices: - Minimize what you send to the model - Redact sensitive fields - Use retention controls - Apply access control and auditing For enterprise, consider on-prem/isolated deployments and strict data processing agreements.
What should you monitor in production LLM applications?
Monitor both system and quality signals. Track: - Latency and error rate - Cost per request - Safety policy violations - Retrieval quality (for RAG) - User feedback and escalations Use sampling for human review and track regressions when prompts/models change.
How do you reduce LLM costs without harming quality?
Cost reduction is a system design problem. Levers: - Smaller/cheaper models for simple tasks - Caching and batching - Shorter prompts and better retrieval - Multi-model routing Measure quality with a golden set so you don’t optimize cost at the expense of user experience.
What is multi-model routing and how do you implement it?
Multi-model routing chooses different models based on task complexity. Examples: - Cheap model for classification/summaries - Strong model for reasoning - Fallback when confidence is low Implement with routing rules, confidence scoring, and evaluation. Always log routing decisions to debug failures and costs.