Notes from the Google Gen AI Intensive Course
Being updated as I go
Day 1: Introduction to Agents
Definitions
Recurring core definition: The Model = The “Brain”. Tools = “Hands” and “Eyes”. Orchestration Layer = “Nervous System”. Deployment = “Body” and “Legs”
“In essence, an agent is a system dedicated to the art of context window curation. It is a relentless loop of assembling context, prompting the model, observing the result, and then re-assembling a context for the next step. […]”
Agentic Workflow
Get the mission (user prompt)
Scan the scene (gather context, orchestration layer accesses resources)
Think It Through (core “think” loop driven by a reasoning model: analyzes mission #1 against scene #2, devises a plan)
Take Action (orchestration layer selects invokes tool)
Observe and Iterate (agent observes the outcome, adds to context or “memory”, returns to step #3). In short: “Think, Act, Observe”
Taxonomy of Agentic Systems
Level 0: The Core Reasoning System. The language model itself, in isolation. Complete lack of real-time awareness; it is functionally “blind” to any event or fact outside its training data.
Level 1: The Connected Problem-Solver. Functional agent by connecting to and utilizing external tools - the aforementioned “Hands”. No longer
confined to its static, pre-trained knowledge. Can interact with the world: web search, realtime API access, database access.
Level 2: The Strategic Problem-Solver. Multi-step goal achievement through Context Engineering. The agent actively selects, packages, and manages the most relevant information for each step of its plan.
Level 3: The Collaborative Multi-Agent System. Paradigm-shift form single, all-
powerful “super-agent” toward a “team of specialists” working in concert, dividing labor. Agents treat other agents as tools. Constrained by the reasoning limitations of today’s LMs.
Level 4: The Self-Evolving System. Agentic system identifies gaps in its own capabilities and dynamically creates new tools or even new agents to fill them. Learning and evolving organization.
This was misrepresented on social media as “Google seems very confident that its agents will be able to solve problems autonomously in the very near future.”. This was nowhere stated, and with #3 “Constrained by the reasoning limitations of today’s LMs” preceded, it seems clear that this was not the intended messaging.
Core Agent Architecture: Model, Tools, Orchestration
Model: the “best” model is the one that sits at the optimal
intersection of quality, speed, and price for your specific task. “You don’t use a sledgehammer to crack a nut.”. The whitepaper compares Gemini 2.5 Pro vs. Flash. However, the hands-on sessions exclusively use Flash Lite (gemini-2.5-flash-lite).
Because the Gemini API Free Tier showed very low rate limits (less than 1 per minute?!), I switched to paid Tier 1. The costs including web search were a fraction of a cent.
Tools:
Retrieval: Natural Language to SQL (NL2SQL) was given as one example, besides RAG via Vector Stores and Knowledge Graphs.
Actions: wrapping existing functions and APIs. But also generating and executing code on the fly - through Agent Engine Code Execution. “Autonomous Actor”. Also includes tools for human interaction (human-in-the-loop - HIL, HITL), pausing the workflow and asking for confirmation.
Function Calling: OpenAPI. MCP presented as popular because more convenient. Native tools like Google Search “where the function invocation happens as part of the LM call itself”.
Orchestration layer: Think-Act-Observe loop. State-machine.
Agent Ops
Evals: traditional unit tests (assert x==y) does not work for agent responses that are probabilistic. Solution 1: A/B tests, using KPIs; solution 2: LM judge, solution 3: metrics driven.
Debugging and logging through OpenTelemetry Traces. “Trace data can be seamlessly collected in platforms like Google Cloud Trace, which visualize and search across vast quantities of traces, streamlining root cause analysis.”
Integrations
Humans and Agents
MCP UI given as an example of how agents don’t use an existing user interface on behalf of the user. AG UI as a specialized UI messaging systems which can sync client state with an agent. A2UI for generation of bespoke interfaces [broken link; footnote: “A2UI is a protocol of generating UI via structured output and A2A message passing”].
Gemini Live API: multi-modal, streaming, interuptable.
Agents and Agents: Agent2Agent (A2A) protocol for enterprise-grade discovery and communication. “As opposed to MCP which focuses on solving
transactional requests, Agent 2 Agent communication is typically for additional problem solving.”.
Money: Agent Payments Protocol (AP2) for agentic commerce. Cryptographically-signed digital “mandates”, acting as verifiable proof of user intent, creating a non-repudiable audit trail for every transaction. Complementing this is x402, an open internet payment protocol that uses the standard HTTP 402 “Payment Required” status code. Together: foundational trust layer for the agentic web.
Security considerations
Defense-in-depth approach, including reasoning-based defenses. Agent Identity - a new class of principal: distinct from the identity of the user who invoked it and the developer who built it. SPIFFE mentioned as a standard for cryptographically verifiable identity. Google Model Armor “for organizations that prefer a fully managed, enterprise-grade solution for these dynamic checks”, including against PII leakage. Gateway “Control Plane” for governance: inspect, route, monitor, and manage every interaction.
Hands-on
The hands-on Jupyter notebooks introduced to single and multi-agent configurations. Four workflow patterns of agent invokation were shown: sequential, parallel, loop (refine) and LLM-orchestrated. There is also a “custom” agent type. A quick standalone starter project can be created with:
!adk create sample-agent --model gemini-2.5-flash-lite --api_key $GOOGLE_API_KEY
Learnings
Gemini 2.5 Flash Lite appears to be viable reasoning model for agentic tasks
Previously generous rate limits on the free tier may have been discontinued in actual practice. This could have been due to very high demand during the course, though.
The Payment aspect we had included in our ENISA CA Day talk on Trustworthy AI has been actualized through AP2.
There is a Special Interest Group on AI Agent Observability at OpenTelemetry.
Kaggle notebooks are reachable from the outside: there is a reverse proxy in place so that `adk web` can be started in a Jupyter notebook and be used from a regular web browser.
Day 3: Context Engineering: Sessions, Memory
Core concepts
Context Engineering: The process of dynamically assembling and managing information within an LLM’s context window to enable stateful, intelligent agents.
includes data returned from tools and sub-agents, besides the system prompt, tool definitions, RAG input, …
the whitepaper acknowledges “context rot,” a phenomenon where the LLM ability to pay attention to critical information diminishes as context grows. Context Engineering directly addresses this by employing strategies to dynamically mutate the history—such as summarization, selective pruning, or other compaction techniques—to preserve vital information while managing the overall token count, ultimately leading to more robust and personalized AI experiences.
Sessions: The container for an entire conversation with an agent, holding the
chronological history of the dialogue and the agent’s working memory.
Session storage can hold user- and app-specific information, though (“state” vs. the “events” from chat-style interaction) 👉 “transferable characteristics”.
`ToolContext.state[”user:name”] = user_name`
can be persisted to SQLite
Memory: The mechanism for long-term persistence, capturing and consolidating key information across multiple sessions to provide a continuous and personalized experience for LLM agents.
Memory considerations
Memory (vs. Session) storage is the engine of long-term personalization. It moves beyond RAG (which makes an agent an expert on facts) to make the agent an expert on the user.
Memory is an active, LLM-driven ETL pipeline—responsible for extraction, consolidation, and retrieval—that distills the most important information from conversation history. With extraction, the system distills information of interest (can be customized) into key memory points. Following this, consolidation curates and integrates this new information with the existing corpus, resolving conflicts, and deleting redundant data to ensure a coherent knowledge base. Current storage implementations include keyword-based (local in-memory), vector database (Vertex AI Engine deployment), and knowledge-graph based. Another emerging strategy discussed in the QA session is a narrative structure. Multi-modal memory artefacts are turned into text.
By tracking provenance and employing safeguards against risks like memory poisoning, developers can build trusted, adaptive assistants that learn (declarative memory) and adapt (procedural memory).
Security considerations
Model Armor is introduced to scrub PII from user data before persisting.
Day 4: Agent Quality
While “Evals” were hyped up quite a bit this year, this whitepaper posits that:
The Trajectory is the Truth: We must evolve beyond evaluating just the final output. The true measure of an agent’s quality and safety lies in its entire decision-making process.
… and proposes Observability as the foundation.
Problem statement
Agent Quality in a Non-Deterministic World: An agent can pass 100 unit tests and still fail catastrophically in production because its failure isn’t a bug in the code; it’s a flaw in its judgment. In traditional software, failure is explicit: a system crashes, throws a NullPointerException, or returns an explicitly incorrect calculation. These failures are obvious, deterministic, and traceable to a specific error in logic. Agents fail differently, and dealing with this is also different. You cannot use a breakpoint to debug a hallucination. You cannot write a unit test to prevent emergent bias.
The core technical challenge stems from the evolution from model-centric AI to system-centric AI. Agents decompose complex goals (”plan my trip”) into multiple sub-tasks. This creates a trajectory (Thought → Action → Observation → Thought...). The non-determinism of the LLM now compounds at every step. Further, as Agents interact with external tools, the Agent’s next action depends entirely on the state of an external, uncontrollable world. On the other hand, Agents maintain state. Short-term “scratchpad” memory tracks the current task, while long-term memory allows the agent to learn from past interactions. This means the agent’s behavior evolves, and an input that worked yesterday might produce a different result today based on what the agent has “learned”.
When evaluating Multi-Agent systems, the objective function may become ambiguous. The whitepaper contrasts cooperative MAS (e.g., supply chain optimization) with competitive MAS (example given: game theory scenarios or auction systems), noting that “performance” can be a global or a local, per-agent, metric - and these may not align.
Thus, the primary unit of evaluation is no longer the model, but the entire system trajectory.
The Pillars of Agent Quality
Effectiveness (Goal Achievement): user-centered metrics and business KPIs.
task success rate, user satisfaction, overal quality
`adk web` facilitates regression tests: when an ideal interaction was received from an agent that should be used as a benchmark: navigate to the Eval tab and click “Add current session.” This saves the entire interaction as an Eval Case and locks in the agent’s current text as the ground truth final_response.
You can then run this Eval Set via the CLI (adk eval) or pytest to automatically
check future agent versions against this saved answer, catching any regressions in output quality.
Efficiency (Operational Cost): measured in resources consumed: total tokens (cost), wall-clock time (latency), and trajectory complexity (total number of steps)
Robustness (Reliability): a robust agent retries failed calls, asks the user for clarification when needed, and reports what it couldn’t do and why rather than crashing or hallucinating.
Safety & Alignment (Trustworthiness): ensures the agent stays on task, refuses harmful instructions, and operates as a trustworthy proxy for your organization
You cannot measure any of these pillars if you only see the final answer.
Trajectory evaluation
LLM Planning, Tool Use, Tool Response Interpretation, RAG performance. Beyond correctness, we must evaluate the process itself: exposing inefficient resource allocation, such as an excessive number of API calls, high latency, or redundant efforts. It also reveals robustness failures, such as unhandled exceptions. Evaluations in Multi-Agent systems must also include inter-agent communication logs to check for misunderstandings or communication loops and ensure agents are adhering to their defined roles without conflicting with others. 👉 measure “Trajectory adherence”
Automated metrics
Traditional ML metrics such as ROUGE, BLEU, BERTScore, … still have their place: as trend indicators (not as absolute measures of quality) during CI/CD. Tack changes: if your main branch consistently averages a 0.8 BERTScore on your “golden set,” and a new code commit drops that average to 0.6, you have automatically detected a significant regression. This makes metrics the perfect, low-cost “first filter” to catch obvious failures at scale before escalating to more expensive LLM-as-a-Judge or human evaluation.
LLM-as-a-Judge Paradigm: whitepaper recommends paired comparisons over biased, noisy 1-5 grader
emerging Agent-as-a-Judge: uses one agent to evaluate the full execution trace of another: Plan quality, Tool use, Context handling. Create a specialized “Critic Agent” with a prompt (rubric) that asks it to evaluate the collected trace object directly.
Human-in-the-Loop (HITL) Evaluation: method to establish a human-calibrated benchmark, ensuring the agent’s behavior aligns with complex human values, contextual needs, and domain-specific accuracy. “We must move away from the idea that human rating provides a perfect “objective ground truth.” For highly subjective tasks (like assessing creative quality or nuanced tone), perfect inter-annotator agreement is rare.”
Security considerations
The whitepapers proposes several procedural measures (e.g. systematic red teaming), but also a concrete Safety Plugin mechanism: this would register e.g. its `check_input_safety()` prompt injection classifier with the adk `before_model_callback` hook, or its `check_output_pii()` PII scanner with the adk `after_model_callback`.
Observability
Observability shifts the focus from mere monitoring (verifying if an agent is active) to understanding the quality of its cognitive processes. The three pillars of Observability:
Logs: the Agent’s diary
Traces: the narrative thread that connects the logs into a coherent story. Individual log entries (OpenTelemetry: “spans”) are stitched together into a complete end-to-end view, revealing a causal chain. Span attributes contribute rich metadata, like latency.
traces recorded to Google Cloud Trace; streamlined through Agent deployment to Vertex AI Engine
Metrics: quantitative, aggregated health scores. Tokens per Task/API Cost per Run.
Integration can be done through the same type of plugin hooks as for the security considerations discussed above, but particularly: before/after_agent_callbacks, before/after_tool_callbacks, before/after_model_callbacks, on_model_error_callback. (See section 3.1 in codelabs notebook 4a). Alternative: `plugins`attribute to the Agent runner (codelab 4a @3.4). google.adk.plugins.logging_plugin.LoggingPlugin is a provided logging plugin.
Learnings
Trajectory evaluation and inter-agent communication logs as a way to expose runaway costs of agents.
The daily sessions frequently tout Google’s Model Guard product for use against prompt injection, PII scrubbing etc., but the adk hooks allow for perhaps more cost efficient means - Microsoft Presidio comes to mind.
adk hooks are used to tie in security and logging plugins, either self-written or provided by the adk. Alternative: agent runner plugin registration.
`adk web` (as well as `adk eval`) offers a way to evaluate agents. However, the verification based on text similarity appeared brittle: replacing “on” with “off” in the reference answer for the “home automation” test case caused false accepts even with similarity threshold = 0.8. This cloud motivate the agent-based “user simulation” evals.
As part of the LLM judge biases, “recency bias” was mentioned during the Q&A: the LLM judge may assign a passing verdict on a wrong trajectory if the final outcome is still correct.
New trajectories can be judged through self-consistency (several runs take same course) or clustering (thereby detecting outliers). This was also proposed during Q&A.
An agent was shared that answers questions based on a Pandas dataframe by writing code: Nvidia research agent

Insightful. This breakdown of agentic systems is really clear; I'm curiuos what you thnk is the biggest hurdle for widespread adoption of Level 3 collaborative agents right now? Your explanation using the brain and body analogy made these complex concepts click instantly.