Stanford CS230 | Fall 2025 | Lecture 8: Agents, Prompts, and RAG
Englishto
In 2016, Microsoft launched a bot on Twitter to learn from users. In less than a day, it had become so racist that they had to shut it down after just 16 hours. It wasn't an impromptu team: it was Microsoft. Yet, even with billions of dollars and hundreds of engineers, truly controlling a language model remains an unsolved problem. And here lies the first flaw in the common narrative: we think that a “powerful model” is synonymous with a “useful model.” But in reality, the recent history of LLMs is a huge lesson in how complicated it is to get reliable, up-to-date, and—most importantly—correct outputs. The takeaway from this lesson: The real breakthrough lies not in building ever-larger foundational models, but in learning to orchestrate, refine, and enhance the models we already have. From simple, improved prompts to full-fledged agentic and multi-agent workflows, the difference between a toy and a product lies entirely in the architecture surrounding the model, not in the model itself. Andrew Ng has given a name to this approach: “agentic workflows,” i.e., systems where models, external tools, memory, and APIs are combined into a chain of autonomous actions. Take the case of a biotech company that wants to categorize customer reviews. In theory, it's as simple as asking the model, “Is this sentence positive, neutral, or negative?” But the result depends on a thousand nuances: for a medical startup, a comment like “Everything went well, but I expected more” might be negative, whereas in other industries it would be neutral. How do you align a model with real-world needs? Not with more data or a larger model, but with engineered prompts, tailored examples, and—more and more often—multi-step pipelines that guide generation, evaluate it, correct it, and adapt it to the context. A concrete example: the prompt chain. Instead of asking for everything in a single instruction, the task is broken down into stages: first, extract the key points; then, create an outline; finally, write the final answer. This approach, used by companies like Workera, makes it possible to pinpoint where the system is really going wrong: Is the outline weak? Is the final answer too impersonal? You can take targeted action. And in business, this level of granularity makes the difference between a demo and a reliable solution. An interesting fact: In a study of BCG consultants, those who had access to AI and also received brief training on prompts significantly outperformed both those who did not use AI and those who used it “blindly.” What's more, the research identified two styles of collaboration with LLMs. There are the “centaurs,” who delegate entire blocks of work, such as “You handle the presentation, and let me know when you're done,” and the “cyborgs,” who work in symbiosis, micro-interacting with the model at every step. Both methods work, but they require different workflows—and the difference is not insignificant when scaling up within a company. RAG, or Retrieval-Augmented Generation, is the most practical solution to the problem of keeping information up to date and ensuring accuracy. Instead of expecting the model to “know everything,” it is connected to external databases, retrieves only the relevant documents, and incorporates them into its response. This may seem like a makeshift solution, but consider the following: even if, in the future, models could read the entire web in real time (spoiler: this will never happen due to latency and cost), we would still need retrieval systems, such as search engines, for reasons of efficiency and source traceability. Take the example of the content generated shortly after Trump’s famous “Covfefe” gaffe: Twitter’s LLM had no idea how to handle it, and the recommendation system went haywire. Today, the same thing happens every day with slang, neologisms, and trends. RAG allows us to keep up without having to retrain everything from scratch. Now let's move on to agents: think of a customer support agent. It's no longer just a chatbot providing answers: it extracts data, consults the order database, checks policies, updates information, and drafts emails, all while orchestrating tools and memory. But how do you know if it's “working”? This is where the topic of “evals,” or evaluations, comes into play. Both objective metrics—percentage of requests resolved, response time, accuracy of the output—and subjective evaluations using LLM judges or human feedback are used. And, crucially, every intermediate step is tracked: if the response is rude, you can trace it back to the prompt or subsystem that caused the problem. This modular, traceable architecture is the real difference between traditional, deterministic software and LLM-based fuzzy systems: Here, it’s not enough to write solid code once; you have to learn to experiment, discard parts, iterate, and deploy human workflows to correct areas where artificial intelligence makes mistakes or goes off track. At the enterprise level, McKinsey has estimated that agentic automations can reduce the time required for processes such as credit risk assessment by 20–60%. But the real challenge is not technical: it is getting thousands of people to change their habits, rewriting job descriptions, and redefining incentives. That's why, even as technology advances at a rapid pace, it will take organizations years to truly transform. One final thought: today, the real value no longer lies in building the biggest model, but in knowing how to stitch together models, tools, workflows, and memory to solve real, measurable problems that can be improved over time. What's the difference between a demo and a product? Being on the side of the system that orchestrates, not just the model it generates. This lesson comes from Stanford's CS230 course, Fall 2025: you've just saved yourself nearly two hours of class time.
0shared

Stanford CS230 | Fall 2025 | Lecture 8: Agents, Prompts, and RAG