Unlocking a Faster Development Loop: Q&A on Agent-Driven Development in Copilot Applied Science

Software engineers often automate repetitive tasks to focus on creative work, and AI researcher John Doe (author) took this to the next level by automating intellectual toil. Using GitHub Copilot, he built eval-agents to analyze agent trajectories—a common but time-consuming task. This Q&A explores how this project transformed his workflow and enabled his team to do the same.

1. What prompted the creation of eval-agents?

Analyzing coding agent performance against benchmarks like TerminalBench2 or SWEBench-Pro involved reviewing hundreds of thousands of lines of JSON trajectory files each day. Each task produces a trajectory showing how an agent attempted the task. Manually scanning these was impossible, so the author turned to GitHub Copilot to surface patterns. However, the repetitive nature of this analysis—using Copilot to find patterns, then investigating—led to the idea: automate the intellectual work. Thus, eval-agents was born to handle the repetitive loop and free up time for deeper analysis.

Unlocking a Faster Development Loop: Q&A on Agent-Driven Development in Copilot Applied Science — Source: github.blog

2. How does the author rely on GitHub Copilot in daily work?

GitHub Copilot serves as a pattern detection tool. When analyzing new benchmark runs, the author uses Copilot to sift through trajectories and identify recurring behaviors or anomalies. This reduces the data from hundreds of thousands of lines to just a few hundred relevant ones. The collaboration with Copilot is iterative: surfacing patterns, investigating them, and then refining queries. This loop, while effective, was repetitive—leading to the desire to automate it entirely with agents. The experience demonstrated how AI can augment human analysis without replacing critical thinking.

3. What are trajectories and why are they important?

Trajectories are detailed logs—often JSON files with hundreds of lines—that record an agent's thought process and actions while completing a benchmark task. They expose how the agent plans, executes, and corrects itself. Analyzing these trajectories is crucial for measuring performance, debugging failures, and improving agent designs. However, with dozens of tasks per benchmark and multiple runs daily, the sheer volume becomes overwhelming. Trajectories are the raw data that eval-agents now processes automatically, extracting insights without manual reading.

4. What are the main goals of the eval-agents project?

Easy to share and use – The project should be accessible to all team members, allowing anyone to run analyses without deep setup.
Easy to author new agents – Creating new analytical agents should be straightforward, encouraging contributions from the whole team.
Make coding agents the primary vehicle for contributions – Instead of manual scripts, agents encapsulate analysis logic, making them reusable and shareable.

These goals align with GitHub's collaborative ethos and the author's experience as an open-source maintainer on GitHub CLI. The result is a tool that empowers the team to build solutions that fit their specific needs.

5. How does this automation change the author's role?

The author transitioned from manually analyzing trajectories to maintaining and improving eval-agents. While they automated intellectual toil, they now own the system that enables peers to do the same. This shift mirrors a common pattern among engineers: building tools to remove toil leads to new responsibilities. However, the payoff is huge—the team gains a collaborative platform where agents are the building blocks, and everyone can contribute to agent-driven analysis. The author remains an AI researcher but now spends more time designing agents rather than reading JSON files.

6. What lessons were learned about collaborating with GitHub Copilot?

Key takeaways include: using AI to surface patterns from large datasets accelerates understanding; building agents around repetitive tasks creates a force multiplier for the entire team; and sharing these agents via GitHub's ecosystem fosters a culture of reuse and iteration. The author emphasizes that the most effective development loop combines human creativity with AI assistance—Copilot handled the grunt work while the author focused on strategy. This synergy unlocks an incredibly fast development loop, enabling rapid prototyping and deployment of new analytical agents.

Tags: