Research Vision

My research sits at the intersection of language model evaluation, software engineering, data contamination, and real-world LM applications. A common thread runs through all of my work: standard benchmarks and standard evaluation practices frequently fail to tell us what we actually want to know — whether a model can reliably perform a task that matters in practice.

I build benchmarks grounded in real practitioner needs (not synthetic proxies), study how contamination silently inflates benchmark scores, and design evaluation protocols that are robust to that contamination. In parallel, I apply these ideas to domains where trustworthy LM behavior matters most: infrastructure automation, financial reasoning, and data science assistance.

LM Evaluation Software Engineering Data Contamination Real-World LM Applications

Current Research Projects

1
IT Automation with Language Models
Active

IT automation tools like Ansible are fundamentally different from general-purpose code generation targets: tasks are stateful, execution happens against live system state, and correctness requires understanding state reconciliation — not just syntactic validity. Existing benchmarks rely on synthetic tasks that miss this entirely. This project builds and evaluates rigorous, execution-based benchmarks that reflect the real needs of practitioners. Our current benchmark, ITABench, covers 126 diverse Ansible automation tasks evaluated in controlled Docker environments. We also study how LLMs reason (and fail to reason) about state changes, module-specific behaviors, and idempotency — failure modes that do not surface in standard code benchmarks.

Papers
2
Private Evaluation Protocol — Benchmark Contamination Detection and Mitigation
In Progress

A growing body of evidence suggests that LLMs have been exposed to popular benchmark datasets during pre-training, causing reported performance to be significantly overestimated. This phenomenon — benchmark contamination — is one of the most underappreciated reliability threats in modern LM evaluation. This project develops principled methods to detect whether a model has memorized evaluation data and to mitigate contamination effects through private evaluation protocols. Our goal is to make LM evaluation trustworthy even when model training data is opaque. This work directly extends the insights from ITABench, where our use of controlled Docker execution served as one form of contamination mitigation.

Papers
  • Under Review Benchmark Contamination Detection in Language Models — Manuscript in preparation (2026)
  • Under Review Contamination Mitigation via Private Evaluation Protocols — Manuscript in preparation (2026)
3
Conversational Data Science
Active

Data science workflows are complex, iterative, and deeply conversational — analysts clarify requirements, explore hypotheses, and refine analyses through dialogue. This project investigates how LLMs can serve as reliable partners in that process, acting as personal data science assistants capable of understanding analytical intent, generating and executing code, and explaining results in natural language. A key contribution is the Forecast Utterance framework, which formalizes how users express predictive queries in natural language — enabling more precise interpretation of analytical requests. We also study where LLM-based DS agents fail, and how to evaluate them rigorously.

Papers