Research Vision
My research sits at the intersection of language model evaluation, software engineering, data contamination, and real-world LM applications. A common thread runs through all of my work: standard benchmarks and standard evaluation practices frequently fail to tell us what we actually want to know — whether a model can reliably perform a task that matters in practice.
I build benchmarks grounded in real practitioner needs (not synthetic proxies), study how contamination silently inflates benchmark scores, and design evaluation protocols that are robust to that contamination. In parallel, I apply these ideas to domains where trustworthy LM behavior matters most: infrastructure automation, financial reasoning, and data science assistance.
Current Research Projects
IT automation tools like Ansible are fundamentally different from general-purpose code generation targets: tasks are stateful, execution happens against live system state, and correctness requires understanding state reconciliation — not just syntactic validity. Existing benchmarks rely on synthetic tasks that miss this entirely. This project builds and evaluates rigorous, execution-based benchmarks that reflect the real needs of practitioners. Our current benchmark, ITABench, covers 126 diverse Ansible automation tasks evaluated in controlled Docker environments. We also study how LLMs reason (and fail to reason) about state changes, module-specific behaviors, and idempotency — failure modes that do not surface in standard code benchmarks.
- ACL 2026 Findings Large Language Models for IT Automation Tasks: Are We There Yet? — Md. Mahadi Hasan Sibat, John Salvador, Akond Ashfaque Ur Rahman, Santu Karmaker
- FSE 2024 State Reconciliation Defects in Infrastructure as Code — Md. Mahadi Hasan Sibat, John Salvador, Santu Karmaker, Akond Ashfaque Ur Rahman
A growing body of evidence suggests that LLMs have been exposed to popular benchmark datasets during pre-training, causing reported performance to be significantly overestimated. This phenomenon — benchmark contamination — is one of the most underappreciated reliability threats in modern LM evaluation. This project develops principled methods to detect whether a model has memorized evaluation data and to mitigate contamination effects through private evaluation protocols. Our goal is to make LM evaluation trustworthy even when model training data is opaque. This work directly extends the insights from ITABench, where our use of controlled Docker execution served as one form of contamination mitigation.
- Under Review Benchmark Contamination Detection in Language Models — Manuscript in preparation (2026)
- Under Review Contamination Mitigation via Private Evaluation Protocols — Manuscript in preparation (2026)
Data science workflows are complex, iterative, and deeply conversational — analysts clarify requirements, explore hypotheses, and refine analyses through dialogue. This project investigates how LLMs can serve as reliable partners in that process, acting as personal data science assistants capable of understanding analytical intent, generating and executing code, and explaining results in natural language. A key contribution is the Forecast Utterance framework, which formalizes how users express predictive queries in natural language — enabling more precise interpretation of analytical requests. We also study where LLM-based DS agents fail, and how to evaluate them rigorously.
- TMLR 2024 Introducing Forecast Utterance for Conversational Data Science — Md. Mahadi Hasan Sibat, R. Alexander Knipper, Shubhra Kanti Karmaker Santu
- ACM CSUR 2022 AutoML to Date and Beyond: Challenges and Opportunities — Shubhra Kanti Karmaker Santu, Md. Mahadi Hasan Sibat, et al.
- In Progress LLM as Personal Data Scientist: Evaluation and Failure Analysis — Manuscript in preparation (2026)