Welcome to my homepage
My name is Md. Mahadi Hasan Sibat. I am a second-year PhD student in the Department of Computer Science at the University of Central Florida (UCF), advised by Dr. Shubhra Kanti Karmaker (Santu). I am a member of the Bridge-AI Lab at UCF.
My research focuses on the evaluation and trustworthiness of large language models. In particular, I am interested in building rigorous benchmarks that expose failure modes in LLMs on real-world tasks, and in developing methods to detect and mitigate benchmark contamination — a pervasive problem that causes reported model performance to be systematically overestimated.
Before joining UCF, I completed my MS in Computer Science (Software Engineering) at Auburn University, where I also worked as a Graduate Research and Teaching Assistant. Prior to academia, I worked as a Senior Software Engineer at Reve Systems in Bangladesh for over three years.
You can download my full CV here.
Research Interests
- LLM Evaluation & Benchmarking
- Benchmark Contamination Detection and Mitigation
- IT Automation and Infrastructure as Code (Ansible, IaC)
- Code Generation for Real-World Tasks
- Natural Language Processing and its Applications
- Conversational Data Science
News and Announcements
- [April 2026] ACL 2026 Our paper Large Language Models for IT Automation Tasks: Are We There Yet? has been accepted to ACL 2026 Findings. We present ITABench, a benchmark of 126 real-world Ansible tasks. Best model achieves only 23.9% pass@10.
- [April 2026] ACL 2026 Our paper The Path Not Taken: Duality in Reasoning about Program Execution has been accepted to the ACL 2026 Main Conference (co-authored with Eshgin Hasanov, Santu Karmaker, and Aashish Yadavally).
- [Aug 2024] Joined the University of Central Florida as a PhD student, advised by Dr. Santu Karmaker.
- [2024] FSE 2024 Paper accepted at FSE 2024: State Reconciliation Defects in Infrastructure as Code.
- [2024] TMLR Paper accepted at TMLR 2024: Introducing Forecast Utterance for Conversational Data Science.
Publications
-
ACL 2026 Findings · 2026
-
ACL 2026 Main · 2026The Path Not Taken: Duality in Reasoning about Program Execution
-
FSE 2024 · 2024State Reconciliation Defects in Infrastructure as Code
-
TMLR 2024 · 2024Introducing "Forecast Utterance" for Conversational Data Science
-
ACM CSUR 2022 · 2022
-
Under Review · ARR 2026FinTradeBench: A Financial Reasoning Benchmark for LLMs
Experience & Education
-
Fall 2024 — PresentPhD in Computer Science
-
Aug 2021 — Aug 2024MS in Computer Science (Software Engineering)
-
Sep 2017 — Dec 2020Senior Software Engineer
-
2012 — 2017B.Sc. in Computer Science & Engineering
Academic Service
-
2023Track Committee Member — ACL 2023
-
2022Track Committee Member — EMNLP 2022
-
2023, 2025, 2026Reviewer — ACL Rolling Review (ARR)
-
2024, 2025Reviewer — Transactions on Machine Learning Research (TMLR)
-
2023Reviewer — EMNLP 2023, Workshop BLP