Larry, Yinxi Li

Amor fati, carpe diem

🦄 About Me | Currently Master’s Student in CS

Hey there, I’m Larry! Thanks for visiting my humble page. I’m currently a first-year Computer Science M.Math. student at the University of Waterloo, advised by Pengyu Nie. Before that, I received my B.Sc. in Computer Science (ELITE Stream) from The Chinese University of Hong Kong, and my final year project was advised by Eric Lo. I also spent a semester as an exchange student at ETH Zürich during my undergraduate, where I explored advanced topics in Machine Learning and Software Engineering.

My research is supported by grants from Prof. Nie’s research group, funded by the Natrual Sciences and Engineering Research Council of Canada (NSERC) and the University of Waterloo.

💻 Research Interests

Understanding and Improving the Internal Mechanisms of LLMs and DLMs: tokenization, reasoning, representation learning
LLM Applications in Software Engineering, Math, and Scientific Discovery

I am open to academic and research collaborations and welcome any questions or discussions.

🎉 News

Oct 17, 2025	📌 New preprint: TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar is now available on arXiv! 1️⃣ LLMs’ subword tokenizers don’t align well with programming language grammar: tiny whitespace or renaming tweaks -> different tokenization -> flipped outputs. 2️⃣ Our framework TokDrift systematically tests 9 code LLMs on 3 tasks, showing their sensitivity to tokenization changes: up to 60% outputs change under a single semantic-preserving rewrite. 3️⃣ If your win margin is ~1 pp, beware: spacing & naming can swing results.
Feb 15, 2025	Homepage Acknowledgements
Feb 15, 2025	Hello my personal website! Let’s make a brithday for it😍. Glad to see you there but it was still under construction. Hopefully it will be done soon.

💡 Selected Publications / Preprints

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Yinxi Li, Yuntian Deng, and Pengyu Nie

arXiv preprint arXiv:2510.14972, 2025

Abs arXiv Bib PDF Code

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
@article{YinxiETAL25TokDrift, title = {{TokDrift}: When {LLM} Speaks in Subwords but Code Speaks in Grammar}, author = {Li, Yinxi and Deng, Yuntian and Nie, Pengyu}, journal = {arXiv preprint arXiv:2510.14972}, year = {2025}, url = {https://arxiv.org/abs/2510.14972}, }