Hi, I'm Zichun

Hi! I am a third-year PhD student at Language Technologies Institute (LTI), Carnegie Mellon University (CMU), advised by Prof. Chenyan Xiong. I was fortunate to work with Dr. Scott Yih during my internship at Meta (2024). My primary research interests are:

Intelligent and efficient LLM scaling with novel pretraining data curation and synthesis methods.
Data valuation and influence attribution to better capture the impact of LLM training data.

Previously, I graduated from Tsinghua University in 2023 with a honours degree in Computer Science and Technology. I was honored to be a member of THUNLP, advised by Prof. Zhiyuan Liu, working closely with Dr. Tianyu Gao and Dr. Zhengyan Zhang in efficient few-shot learning.

When I am not doing research, I like to work out, play guitar, and watch movies.

Updates:

October 2025: Check out our web recycling paper: RePro: Training Language Models to Faithfully Recycle the Web for Pretraining ✨

June 2025: Check out our group-level data selection paper: Group-Level Data Selection for Efficient Pretraining at NeurIPS 2025 (poster) ✨

June 2024: Check out our model-aware data selection paper: MATES🧑‍🤝‍🧑: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models at NeurIPS 2024 (poster) ✨

December 2023: Check out our benchmarking LLMs paper: An In-depth Look at Gemini's Language Abilities ✨

August 2023: Begin my PhD at CMU 💪

May 2023: Check out our generic retrieval augmentation paper: Augmentation-Adapted Retriever Improves Generalization of Language Models as Generic Plug-In at ACL 2023 (oral presentation) ✨

August 2022: Check out our automatic prompting paper: Automatic Label Sequence Generation for Prompting Sequence-to-sequence Models at COLING 2022 (oral presentation) ✨