Conference PaperProofLang: the Language of arXiv ProofsHammer, Henry; Noda, Nanako; Stone, Christopher AThe ProofLang Corpus includes 3.7M proofs (558 million words) mechanically extracted from papers that were posted on arXiv.org between 1992 and 2020. The focus of this corpus is proofs, rather than the explanatory text that surrounds them, and more specifically on the language used in such proofs. Specific mathematical content is filtered out, resulting in sentences such as Let MATH be the restriction of MATH to MATH. This dataset reflects how people prefer to write (informal) proofs, and is also amenable to statistical analyses and experiments with Natural Language Processing (NLP) techniques.Springer, Cham2023-08-081049993016th Conference on Intelligent Computer Mathematics (CICM 2022)978-3-031-42753-4https://doi.org/10.1007/978-3-031-42753-4_191950885Cambridge, UKNational Science Foundation