LLM-Assisted Code Cleaning For Training Accurate Code Generators.

Jain, Naman; Zhang, Tianjun; Chiang, Wei-Lin; Gonzalez, Joseph E; Sen, Koushik; Stoica, Ion

Big data, the “new oil” of the modern data science era, has attracted much attention in the GIScience community. However, we have ignored the role of code in enabling the big data revolution in this modern gold rush. Instead, what attention code has received has focused on computational efficiency and scalability issues. In contrast, we have missed the opportunities that the more transformative aspects of code afford as ways to organize our science. These “big code” practices hold the potential for addressing some ill effects of big data that have been rightly criticized, such as algorithmic bias, lack of representation, gatekeeping, and issues of power imbalances in our communities. In this article, I consider areas where lessons from the open source community can help us evolve a more inclusive, generative, and expansive GIScience. These concern best practices for codes of conduct, data pipelines and reproducibility, refactoring our attribution and reward systems, and a reinvention of our pedagogy.

More Like this