
Cautions when using AI for coding
Generative AI is a hot topic these days, and people are finding new ways to leverage large language model (LLM) systems to streamline processes. One way that people have put AI to work is in writing code, such as with AI “assistants” like GitHub’s Copilot.
While AI co-authors can help remove some of the boring parts of writing code (how many times do we need to write another implementation of “read a data file into memory”?) I think developers need to keep in mind the limitations of using AI in this way.
The biggest drawback in using AI for coding is that AI was trained on other work and the “generative” nature of LLMs means that the AI will sometimes echo or repeat its input. At a high level, this presents two major risks, which can impact developers working either on open source projects or in a proprietary project:
Open source | Proprietary |
---|---|
AI includes incompatibly-licensed open source code | AI inserts copyleft-licensed code into a proprietary codebase |
AI copies proprietary code into your open source project | AI echoes other proprietary code into your closed project |
Let’s look at the two major concerns and how they affect both open source and proprietary software projects:
1. AI inserts open source code
Copilot and other AI coding “assistants” lure developers with the promise that AI can automate the development cycle. For example, GitHub’s Copilot features page says developers can “Ask GitHub Copilot a question, get the right answer for you, and accept the code with a single click” and that “GitHub Copilot generates what you need—so you can build faster.”
However, AI always starts with training data, and that training data had to come from somewhere. GitHub’s Copilot had a jump-start in this area because GitHub hosts so many repositories, both open source (free accounts) and proprietary (using GitHub Enterprise). This provided Copilot with a wealth of training data. Drawing from this broad set of training data, Copilot can find inventive solutions to coding challenges.
Unfortunately, this training—combined with the “generative” nature of AI—means that Copilot can also insert copies from other projects into your code. One memorable example was shared by Armin Ronacher in 2021, showing how GitHub’s Copilot “autocompletes” the fast inverse square root implementation from Quake III. Id Software released the Quake III Arena source code in 2012, so it was likely included in Copilot’s training data. Copilot inserted the code into Ronacher’s code session, adding a copy of the BSD 2-clause License, also called the “Simplified BSD License” or the “FreeBSD License.” However, as Stefan Karpinski noted in a followup comment on X, Id Software actually released Quake III under the GNU General Public License, version 2. Karpinski also highlighted that Copilot’s inserted comment attributed the wrong person as the copyright holder.
The critical detail is that while the Free Software Foundation lists the FreeBSD License as compatible with the GNU GPL—meaning code released under the FreeBSD License can be included in projects covered by the GNU GPL—the reverse is not true. The FSF notes the FreeBSD License is a “lax, permissive non-copyleft free software license.” Anyone can use source code licensed under FreeBSD, without attribution, including in proprietary or “closed source” projects. In contrast, the GNU GPL requires that any program that uses code released under the GNU General Public License must also be released under the GPL. This also requires that the source code be made available; the GNU GPL version 2 says:
- You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
- Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
- Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
- Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
The downstream effects of Copilot inserting open source code can have a huge impact.
For open source projects, maintainers need to understand the origin of every source code contribution. This has usually meant contributed by other developers, but in the era of AI code assistants, developers need to consider if code generated by an LLM might have originated from an open source project. And if so, is the license of the inserted code compatible with your project?
That might mean AI inserting code covered by a free software license that is incompatible with the GNU GPL into a project’s codebase that is actually licensed under the GNU GPL. Or it could mean an AI assistant inserting code released under the GNU GPL into a project that is covered by another, incompatible open source license.
For proprietary projects, the problem is made worse by the threat that an AI coding agent might insert code that was originally licensed under the GNU GPL, but without attribution or warning of the original license. In this case, if the issue were eventually uncovered (such as via an audit) the company would need to halt all distribution and sales of the software product until the full codebase can be fully investigated and any offending code contributions rewritten from scratch.
2. AI lifts from proprietary source code
The other worst-case scenario is an AI coding assistant inserting code that originated from a proprietary codebase. While I have not seen reports of this happening, I believe it is a matter of “not yet.”
Consider the Copilot example. GitHub trained Copilot on projects hosted at GitHub. And while GitHub claims that Copilot does not “copy/paste” code, Microsoft also admits that Copilot can “generate code suggestions based on patterns and examples it has seen in public code” and that “there is a possibility that suggestions might closely resemble existing public code snippets due to the nature of the training data.”
In the same reply, Microsoft advises that “Users should review and validate the suggestions provided by Copilot to ensure they meet their specific requirements and adhere to intellectual property laws” and “For enterprises, it’s important to consider regulatory compliance and internal policies regarding the use of AI-powered tools like Copilot.” This shifts the onus to developers to ensure that the source code generated by an AI coding assistant does not violate someone else’s intellectual property.
For proprietary projects, an AI coding agent inadvertently inserting another organization’s proprietary code may not present an immediate risk. Even if this were to happen, the risk of discovery is much lower due to the “closed source” nature of proprietary software development. However, with an increase in most types of cyberattacks and cyberattacks on the rise, including ransomware attackers publicly posting proprietary data and source code, the threat of another organization discovering code copied (from another proprietary codebase) by an AI agent remains.
For open source projects, there is a nonzero risk that AI-generated code and merged into an open source software project might have been copied from proprietary code. This risk might be very small, especially with Microsoft’s claims that specific code is not used to train Copilot, but the risk is still there.
This concern is unfortunately one-sided for open source projects, because open source is necessarily in the open where anyone can study how the program works. That includes review by companies who might discover re-use of their proprietary code, even when unintentionally and unknowingly inserted by an AI co-author. If left unresolved, the project’s developers might be the subject of a costly lawsuit. Open source developers want to write software, not be the next SCO v Linux.