Unveiling the Truth: Why GitHub Copilot's Productivity Impact Varies Across Studies?

It has been over two years since GitHub released its much-discussed research on Copilot, the AI-powered pair-programming tool, which claimed a 55% improvement in the speed of task completion rate. Since then, several studies have been conducted by prominent institutions and organizations such as MIT Sloan, Microsoft, Accenture, Ness-Zinnov, the University of Alberta, among others. These studies have revealed significant variations in findings, with some reporting productivity gains as low as 5% far below GitHub's original claim of 55%. Surprisingly, a few studies even suggest that Copilot can negatively impact productivity due to the poor quality of AI-generated code.

These discrepancies raise important questions about the validity of GitHub's high productivity improvement claims. However, it is crucial to delve into the subsequent research to understand the potential reasons for these variations.

Mert Demirer, an assistant professor of economics at MIT Sloan, aptly noted that measuring how AI impacts productivity in real-world workplace environments is a significant challenge. In this context, understanding the factors contributing to the wide range of productivity outcomes in different studies on Copilot becomes essential.

Given that Copilot has been the leading AI pair-programming tool over the past two years, I reviewed 17 empirical studies, papers and articles (references available at the bottom) focused on its productivity impact. From this research, I identified six possible factors and three additional reasons suggested by ChatGPT that can help technology leaders, planners, managers, and developers better align their expectations to achieve the desired productivity improvements.

Six factors contributing to productivity variation

Six factors outlined in the studies collectively explain much of the variation in developer productivity when using GitHub Copilot. Here's how each contributes to the observed differences:

1. Developer Experience: Productivity varied by developer experience, with less experienced developers getting more benefit from Copilot. [Ref #8].

2. Adoption and Usage Patterns: Short-tenured and junior developers were more likely to adopt Copilot and to continue using it for more than one month, and that these developers were more likely to accept the output code generated by Copilot. [Ref #8]

Takeaways from factor#1 and #2:

  • Organizations and teams with a higher proportion of junior developers can expect significant productivity gains from adopting GitHub Copilot. These teams may also benefit from reduced project delivery risks typically associated with less experienced developers, as Copilot's assistance helps bridge knowledge gaps and enhance code quality.

  • For teams with a greater number of senior developers, expectations should be adjusted to reflect incremental productivity improvements, as experienced developers may already possess the expertise that Copilot enhances.

  • Additionally, the adoption and sustained usage patterns among junior developers suggest that organizations can confidently onboard and deploy them on critical projects while leveraging Copilot to maintain or even exceed delivery expectations and reduce project cost.

3. Language-Specific Performance: Notable differences among the programming languages in terms of correctness of suggestions (between 57%, for Java, and 27%, for JavaScript). [Ref #11]

Takeaways:

  • Recognize that GitHub Copilot's performance varies by programming language. When planning its adoption, assess the predominant languages used in your organization and set realistic expectations for productivity gains and code quality improvements.

  • Prioritize investment in languages where Copilot demonstrates higher suggestion accuracy (e.g., Java).

4. Existing Codebase Utilization: Engineers witnessed maximum impact when utilizing existing codebase functions, leading to reduced development cycle time. [Ref #9]

Takeaway:

  • Organizations and teams working on maintenance and support work can expect increased productivity benefits compared to the organisations starting projects from scratch.

5. Sampling Effectiveness: A study by OpenAI found that Codex (Copilot underlying model responsible for generating code) has a 29% effectiveness when using the first sample, 47% effectiveness when using the best of ten samples, and 78% effectiveness when using the best of a hundred samples. In other words, the more code samples you generate, the more likely one of them will pass the testing suite. [Ref #10]

Takeaways:

  • Tech leaders should invest in training programs to teach developers how to effectively evaluate and test generated samples, ensuring they choose the most suitable option for their requirements.

  • Developers should generate multiple samples when solving complex problems to increase the likelihood of finding a high-quality solution.

6. Task Simplicity: Copilot is accurate for simpler tasks. Worth exploring task decomposition for better accuracy. [Ref #12]

Takeaway:

  • Developers should use Copilot with confidence for simpler tasks, repetitive patterns, or standard logic implementations, saving time for more complex challenges and breaking down complex tasks into smaller, well-defined steps before attempting to generate code with Copilot. This not only increases accuracy but also helps you better understand the problem at hand.

Additional factors contributing to variations in productivity

Beyond the above outlined reasons revealed through various studies and research papers, other factors can also influence productivity. These additional reasons are generated by ChatGPT when inquired about the reasons for Copilot productivity variations.

1. Learning Curve: Developers new to Copilot may initially experience reduced productivity due to the time needed to learn how to use it effectively.

Takeaways:

  • Organisations should plan workshops or training programs to familiarize developers with Copilot’s features, workflows, and best practices.

  • Recognize that developers new to Copilot may experience a short-term productivity decline. Set realistic expectations for adoption timelines and provide support during the learning phase.

2. Code Quality Review: The need to verify and refactor AI-generated code can sometimes offset productivity gains, particularly in critical or high-stakes projects.

Takeaway:

  • Organizations should emphasize that AI tools like GitHub Copilot are not replacements for human judgment. Integrate their use with established code review practice, coding standards and incorporating code review and quality tools such as SonarQube, DeepSource etc.

3. Ethical and Security Concerns: Concerns about insecure or unoriginal code might lead to increased scrutiny, reducing overall efficiency.

Takeaway:

  • Organisations need to encourage developers to validate Copilot's output for originality and security, and establish clear accountability for code quality by setting up following security protocols, coding-standards, and using the tools such as Veracode, Checkmarx, Fortify, Black Duck etc.

These factors, collectively, offer a comprehensive explanation for the observed variations in productivity when using Copilot.

Conclusion

Understanding the factors behind GitHub Copilot's varied productivity outcomes is crucial for organizations and developers looking to maximize its benefits. While Copilot holds great potential, its effectiveness depends on several variables, including developer experience, task complexity, language-specific performance, and code review practices. By addressing these factors and aligning expectations, teams can better leverage Copilot to enhance efficiency. Additionally, navigating the learning curve, improving sampling strategies, and addressing concerns around security and code quality are essential for unlocking its full potential. This nuanced understanding can help technology leaders, planners and managers and developers make informed decisions, fostering greater innovation and productivity in software development.

References

#1. Research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness

#2. Is-github-copilot-worth-it-real-world-data-reveals-the-answer

#3. Measuring-github-copilots-impact-on-engineering-productivity

#4. Ai-for-developer-productivity

#5. The-impact-of-github-copilot-on-developer-productivity-a-case-study

#6. The Productivity Effects of Generative AI

#7. Ebay-generative-ai-development

#8. copilot-developer-productivity

#9. generative-ai-improves-software-engineering-productivity-by-70-says-ness-zinnov-study

#10. Increasing-Productivity-With-GitHub-Copilot

#11. On the Robustness of Code Generation Techniques

#12. Evaluating the Usability of Code Generation Tools

#13. Is GitHub Copilot a Substitute for Human Pair-programming?

#14. how-generative-ai-affects-highly-skilled-workers

#15. An Empirical Evaluation of GitHub Copilot’s Code Suggestions

#16. Measuring-github-copilots-impact-on-productivity

#17. Productivity Assessment of Neural Code Completion

0
Subscribe to my newsletter

Read articles from Raj Darshan Pachori directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Raj Darshan Pachori
Raj Darshan Pachori

Technology Leader. Worked as Director of Engineering for Tyfone CDI. Currently researching and exploring Gen AI to drive productivity improvement in Product Engineering. 22+ yrs of exp and 10+ yrs in leadership roles.