Skip to content

GenAI pilot project ‐ Coding with LLMs

Alex Richert edited this page Jul 16, 2025 · 7 revisions

The NCEPLIBS team is a participant in NOAA's GenAI pilot program. Our work aims to explore the use of large language models (LLMs) in code development, for example,

  • using Gen AI tools to assist in code review;
  • using Gen AI tools to generate drafts of code documentation;
  • using Gen AI predictive-completion features;
  • using Gen AI to assist with assembling materials for reporting; and
  • using Gen AI to assist with writing code.

This page contains examples of applying LLMs toward various aspects of code development, including best practices so far identified and lesson learned.

Writing code

Unit test development

Documentation generation

Code review with LLMs

Given the proficiency of many established LLMs at working with code, our team has also explored the use of LLMs to assess code modifications and supplement human code reviews (i.e., for GitHub pull requests).

Via web interfaces

A number of LLMs provide free, web-based interfaces which can be used to generate code reviews. This approach does not lend itself to comprehensive, systematic automated code reviews, but may nonetheless be sufficient for targeted reviews providing suggestions or bug detection for modest pieces of code. Limits on the number of input tokens, especially in free-tier LLM services, present one important limitation on the utility of these approaches for analyzing entire code bases or extensive code changes.

Via APIs (continuous integration)

There are several providers of free RESTful APIs, each of which has usage limits. As a proof of concept for integrating these with our GitHub Actions-based continuous integration pipelines, we have created the NOAA-EMC/ci-llm-code-review custom GitHub action. This action can be run manually or automatically for a pull request.

We have found the rate limits to be fairly restrictive, and therefore the ci-llm-code-review action generates separate code reviews for each file (i.e., each file is reviewed within a separate query to an LLM API). This necessarily limits the insight that the model has into the code, as it is unable to obtain context from other files to determine, say, whether a given change to a variable's name or data type has been appropriately propagated. Here is the custom action's default prompt:

Create a code review for the following file diff. Structure the review as 'Summary of Changes,' 'Issues & Possible Errors,' and 'Suggestions' with possible better approaches if appropriate. Focus your review on the new and removed lines indicated by + and - symbols, using the surrounding context to understand the impacts of those changes. These changes are part of the NOAA EMC '${{ github.event.repository.name }}' repository, so please consider the implications of these changes for the rest of the code.

Out of the available free providers, Google Gemini's rate limits are the most generous, with input token limits established on a per-day basis, and we have done some testing of providing many files' worth of context within a single query. The model's ability to navigate the code has proven to still be fairly limited, but this work is only preliminary.

Apart from the challenge of finding free tools with reasonable rate limits, the other main challenge is in developing queries that consistently yield accurate, usable code reviews. Our initial testing produced code reviews that were highly inconsistent in their scope, depth, and accuracy. Whether generating one-off reviews or automated ones, we find it useful to provide explicit instructions as to the scope and depth, and even saying "be thorough and accurate" does not hurt.

Example: Reviewing NCEPLIBS-basio modifications

Here is a sample review generated by the Microsoft Phi 4 model in a randomly chosen set of changes to the NCEPLIBS-bacio library:

bacioreview

The model has correctly parsed the diff-formatted code modifications. As with most LLM-generated reviews, especially given the limited context provided, the insights are often limited and can be a bit generic. For example, the review says, "Ensure that all version updates are consistently documented," and, "Verify that all references to bacio in the codebase are updated to baciol," because it does not have the context that would be needed to directly confirm whether corresponding changes were made.

In our investigations so far, the LLMs tested-- Google Gemini 2.0, MistralAI, GPT 4o, Microsoft Phi 4-- have all proven fairly adept, though not 100% accurate, at parsing diff/patch-formatted output by correctly recognizing '+' and '-' symbols as representing added and subtracted lines, respectively.

Via code editors/plugins

We have not extensively explored tools integrating LLMs into code editors (VisualStudio plugins, etc.). From preliminary testing, tools such as Windsurf and GitHub Copilot appear to be reasonably adept at understanding code in context, including explaining and reviewing code. Note that free tools of this sort are even more sparse than free LLM APIs. The applicability of these free-tier tools for larger projects is therefore limited.

Clone this wiki locally