Conference, Insights

Generative AI in evaluation and research: opportunities, risks and lessons from our pilot

By Pierre Canet

At the UK Evaluation Society conference 2025, artificial intelligence in evaluation emerged as a standout theme. Among the presentations and conversations, our Junior Consultant Pierre Canet made a notable contribution by showcasing how we applied generative AI (genAI) in the qualitative coding process within our evaluation of the Girls’ Education Challenge Fund (GEC). In this blog, he reflects on the practical benefits, the impact on consultants and the future of evaluation and research methodologies.

Elevating evaluation with genAI

Evaluation as a discipline continues evolving through adoption of new tools that enhance both the quality and efficiency of analysis. Software like R and Stata have become essential for quantitative analysis while NVivo and Dedoose are integral to qualitative work. However, the volume of data and documents generated in complex programmes increasingly challenges human capacity for timely and thorough analysis.

New tools like genAI offer promising ways to increase evaluators’ efficiency and their ability to undertake complex analyses. Large Language Models (LLMs), a type of genAI with particularly strong analysis capabilities, expand possibilities for rapidly processing documents. Our Evaluation and Research practice has been piloting the use of such an LLM – in this case ChatGPT Enterprise – to conduct document reviews. Because it does not use input data to train its models, we are able to harness the power of ChatGPT Enterprise workspace to enhance our analysis and processing, while maintaining the confidentiality and security requirements of our clients.

As experienced evaluators, we often need to analyse large volumes of written data from interviews, monitoring reports, or evaluations, we read, code and analyse text. Junior staff typically handles basic coding of excerpts, which can be particularly time intensive, while more experienced evaluators focus on interpreting excerpts and drafting findings. So how can genAI assist?

Lessons from our genAI pilot

Over the past 12 years, our independent evaluation followed the FCDO-funded Girls’ Education Challenge Fund to assess its implementation and outcomes through a series of evaluation studies. In our recent Lessons Learned study, we harnessed ChatGPT Enterprise to support the time-intensive qualitative coding of over 30 evaluation documents spanning the programme’s history. With FCDO’s approval, we tailored AI prompts to focus on specific thematic areas, enabling the LLM to quickly extract excerpts which could then be exported to Excel and analysed.

Here are the key lessons that we took out:

1. Balancing time and cost savings with quality assurance

The application of genAI should be prioritised for large document reviews, where the time invested in prompt tailoring justifies the time saved in coding. The LLM can significantly reduce the time and money spent on document review as it quickly filters out irrelevant information, allowing teams to accelerate the coding process and focus more on content analysis. However, human quality assurance is still required to confirm accuracy as AI outputs depend heavily on the precision of prompts. In some cases, reviewing AI-generated content can be as time-consuming as manual coding.

2. Impact on the evaluation work and junior staff development

It is crucial to balance AI efficiency with opportunities for junior staff to engage closely with source materials. Manual coding remains a vital training ground for junior evaluators as it helps build an understanding of the thematic issue, the project evaluated, and the geographic context in which it is delivered. It is also a key process for an evaluator to make better sense of a coding framework and the research questions the study is trying to address.

Using AI to bypass this process risks fragmenting their learning experience and limiting holistic comprehension of the documents being reviewed as the analysis of individual excerpts is not done in context.

3. Research rigour and ethics

Rigorous evaluation demands that AI tools support, not replace, the critical thinking and expertise of evaluators. Due to the lack of transparency over how AI algorithms function, evaluators cannot fully trace how AI comes to certain conclusions or chooses to select information from documents. This may result in potential biases in the analysis and prevents the evaluators’ understanding of what they could be.

Considering the opaque nature of AI models, we emphasise the need for careful strategies to mitigate those risks by systematically quality assuring the results for some of the work conducted. Our experience shows that decisions about interpretation and insights should not be left to the algorithm alone and should be critically assessed.

Implications for future evaluation

Taking full advantage of AI’s time-saving potential requires having a certain level of trust in the algorithm, especially when applied to critical and complex analyses. This trust, however, must be carefully balanced with the rigour demanded by the research.

LLMs are best positioned as support tools within the evaluation process, serving as efficient starting points that speed up initial data processing and coding. But it remains essential for evaluators to critically review AI-generated outputs and verify information beyond the AI interface to ensure accuracy and reliability.

Looking ahead, we are committed to continuing our exploration of generative AI as a supporting tool in evaluation, refining our approaches to maximise its benefits while upholding the rigour and ethical standards central to our research.

Pierre Canet

Pierre Canet is a Junior Consultant in the Evaluation and Research Practice at Tetra Tech’s International Development.

View bio

Through his experience in MEL consulting, he has harnessed quantitative and qualitative research and analysis skills. Pierre has experience in designing and implementing results frameworks, including theories of change and logframes.

Pierre has worked on a Lessons Learned Report for the Girls’ Education Challenge (GEC) Fund Phase II and is working on multiple studies for the FCDO Integrated Security Fund (ISF) programme.