Testing and evaluating a Generative AI tool for NHS Research Ops

Designing and testing tools that use Generative AI, such as large language models (LLMs), requires more than only user research and prototyping. Generative AI can be an excellent tool in specific contexts. Still, to determine if the product is viable, we also need to test and evaluate the quality of its output within each use case.

Within this case study, I present my process for evaluating a Generative AI tool, the findings and ways to improve the tool.

Problem space

The NHS England's User Research community includes over 125 user researchers across many programmes and directorates. Researchers conduct vast amounts of research but need an easy way to share and learn from one another. Therefore, it is possible that research is sometimes repeated rather than learnt from, which can waste:

project time
public funding
time and effort of users*

* This would especially affect marginalised groups, who are often repeatedly putting time and effort into the research process, being asked the same questions again, seeing little outcomes which further erodes trust

Opportunities

Before I joined the team, a research phase identified the user needs for sharing and acquiring research knowledge and documentation.

Next, the team made a technical prototype of a research repository where users could upload their research reports, and a summary of the report would be generated.

An illustration and flow chart showing the process of the tool. From a user perspective, users first follow onboarding steps, upload a document, review and edit fields and then publish it. From a generative AI perspective, the input which is the report is received. It then processes the reports based on prompts provided prior. It then creates a data output which is shown to the user for review and editing.

Impact

The research and design recommendations that I delivered are continuing to improve the product, as the team implements design changes and updates the prompts.

‍

Over 200 user research projects have been added to the prototype, which is the foundation of the database for production. In the meantime, user researchers are accessing information and knowledge sharing.

‍

This project has been showcased to public sector organisations, and government on how they can utilise and build AI products.

‍

My role

As a contract Senior User Researcher, I:

Created and carried out an evaluation method of the functional prototype
Planned and conducted user research with user researchers while they uploaded reports
Planned and conducted user research with user researchers and designers while they tried to find relevant reports
Created final design recommendations for upcoming phases
Collaborated with colleagues on iterating the prompts, also known as prompt engineering, to improve generative AI outputs

Evaluation approach

The tool takes uploaded research reports and generates information for various fields, such as:

Title
Executive summary
Research methods
Key learnings
Outcomes

To evaluate the functional prototype and use of generative AI, I:

Recorded data field completion and edit rates: Uploaded 47 research reports and recorded which fields were completed and which I had to edit. This information helped identify where we needed to improve the AI prompts used to retrieve information and the accuracy and content generated.
‍
Tested reliability: Uploaded the same 7 files 6 times to see how the summarised and extracted data changed upon each upload.

Evaluation results

Completion rates

Completion rates varied considerably. For example, the LLM generated titles for all 47 reports; whereas for project completion dates, it completed 34, with 12 being manually entered.

Research questions and outcomes are two important fields that were rarely found, with a 6% completion rate. These fields are more text-heavy, which is why improving their completion rates is crucial for the product's success.

A bar graph showing the tool's completion rates for each field, and those that I edited manually. The description after this image outlines the important aspects of the data verbally.

Edit rates

Edit rates were recorded to assess the quality of output provided by the tool and to find any inaccuracies.

For example, the completion rate for project titles seems positive; however, of the 47 auto-completed title fields, I edited 38 of them. The tool did not generate the project title by summarising the report; instead, it found the largest title and displayed that. The subtitles, which often had important information, were always excluded.

For the executive summary, I edited 40 of 47. Many of these edits were due to insignificant information being generated; however, adapting the AI prompt can be easily improved.

The tool incorrectly identified 6 of 23 user groups or teams, sometimes taking information out of context. For example, if background research was included as an introduction, the tool assumed that this was the research of interest. In another case, research constraints were seen as research findings.

A bar graph showing the tool's edit rates for each field. The description after this image outlines the important aspects of the data verbally.

Reliability testing: longer fields

Sometimes, more concise reports were hard for the tool to process. In 3 of 6 uploads, to fill in data, it took a user research quote and added it as an overall finding to the background information of the project, which was inaccurate. In 4 of 6 uploads, it said that the user group was the research team.

Here is a sample of 2 upload results from the same report, which have been altered for an imaginary topic.
Inaccurate statements are shown in light pink, and labelled with an ARIA role "inaccurate" for screenreaders. Accurate statements are in light blue and labelled with an ARIA role "accurate" for screenreaders.

Upload 1’s result:
This survey was conducted on 13 May by the Research team and aimed to gather insights into the experiences of fairies and gnomes. The team size and sample size of participants are undisclosed. The survey was conducted to gain a deeper understanding of the fairies’ process and how that affects the gnomes. It included the information they needed to check. The underlying problem being addressed is the need for fairy wands to not be misplaced so often.

Upload 2’s result:
This research was conducted on 13 May by the fairies and gnomes who have been doing this work since May 2021. It aimed to understand the process of fairies and its effect on the gnomes.

Reliability testing: shorter fields

When uploading the same document repeatedly, user groups often varied. The tool was unable to know which user group type was most helpful.

Here is a sample of 6 upload results, that have been altered for a different topic. In this example, some were quite generic, others were helpful, and upload 4 was not useful.

Upload 1: fairy wand users

Upload 2: fairy wand users

Upload 3: individuals with conditions affecting wand use
‍
Upload 4: users participants interviewees

Upload 5: individuals with conditions affecting wand use

Upload 6: fairy wand users fairies with one wing fairies with a torn wing pixies

Improving the longer fields, compared to these shorter fields, is a higher priority as longer fields take more time for the uploader to edit. Therefore, we deprioritised user groups for now and would need to explore this more with a multidisciplinary team.

User research with uploaders

User group

Uploaders are user researchers who have conducted a research project and want to share it with colleagues.

Method

5 moderated remote usability testing sessions where user researchers:

Uploaded reports
Rated the outputs in regards to completeness and accuracy

The objective was to identify perceptions and pain-points

Top 3 findings

As the use of generative AI is relatively new, we found common misconceptions about its use. In response to low completion rates, users falsely assumed that their files must not include the correct information. They requested that we provide templates that standardise reports for uploading, assuming that the completion rates would increase, rather than requesting that the technology needed improvement.

Due to the low completion rate of some fields, uploading information to the finder takes a lot of time. Some participants thought the process was taking too long; however, others said they were happy to take the time to complete it. As a product team, we agreed that the process was taking too long. We set a time expectation of 10 minutes. This time will also be the KPI for determining product success, and our goal is to improve the technology to meet this.

Users also provided a lot of feedback on the user flow and content design. Often when AI was mentioned in the content design, users became hesitant and confused. Therefore, we will continue to improve on these aspects to help prepare users for the process and know what to expect.

User research with viewers

User group

Viewers are often user researchers and designers who want to know about and learn from previous research.

Method

5 moderated remote usability testing sessions where user researchers

Search for topics
Scan results
Choose projects of interest and dive deeper

Top 3 findings

The viewers' goal is to scan different projects quickly and identify the most relevant ones. Some fields were not helpful and were repetitive. Simplifying and reducing the number of fields will help viewers scan while also making the uploading process easier for uploaders.

Titles and executive summaries need to be improved so it is clear what the project is about. The titles were often unclear, so the executive summary was then an essential part to read, but often lacked the necessary information.

The viewers need clarification on how some search results related to their request because many unrelated projects were shown. Some falsely believed the search functionality must be more sophisticated by searching through the report as well as the data shown.

Improving prompts

In response to the research's outcome, we began to iterate the prompts. Within the first working session, we improved three of the prompts successfully, generating better content design with more important information.

Design recommendations

Based on the user research and data findings, prototypes were made for testing and implementation. These prototypes have been simplified to show the main concepts and changes.

Changing the design flow

Before

1. Home page

The user research finder homepage. The two main options highlighted are for users to find a research project or add a project.

2. Share user research

3. Upload report

This is the upload page. Users can review some content and then upload the file. Most of the content not this page has been simplified to lorum ipsum - as the main aim is to upload.

4. Check the details

Once the user has uploaded the page, they are brought to a screen where they can review the information that the tool has plucked out from the documentation. It is displayed for each field, such as "executive summary", "background", etc.

5. Edit the details

After the user has reviewed the content, they can now edit the fields. On this page, each field has a box underneath with the content generated. Each are interactive so users can add or delete the content provided.

6. Submission complete

Once edited, the users submit and come to this page which confirms that their report has been uploaded. It says thank you.

After

1. Home page (same)

2. Share user research (same)

3. Upload report (same)

4. Check and edit the details (new)

Within this revised flow, after the user uploads their report, they then get to the edit page. They can edit each field.

5. Confirm your submission (new)

After editing, the user submits and then comes to this page. They then check over their answers before submitting. If they notice a mistake at this point, they can go back to edit.

6. Submission complete

From user research, it was identified that the current flow did not fit users' expectations. In the revised version, the editing page has been put first so that users can check and edit the content at the same time. After editing, they can then see the content again fully and be given the option to go back and edit if they missed something.

Changing the check and edit pages

Before

In the original edit page, the heading says "Check the details". Below there is an alert box that says which fields are empty, prompting users to Check and Enter the details. Then below, there are 13 fields, where some are empty. Users receive character counts for the longer fields as well. They can then press "Submit" at the bottom.

After

Based on the user research and quantitative research results, it was clear that the Check and Edit page needed revising. I:

Added clear instructions on what users should do and removed the red box. One user left wanted to leave missing fields blank and were not editing information they identified as inaccurate.
Added helper text to fields that users found confusing, for example the executive summary. By providing clear descriptions, this will hopefully reduce editing time by removing uncertainty.
Removed the Background, Duration, Outcomes and Key learning fields. Some were redundant, while others were better suited for the report with more context.
Added formatting for research questions field and increased the character limit
Added a 'Terms of reference' field for user researchers to explain acronyms that have been identified by the LLM

Changing the search result page

Before

After

After - filtering

This prototype is very similar to the design with the filter, but in this case I'm showing how the filter may operate. The phases have been expanded, and Discovery has been selected. Only two results are shown, rather than 3.

From user research, it was identified that the search page was not meeting user needs. This will be a key focus in the next development phase technically. Design-wise, we can also improve the user's experience with key changes. Here I:

Ensured users can see the full executive summary, rather than a portion of it, so that they can scan it quickly to see if the project is relevant for them or not.
Improved the executive summary prompt to remove low impact, front-loaded content
Displayed the search query to remind people what they searched for
Added filters to help users narrow down search results

Next steps

The team has continued to improve the prompts. My design and development recommendations were delivered to the product team for prioritisation, ready for the next development phase to begin.

Lessons learned

Lesson 1

Uploading reports and gathering rich quantitative data about the completion and edit rates helped identify where we needed to improve the prompts. Next time, I would explore with developers ways to automate this process, as uploading data manually was very time consuming.

Lesson 2

From this experience, I learnt ways in which a user researcher / designer can significantly impact systems that use generative AI and ensure that the products are accurate, helpful, desirable, and provide quality outputs. Evaluating the system through both qualitative and quantitative methods and using those outcomes to improve it was incredibly satisfying and crucial.

Lesson 3

Completing both rounds of user research before changing the design was beneficial because we could simplify and remove many fields. If we had moved forward with design changes between rounds, we would have done a lot of redundant work.

Testing and evaluating a Generative AI tool for NHS Research Ops

Problem space

Opportunities

Impact

Jump to

My role

Evaluation approach

Evaluation results

Completion rates

Edit rates

Reliability testing: longer fields

Reliability testing: shorter fields

User research with uploaders

User group

Method

Top 3 findings

User research with viewers

User group

Method

Top 3 findings

Improving prompts

Design recommendations

Changing the design flow

Before

After

Changing the check and edit pages

Changing the search result page

Next steps

Lessons learned