Generative Artificial Intelligence (GenAI) in Quality Assura

January 31st, 2024 by Adam Sandman

This article outlines our understanding of the current state of the art for using artificial intelligence and machine learning in software testing and QA, and challenges/opportunities for our customers. We are focusing primarily on Generative AI (GenAI), but there are other AI/ML use cases that are very important in QA.

Disclaimer: As with any discussion around the ‘state of GenAI’, the capabilities improve daily, so what is written here may well be superseded in short order!

1. Testers are Developers/Teachers

With AI, testers can now teach and train the Large Language Models (LLMs) to understand the application(s) being tested. As described in this Microsoft document, Teachable Agents are a way to get around the limitation of LLMs where conversational assistants (such as ChatGPT) “forget” their learnings in between sessions. From the document:

“Teachability addresses these limitations by persisting user teachings across chat boundaries in long-term memory implemented as a vector database. Instead of copying all of memory into the context window, which would eat up valuable space, individual memories (called memos) are retrieved into context as needed. This allows the user to teach frequently used facts and skills to the teachable agent just once, and have it recall them in later chats.”

So, the role of the tester is to write prompts for the AI to perform tasks and test the application for the tester. The tester identifies incorrect or incomplete answers and 'teach' the system by correcting its thoughts by providing some facts or examples. With this approach, fixing is immediate and testing becomes a matter of fine-tuning.

2. Automation and Exploratory Testing Blends

Previously test automation was seen mainly a form of “checking” or regression testing, i.e. preventing existing errors or known issues from reoccurring. The move from traditional “scripted” manual testing to free-form exploratory or session-based testing was seen as the better use of human testers, with the scripted tests being automated.

However, with AI, a lot of exploratory testing can be potentially automated. For example, in this paper on how to use GPT-4V for testing, it was found to be a viable solution for navigating an iOS device, selecting the right application and executing a series of prompted tasks:

(Reproduced from the paper referenced above)

In the example above. The GPT-4V bot was able to complete the tasks on the iOS device, provide screenshots as to the actions, and even generating textual reasoning as why they were completing the task in a specific way and whether they succeeded or failed at the task.

The role of the human tester in this scenarios would be to view the results, understand if it was correct and complete, and if not, tune/teach the model to improve the results.

From the Inflectra point of view it is realistic to expect that the "Play" button for a manual test in our Spira test management tool would just work without human interaction! Similarly, Inflectra hosts a software testing competition called the “SoftwareTestBowl”. At some point (just like have seen with Chess and Go), GPT will start winning in such testing competitions. This is a good example of how fast the AI world is moving.

3. Scenario Generation

Another major area that AI is changing is the creation of test scenarios themselves. For example, it will make sense to have an AI-based recorder that simply watches the user performing a task and generates an audio and/or textual description of the scenario.

This is already a valuable tool when conducting UX/UI usability interviews and assessments. The AI can look at the user performing a task and generate a summary transcript and list of tasks. The next step will be for the AI to be able to make recommendations based on common design patterns.

From the Inflectra point of view, one area of promising functionality is the ability to simply upload/draw the sketch of a proposed application into Spira and have Spira generate the full specification for it. For example, using the free LLaVA: Large Language and Vision Assistant and the Library Information System sample web application:

When you ask it: “This is a web software application. Can you generate the specification for it?” you get:

The image you've provided appears to be a screenshot of a web-based application interface for a library information system. Based on the visible elements, I can infer some of the features and functionalities that this application might have. Here's a speculative specification for the application:

Application Name: Library Information System

Purpose:
The Library Information System is a web-based application designed to manage and organize library resources, including books, authors, and genres. It aims to provide librarians and library staff with a user-friendly interface to catalog, search, and maintain their library's collection.

Features:

Book Management:

Ability to add new books to the system.

Fields for book details such as title, author, genre, and publication date.

Option to edit book details.

Fields for additional information like ISBN, publisher, and number of copies.

Ability to mark books as "Out of Print" or "In Print."

Author Management:

Fields for author details such as name, birth date, and death date.

Option to add multiple works by the same author.

Ability to link authors to their works.

Genre Management:

Fields for genre details such as name and description.

Option to add new genres.

Ability to link genres to books.

Search Functionality:

Advanced search options to find books by title, author, genre, and other criteria.

Ability to filter results by availability (in print/out of print).

User Interface:

Clean and intuitive layout for easy navigation.

Responsive design for use on different devices.

Clear and concise error messages for user feedback.

Security:

Secure login system for authorized users.

Role-based access control to restrict certain actions to specific users.

Data encryption for sensitive information.

Database Integration:

Backend database integration for efficient storage and retrieval of data.

Ability to import and export data in standard formats.

Reporting:

Generating reports on book circulation, author popularity, and genre trends.

Customizable reports to suit the needs of the library.

User Support:

Online help or FAQ section for user assistance.

Contact form for user inquiries.

Scalability:

Designed to handle a large number of books, authors, and genres.

Ability to add additional fields or modules as needed.

Technical Requirements:

Operating System: Compatible with modern web browsers (e.g., Chrome, Firefox, Safari, Edge).

Database: MySQL, PostgreSQL, or other relational database management systems.

Server Requirements: Apache, Nginx, or other web servers capable of running PHP or similar scripting languages.

Front-end Technologies: HTML, CSS, JavaScript, and possibly frameworks like React, Angular, or Vue.js for a responsive and interactive interface.

Back-end Technologies: PHP, Python, Ruby on Rails, or similar server-side scripting languages for database interaction and business logic.

Development Process:

Requirements gathering and analysis.

Design and prototyping.

Development and testing.

Deployment and maintenance.

Conclusion:
The Library Information System is a comprehensive web application designed to streamline library operations, from cataloging to searching and reporting. It is intended to be user-friendly, secure, and scalable to meet the needs of libraries of various sizes.

From this structured description we can create in Spira a list of:

Requirements and user stories
Test cases and BDD scenarios
Development tasks
Infrastructure requirements

With additional prompting, we can generate derived items such as project risks and include them in the Spira project model.

4. New Testing Subjects

Although (as we have seen), using AI can improve the productivity and efficiency of testing existing applications, GenAI itself requires testing, and that introduces new types of testing:

How do we test for AI hallucinations, insertions and “laziness”?
Previously we had things to check for things like "memory", and "disk space". What new things do we have now (dreaming? incompleteness? instability? GPU usage? request window size?)
Intermittency. Since all responses differ, we need to repeat all requests several times when testing GPT to make sure the response stays valid.
There will be new types of applications to be tested such as chat-bots, voice-based applications.
Security testing where the CAPTCHAs are meant to stop bots, but if they can be overcome by bots, how do we check for humans?
- Should applications care any more who is using them?
- GPT input vs human input. What happens when we don't want to allow our system to be used with another GPT system?
New User Interface (UI) Paradigms
- We had grids, trees with a summary.
  - Now we may produce a text summary about anything.
  - We may draw anything (picture of the day) or produce graphs in arbitrary format (org structure in anime style).
- How we do we test these new UI elements?

5. Testing Fluid AI Apps

In section 3 we talk about AI generating scenarios, the next level is that AI will be able to take the same sketches and prototypes and develop the actual applications. These “fluid” applications will be capable of being “regenerated” every day as the requirements change. From Agile to DevOps (and CI/CD) to continual development.

Imagine that we want to create a pizza ordering app:

Do we need to ask the AI to "write requirements for pizza app"?
Or we may prompt "Write pizza ordering app & create tests for it"?
Or just say "Write a fully tested pizza application" and that's it.
Then check and say what you want to be modified in it and it will re-write the application.

I.e. We can skip all the planning/requirements/test stages because they are done implicitly in few seconds. But then the next level could be:

Do you still need an app whatsoever?
The customer will say to Bing "I want pizza”, and the rest is generated “on the fly”.

However, how do we test these dynamic, AI-composed applications? Their functionality, structure and features will change continuously, making it difficult to know what’s the correct behavior. We need to do some regression... Somehow..!

6. Reporting & Analytics

There are new ways for reporting and analytics to be created using AI:

New summary report - summarize documents. Summarize information based on text request.
New reports like "What is unusual in the activity?"
Visualization - ask to draw a graph using a text prompt. The prompt may describe both data and view.
Find 10 most clearly defined incidents and 10 most vague ones
Map queries
- Find regions with a lack of drug stories.
- Find similar "shops" that are too close to each other.

These are complex to test and make it more difficult to create test scenarios without context. The tester (either human or AI) needs to apply “reasoning” to determine if the system is behaving as desired.

7. AI Laziness

When using AI to do testing, there are some limitations due to AI “laziness” that need to be considered:

AI Stops after finding the first match. Inability to find all.
AI may find some results, not everything.
AI may lie or misrepresent facts
AI has a limited context window. This means it may read only limited input to draw its conclusions.
AI tends to forget the middle. It is focusing on the beginning and end of the request.

8. Cost Considerations

Ironically, the latest GPT-4 models are expensive to use and unless the results from the prompts are cached, using them for testing is currently more expensive (in terms of number of tokens) than using human testers/designers.

However, that will change in the future.

The simpler GPT-3.5 Turbo models are currently cost-effective and can be used for more limited tasks, as we’re demonstrating already with the Inflectra Spira AI functionality that generates risks, test cases, scenarios and tasks using GPT-3.5.

To make GPT-4 models cost effective at present, you need to use token optimization strategies such as using the Teachable Agents and vector databases that we discussed in section 1.

Conclusion

As you can see in this article, we are witnessing a real-time revolution in how AI can help make software testing easier, while simultaneously creating new types of application to be tested, and new ways we need to think about testing. Clearly 2024 will be an interesting year in testing!

Adam Sandman is a visionary entrepreneur and a respected thought leader in the enterprise software industry, currently serving as the CEO of Inflectra. He spearheads Inflectra’s suite of ALM and software testing solutions, from test automation (Rapise) to enterprise program management (SpiraPlan). Adam has dedicated his career to revolutionizing how businesses approach software development, testing, and lifecycle management.