Presenter(s)
Saabiriin Abdi
Abstract
Automated accessibility evaluation tools, such as WAVE and Google Lighthouse are widely used to assess compliance with the Web Content Accessibility Guidelines (WCAG). However, prior studies indicate that these tools often disagree, vary in the success criteria they support, and differ in the type of issues they detect. Because most evaluations were conducted on real websites, where the precise number of accessibility violations is unknown. Without a ground-truth baseline, it is impossible to measure false negatives, false positives, or the accuracy of detection.
This project addresses that gap by creating a controlled HTML webpage containing 40 intentional WCAG 2.1 Level A and AA violations distributed across the four POUR principles. Each violation is documented with its success criterion, HTML location, and expected detection behavior. WAVE and Lighthouse were executed 40 times under identical system conditions using automated macros to ensure consistency, and all results were directly compared to the WCAG 2.1 success criteria to identify true positives, false negatives, and false positives.
Both tools exhibited deterministic behavior, producing identical results across all runs. WAVE detected a broader range of violations, especially in perceivability and structure, yet generated substantial informational noise. Conversely, Lighthouse identified fewer violations overall but showed stronger performance in programmatically testable areas such as ARIA validation and contrast errors. Overlap analysis showed a high level of agreement regarding missing alt text, absent labels, and contrast failures, but low level of agreement in keyboard operability, focus order, and landmark structure. These findings indicated that the two tools captured fundamentally different subsets of WCAG violations.
Overall, the findings showed that no single automated tool provided complete WCAG 2.1 coverage. WAVE offered a wide range of features but lacked precision, while Lighthouse offered precision but lacked breadth. The results demonstrated that multi-tool evaluation was necessary for meaningful accessibility assessment, and that manual testing is crucial for success criteria that cannot be consistently automated. This study offers empirical evidence supporting the integrated application of both automated and manual approaches in assessing web accessibility.
College
College of Science & Engineering
Department
Computer Science
Campus
Winona
First Advisor/Mentor
Mingrui Zhang; Trung Nguyen
Location
Kryzsko Great River Ballroom, Winona, Minnesota; United States
Start Date
4-23-2026 1:00 PM
End Date
4-23-2026 2:00 PM
Presentation Type
Poster Session
Format of Presentation or Performance
In-Person
Session
2a=1pm-2pm
Poster Number
1
Analyzing Accessibility Issue Detection Differences Between WAVE and Google Lighthouse Using a Controlled WCAG Violation Webpage
Kryzsko Great River Ballroom, Winona, Minnesota; United States
Automated accessibility evaluation tools, such as WAVE and Google Lighthouse are widely used to assess compliance with the Web Content Accessibility Guidelines (WCAG). However, prior studies indicate that these tools often disagree, vary in the success criteria they support, and differ in the type of issues they detect. Because most evaluations were conducted on real websites, where the precise number of accessibility violations is unknown. Without a ground-truth baseline, it is impossible to measure false negatives, false positives, or the accuracy of detection.
This project addresses that gap by creating a controlled HTML webpage containing 40 intentional WCAG 2.1 Level A and AA violations distributed across the four POUR principles. Each violation is documented with its success criterion, HTML location, and expected detection behavior. WAVE and Lighthouse were executed 40 times under identical system conditions using automated macros to ensure consistency, and all results were directly compared to the WCAG 2.1 success criteria to identify true positives, false negatives, and false positives.
Both tools exhibited deterministic behavior, producing identical results across all runs. WAVE detected a broader range of violations, especially in perceivability and structure, yet generated substantial informational noise. Conversely, Lighthouse identified fewer violations overall but showed stronger performance in programmatically testable areas such as ARIA validation and contrast errors. Overlap analysis showed a high level of agreement regarding missing alt text, absent labels, and contrast failures, but low level of agreement in keyboard operability, focus order, and landmark structure. These findings indicated that the two tools captured fundamentally different subsets of WCAG violations.
Overall, the findings showed that no single automated tool provided complete WCAG 2.1 coverage. WAVE offered a wide range of features but lacked precision, while Lighthouse offered precision but lacked breadth. The results demonstrated that multi-tool evaluation was necessary for meaningful accessibility assessment, and that manual testing is crucial for success criteria that cannot be consistently automated. This study offers empirical evidence supporting the integrated application of both automated and manual approaches in assessing web accessibility.
