Integrating artificial intelligence into radiological practice for automated pneumonia assessment
Highlight box
Key findings
• The study shows a strong concordance between artificial intelligence (AI)-based automated quantification and traditional semi-quantitative methods for assessing the extent of lung lesions in coronavirus disease 2019 (COVID-19) patients. AI can serve as a reliable alternative or complement to conventional techniques, providing consistent, objective, and reproducible results.
What is known and what is new?
• Traditional methods are subject to interpretation variations between observers. AI has shown its potential in reducing human errors and improving the consistency of assessments, but few studies have explored its effectiveness in real-world clinical settings.
• This study provides strong evidence of the concordance between AI and traditional methods, while highlighting the need to evaluate the effectiveness of AI in practical clinical contexts.
What is the implication, and what should change now?
• AI offers an opportunity to standardize the assessment of lung lesions in COVID-19 pneumonia, which could improve diagnostic accuracy and disease progression monitoring, particularly in resource-limited settings.
• The integration of AI into clinical practice should be considered, taking into account the required analysis time and adjusting workflows accordingly. Further research is needed to validate long-term outcomes and assess the effectiveness of AI across different imaging systems and patient populations.
Introduction
Background
The disease commonly leads to severe respiratory failure due to viral pneumonia, which significantly impacts lung function. At the start of the pandemic, polymerase chain reaction (PCR) testing resources were limited, which made imaging techniques, particularly computed tomography (CT) scans, critical for diagnosing COVID-19 pneumonia, identifying pulmonary lesions, and assessing the severity of the disease (1-5). However, CT scans are also useful for monitoring disease progression and quantifying the extent of lung damage. In recent years, artificial intelligence (AI) has emerged as a powerful tool to assist radiologists in analyzing CT images, improving the speed and accuracy of lesion detection and quantification. AI algorithms, with their ability to process large amounts of data, have the potential to enhance the diagnostic process, especially in critical situations where timely decision-making is essential (6).
Rationale and knowledge gap
While CT scans remain essential for diagnosing COVID-19 and assessing lung damage, traditional methods such as semi-quantitative scoring and subjective interpretation by radiologists are often subject to inter- and intra-observer variability (7,8). These limitations highlight the need for more reliable, objective, and reproducible methods. AI, with its capacity for automated image analysis, has shown promise in addressing these issues by offering consistent results without the inherent variability associated with human interpretation. Although AI has shown potential in diagnosing COVID-19 and assessing lung damage through imaging techniques, there is a lack of comprehensive studies evaluating the consistency and accuracy of these models across various clinical settings and patient populations. Most existing research focuses on isolated aspects, without assessing how these models perform under real-world conditions, considering variations in imaging data and patient demographics. Furthermore, the generalizability of AI models across different imaging equipment and hospitals has not been sufficiently explored.
This study addresses this gap by providing a detailed evaluation of AI performance in both COVID-19 diagnosis and lung damage assessment, focusing on consistency and accuracy across various clinical scenarios. We highlight the need for robust, well-validated AI tools that can provide consistent results across different clinical environments, thereby filling an important gap in the literature and advancing the practical application of AI in pandemic management. Additionally, AI could standardize lesion assessment, thereby improving diagnostic accuracy and monitoring disease progression. However, the application of AI in COVID-19 diagnosis and lesion quantification is still emerging, and its concordance with traditional methods remains insufficiently explored. Another under-researched aspect is the comparison of time efficiency between AI-based methods and conventional radiological techniques, an important consideration for understanding the practical implications of implementing AI in clinical practice (9-12).
Objective
The primary objective of this study was to evaluate the concordance between AI-based automated quantification methods and conventional semi-quantitative and manual methods for assessing the extent of lung damage in COVID-19 patients (13,14). Specifically, this study sought to investigate the reliability, accuracy, and consistency of AI in detecting lung lesions and quantifying their severity compared to traditional radiological assessments (15). Furthermore, the study aimed to analyze potential time differences between AI-based quantification methods and traditional semi-quantitative or manual techniques, providing insights into the efficiency and practicality of AI applications in clinical settings (16). Ultimately, the findings aim to assess whether AI can serve as a reliable alternative or complement to conventional methods in evaluating COVID-19 pneumonia. We present this article in accordance with the STROBE reporting checklist (available at https://tro.amegroups.com/article/view/10.21037/tro-24-26/rc).
Methods
Patient selection and study design
We conducted a single-center retrospective cohort study from 17 March to 11 May 2020, at CHU Ibn Sina in Rabat. The study included 106 patients diagnosed with COVID-19 pneumonia, based on either a positive COVID-19 PCR test (using the Roche Light Cycler 480 automated thermal cycler) or clinical symptoms of COVID-19 pneumonia, such as fever, cough, dyspnea, and exposure to COVID-19, as determined by the emergency physician (17-19) (see Table 1).
Table 1
| Parameters | Total (n=106) | Death (n=17) | Survival (n=89) |
|---|---|---|---|
| Age (years) | 64±10 | 72±6 | 62±9 |
| Male | 60 [57] | 11 [65] | 50 [56] |
| BPCO | 6 [6] | 0 [0] | 6 [7] |
| Asthma | 11 [10] | 2 [11] | 9 [10] |
| Diabetes | 25 [24] | 7 [41] | 18 [20] |
| HTA | 46 [43] | 11 [65] | 35 [39] |
| Cardiopathy | 26 [25] | 7 [41] | 19 [21] |
| Cancer | 19 [18] | 7 [41] | 12 [13] |
| Intensive care | 43 [41] | 6 [35] | 37 [42] |
| PCR+ | 76 [72] | 13 [76] | 63 [70] |
Data are presented as mean ± SD or number [%]. BPCO, bronchopneumopathie chronique obstructive; HTA, hypertension artérielle; PCR+, polymerase chain reaction positive; SD, standard deviation.
The study duration was chosen due to the specific evolution of the COVID-19 pandemic, which did not follow a linear trajectory. By selecting a non-integer year period, we were able to better align the study with key phases of the pandemic, including changes in virus transmission, public health measures implemented, and the introduction of vaccines, which had varying effects on the disease dynamics. The sample size was determined based on the expected effect, the primary outcomes of the study, and the desired level of statistical significance. We considered available epidemiological data and historical trends of COVID-19 to estimate a sufficient sample size to detect significant differences between the studied groups. A power analysis was conducted to determine the required sample size. This analysis took into account the expected effect, the power level (set at 80%), and the significance threshold (0.05). The results showed that a sample of 106 participants was necessary to achieve the desired statistical power.
To standardize the assessment of patients’ respiratory status, it is essential to measure several key parameters, such as respiratory rate, breathing depth, and respiratory rhythm. These measurements should be performed under controlled conditions, considering the patient’s posture and the environment. Pulse oximetry and spirometry are essential tools for assessing oxygen saturation and respiratory volumes, thereby providing information on pulmonary ventilation efficiency. Moreover, measurement conditions must be standardized, taking into account temperature, humidity, and the patient’s position, ensuring that measurements are taken at similar times, particularly before and after physical activity or treatment. The efficiency of gas exchange, measured by the ratio of oxygen consumption to carbon dioxide production, can also be monitored using blood gas analysis, which helps detect potential dysfunctions. Finally, it is crucial that healthcare professionals receive uniform training to ensure consistent assessment, following strict protocols for measurement, result interpretation, and documentation of the patients’ respiratory status.
The typical CT appearance of COVID-19 pneumonia included peripheral, bilateral ground-glass opacities, with or without consolidation, or visible intralobular lines (20,21). No exclusion criteria were applied.
Data were retrospectively collected from electronic medical records (EMR) between May and June 2020, using Excel. These records, stored in the EMR system at CHU Ibn Sina in Rabat, contained all patient medical data, including CT scan results, which were analyzed to assess lung lesions in patients with COVID-19 pneumonia (22). The images were analyzed using semi-quantitative, manual, and AI-based automated methods. All collected data were coded and processed using R software (version 4.3.3). The primary objective of this retrospective study was to examine the radiological characteristics of COVID-19 patients and evaluate their clinical progression during this specific period of the pandemic.
A compatible clinical picture was defined by the emergency physician and included fever, cough, dyspnea, and exposure to COVID-19 (20,23). The typical CT appearance of COVID-19 pneumonia included peripheral, bilateral, ground-glass opacities, with or without consolidation, or visible intralobular lines. No exclusion criteria were applied (23). Data were collected retrospectively from electronic files between May and June 2020 using Excel. These files were part of the EMR at CHU Ibn Sina in Rabat, where all patient medical data, including CT scan results, were stored and used for the analysis of lung lesions in patients with COVID-19 pneumonia. All the data collected during this study were coded and processed using R software (version 4.3.3).
Subsequently, we aim to address the following question: is there a significant difference between the techniques employed, specifically regarding their ability to assess the degree of severity of lung damage?
In this study, Lin’s concordance correlation coefficient is used to compare a new measurement method to the gold standard method. More specifically, we established the percentage concordance test of the extent of the lesions described by the AI versus the gold standard (reconstruction conducted using the Aquarius software in Augustin Huet’s study) to determine whether the new method can be used instead of the gold standard. Furthermore, the Kruskal-Wallis test is often used as an alternative to analysis of variance (ANOVA) in cases where the normality assumption is not acceptable. It allows testing if k samples (k>2) come from the same population (see Tables 2,3).
Table 2
| Statistics | Q20 interne chrono | Q20 interne 2 chrono | Q20 senior chrono | AI chrono | Logical chrono |
|---|---|---|---|---|---|
| Average (s) | 60.4 | 35.8 | 35.1 | 231.7 | 1,269.3 |
| SD (s) | 13.9 | 12 | 11.1 | 33.1 | 465.4 |
| Median (s) | 58.5 | 34 | 33 | 235 | 1277 |
| Min (s) | 26 | 12 | 12 | 150 | 482 |
| Max (s) | 91 | 85 | 67 | 360 | 2392 |
AI, artificial intelligence; SD, standard deviation.
Table 3
| Method | Q20 R1 | Q20 R2 | Q20 S |
|---|---|---|---|
| Q20 AI | |||
| ρc | 0.83 | 0.87 | 0.80 |
| ρ of Pearson | 0.88 | 0.88 | 0.88 |
AI, artificial intelligence.
Kruskal-Wallis test
H0: there is no significant difference between the time (in s) to obtain the results according to the different methods Q20, Q24, AI, or software.
H1: there is a significant difference between the time (in s) to obtain the results according to the different methods Q20, Q24, AI, or software.
CT protocol
Chest imaging was performed using Siemens Somatom definition AS+ CT equipment (Siemens Healthineers, Erlangen, Germany) according to a standardized protocol: patients are placed in the supine position, with their arms elevated, in apnea during the image acquisition. The decision to inject intravenous iodinated contrast material to rule out pulmonary embolism was left to the discretion of the radiologist after discussion with the emergency physician.
The acquisition parameters used were: 120 kV, automatic tube current, pitch =1.2, slice thickness =1 mm, rotation speed =0.33 s, increment =0.7 mm, and matrix =512×512 (see Figure 1).
Scanographics analysis
Semi-quantitative analysis with Q20 scores: for each of the five pulmonary lobes, the extent of lobe involvement was quantified from 0 to 5, with a maximum of 20 points. This semi-quantitative evaluation was performed independently and blinded by two junior radiologists (observer No. 1 and observer No. 2: residents, 2 years of experience) and a senior radiologist (observer No. 3: more than 10 years of experience in thoracic imaging).
Q24: Semi-quantitative scale that assigns a score to three zones for each lung: upper (above the carina), middle (between the carina and the inferior pulmonary vein), and lower (below the inferior pulmonary vein). Scores were defined as follows: 0 (0%), 1 (1–24%), 2 (25–49%), 3 (50–74%), 4 (75–100%). The Q24 was obtained by adding the scores of each of the six individual zones and varies from 0 to 24. This semi-quantitative evaluation was performed independently and blinded by two junior radiologists (observer No. 1 and observer No. 2: resident, 2 years of experience) and a senior radiologist (observer No. 3: more than 10 years of experience in thoracic imaging).
Semi-automated software-based quantification was performed using TeraRecon Aquarius Intuition software. This algorithm provided an initial segmentation of lung parenchyma (see Appendix 1). If needed, corrections were made manually by the reader using contouring tools. The difference between well-aerated and affected lung regions was defined by a density threshold [between −450 and −600 Hounsfield units (HU)], set individually for each chest CT. The choice of a density threshold between −450 and −600 HU is based on the typical range of attenuation values observed in chest CT scans that distinguish well-ventilated lung tissue from areas affected by disease, such as pneumonia. In normal, healthy lung tissue, the air-filled alveoli result in relatively low attenuation values, typically ranging from around −900 to −700 HU. Well-ventilated, healthy lung regions generally have higher air content, leading to less density and more negative HU values. However, in areas affected by pneumonia or other lung pathologies (such as COVID-19), the lung tissue becomes inflamed, consolidates, or fills with fluid, leading to increased attenuation values. These regions have a higher density due to the presence of inflammatory cells, fluid, and consolidated tissue, which results in less negative HU values compared to healthy lung tissue. The range between −450 and −600 HU was chosen because it effectively captures the transition from healthy to affected lung tissue, with affected areas such as those with ground-glass opacities, consolidation, or inflammation falling within this density range. By setting the threshold between −450 and −600 HU, the algorithm can effectively differentiate well-ventilated lung areas (with lower HU values) from areas affected by disease (which have higher HU values due to increased tissue density and fluid accumulation). This threshold allows for an accurate segmentation of lung regions based on their density, helping to quantify the extent of disease involvement and aiding in the assessment of pulmonary conditions. A percentage of lung involvement was then automatically calculated based on the ratio of COVID-19 pneumonia volume/total lung volume. It was made by observer No. 1.
The TeraRecon Aquarius Intuition software is a powerful tool for the semi-automated segmentation of lung parenchyma from chest CT scans. It employs advanced algorithms to identify and segment the lung tissue by analyzing the HU values in the CT images. Specifically, the software uses a density threshold in the range of −450 to −600 HU, which enables it to differentiate between well-aerated lung tissue (typically represented by higher HU values) and areas affected by pneumonia (which generally exhibit lower HU values due to inflammation, consolidation, or ground-glass opacities, common in conditions like COVID-19). Once the algorithm performs the preliminary segmentation, it generates an initial delineation of the lung regions, clearly identifying the affected areas based on their lower density. However, since the algorithm’s performance may not be flawless, particularly in challenging cases with heterogeneous disease patterns or artifacts, manual corrections can be performed using the contouring tools within the software. These tools allow the user to refine the segmentation by adding or removing segmented areas, adjusting the borders for greater precision, and smoothing contours for a more natural fit along the lung surface. After completing the segmentation, the software automatically calculates the volume of the affected lung tissue, corresponding to the regions identified as having pneumonia. The software also computes the percentage of lung involvement by calculating the ratio of the affected lung volume to the total lung volume, which includes both healthy and diseased lung tissue. This quantitative measure provides a precise assessment of pulmonary involvement, which is critical for evaluating the severity of the disease.
These results, including the volume and percentage of affected lung tissue, are essential for clinical decision-making. They assist clinicians in assessing the extent of pneumonia, monitoring disease progression over time, and determining appropriate treatment strategies. In the context of COVID-19, the ability to accurately quantify lung involvement plays a crucial role in predicting patient outcomes, evaluating the effectiveness of treatments, and managing resource allocation, such as the need for ventilation or intensive care.
Quantitative automated analysis: a quantitative automated analysis was performed independently by observer No. 1 using AI software [Advanced Workstation Server (ADWS)-CT pneumonia analysis-SIEMENS]. This evaluation is based on pulmonary density levels. The results of the evaluation are provided directly by the software, including total lung volumes. For the set-up of automatic analysis, the density threshold adjusted automatically. The software gave a summary of results detailing, on the one hand, a result in the form of a Q20 and, on the other hand, in the form of a percentage of the extent of lung damage suggestive of COVID-19. Quantification scores were calculated in a blinded process by randomizing the CT reading order. The time spent for the analysis of each CT by each of the observers was measured using a smartphone, whatever the method (24).
In this project, we used transfer learning to build a deep learning model capable of detecting COVID-19 pneumonia from chest CT scans (see Appendix 2). Transfer learning allows us to take advantage of a pre-trained CNN, which has already learned useful features from a large image dataset. This significantly reduces training time and improves performance, especially with limited medical image data. The model architecture includes several convolutional and pooling layers followed by dense layers for classification. The image data was preprocessed and augmented to improve generalization. After training, the model was able to accurately distinguish between normal lungs and lungs affected by pneumonia.
Statistical analysis
Both techniques were applied to the same series of 106 patients, so we have two observation series of 106 results that are paired. First, the results are introduced in condensed form. Five results are critical for both techniques, while the condition of 23 patients is considered Severe for the AI method against 19 patients for the Software (R1) method. Thirty-one extended for the AI method versus 42 patients for the Software (R1) method. Finally, the number of patients with a pulmonary extent considered as minimal is 21 as regards the AI method (see Figure 2), and 12 patients as regards the Software (R1) method (see Table 4).
Table 4
| State of gravity | AI method | Software (R1) method |
|---|---|---|
| Minimal | 21 | 12 |
| Moderate | 26 | 29 |
| Extent | 31 | 42 |
| Strict | 23 | 19 |
| Critical | 5 | 5 |
Data are presented as number. AI, artificial intelligence.
Thus, we can statistically test the relationship between these results. Since these are two paired series, the appropriate test is the Chi-squared test.
Hypotheses:
- H0: there is no significant difference between techniques according to the degree of severity;
- H1: there is a significant difference between techniques depending on the degree of severity.
Given that the two variables are of categorical type, we will thus use the Chi-squared test.
Lin’s concordance coefficient
We proposed utilizing the Lin approach to compute the concordance correlation coefficient for evaluating reproducibility. This method requires minimal distribution assumptions and considers the correlation between measurements taken from the same subject, enabling the simultaneous modeling of various measurements. Traditionally, indices such as intraclass correlation and within-subject coefficient of variation have been employed to assess reproducibility in scientific literature.
The choice of correlation coefficient heavily relies on the measurement range. In 2000, Lin et al. (25) emphasized the importance of reporting the data range and comparing the agreement of different measurement methods across similar analytical ranges. These considerations are crucial when applying the suggested method in practical settings. Additionally, they introduced a comprehensive index, the generalized concordance correlation coefficient, to evaluate agreement in both continuous and categorical data. The study is registered with CTRM/2021/05/01568.
Ethical considerations
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics committee of CHU Ibn Sina in Rabat (CARA/AI/21/2234) and informed consent was taken from all individual participants.
Results
Study population
First, we recruited a total of 490 patients who passed a chest CT for suspicion or evaluation of SARS‑CoV‑2 infection pulmonary at CHU Ibn Sina in Rabat. Then, we kept 76 patients with a reverse transcription-PCR (RT-PCR) COVID-19 PCR positive and evident sign of pneumonia on CT, and 30 patients who had a compatible clinical signs and pneumonia anomaly on chest CT. That is why there are a total of 106 patients in our cohort. Part of them had fewer chest CT during their hospitalization, but we kept only one for each patient, and we kept the one with the highest severity damage. A total of 384 patients were excluded because their chest CT showed no pulmonary issues, they were suggestive of a COVID-19 infection, they had no history or symptoms of COVID-19, or their PCR test results were negative (see Figure 3) (20,22,26-29).
The comorbidities
Hypertension was the most present comorbidity with 43%. Then come cardiopathy (25%) and diabetes (24%). As regards pulmonary diseases, there were 11 patients with asthma (10%) and 6 patients (6%) with chronic obstructive pulmonary disease. Nineteen patients have a history of cancer (18%). We can also notice that 41% of the patients went through intensive care (see Table 1).
Pulmonary damage extent according to spirometry fibrosis ratio (SFR) classification
The distribution of the study population in terms of pulmonary damage is as follows (see Table 5). Out of 106 individuals, 17 died (n=17) while 89 survived (n=89). As a percentage of the total population, 11% of individuals had damage of less than 10%, of whom no deaths were recorded. Twenty-seven percent had pulmonary damage between 10% and 25%, with three deaths reported, representing 18% of this group. Forty percent of the population had damage in the 26–50% range, with eight deaths, or 47% of the group concerned. Individuals with damage between 51% and 75% represented 18% of the population, with four deaths noted, i.e., 24% of the group. Finally, 5% of the population had damage in excess of 75%, with two deaths reported, representing 11% of this sub-group.
Table 5
| Percentage | Total (n=106) | Death (n=17) | Survival (n=89) |
|---|---|---|---|
| <10% | 12 [11] | 0 | 12 [13] |
| 10–25% | 29 [27] | 3 [18] | 26 [29] |
| 26–50% | 42 [40] | 8 [47] | 34 [38] |
| 51–75% | 19 [18] | 4 [24] | 15 [17] |
| >75% | 5 [5] | 2 [11] | 3 [3] |
Data are presented as number [%].
The P value of the Chi-squared test is 2.2×10−16, which is less than 0.05, leading to the rejection of the null hypothesis (H0). The Chi-squared test result (Chi-squared =613.7, P<2.2×10−16) indicates that the severity detection percentages of patient attacks, obtained using the contouring software from internal radiology [Software (R1)], significantly differ from those generated by the AI CT analysis software (AI). The problem is that this statistical Chi-squared test does not actually provide any indication of the degree of concordance between these two techniques. For that purpose, we chose the concordance test, which can be used whenever we want to study the relationship between two qualitative variables of two paired series.
The analysis reveals substantial agreement ρc of 0.9271, but not close to 1 [95% confidence interval (CI): 0.8975, 0.9484]. This results from both a lack of perfect correlation (Pearson’s r=0.9372) and bias (Cb =0.9893).
The statistical test uses a one-tailed z-test with a significance level of 0.05. The value of ρc is 0.9271, which is considered excellent. The concordance plot in the figure supports this assessment, as it shows a systematic deviation from the straight line of perfect concordance. But this discrepancy is considered negligible (see Table 6).
Table 6
| Parameters | Results |
|---|---|
| Variable Y | AI percentage |
| Variable X | Software R1 |
| Sample size | 105 |
| ρc (95% CI) | 0.9271 (0.8975 to 0.9484) |
| Pearson (precision) | 0.9372 |
| Bias correction factor Cb (accuracy) | 0.9893 |
AI, artificial intelligence; CI, confidence interval.
The Lin’s concordance test (with 95% CI) was used, on the one hand, to calculate the concordance between the AI and the Software (R1) method to objectify the several types of lesions, on the other hand, to determine agreement between radiologists and AI with Q20 Score (see Figure 4).
The results were considered to be consistent for values between 0.61 and 1.00. The agreement was considered to be excellent, strong, fair, poor, for values between 1.00 and 0.81, 0.80 and 0.61, 0.60 and 0.41, and 0.40, respectively, and 0.21.
The concordance correlation coefficient ρc measures the agreement between two variables, for example, to assess reproducibility or for inter-rater reliability. While the ordinary correlation coefficient (Pearson’s ρ) is used to assess an association (dependence) between two variables. The Pearson’s ρ correlation is insensitive to the use of the biased or unbiased versions for the estimation of the variance; the concordance correlation coefficient is not (see Table 3).
Concordance between radiologists and AI was excellent regardless of the type of lesions observed (see Figure 5).
Without performing tests, we can see that the average completion times of the different methods in our sample are not the same. This leads to the following questions: how different are they? Are the averages sufficiently close for it to be concluded that the completion times of the different methods are the same? Are the averages too different for us to draw this conclusion? In order to answer these questions, we will perform the Kruskal-Wallis test.
The P value of the Kruskal-Wallis test is 2.2×10−16. As long as the P value is less than 0.05, we will reject the H0 hypothesis. This result indicates that the time to obtain the results by the different observers (internal radiologist 1, internal radiologist 2, and senior radiologist) are globally different regardless of the method used Q20, Q24, AI, or software.
The standardized time recording process for analyzing each scan with a smartphone begins with the use of an accurate timing application. The timing starts when the scan is opened and ends when the analysis is complete. Each step of the examination follows a uniform protocol to ensure consistency, and the start, end, and total times are recorded and stored centrally. Regular checks of the timing accuracy are carried out to ensure data precision, which is then analyzed to ensure consistency across different users and sessions. This process ensures the accuracy and comparability of the time measurements for analysis.
We note that the methods that take the longest in terms of implementation time are methods that use software compared to semi-quantitative methods (Q20 senior and Q20 internal 2 and Q20 internal). We conclude after analysis of the graph that the method uses the longest time is the software method with a time equal to (1,269.3 s), after we find the AI with a time equal to (231.7 s). This score can be explained by the time required to use software. Specifically, the senior radiologist’s completion time (35.1 s) presents the shortest time that could be due to the expertise of this radiologist. Then comes the radiologist 2, who realized a time to obtain the results according to the Q20 method equal to 35.8 s, and at the end we find the radiologist 1 (60.4 s) (see Figure 5).
Discussion
The extent of pulmonary damage is a key factor explaining the elevated mortality rate observed in this study. Previous studies have shown that the severity of lesions, as measured by CT, is correlated with the severity of the infection and the patient’s prognosis, particularly in the context of COVID-19 (30,31). In our study, patients with severe pulmonary damage (25–75%) exhibited higher mortality rates, consistent with observations that severe pulmonary lesions can lead to respiratory failure and other organ complications. While other factors, such as comorbidities and age, may also influence prognosis, the extent of pulmonary damage appears to be the primary factor explaining the high mortality observed in our cohort.
While diagnosing COVID-19 using CT scans might appear straightforward for radiologists, our findings challenge this assumption. Our results demonstrate that the accuracy achieved by automated methods is comparable, if not superior, to human expertise, underscoring the potential of these technologies to aid physicians in decision-making. Specifically, AI can effectively discern not only the presence of COVID-19 in CT scans but also the characteristics of lung lesions, particularly those that are subtle or ambiguous. The outcomes of our study offer promising prospects for integrating AI models into clinical practice.
A major strength of this study lies in the high concordance between AI quantification methods and human expertise. The correlation model revealed a strong agreement (ρ=0.9271) between the AI method and the software, which was considered the gold standard. Furthermore, our findings indicate that there is significant concordance between the semi-quantitative method and the evaluations made by senior radiologists as well as residents. Despite the divergent reasoning, senior radiologists tended to arrive at conclusions that were largely in agreement with the internal radiologists. However, there are limitations to consider. The relatively small sample size utilized in our experiments limits the generalizability and robustness of our conclusions. The wide CIs for the concordance correlation coefficients indicate that more data is needed to draw definitive conclusions. Additionally, AI systems took longer to apply compared to semi-quantitative methods, although they were faster than the software method. This delay is attributed to the complexity of using AI tools, rather than the expertise of the radiologist.
It is noteworthy that the 95% CIs for the concordance correlation coefficients exhibited considerable width, mainly due to the small sample size and moderate precision in our study. These estimates may also be somewhat skewed downward because of the small sample sizes. As a result, caution should be exercised in interpreting these results, and further validation in larger patient cohorts through reproducibility studies is recommended. Future research should also focus on investigating the causality of AI predictions, in addition to modeling explainability, to evaluate the reliability of the explanations provided by AI systems. The potential of AI in pulmonary disease diagnosis in the future, advancements in AI, particularly the use of CNNs, are expected to yield even more promising results. These networks, which are particularly well-suited for the analysis of complex images, could enhance the detection of pulmonary abnormalities and refine predictions regarding the severity and progression of pulmonary diseases. This could lead to significant improvements in the automated quantification of diffuse interstitial lung disease, particularly in predicting the severity and evolution of the disease. This would be especially useful in the absence of biomarkers, for patients for whom it is unclear whether a treatment is appropriate, and for whom treatments may prove effective but may also carry significant side effects.
Conclusions
We conducted this study in order to find the level of reliability between the methods of quantification of COVID-19 (Q20, Q24), the gold standard “software” as well as the AI. The results were particularly insightful; we found an excellent concordance between the AI and the other methods. In terms of implementation time of the different methods, the results show that the application of the software [AI and software (R1)] takes longer than the semi-quantitative technique (Q20 and Q24). The AI can be used as an alternative to the gold standard “software” and semi-quantitative method. As a result, we recommend that more research be done using statistical tests of concordance to see if simple and acceptable methods like AI may be used as an alternative to the gold standard “software” for quantification of COVID-19. The results obtained in both COVID-19 identification and lesion classification pave the way for further improvements aimed at robust advanced COVID-19 implementations that identify not only the disease, but also the risk of disease progression.
Acknowledgments
The authors acknowledge the support provided by the Radiology Department of CHU Ibn Sina in Rabat, Morocco. We also express our gratitude to the DOST Learning Resource Center for granting us access to their computer resources for our statistical analysis.
Footnote
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://tro.amegroups.com/article/view/10.21037/tro-24-26/rc
Data Sharing Statement: Available at https://tro.amegroups.com/article/view/10.21037/tro-24-26/dss
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tro.amegroups.com/article/view/10.21037/tro-24-26/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics committee of CHU Ibn Sina in Rabat (CARA/AI/21/2234) and informed consent was taken from all individual participants.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Jarrom D, Elston L, Washington J, et al. Effectiveness of tests to detect the presence of SARS-CoV-2 virus, and antibodies to SARS-CoV-2, to inform COVID-19 diagnosis: a rapid systematic review. BMJ Evid Based Med 2022;27:33-45. [Crossref] [PubMed]
- Khalid MF, Selvam K, Jeffry AJN, et al. Performance of Rapid Antigen Tests for COVID-19 Diagnosis: A Systematic Review and Meta-Analysis. Diagnostics (Basel) 2022;12:110. [Crossref] [PubMed]
- Dessie ZG, Zewotir T. Mortality-related risk factors of COVID-19: a systematic review and meta-analysis of 42 studies and 423,117 patients. BMC Infect Dis 2021;21:855. [Crossref] [PubMed]
- De Smet K, De Smet D, Ryckaert T, et al. Diagnostic Performance of Chest CT for SARS-CoV-2 Infection in Individuals with or without COVID-19 Symptoms. Radiology 2021;298:E30-7. [Crossref] [PubMed]
- Bompard F, Monnier H, Saab I, et al. Pulmonary embolism in patients with COVID-19 pneumonia. Eur Respir J 2020;56:2001365. [Crossref] [PubMed]
- Akl EA, Blažić I, Yaacoub S, et al. Use of Chest Imaging in the Diagnosis and Management of COVID-19: A WHO Rapid Advice Guide. Radiology 2021;298:E63-9. [Crossref] [PubMed]
- Ebrahimzadeh S, Islam N, Dawit H, et al. Thoracic imaging tests for the diagnosis of COVID-19. Cochrane Database Syst Rev 2022;5:CD013639. [Crossref] [PubMed]
- Flament T, Artaud-Macari E, Dumenil C, et al. COVID-ECHO : description échographique des menumonies COVID-19. Rev Mal Resp Actu 2022;14:66-7.
- Bernheim A, Mei X, Huang M, et al. Chest CT Findings in Coronavirus Disease-19 (COVID-19): Relationship to Duration of Infection. Radiology 2020;295:200463. [Crossref] [PubMed]
- Salehi S, Abedi A, Balakrishnan S, et al. Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients. AJR Am J Roentgenol 2020;215:87-93. [Crossref] [PubMed]
- Kattimani AK, Biradar SG, Akki N. Spectrum of Chest HRCT findings in covid-19 pneumonia. Eur J Molec Clin Med 2022;9:251-8.
- Lodé B, Jalaber C, Orcel T, et al. Imagerie de la pneumonie COVID-19. Journal d'imagerie diagnostique et interventionnelle 2020;3;249-58.
- Lv H, Chen T, Pan Y, et al. Pulmonary vascular enlargement on thoracic CT for diagnosis and differential diagnosis of COVID-19: a systematic review and meta-analysis. Ann Transl Med 2020;8:878. [Crossref] [PubMed]
- García-Lledó A, Del Palacio-Salgado M, Álvarez-Sanz C, et al. Pulmonary embolism during SARS-CoV-2 pandemic: Clinical and radiological features. Rev Clin Esp (Barc) 2022;222:354-8. [Crossref] [PubMed]
- Martínez Chamorro E, Revilla Ostolaza TY, Pérez Núñez M, et al. Pulmonary embolisms in patients with COVID-19: a prevalence study in a tertiary hospital. Radiologia (Engl Ed) 2021;63:13-21. [Crossref] [PubMed]
- Rai DK, Kumar S, Sahay N. Post-COVID-19 pulmonary fibrosis: A case series and review of literature. J Family Med Prim Care 2021;10:2028-31. [Crossref] [PubMed]
- Debray MP, Tarabay H, Males L, et al. Observer agreement and clinical significance of chest CT reporting in patients suspected of COVID-19. Eur Radiol 2021;31:1081-9. [Crossref] [PubMed]
- Saeed GA, Gaba W, Shah A, et al. Correlation between Chest CT Severity Scores and the Clinical Parameters of Adult Patients with COVID-19 Pneumonia. Radiol Res Pract 2021;2021:6697677. [Crossref] [PubMed]
- Sun D, Li X, Guo D, et al. CT Quantitative Analysis and Its Relationship with Clinical Features for Assessing the Severity of Patients with COVID-19. Korean J Radiol 2020;21:859-68. [Crossref] [PubMed]
- Francone M, Iafrate F, Masci GM, et al. Chest CT score in COVID-19 patients: correlation with disease severity and short-term prognosis. Eur Radiol 2020;30:6808-17. [Crossref] [PubMed]
- Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 2020;369:m1328. [Crossref] [PubMed]
- Li K, Fang Y, Li W, et al. CT image visual quantitative evaluation and clinical classification of coronavirus disease (COVID-19). Eur Radiol 2020;30:4407-16. [Crossref] [PubMed]
- Dane B, Smereka P, Wain R, et al. Hypercoagulability in Patients With Coronavirus Disease (COVID-19): Identification of Arterial and Venous Thromboembolism in the Abdomen, Pelvis, and Lower Extremities. AJR Am J Roentgenol 2021;216:104-5. [Crossref] [PubMed]
- Zhao S, Lin Q, Ran J, et al. Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak. Int J Infect Dis 2020;92:214-7. [Crossref] [PubMed]
- Lin Z, Pearson C, Chinchilli V, et al. Polymorphisms of human SP-A, SP-B, and SP-D genes: association of SP-B Thr131Ile with ARDS. Clin Genet 2000;58:181-91. [Crossref] [PubMed]
- Yang R, Li X, Liu H, et al. Chest CT Severity Score: An Imaging Tool for Assessing Severe COVID-19. Radiol Cardiothorac Imaging 2020;2:e200047. [Crossref] [PubMed]
- Wasilewski PG, Mruk B, Mazur S, et al. COVID-19 severity scoring systems in radiological imaging - a review. Pol J Radiol 2020;85:e361-8. [Crossref] [PubMed]
- Salaffi F, Carotti M, Tardella M, et al. The role of a chest computed tomography severity score in coronavirus disease 2019 pneumonia. Medicine (Baltimore) 2020;99:e22433. [Crossref] [PubMed]
- Pan F, Li L, Liu B, et al. A novel deep learning-based quantification of serial chest computed tomography in Coronavirus Disease 2019 (COVID-19). Sci Rep 2021;11:417. [Crossref] [PubMed]
- Inoue A, Takahashi H, Ibe T, et al. Comparison of semiquantitative chest CT scoring systems to estimate severity in coronavirus disease 2019 (COVID-19) pneumonia. Eur Radiol 2022;32:3513-24. [Crossref] [PubMed]
- Xie X, Zhong Z, Zhao W, et al. Chest CT for Typical Coronavirus Disease 2019 (COVID-19) Pneumonia: Relationship to Negative RT-PCR Testing. Radiology 2020;296:E41-5. [Crossref] [PubMed]
Cite this article as: Benlakhdar S, Rziza M, Oulad Haj Thami R. Integrating artificial intelligence into radiological practice for automated pneumonia assessment. Ther Radiol Oncol 2025;9:11.


