Performance Comparison of Large Language Models on VNHSGE English Dataset: OpenAI ChatGPT, Microsoft Bing Chat, and Google Bard

1. Introduction
2. Related Works
- 2.1 Large Language Models
- 2.2 Educational Applications of LLMs
3. Methodology
4. Results
- 4.1 Overall Performance
- 4.2 Comparison with Human Performance
5. Discussion
- 5.1 Implications for English Education
- 5.2 Limitations and Future Work
6. Conclusion
7. Original Analysis
8. Technical Details and Mathematical Formulation
9. Experimental Results and Chart Description
10. Analytical Framework Example
11. Future Applications and Directions
12. References

1. Introduction

Artificial Intelligence (AI) has revolutionized education by transforming learning and teaching methods. Large language models (LLMs) such as OpenAI ChatGPT, Microsoft Bing Chat (BingChat), and Google Bard represent significant advancements in this domain. This paper evaluates their performance on the Vietnamese High School Graduation Examination (VNHSGE) English dataset, addressing three research questions: (1) What is the performance of ChatGPT, BingChat, and Bard on the VNHSGE English dataset? (2) How do these LLMs compare to Vietnamese students in English proficiency? (3) What potential do LLMs hold for English language teaching and learning in Vietnam?

2. Related Works

2.1 Large Language Models

Recent advancements in LLMs, particularly BERT and GPT architectures, have enabled human-like communication. These models are trained on vast corpora and fine-tuned for specific tasks, demonstrating capabilities in education, content generation, and translation.

2.2 Educational Applications of LLMs

LLMs have been applied in virtual assistants, chatbots, and online learning systems. Studies by Kasneci et al. (2023) and Kung et al. (2023) highlight their potential for personalized learning, though careful evaluation is needed for different educational contexts.

3. Methodology

3.1 Dataset

The VNHSGE English dataset consists of multiple-choice questions covering grammar, vocabulary, reading comprehension, and writing skills, designed for high school level assessment in Vietnam.

3.2 Evaluation Metrics

Performance is measured using accuracy (percentage of correct answers). The models are evaluated on the same set of questions to ensure fair comparison.

3.3 Experimental Setup

Each model (ChatGPT GPT-3.5, BingChat, and Google Bard) was tested on the dataset under controlled conditions. Responses were recorded and scored against the official answer key.

4. Results

4.1 Overall Performance

BingChat achieved the highest accuracy at 92.4%, followed by Bard at 86%, and ChatGPT at 79.2%. These results demonstrate significant variation in LLM performance on the same task.

4.2 Comparison with Human Performance

All three LLMs outperformed the average Vietnamese high school student in English proficiency, indicating their potential as supplementary educational tools.

5. Discussion

5.1 Implications for English Education

The superior performance of BingChat and Bard suggests they can serve as effective alternatives to ChatGPT, especially in regions where ChatGPT is not officially available. These models can support self-study, provide instant feedback, and enhance learning outcomes.

5.2 Limitations and Future Work

Limitations include the focus on a single dataset and the lack of qualitative analysis of model reasoning. Future work should explore broader datasets, multilingual capabilities, and integration into classroom settings.

6. Conclusion

This study demonstrates that BingChat, Bard, and ChatGPT outperform Vietnamese students on the VNHSGE English exam, with BingChat leading. These findings support the integration of LLMs into English language education, offering scalable and accessible learning solutions.

7. Original Analysis

This paper provides a timely and practical comparison of three leading LLMs on a standardized English test, addressing a critical gap in the literature regarding LLM performance in non-English educational contexts. The finding that BingChat outperforms both ChatGPT and Bard is particularly noteworthy, as it challenges the assumption that the most popular model (ChatGPT) is necessarily the best. This aligns with broader research showing that model performance can vary significantly across languages and domains (Brown et al., 2020; Devlin et al., 2019). The study's contribution lies in its direct relevance to Vietnamese educators and policymakers, offering actionable insights for integrating LLMs into the curriculum. However, the analysis could be strengthened by examining the types of errors each model makes, as this would provide deeper pedagogical insights. For instance, are errors concentrated in grammar, vocabulary, or reading comprehension? Such granularity would help tailor LLM-based interventions. Furthermore, the study does not address potential biases in the dataset or the models' training data, which could affect generalizability. Despite these limitations, the paper convincingly demonstrates that LLMs can serve as effective tools for English language learning, particularly in resource-constrained settings. Future research should explore longitudinal studies to assess the impact of LLM-assisted learning on student outcomes over time.

8. Technical Details and Mathematical Formulation

The performance of each LLM is evaluated using accuracy, defined as:

$Accuracy = \frac{Number\ of\ Correct\ Responses}{Total\ Number\ of\ Questions} \times 100\%$

For a dataset with $N$ questions, the accuracy $A$ for model $M$ is:

$A_M = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i)$

where $\hat{y}_i$ is the model's prediction and $y_i$ is the ground truth for question $i$.

9. Experimental Results and Chart Description

The results are summarized in a bar chart comparing the accuracy of the three models. The x-axis represents the models (ChatGPT, Bard, BingChat), and the y-axis represents accuracy percentage. BingChat's bar reaches 92.4%, Bard's 86%, and ChatGPT's 79.2%. A horizontal line indicates the average human performance (approximately 70%), showing all models exceed this benchmark.

10. Analytical Framework Example

Consider a sample question from the VNHSGE English dataset: "Choose the correct word to complete the sentence: She ___ to school every day." Options: A) go, B) goes, C) going, D) gone. The correct answer is B) goes. Each model's response is recorded and scored. This simple example illustrates the evaluation process used for all questions in the dataset.

11. Future Applications and Directions

LLMs can be integrated into Vietnamese high school English education through: (1) AI-powered tutoring systems that provide personalized feedback; (2) Automated essay scoring and grammar correction; (3) Conversational agents for speaking practice; (4) Adaptive learning platforms that adjust difficulty based on student performance. Future directions include developing multilingual LLMs tailored to Vietnamese contexts, incorporating cultural nuances, and ensuring equitable access to technology.

12. References

Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
Dao, X.-Q., et al. (2023a). ChatGPT on the Vietnamese High School Graduation Examination. arXiv preprint.
Dao, X.-Q., et al. (2023b). ChatGPT on an English Test Case. arXiv preprint.
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT, 4171-4186.
Kasneci, E., et al. (2023). ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learning and Individual Differences, 103, 102274.
Kung, T. H., et al. (2023). Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education. PLOS Digital Health, 2(2), e0000198.
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint.
Thorp, H. H. (2023). ChatGPT is Fun, But Not an Author. Science, 379(6630), 313-313.

Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

Core Insight: This paper is a pragmatic, data-driven comparison that cuts through the hype, showing that 'best' is context-dependent. BingChat's dominance on a Vietnamese exam is a wake-up call for those who assume ChatGPT is universally superior.

Logical Flow: The paper follows a clear, linear path: problem statement (need for LLM evaluation in Vietnam), methodology (standardized test), results (BingChat > Bard > ChatGPT), and implications (LLMs as viable educational tools). The logic is sound but lacks depth in error analysis.

Strengths & Flaws: Strengths include a focused, replicable experimental design and direct relevance to Vietnamese education policy. Flaws include a narrow dataset (single exam), lack of qualitative analysis (why BingChat wins?), and no discussion of model biases or dataset representativeness. The study is a useful snapshot but not a comprehensive evaluation.

Actionable Insights: For Vietnamese educators: Pilot BingChat and Bard in classrooms immediately, focusing on grammar and vocabulary drills. For researchers: Conduct error analysis to identify model-specific weaknesses. For policymakers: Invest in local LLM development tailored to the Vietnamese curriculum. The key takeaway: don't put all your eggs in one LLM basket—diversify and test locally.

Table of Contents