Comparative Analysis of Hybrid and Single Classification Algorithms for Student Academic Performance Forecasting

Najah Al-Shanableh; Mazen Alzyoud; Ahmed Khalil; Mohamed Sahbi Benlamine; Sadeq Damrah; Muhammad Saud Al-Alimat

Authors

Dr. Najah Al-Shanableh Computer Science Department, Al al-Bayt University, Mafraq
Dr. Mazen Alzyoud Computer Science Department, Al al-Bayt University, Mafraq
Dr. Ahmed Khalil Computer Information Science, Higher Colleges of Technology, Sharjah
Dr. Mohamed Sahbi Benlamine Computer Information Science, Higher Colleges of Technology, Sharjah
Dr. Sadeq Damrah Department of Mathematics and Physics, College of Engineering, Australian University, West Mishref, Safat 13015
Dr. Muhammad Saud Al-Alimat Expert and political consultant – Palladium Company

Keywords:

Learning Outcomes, Artificial Intelligence, Educational Technology

Abstract

Educational data mining has become an important area of research for predicting students’ performance and enabling early
intervention at higher education levels. In this work, a comparison of hybrid and single machine learning classifiers is undertaken
to predict student academic performance datasets (hybrid dataset) from Al al-Bayt University, Jordan that consists of 19,700
students’ records, while a synthetic dataset that consists of 10,000 students’ datasheets is used for model validation. Ten single
models, i.e., Logistic Regression, Naïve Bayes, Decision Tree, K-Nearest Neighbor, Support Vector Machine, Random Forest,
Gradient Boosting, XGBoost, CatBoost, and AdaBoost, were tested via 10-fold cross-validation. Furthermore, a hybrid soft-voting
ensemble model combining Logistic Regression, Random Forest, and XGBoost was constructed. The best-performing single
model was XGBoost, with an accuracy of 80%, while the combined hybrid model achieved the highest accuracy (92.06%). This
study shows that hybrid ensemble models improve predictive performance and generalization compared to single classifiers,
providing insights for educational institutions to detect at-risk students and facilitate early academic intervention.

1. Introduction

Predicting student academic performance in higher education is a topic that has received attention over decades as it can help to enhance learning outcomes and institutional effectiveness. Lecturers and educational institutions are collecting vast amount of student’s data through the admission systems, learning management system/performance records & demographic related database. Use of this data, alimented through advanced Machine Learning algorithms, can reveal important information for early intervention and strategic decision making.

Educational Data Mining (EDM) uses data mining and machine learning (ML) techniques for improved understanding of students’ behaviors and predicting their academic standings (Romero & Ventura, 2007). Predictive analytics in education helps to identify students at-risk, bring down dropout rates and improve resource allocation and academic advising (Zorić, 2020).

Figure 1 shows that the educational system’s data mining is a loop and students learn the knowledge. Mining the data to extract meaningful information (e.g. the relationship between courses and grades) can provide invaluable knowledge that may raise the quality of the educational system.

Teaching a computer to learn from data and to make smart decisions is what ML is all about. In data mining, two major types of ML approaches are supervised learning and unsupervised learning. Unsupervised learning refers to using unlabeled data, whereas supervised learning describes where the ML algorithm adapts from labeled examples (Shalev-Shwartz & Ben-David, 2014). As seen in Figure 2, below.

ML classification algorithms have been broadly utilized in the prediction of academic performance, using demographic, historical, and academic-based attributes (Rastrollo-Guerrero, Gómez-Pulido, & Durán-Domínguez, 2020). Single classifiers, such as decision tree and logic regression (which refers to either naïve Bayes or logistic regression) have yielded good performance while being prone to suffer from poor generalization capability and instability when dealing with complex datasets (Ababneh, Al-Shanableh, & Alzyoud, 2021). The latest findings prove that ensemble and hybrid models provide better degrees of both prediction accuracy and reliability compared with simple algorithms on account of their ability to work based on multiple learning perspectives (Várkonyi-Kóczy, 2020). No comprehensive work, to the best of the authors’ knowledge, has yet examined hybrid models for predicting students’ performance considering realworld datasets from Middle Eastern universities.

To fill this void, a comparative study to assess the merits and demerits of hybrid and single ML based classification models for predicting academic results is performed in this paper. Based on real student data from Al alBayt University, the present study aims to evaluate the prediction performance of eleven machine learning algorithms and develop a hybrid ensemble model by fusing Extreme Gradient Boosting (XGBoost), Logistic regression, and Random Forest (RF). The model forecasts educational attainment of student and identifies key factors that affect learner performance.

Accurate prediction models are increasingly needed to provide timely academic intervention and increase student success in college (Salimeh, Al-Shanableh, & Alzyoud, 2022). Most of the current methods follow a single ML algorithm, which might not be sufficient to handle heterogeneous educational datasets. Hence, the construction of a hybrid approach to both enhance prediction accuracy and robustness is necessary.

The following research questions serve as the guide for this study:

How to apply classification and ML algorithms to predict student performance?
Do hybrid ensemble methods enhance the prediction performance over single classifier approaches?
What is the best model in predicting academic status?

The aims of this study are to: (1) assess the performance of individual ML algorithms in predicting student academic performance; (2) construct and validate a hybrid ensemble model using XGBoost, Logistic regression, and RF; (3) compare classification accuracy between the single models with that for the hybrid model; (4) determine which predictors contribute most strongly to student learning success.

This study has practical implications for academics through the induction of a validated predictive model that enables the early identification of atrisk students, enhances academic advising and institutional planning. The results of this study provide understanding in the use of hybrid ML models in education analytics and contribute to the wider literature on student performance forecasting limitations.

2. Literature Review

Educational data mining (EDM) is now a significant research area that deals with analyzing educational dataset for enhancing learning processes, predicting performance of students and decision-making purposes (Romero & Ventura, 2010). EDM uses data mining, statistics, and ML to reveal patterns and actionable information in educational systems (Baek & Doleck, 2021). One of the most investigated fields in EDM is predicting student academic performance as this has an important role in minimizing dropout rates and supporting academic planning, which leads to personalized learning interventions (Rastrollo-Guerrero, Gómez-Pulido, & Durán-Domínguez, 2020).

2.1 Prediction of Student Academic Performance Research

Studies on the predictions of student performance generally use demographic, academic, and behavior characteristics to predict Grade Point Average (GPA), course grade writing Graduate Management Admission Test (GMAT) or Scholastic Assessment Test (SAT) scores (if available), or graduation status (Nedeva & Pehlivanova, 2021). Several ML techniques are applied: Logistic Regression (LR), Naïve Bayes (NB), Decision Trees (DTs), Neural Networks (NNs), and Support Vector Machines (SVMs), as well as boosting algorithms (Al-Shanableh et al., 2024). Existing research has indicated that the performance of different learning system and learning models are subject to source data size, feature type and model selection.

Shahiri, Husain, & Rashid (2015) conducted a review of EDM studies during 2002–2015 and observed that DTs and Neural Networks had the highest accuracy rates (of up to 98%). Another study by Namoun and Alshanqiti (2020) analyzed 586 research papers and found that the RF and Hybrid Neural Networks performed better than classical statistical techniques. Albreiki, Zaki, & Alashwal (2021) conducted a survey of 78 studies and discovered that: classification algorithms, such as SVMs, RF and NB, are often used in practice (especially for at-risk students’ prediction) (Al-Shanableh et al., 2024).

2.2 Hybrid and ensemble ML models

ML is further divided into three types: single, ensemble and hybrid (Várkonyi-Kóczy, 2020). Single models are based on a single classifier, and ensemble models integrate such homogeneous models through bagging, boosting or stacking to strengthen generalization. Hybrid models combine several mixed techniques and typically include optimization or feature selection (Al-Shanableh et al., 2026).

Recent works have shown that hybrids of classifiers do better than single loners, especially if we are dealing with complex data. Kumar, Singh, & Handa, (2017) reported an increase of 75.62% in accuracy by combining Radial Basis Function (RBF) and Multi-layer Perception (MLP) NNs with respect to single models. Accordingly, using 4-algorithm hybrid classifier, Sokkhey and Okazaki (2020) obtained an accuracy between 84.9 and 99.7%. However, few works have been published that study hybrid methodologies for higher-education databases in MiddleEastern contexts.

Al-Husban (2021) used several algorithms in Jordan and employed a data set from Al al-Bayt University to predict student status and obtained 77% accuracy with XGBoost as the best single model among others. Extending this work, Mashagba (2022) revealed that CatBoost was the best model (92.16%) for predicting student academic status with boostingbased algorithms. Nonetheless, none of these works used a hybrid ensemble approach.

Hybrid and ensemble learning models are of great interest to researchers. Both ensemble models and the hybrid model use the integration principle but in a simple twist, while ensemble ML integrate homogenous models, the hybrid classifier integrates heterogeneous models (Al-Shanableh et al., 2026).

A grouping model combines models to make a group decision for prediction. The hybrid classifier takes more features for filtering; the reason for calling it hybrid is that it learns about data pre-processed flow and model building, in contrast to ensemble there are no constraints on data processing. In Hybrid ML, one classifier classifies each model (Wong & Yeh, 2020).

Although existing literature has proved the superior performance of Boosted Decision Tree Classifiers for education prediction tasks, little effort has been devoted to hybrid ensemble classifiers that combine CatBoost with RF and XGBoost together. Besides, only a few works use large real-world datasets from Middle East contexts and only a limited number of studies compare hybrid vs single models on the same dataset. The current paper fills these gaps through a comparative study that includes up to 19,700 student results from Al alBayt University and a newly developed hybrid softvoting classifier.

Albreiki, Zaki, & Alashwal (2021), in their research “A Systematic Literature Review of Student’ Performance Prediction Using Machine Learning Technique”, performed a systematic review of EDM literature over the period 2009 to 2021 (78 studies reviewed). The most common type of datasets used in studies consisted of the ones from student university databases and online learning platforms. Data mining significantly increased student achievement because it was very successful in predicting at-risk students and dropout rates.

Albreiki, Zaki, & Alashwal, (2021)’s literature review showed 16 research studies into predicting student performance, 12 research studies into recognizing those at risk, and five studies to determine how e-learning affects student academic achievement. The most popular methods used were Decision Tree (DT), Logistic Regression (LR), Naive Bayes (NB) and Support Vector Machine (SVM). Student’s drop-out prediction (21 papers) was the second most common metatask, with the primary integrated methods being Decision Tree (DT), Support Vector Machine (SVM), Classification and Regression Trees (CART), K-Nearest Neighbors (KNN), and Naive Bayes (NB). Twenty-four studies focused on predicting students’ performance using static and dynamic data (14 studies using a combination of methods) where the algorithms most commonly used were KNN, NB, SVM, DT, RF I.D3 & ICRM.

Namoun and Alshanqiti (2020) demonstrate this in “Predicting Student Performance Through Data Mining and Learning Analytics Methods – A Systematic Literature Review.” This is the one adopted in Bunkar et al. (2020), when they used some criteria to select papers, and left out what did not accept them. They began with 586 articles and filtered it down to 62 papers “to adjudicate.” The most utilized learning types were statistical analysis with 28 articles, supervised ML with 25 articles, and data mining with five articles. The distribution of algorithms/learning models were:

Statistical models (Correlation and Regression): 32 studies.
Neural networks: nine studies.
Tree-based models (DT): nine studies.
Bayesian-based model: five studies.
Support vector machines: two studies.
Instance-based models: one study.

We dichotomized the models into best versus worst. The best performing models were 3L Neural Network (feedforward) 98.8%, RF 98%, Hybrid RF 99%, Naive Bayes 96.8% and ANN (95–97%). The least accurate models were linear/cox regression (50%) and logistic regression (76.2%), repeated discriminant analysis (64–73%), mixed effect logistic models’ analysis (69%), and bagging (48–55%).

Bunkar et al. (2020) used Clustering, Classification and Association Rules in E-learner. In clustering, the most widely used algorithm that addresses the solution to a problem is K-Means clustering algorithm. The classifier was Apriori algorithm. The algorithms employed for association rules have been J48, C4.5, REPTree, and Naïve Bayes.

Aldowah, Al-Samarraie and Fauzy (2019) conducted a literature review from 2000 to 2017. 402 articles were included in the review. They found 26% used classification methods, followed by 21% using clustering, 15% using visual data mining, 14% using statistics, 14% using association rule-mining, and 10% using regression. Romero and Ventura (2007), on the other hand, reported a different outcome in their review, finding that association rule mining was employed over that for classification (43% vs 28%) and clustering (15%). This is in line with the work of Papamitsiou and Economides (2014), who also applied a large number of classification methods effectively prior to clustering and regression.

The work of Ashraf, Anwer & Khan (2018) also compared data mining methods and classification algorithms in terms of the impact they have on datasets attributes’ influencing student performance predictions (see Figure 3).

In another vein, Shahiri, Husain, & Rashid (2015) indicated that in their investigation “The Third Information Systems International Conference 2065 A Review on Predicting Student’s Performance using Data Mining Techniques.” During 2002 – 2015, they found ten papers that used DT as learner to predict student performance, whilst eight other papers utilized NN algorithms; four used Naïve Bayes algorithm; and lastly three papers employed K-Nearest Neighbor (KNN) and SVM for predicting the output variable. They proved that for these algorithms, the best prediction accuracy of all was achieved by NN (98%), and the second best is DT (91%). For the KNN and the SVM, accuracy was equated (83%). Naive Bayes is in the third place with 76% accuracy.

The authors used SVM algorithm most frequently, following the work of Rastrollo-Guerrero, Gómez-Pulido, & Durán-Domínguez (2020), and found it to be the best at making predictions. In addition to SVM, DT, NB and RF are common algorithmic recommendations that have been well-investigated and yielded positive results. Even if neural networks are not widely used, they seem to be excellent at predicting academic achievement of pupils (Rastrollo-Guerrero, Gómez-Pulido, & Durán-Domínguez, 2020).

The work of Al-Husban (2021) proceeded in the same line. Using application/implementation analysis with real data, the study applied to collected concerning outcomes of Al-al-Bayt University Jordan. This amounts to a dataset of 25,017 students. Al-Husban developed a model to predict different majors (Graduate/Non-Graduate) of the students by categorizing the students in binary classes. She uses multiple classifiers as well, such as XGBoost, RF, SVM, KNN, and DT for her algorithms. The data was split into a 75% training set and a 25% test set for all functions. She concluded by predicting with the following metrics: 2.4 Using Accuracy, Precision, Recall and F1 Score for power generation. The result shows XGBoost classifier has the better accuracy (of 77%) than all others.

Moreover, Al-Husban has completed a model for predicting the students’ performances in two scenarios – predicting whether a student will succeed or fail, andpredicting students’ level of appreciation. The thesis of Mashagba (2022) was used as it contained actual data collected from Al-al-Bayt university in Jordan (it is very close to our study’s proposed use cases of data, particularly that having same aim). More than three algorithms of Gradient Boosting were applied to this dataset including AdaBoost, CatBoost, and XGBoost. A 10-fold cross-validation was worked against grid-search, looking for optimal split point and combine parameter value. The experimental results indicated that the CatBoost algorithm has better prediction accuracy, with 92.16% for the final status predicting model and 86.89% for the appreciation prediction model, respectively. She calculated the meta-tracker performance for the following metrics: Accuracy-, Precision-, Recall-, and F1-scores.

In terms of predicting student performance, we observed that most algorithm types (NN, DT, NB, SVM – and even RF and LR) were used. Information about students’ demographics and grades, including their grades in different studies and their high school qualifications, behavioral data, Moodle access modules, personal details (i.e., gender), and academic performance are being employed to predict successfulness of students. Table 1 shows a summary of previous research and the most used approach for students’ performance prediction.

Table 1. Summary of Literature Review
Authors, Year	Dataset / Context	Result / Key Metric(s)	Approach
Mashagba, 2022	Al-al-Bayt Univ., Jordan	XGBoost: 91.61%, LightGBM: 91.95%, CatBoost: 92.16%	Gradient Boosting (XGBoost, LightGBM, CatBoost)
Al-Husban, 2021	Al-al-Bayt Univ., Jordan	XGBoost: 77%, RF: 76.86%, DT: 76%, KNN: 75.71%, SVM: 75.69%	XGBoost, RF, SVM, KNN, DT
Kumar, Singh, & Handa, 2017	UCI dataset (480 samples)	Hybrid model accuracy up to 76.45%	Hybrid classification (RBF+MLP & J48+RF)
Okoye et al., 2021	2013 ECOA Student Opinion Survey	KNN effective for recommendation / prediction	Text mining + ANCOVA + KNN
Durai & Sherly, 2021	Engineering college, India (2016–2021)	DNN accuracy: 96.3%	Deep Neural Network
Kehinde et al., 2021	UCL ML Repository dataset	ANN accuracy: 92.26%	Artificial Neural Network
Ünal, 2020	Secondary-school dataset (math & Portuguese courses, Portugal)	Accuracy improved with wrapper-based feature selection	DT, RF, Naive Bayes
Li & Liu, 2021	University student data (2007–2019)	Prediction error (RMSE/MAE): 0.785	Deep Neural Network
Dhilipan et al., 2021	Academic records (grades 10, 12, previous semesters)	Logistic Regression: 97.05% accuracy	KNN, DT, Entropy method, Logistic Regression
Sokkhey & Okazaki, 2020	Cambodian high-school datasets (three sets)	Hybrid RF: 99.7%, Hybrid C5.0: 99.25%	Hybrid ML models (RF, C5.0, PCA, NB, SVM)
Zorić, 2020	Baltazar Univ. dataset (76 students)	ANN prediction high (≈ 93.4%)	Neural Network (Allyuda Neurointelligence)
Alamri et al., 2020	Two datasets (Portuguese and Mathematics)	Binary classification accuracy ~ 93%	SVM and RF
Kumar & Minz, 2020	UG student’s dataset (300 samples)	Hybrid method accuracy: 62.67%	Hybrid classification (ID3 + J48)
Abu Zohair, 2019	Admin-dept master’s program, 50 students	LDA & SVM had best accuracy among tested methods	NB, SVM, LDA, MLP-ANN, KNN
Razak et al., 2014	Semester-6 students (257 samples)	Linear Regression: 96.2%, DT: 82.5%	DT, Linear Regression
Ramesh et al., 2013	900 higher-secondary students (9 schools)	MLP accuracy: 72.38% (best vs other methods)	J48, Naïve Bayes, MLP
Osmanbegović & Suljić, 2012	First-year students at University of Tuzla (Economics Faculty)	Naïve Bayes outperformed MLP & DTs	J48, NB, MLP
Alsubihat & Al-Shanableh, 2023	University student data (various features)	Heterogeneous-model accuracy: 93.46%; CatBoost: 93.15%, XGBoost: 93%, RF: 92.9%	Combined heterogeneous classification models (Logistic Regression, KNN, DT, SVM, NB, MLP, RF, Gradient-Boosting, XGBoost, CatBoost, LightGBM)
Alharbi & Allohibi, 2024	Student academic dataset	Proposed hybrid classifier (PHC) accuracy: 92.40%	Hybrid classifier combining multiple algorithms (RF, C4.5/CART, SVM, NB, KNN)
Guanin-Fajardo et al., 2024	College student data (various features)	High effectiveness for predicting academic success (MDPI)	ML techniques (various)
Airlangga, 2024	Student demographic & educational data	CNN (deep learning) outperformed MLP, BiLSTM, LSTM-attention in score prediction	Deep Learning: CNN, MLP, BiLSTM, LSTM w/ Attention
Junejo et al. 2024	Online-learning dataset (VLE clickstream + demographics + assessments) — early semester data	Neural-network model significantly outperforms baselines; strong accuracy & early prediction even at 20% course completion	Neural Network for multi-class classification (Distinction, Pass, Fail, Withdrawn)
Rohani et al., 2024	Clickstream data from math students (assignments)	AUC = 0.7884 in assignment success prediction; ranked 2nd in EDM Cup 2023	Tree-based model (CatBoost) on behavior-based features
Abukader, Alzubi & Adegboye, 2025	Higher-ed educational datasets (various features)	Metaheuristic-optimized LightGBM achieved R² = 0.941 (strong regression performance)	Metaheuristic hyperparameter optimization + LightGBM + SHAP interpretability
Ahmed et al., 2025	Student performance dataset (supervised ML)	Classification + prediction of student performance (varied accuracy)	Supervised ML (various)
Gharkan, Radif & Alsaeedi, 2025	Higher-ed student data (historical records)	Survey/review of predictive methods; highlight effective techniques for identifying at-risk students and dropouts	Various predictive analytics and ML / deep learning models

3. Methodology

This study adopts a quantitative research methodology utilizing ML classification approaches to predict student academic performance based on a real dataset collected from Al al-Bayt University (AABU) in Jordan and an artificial dataset generated by ChatGPT. The methodology consists of several primary phases: dataset acquisition, data preprocessing, model implementation, and performance evaluation.

Supervised ML was used to develop and validate models for predicting student academic status. In this work, ten base classifiers and a proposed hybrid model were employed. Model comparison analysis was performed by standard validated tools to determine the most accurate prediction. The methodology flow diagram is depicted in Figure 4.

3.1 Dataset Description

The first dataset is a sample of the AABU student population that contains academic and demographic attributes for 19,700 students from many faculties and departments across all academic years. For prediction model development, the clean file remained after data preparation (removal of cases with missing and duplicate records). The synthetic dataset was constructed using ChatGPT (GPT-4) with the following pipeline: A structured human-friendly prompt was conceived and input to Chat GPT and the task at hand was promoted: show us what a realistic student record would look like, which includes the same features as the AABU datase. Demographic distributions were given (age 18–25, gender ratio of 55:45 female to male), academic performance ranges (GPA between 50 and 100 with normal distribution μ=70, σ=12), and high school rates (between 60 and 99). The synthetic data were tested against the real AABU dataset in terms of distributional statistics (mean, standard deviation, correlation matrices) to ensure realistic representations. We used this dual-dataset approach: a) to assess the model generalizability across different data sources, b) to deal with privacy issues by releasing a shareable complimentary fake synthetic dataset and c) to investigate the robustness of our models against differences in data generation processes.

Features for both datasets are described in Table 2. The dependent variable is the student academic performance, which is divided into excellent performance, very good, just good, passing and failing. The aim is to accurately group students according to the appropriate outcome category from these features.

Table 2. Dataset features
Feature Name	Explanation
Student_ID	A unique identifier assigned to each student for tracking records across the database.
Specialization	The academic major or program the student is enrolled in (e.g., Data Science, AI, Nursing).
Study_status	Indicates whether the student is Active, Suspended, Deferred, Graduated, or Dropped out.
High_school_rate	The percentage or GPA the student obtained in high school before university admission.
Gender	The biological sex of the student (e.g., Male, Female).
Social_status	Describes the social or marital status of the student (e.g., Single, Married).
Birth_date	The date of birth of the student, used for calculating age and age-related performance trends.
Admission_year	The year the student joined the university; useful for cohort analysis.
Graduation_year	The expected or actual year of graduation; helps determine duration of study and delays.
GPA	The student’s cumulative Grade Point Average, representing overall academic performance.
Rating	A qualitative or quantitative assessment of overall student performance (e.g., Excellent, Good, etc.).

3.2 Data Preprocessing

Data preprocessing included: (1) Handling of missing data: Records with more than 30% of missing values were removed, and remaining missing values were imputed using mode for categorical attributes and median for numerical ones; (2) Encoding of categorical attributes was applied by ordinal encoder (for ordered categories like rating levels) or one-hot encoding (for nominal categories like specialization and gender); (3) Dropping irrelevant and privacy-sensitive columns such as student name, national ID, phone number, email; (4) Outlier detection by Interquartile Range method, where we flagged any value outside 1.5 times IQR; (5) Normalization via Min-Max scaling transformation of numerical columns to ensure same scale across algorithms; and finally (6) Balancing classes using SMOTE (Synthetic Minority Over-sampling Technique).

3.3 Feature Engineering and Selection

Some additional features were constructed on the base of the original ones. These were annual means, year of admission categories and cumulative scores. Selection of features was done using recursive feature elimination (RFE) and information gain (IG). This approach improved model as well processing time.

3.4 Machine Learning Algorithms

The following ten single ML classification algorithms were evaluated: LR, NB, DT, KNN, SVM, RF, Gradient Boosting, XGBoost, Adaptive Boosting (AdaBoost), and Categorical Boosting (CatBoost) All experiments were implemented using Python 3.9 with scikit-learn 1.2.0, XGBoost 1.7.0, AdaBoost 3.3.5, and CatBoost 1.1.1. Key hyperparameters were optimized using GridSearchCV with 5-fold cross-validation: Logistic Regression (C=1.0, solver=’lbfgs’, max_iter=1000); DT (max_depth=10, min_samples_split=5); Random Forest (n_estimators=100, max_depth=15); KNN (n_neighbors=5, metric=’euclidean’); SVM (kernel=’rbf’, C=1.0, gamma=’scale’); XGBoost (n_estimators=100, max_depth=6, learning_rate=0.1); CatBoost (iterations=500, depth=6, learning_rate=0.1); AdaBoost (n_estimators=50, learning_rate=1.0)). Data was split into 80% training and 20% testing sets with stratified sampling to maintain class distribution.

3.5 Hybrid Model Architecture

The hybrid model consists of three models:

Random Forest — Tree ensemble
XGBoost — Gradient boosting with weighted voting
Logistic Regression — Linear classifier

The choice of these models was based on individual model performance and the type of the classifier. Both Soft Voting Ensemble (Averages probability predictions) and Hard Voting Ensemble (Majority vote) was tried.

3.6 Evaluation Metrics

Performance was evaluated based on Accuracy, Precision, Recall and F1-score. The performance metrics used in this study are defined mathematically as follows: Accuracy = (TP + TN) / (TP + TN + FP + FN), measuring the proportion of correctly classified instances; Precision = TP / (TP + FP), measuring the proportion of true positive predictions among all positive predictions; Recall (Sensitivity) = TP / (TP + FN), measuring the proportion of actual positive instances correctly identified; F1-Score = 2 × (Precision × Recall) / (Precision + Recall), the harmonic mean of precision and recall providing a balanced performance measure. Where TP = True Positive, TN = True Negative, FP = False Positive, and FN = False Negative. The equations are also shown in Figure 5. Moreover, 10-fold cross-validation was used to improve the reliability of evaluation.

3.7 Ethical Considerations and Data Governance

The data were collected in an official way through university administrative staff with the consent of The Dean of Student Affairs. To ensure ethical treatment of student related information, we took the following measures: (1) We removed all identifiable student data such as student names, national identification numbers, phone numbers, email addresses and home addresses from the dataset before any analysis can be done; (2) anonymous unique identifiers were generated through an irreversible cryptographic hash function to replace real Student IDs for de-identification in order to protect re-identification of individual students; (3) The files were kept on secure encrypted university servers utilizing password protection to restrict access exclusively to members of the research team; (4) Prior to even looking at the data, all research team members signed agreements ensuring confidentiality and non-disclosure; Additionally, Data handling procedures are designed based on FERPA act and Jordanian laws regarding use and transfer of educational record and personal identifiable information that protects researchers’ access or misuse of educational Personal Information with JPR. For the synthetic case, no ethical approval was necessary because the data are synthetic and do not contain subjects.

4. Results and Discussion

In this section, we discuss the performance comparison results of the ten ML classification algorithms developed on a single machine with those of our proposed hybrid ensemble model. Model performance was measured using accuracy, precision, recall, and F1-score. The findings show widely different classification performances of the classifiers, and that boosting-based algorithms significantly outperform typical ML models.

Table 3 shows the used datasets’ descriptive statistics, and Figure 6 shows the correlation heatmap for all variables.

Table 3. Dataset description
Variable	Artificial Dataset		AABU Dataset
Variable	mean	std	mean	std
Age	21.52	2.08	19.4352	4.2276
High_school_rate	75.6629	23.8367	72.8277	7.3604
Year1Avg	71.7573	10.4899	71.7667	10.5785
Year2Avg	46.4084	29.9172	46.3284	34.2683
GPA	59.0828	15.8337	70.229	10.6519

4.1 Performance of Single Classification Models

In the first step, an individual ML algorithm was evaluated within a 10-fold cross validation. The accuracy of all the single models is shown in Table 4. Overall, best performing with the highest accuracy on the AABU and synthetic datasets was XGBoost (Accuracy 80% and 79% respectively) as shown in Tables 4 and 5. Figure 7 shows the accuracy comparison.

Table 4. Accuracy Results for Single Classification Models for Dataset 1
Algorithm	Accuracy	Precision	Recall	F1-Score
Logistic Regression	0.7300	0.6200	0.5800	0.5900
Naïve Bayes	0.6900	0.5600	0.5400	0.5400
Decision Tree	0.7600	0.6500	0.6200	0.6300
K-Nearest Neighbor	0.7100	0.6100	0.5700	0.5800
Support Vector Machine	0.7400	0.6300	0.5900	0.6000
Random Forest	0.7900	0.6900	0.6500	0.6600
Gradient Boosting	0.7800	0.6800	0.6400	0.6500
XGBoost	0.8000	0.7100	0.6700	0.6800
CatBoost	0.7900	0.7000	0.6600	0.6700
AdaBoost	0.7500	0.6400	0.6100	0.6200

Table 5. Accuracy Results for Single Classification Models for Dataset2
Algorithm	Accuracy	Precision	Recall	F1-Score
Logistic Regression	0.7200	0.6850	0.6520	0.6680
Naïve Bayes	0.6800	0.6350	0.6180	0.6260
Decision Tree	0.7400	0.7050	0.6850	0.6950
K-Nearest Neighbor	0.6900	0.6520	0.6380	0.6450
Support Vector Machine	0.7100	0.6780	0.6620	0.6700
Random Forest	0.7800	0.7480	0.7250	0.7360
Gradient Boosting	0.7600	0.7280	0.7120	0.7200
XGBoost	0.7900	0.7620	0.7380	0.7500
CatBoost	0.7700	0.7380	0.7180	0.7280
AdaBoost	0.7300	0.6950	0.6780	0.6860

The summary of experiment results shows that among the single classifiers, XGBoost obtained the highest accuracy rate (80%), followed, very closely, by Random Forest. Classic algorithms such as SVM, Naïve Bayes and KNN demonstrate less accuracy in terms of precision because it is difficult for them to deal with multi-dimensional data and complex and nonlinear relationships.

4.2 Hybrid Ensemble Model Performance

Hybrid Ensemble Model was created by combining Random Forest, Logistic Regression and XGBoost, using soft voting and hard voting. Combination of the various classification pairs (RF+ LR+XGBoost), demonstrated highest predictive degree, 92.06 points one-day ahead with all-AUC above 90% in any case. For the hybrid ensemble model in our research, the result was better than all single classifiers. Comparative results their performance is shown in Table 6.

Table 6. Soft vs Hard Ensemble Model Performance Summary
Model	Accuracy	Precision	Recall
Soft Voting	92.06%	82.65%	83.28%
Hard Voting	92.06%	82.65%	83.28%

When multiple strong learners are combined, model performance and misleading rate indicate that the benefits of hybrid models lie in their robustness (Al-Shanableh et al., 2026). The soft voting solution provides a stepping-stone for hybrid models to use probability-based decision fusion to enhance generality.

4.3 Discussion

Overall, the findings of this study suggest that our combined model could predict future performance of students much more accurately than any one single classifier had done in previous research. These results not only demonstrate the potential of real-life institutional data; they are also a kind of quality assurance and guardrail to lean on for automatic academic risk detection systems. Our research shows that a mixed ML model has better accuracy and stability than the classification model in predicting the academic performance of a student.

The best overall accuracy (92.06%) was achieved with our hybrid model, which combined XGBoost, Logistic Regression and Random Forest. This was compared to CatBoost, the best single classifier (80%). Furthermore, the hybrid had improved robustness across validation folds and less misclassification, particularly in low academic achievement categories. These conclusions are consistent with the findings of previous studies on the effectiveness of ensemble and hybrid learning methods in complex, heterogeneous educational data (Kumar, Singh, & Handa, 2017; Sokkhey & Okazaki, 2020). The hybrid model integrated multiple strong learners into a single system, leveraging beneficial decision boundaries while dampening these turning points in such a way that generalization was also promoted and algorithmic bias eased. Soft voting achieved further enhancements by probability-based aggregation: the combination of strong learners could be managed even better than voted Majority Rule or Not.

Feature importance analysis identified seven significant influential factors for academic performance. The cumulative GPA, high school average, academic year in high school and course load were the only important predictors of student college GPA. These predictions are also consistent with those in similar research, which also discusses the impact of students’ academic history and background variables on their performance. Schools rely on accurate prediction of students’ academic behavior to help them develop reliable plans. However, by identifying at-risk students in time, it can offer more targeted academic advising, and lead to better resource allocation; such measures can encompass the use of personalized learning or interventions that reduce dropouts and increase the effectiveness of the whole institution. Predictive tools that take data as an input and turn it into information in university management systems help universities to plan for their future, plan intelligently, and plan well.

This research proves the superiority of ensemble learning in educational data analytics. Hybrid systems like these combine the abilities of several ML models (for example, robustness and strong modelling on research in multi-dimensional space) with all those advantages that single classifier cannot bring off, such as dealing with non-linear data and heavily imbalanced categories. Today we might say that “the most essential combination” based on the above analysis builds up around “enslaving processes wirelessly”, which are envisaged as yielding superior results. The result concerning the importance of features was that, across algorithms, academic history variables (for instance, cumulative GPA high school code program concentration which academic year) are key predictors, confirming similar findings from education data mining studies.

Although the findings are encouraging, there are several constraints that should be considered. First, the present study is conducted based on a single university (AABU); thus, findings may not be easily generalized to other institutional contexts of student populations, academic systems or grading styles. Second, even though the synthetic dataset is suitable for validation purposes, it might not be able to replicate all of the complexity and subtleties present in actual student data, which can have an impact on the reliability of cross-dataset comparisons. Third, the study is limited to academic and demographic factors; behavioral measures, including class attendance, learning management system (LMS) use, library utilization, and extracurricular involvement were unavailable but could increase predictive ability. Fourth, the historical patterns used in the analysis are assumed to hold in the future, which won’t be true during times of great institutional changes or exogenous shocks (e.g., large-scale transitions to online learning related to a pandemic). Fifth, although the performance of the hybrid model was superior to all models considered herein, its computational complexity may be too high for real-time use in resource-limited institutional settings. Future studies should address these limitations by: (1) externalizing the model between different institutions with different cultural and educational settings; (2) extending realistically oriented behavioral and engagement features; (3) investigating model interpretability to generate meaningful feedback for educators, and (4) creating lightweight model variants ready to be deployed on education information systems.

In sum, these results confirm the value for educational communities and AI researchers alike of incorporating hybrid classification models in student performance prediction.

5. Conclusion

This study enhances hybrid and single classification models that are trained on the same set of students to predict a different year performance. The experiment indicates that the hybrid model is comprehensively superior to a single classification model in accuracy, precision, recall and F1-score. The hybrid model has the best predictive power, which means that it is good to combine algorithms that excel in different areas. One reason that hybrid models are chosen is that single classifiers are simple and computationally inefficient but are vulnerable to underfit or overfit; the policy, in contrast, tends to be much more dependable.

These results suggest that hybrid models could be effectively implemented to support course selection, placement decisions, early academic interventions, and personalized instruction. However, practitioners should be aware of the computational requirements of ensemble methods and the need for regular model retraining as student populations evolve. Additionally, predictive models should be used as decision-support tools rather than as deterministic classifiers, with human oversight remaining essential in educational decision-making. In conclusion, the hybrid classification models provide a robust approach for predicting students’ academic performance. Future studies with diverse datasets from multiple institutions and additional behavioral features will further validate the effectiveness of these approaches and contribute to evidence-based decision-making in education.