Student Data Visualization

STUDENT DATA POSTER

Overview

Predicting Student Retention with AI-Powered Insights This project leverages Artificial Neural Networks (ANN) and interactive data visualizations to analyze student data and predict first-year persistence.

Importing Libraries

I’ve used Plotly Express for quick interactive visualizations and Dash to build web-based dashboards with interactive components. Plotly Subplots and Graph Objects allow for customizable multi-plot visualizations.  

While TensorFlow Keras is used to build deep learning models with fully connected layers.

Data Exploration

importing libraries student data analysis

At first glance I noticed there appear to be no missing values, but while diving deeper I noticed that there are missing values but the dataset has them encoded with a ‘?’. So I used pandas replace function to replace the ‘?’ with np.nan

Key indicators like First Term GPA (17 missing values), Second Term GPA (160 missing values), First Language (111 missing values), and High School Average Mark (743 missing values) suggest potential gaps in academic records. Additionally, Math Score (462 missing values) and English Grade (45 missing values) indicate missing performance metrics.

replacing question marks with np.nan

Other categorical variables like Prev Education (4 missing values) and Age Group (4 missing values) have relatively fewer gaps. Several features, including Funding, School, Fast Track, Coop, Residency, Gender, and First-Year Persistence, have no missing values

Data Preparation

The preprocess_data function handles all the preprocessing tasks.

We create a List of numeric columns – ‘First Term Gpa’, ‘Second Term Gpa’, ‘First Language’, ‘Math Score

We create a List of Categorical columns – ‘First Term Gpa’, ‘Second Term Gpa’, ‘Math Score’

This function converts the specified columns to numeric and categorical, and also imputes the numerical features with the mean and categorical features with mode

The convert_to_numeric function converts the columns to numeric

The impute_with_mean function imputes the numerical columns with mean

The imput_with_mode function imputes categorical columns with the mode

The drop_columns functions drop the mentioned columns

Using these functions the preprocess_data function preprocesses our dataframe for further analysis

I noticed all the data in our dataframe has already been labelled so I created a mapping and another dataframe df2 which is same as df but with the labels.

Visualization

Correlation Analysis

From this correlation heatmap I noticed

  • First Year Persistence is positively correlated with First Term GPA
  • First Term GPA and Second Term GPA are positively correlated
  • First Language is positively correlated with Funding
  • Residency is Positively correlated with First Language and has a stronger correlation with Funding
  • Residency is negatively correlated with Fast Track

We can observe more smaller correlations by looking at the heatmap

correlationHeatmap

Boxplot of High School Average Mark and Math Score

This box plot built using Plotly Express visualizes the High School Average Mark and Math Score.

The presence of outliers in High School Average Mark shows that some students had significantly lower grades.

The High School Average Mark has a higher Median with most values ranging between 60 and 90, than Math Score with most values between 20 and 50, and a narrower spread.

Note: The high school average mark was dropped due to the high number of null values

High School Average Mark and Math Score Boxplot

GPA Trends Filtered with COOP

This Scatterplot built using Plotly Express visualizes the GPA Trends with COOP Participation where Yes (blue) and No (red) coop participation students.

We have applied blue color for Males and Red for Females’

While we hover over the data you can see the Funding and Prev Education

The feature Gender has a major class imbalance, there 1111 records for Male, and only 325 records for Females. We can observe this in the scatterplot. We can also observe that Female students tend to perform better, with higher GPA Scores.

Influence of Previous Education with COOP

This histogram built using Plotly visualizes the Previous Education with COOP Participation where Yes (blue) and No (red) coop participation students.

Prev Education has 863 records corresponding to High School and 482 records corresponding to Post Secondary Education. From this visualization we can observe that there are more students who dont participate in co-op overall and Students whose prev education was Post Secondary have a lower participation in Co-op, this can be due to alternate career paths, internships, etc

Prev Education with COOP

Influence of English Grade on COOP

This histogram built using Plotly visualizes the Influence of English Grade on COOP Participation where Yes (blue) and No (red) coop participation students.

As we can see there are more students participating in coop for students with English Grade Level 160 and Level 161, while Level 170 has the highest Non – coop count. This suggests that coop participation is more common for those with Higher English Grades

Influence of Funding with COOP

This histogram built using Plotly visualizes the Influence of Funding with COOP Participation where Yes (blue) and No (red) coop participation students.

GPOG_FT and Intl Regular funding categories have the highest student counts, with non-Co-op participants outnumbering Co-op participants.
While other funding categories, such as Apprentice_PS and Second Career Program, show minimal participation in both Co-op and non-Co-op programs.

Influence of First Year Persistence on COOP

This histogram built using Plotly visualizes the distribution of First Year Persistence on COOP Participation where Yes (blue) and No (red) coop participation students.

Students with First Year Persistance (1) have a higher proportion compared to those without First Year Persistence (0), Also the count of those not participating in coop are lot higher in both the groups.

First Language Distribution by Residency

This histogram built using Plotly visualizes the distribution of First Language by Residency among Domestic (blue) and International (red) students.

When First Language is English – Domestic Residency is 711 and International is 116 

When First Language is French – Domestic Residency is 4 and International is 0

When First Language is Other – Domestic Residency is 137 and International is 465

Age Distribution by Residency

This histogram built using Plotly visualizes the distribution of Age Distribution by Residency among Domestic (blue) and International (red) students.

We can observe that Domestic Students fall in the age group from 0-25 with few students in higher age groups. Whereas International Students mostly fall in the age group of 20-25 with few students in between 0-20.

This tells us that most students who come to study from other countries are from the age group 21-25.
Also there are no International students who pursue their education after the age of 40, whereas there are 37 Domenstic students who fall in the age group 41-50 and only 9 students who are older than 51

Gender Distribution by Residency

This histogram built using Plotly visualizes the distribution of Gender Distribution by Residency among Domestic (blue) and International (red) students.

For both International and Domestic Residency the count of Males is higher, while there are a very minor amount of students who identify as Neither Male nor Female and they all are Domestic Students

English Grade Distribution by Residency

This histogram built using Plotly visualizes the distribution of English Grades by Residency among Domestic (blue) and International (red) students.

Majority of Domestic Students have a English Grade of Level 170, with few students in English Grade Level 160, 161 and 171. Whereas majority of International Students have a English Grade of Level 161, 171, and 170, with few students with English Grade of Level 160,141, and very few with lower English Grade.

This tells us that while majority of Domestic Students have a higher English Grade compared to International Students

GPA Comparison: First Term GPA vs Second Term GPA

This scatter plot built using Plotly visualizes the relationship between First Term GPA and Second Term GPA, with Domestic (blue) and International (yellow) students differentiated by color.

There is a strong positive correlation between First Term GPA and Second Term GPA, meaning students who perform well in the first term generally continue to do so in the second term.

International students (yellow points) are more evenly distributed across GPA levels, with a noticeable concentration at higher GPA values.

Math Score Distribution by Residency

MathScore with Residency

This histogram built using Plotly visualizes the distribution of Math Scores by Residency among Domestic (blue) and International (red) students.

Most students score between 15 – 50

There is a significant spike at 33 for both groups, especially for International students

At higher scores (above 40), international students appear more frequent, possibly indicating stronger math proficiency

Count of Age Groups by Gender

This histogram built using Seaborn visualizes the count of Age Groups by Gender, with Male (blue), Female (orange), and Neutral (green) students differentiated by color.

We can clearly see that there is predominance of males compare to females and neutral within all the Age Group, the younger age group shows more of a balanced gender distribution and we can see that as the Age Group progress from middle age to older age there is distinct decline in population count.

Count of Age Groups by Gender

GPA Trend by Age Group

This scatterplot built using Matplotlib visualizes the GPA Trend by Age Group among First Term GPA (blue) and Second Term GPA (orange) students.

Students between 21-35 years have a higher GPA compared to those in 18 – 20 years. We notice the peak for the GPA at the age group of 31-35 years

Average GPA Comparison by Gender

This histogram built using Matplotlib visualizes the Average GPA comparison by Gender, with First Term GPA (blue), Second Term GPA (orange) differentiated by color.

In Male and Female gender’s there is not much of a change in terms of GPA grade, a slight decrease in GPA can be seen in Second Term GPA. For the Neutral category students we see that there is huge improvement in the Second Term GPA.

average GPA Conversion by Gender

Proportion of COOP vs Non COOP Students

proportion of coop vs non coop

This pie char built using Matplotlib visualizes the proportion of students who participated in coop vs those who did not.

From the Pie Chart we observe that 30.4% of students participated in COOP and the rest 69.6% of students did not participate in COOP

Average Math Score by Age Group

This bar plot built using Seaborn visualizes the Average Math Score by Age Group.

In this barplot we notice that:

  • The average math scores seem to increase as age increases, peaking in the 41 to 50 age group.
  • Younger age groups (0-18, 19-20) have lower average math scores compared to older groups.

 

avg math score by age group barplot

Plotly Dashboard

I developed an interactive dashboard using Plotly Express and Dash. This dashboard provides dynamic visualizations, allowing users to explore insights into student demographics, academic performance, and co-op participation.

In the above carousel we have the First Language Distribution, Age Group Distribution, COOP Status by Gender and Funding for Domestic Students. I’ve added interactive visualizations with filters to provide a dynamic visualization.

Implementation of Neural Networks

This project involves building a binary classification model using Keras and TensorFlow. The dataset was preprocessed by handling missing values and converting features into NumPy arrays.

Model Architecture

  • Input Layer: 64 neurons, ReLU activation

  • Hidden Layer: 32 neurons, ReLU activation

  • Output Layer: 1 neuron, Sigmoid activation

Training and Performance

  • Loss Function: Binary Crossentropy

  • Optimizer: Adam

  • Training Accuracy: 85%

  • Test Accuracy: 87%

  • Best Validation Accuracy: 89.1%

The model was trained for 30 epochs and saved for deployment. Future improvements could include hyperparameter tuning and regularization for better generalization. 🚀

Conclusion

This project brought data to life with Plotly Dash and Plotly Express, turning raw numbers into interactive, easy-to-understand visualizations. From bar charts to scatter plots, the dashboards made it simple to spot trends and patterns, making data exploration both insightful and engaging.

On the machine learning side, we built a binary classification model using Keras and TensorFlow, and after training it with carefully structured layers, it hit an impressive 87% accuracy on test data. With ReLU activation, Adam optimization, and batch processing, the model learned efficiently, making solid predictions.

Description

This project leverages Artificial Neural Networks (ANN) and interactive data visualizations to analyze student data and predict first-year persistence. The solution integrates: