Educational Statistics Analysis Tool

Python utility to parse data from the Department of Education and National Science Foundation.

Description

This project was created to answer the question "If I have $5, should I give it to the art department or to the history department?"

This project does so by gathering historical enrollment data from the College Scorecard by the Department of Education, indexable by year, institutiton, and subject, and comparing it against the National Science Foundation's data on funding, which provides data on subject funding indexable by year, institution, subject, and funding source.

Getting Started

Dependencies

This project was built against Python 3.12
Some scripts require excessive computational memory. An HPC Cluster or other server environment is recommended.

Executing program

Due to concerns with license restrictions on data, I have opted not to redistribute any raw or modified data files with this project. Instead, data must be downloaded from the DoE and NSF directly. This was automated with the download.sh file.
The .csv files downloaded from the College Scorecard must be converted into an Awkward Array for effective data analysis. This is done with convert.py. This step requires 80GB of RAM, and as such, must be done on a machine with sufficient memory.
Final plotting can be done with run.py. Sample outputs are provided for user convenience in output.txt (for tabular outputs) and in the plots/ directory.

Help Wanted!

A traditional repository would have the Issues tab in their git host acting as their feature request / help wanted section. However, I've not tested the Issues tab (as I'm also the only one who uses this git server instance), so a simple list will suffice:

Implement cross-comparison of subject fields with the best available subject equivilent in the funding column
Implement mass-scraping of the NSF site to gather funding data for all institutions across all years, rather than just 2025 data for UA
Modify convert.py to accept the CSVs as Dask DataFrames rather than as Pandas tables, to allow for distributed and/or delayed computing of the conversion. This would allow for people with ordinary laptops to run the program

Authors

Nathan Nguyen

Version History

0.1
- Initial Release

License

I'm not a lawyer - most of this data probably has restrictions, anyways. If I was, though, I'd use the NAME HERE License - see the LICENSE.md file for details

Acknowledgments

Inspiration, code snippets, etc.

ReadMe Template

2.7 KiB Raw Permalink Blame History