2.7 KiB
Educational Statistics Analysis Tool
Python utility to parse data from the Department of Education and National Science Foundation.
Description
This project was created to answer the question "If I have $5, should I give it to the art department or to the history department?"
This project does so by gathering historical enrollment data from the College Scorecard by the Department of Education, indexable by year, institutiton, and subject, and comparing it against the National Science Foundation's data on funding, which provides data on subject funding indexable by year, institution, subject, and funding source.
Getting Started
Dependencies
- This project was built against Python 3.12
- Some scripts require excessive computational memory. An HPC Cluster or other server environment is recommended.
Executing program
- Due to concerns with license restrictions on data, I have opted not to redistribute any raw or modified data files with this project. Instead, data must be downloaded from the DoE and NSF directly. This was automated with the
download.shfile. - The
.csvfiles downloaded from the College Scorecard must be converted into an Awkward Array for effective data analysis. This is done withconvert.py. This step requires 80GB of RAM, and as such, must be done on a machine with sufficient memory. - Final plotting can be done with
run.py. Sample outputs are provided for user convenience inoutput.txt(for tabular outputs) and in theplots/directory.
Help Wanted!
A traditional repository would have the Issues tab in their git host acting as their feature request / help wanted section. However, I've not tested the Issues tab (as I'm also the only one who uses this git server instance), so a simple list will suffice:
- Implement cross-comparison of subject fields with the best available subject equivilent in the funding column
- Implement mass-scraping of the NSF site to gather funding data for all institutions across all years, rather than just 2025 data for UA
- Modify
convert.pyto accept the CSVs as Dask DataFrames rather than as Pandas tables, to allow for distributed and/or delayed computing of the conversion. This would allow for people with ordinary laptops to run the program
Authors
- Nathan Nguyen
Version History
- 0.1
- Initial Release
License
I'm not a lawyer - most of this data probably has restrictions, anyways. If I was, though, I'd use the NAME HERE License - see the LICENSE.md file for details
Acknowledgments
Inspiration, code snippets, etc.