undergrad-uh401/README.md

# Educational Statistics Analysis Tool

Python utility to parse data from the Department of Education and National Science Foundation.

## Description

This project was created to answer the question "If I have $5, should I give it to the art department or to the history department?"

This project does so by gathering historical enrollment data from the College Scorecard by the Department of Education, indexable by year, institutiton, and subject, and comparing it against the National Science Foundation's data on funding, which provides data on subject funding indexable by year, institution, subject, and funding source.

## Getting Started

### Dependencies

* This project was built against Python 3.12
* Some scripts require excessive computational memory. An HPC Cluster or other server environment is recommended.

### Executing program

* Due to concerns with license restrictions on data, I have opted not to redistribute any raw or modified data files with this project. Instead, data must be downloaded from the DoE and NSF directly. This was automated with the `download.sh` file.
* The `.csv` files downloaded from the College Scorecard must be converted into an Awkward Array for effective data analysis. This is done with `convert.py`. **This step requires 80GB of RAM**, and as such, must be done on a machine with sufficient memory.
* Final plotting can be done with `run.py`. Sample outputs are provided for user convenience in `output.txt` (for tabular outputs) and in the `plots/` directory.

## Help Wanted!

A traditional repository would have the Issues tab in their git host acting as their feature request / help wanted section. However, I've not tested the Issues tab (as I'm also the only one who uses this git server instance), so a simple list will suffice:

- Implement cross-comparison of subject fields with the best available subject equivilent in the funding column
- Implement mass-scraping of the NSF site to gather funding data for all institutions across all years, rather than just 2025 data for UA
- Modify `convert.py` to accept the CSVs as Dask DataFrames rather than as Pandas tables, to allow for distributed and/or delayed computing of the conversion. This would allow for people with ordinary laptops to run the program

## Authors

* Nathan Nguyen

## Version History

* 0.1
    * Initial Release

## License

I'm not a lawyer - most of this data probably has restrictions, anyways. If I was, though, I'd use the [NAME HERE](https://www.youtube.com/watch?v=XfELJU1mRMg) License - see the LICENSE.md file for details

## Acknowledgments

Inspiration, code snippets, etc.
* [ReadMe Template](https://gist.githubusercontent.com/DomPizzie/7a5ff55ffa9081f2de27c315f5018afc/raw/d59043abbb123089ad6602aba571121b71d91d7f/README-Template.md)