Week 3: Assignment
- Due Jan 22, 2023 by 11:59pm
- Points 10
- Submitting a text entry box or a file upload
Auto Module Navigation Blue
Week 3 Assignment |
Instructions
Download: West Point data Download West Point data
Follow the below instructions
Submit: An RMarkdown knitted HTML file with your work (and set echo = TRUE so your code is visible).
Assignment Details
See the attached Excel workbook, which has two sheets in it. One describes the data documentation, and the other is the data from the United States Military Academy at West Point (USMA). The data is already fairly clean and tidy, but still requires some more work (as would be typical!).
At USMA, cadets are basically randomly assigned to "companies", like military units, with whom they spend all their time and take all their classes with. The Excel sheet contains information on the year, the student standing (fresh/soph/etc.) that student has in that year, the student/cadet's gender, the gender of other cadets in their company, and whether they progressed to the next year (1) or dropped out (0). Note that when the documentation refers to "cohort" you can think of this as referring to "class".
Complete the following data cleaning/manipulation tasks (or analysis tasks which require cleaning and manipulation) in an RMarkdown document. Set echo = TRUE for all your code so it's visible.
1. EITHER save the data sheet as its own CSV file to load in, OR use the read_excel function in the readxl package to read the sheet in directly from the Excel workbook.
2. Recreate the femalespeers, malespeers, and totpeople columns based on the documentation for those columns, and check whether your calculations match what's in the original data. In other words, look at the excel sheet, read the variable descriptions, and create new variables that fit those descriptions. "Recreate" means "create from scratch." Do not use the femalespeers, malespeers, and totpeople columns already in the data to create your new ones. That wouldn't be "recreating", that would be "copying." (NOTE 1: you won't get an exact match with the old columns, NOTE 2: keep in mind these variables count "peers", i.e. not including yourself).
3. Investigate the rows for which your recreation *doesn't* line up exactly with the original columns. Any ideas what the issue might be? Do you trust the original or your recreation more?
4. Create two new columns from company_n: company, and division. If it's A-1, for example, A is the company, and 1 is the division.
5. This data follows a certain number of cohorts, which means that in the first year of the data, we only see a small portion of all students, then more the next year, and so on. Limit the data just to years in which you have all four classes present in full quantity (i.e. not just a few stragglers but all four entire classes appear to be there. This will entail finding which years those are).
6. Make the following tables:
a. Top four companies (A, B, C, etc., not A-1, A-2) with the highest continue_or_grad rates
b. continue_or_grad rates by class
c. continue_or_grad rates of women by class
Note you can make a table by just creating the appropriate data set and showing it, or by sending it to the knitr::kable() function to get it formatted a little more nicely.
7. Bonus task (ungraded, tricky): notice anything strange about the "random assignment" of women?