Week 3: Assignment

Due Jan 22, 2023 by 11:59pm
Points 10
Submitting a text entry box or a file upload

Auto Module Navigation Blue

Week 3 Assignment

Instructions

Download: West Point data Download West Point data

Follow the below instructions

Submit: An RMarkdown knitted HTML file with your work (and set echo = TRUE so your code is visible).

Assignment Details

See the attached Excel workbook, which has two sheets in it. One describes the data documentation, and the other is the data from the United States Military Academy at West Point (USMA). The data is already fairly clean and tidy, but still requires some more work (as would be typical!).

At USMA, cadets are basically randomly assigned to "companies", like military units, with whom they spend all their time and take all their classes with. The Excel sheet contains information on the year, the student standing (fresh/soph/etc.) that student has in that year, the student/cadet's gender, the gender of other cadets in their company, and whether they progressed to the next year (1) or dropped out (0). Note that when the documentation refers to "cohort" you can think of this as referring to "class".

Complete the following data cleaning/manipulation tasks (or analysis tasks which require cleaning and manipulation) in an RMarkdown document. Set echo = TRUE for all your code so it's visible.

1. EITHER save the data sheet as its own CSV file to load in, OR use the read_excel function in the readxl package to read the sheet in directly from the Excel workbook.

2. Recreate the femalespeers, malespeers, and totpeople columns based on the documentation for those columns, and check whether your calculations match what's in the original data. In other words, look at the excel sheet, read the variable descriptions, and create new variables that fit those descriptions. "Recreate" means "create from scratch." Do not use the femalespeers, malespeers, and totpeople columns already in the data to create your new ones. That wouldn't be "recreating", that would be "copying." (NOTE 1: you won't get an exact match with the old columns, NOTE 2: keep in mind these variables count "peers", i.e. not including yourself).

3. Investigate the rows for which your recreation *doesn't* line up exactly with the original columns. Any ideas what the issue might be? Do you trust the original or your recreation more?

4. Create two new columns from company_n: company, and division. If it's A-1, for example, A is the company, and 1 is the division.

5. This data follows a certain number of cohorts, which means that in the first year of the data, we only see a small portion of all students, then more the next year, and so on. Limit the data just to years in which you have all four classes present in full quantity (i.e. not just a few stragglers but all four entire classes appear to be there. This will entail finding which years those are).

6. Make the following tables:

a. Top four companies (A, B, C, etc., not A-1, A-2) with the highest continue_or_grad rates

b. continue_or_grad rates by class

c. continue_or_grad rates of women by class

Note you can make a table by just creating the appropriate data set and showing it, or by sending it to the knitr::kable() function to get it formatted a little more nicely.

7. Bonus task (ungraded, tricky): notice anything strange about the "random assignment" of women?

Rubric

Title:

Find a Rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 pts Full Marks blank 0 to >0 pts No Marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --