Chapter 2 Data sources

Columbia University offers one of the few, most rigorous data science programs on the planet. Every year, an exclusive cohort of undergraduates and professionals from a diverse range of backgrounds (art history, mathematics, computer science, etc.) are offered admission to the Data Science Institute, where they will typically undergo three semesters of formal education before going on to join the workforce or a doctoral program and solve the world’s most pressing problems with data-driven solutions. With this group that was chosen with the utmost selectivity, we didn’t need to go too far to find a quality pool of future data scientists.

However, building our survey was no trivial task by any means. In formulating our questionnaire,we developed three categories for our line of questioning in order to achieve the most interesting/useful results.

-Who are we?

-How are we doing?

-Where are we going?

The first category deals with establishing exactly whom our cohort is made up of (ex: gender, academic background, prior institution, etc.). These necessary profile questions would allow us to potentially identify larger trends in the data from our bigger-picture questions. The second and third categories involve the aforementioned “bigger-picture” questions, which cover work-life balance, domain areas of interest, and post-graduation intentions. With regard to the survey format, some questions were formatted with a restricted set of answer choices for ease of cleaning; though, others had to remain open-ended. With this in mind, we wondered how would we deal with the likely noise in the data from these questions in particular. How would we parse out useful insights from noise that could come from a number of areas, including spelling/capitalization errors, diversity in answer, and nonsensical/implausible responses?