Analyzing the NYC Subway Dataset

Intro to Data Science: Final Project 1, Part 2

(Short Questions)

Austin J. Alexander


1.2

1.2 Why is this statistical test applicable to the dataset? (see 1.1.a)

Section 3. Visualization

3.1

3.1 One visualization should contain two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days (see the ENTRIESn_hourly HISTOGRAM (by RAIN) in the *DataExploration* supplement)

3.2

3.2 One visualization can be more freeform (e.g., see the Combining Station, Latitude, Longitude Data on a Map Layer in the *DataExploration* supplement)

Section 4. Conclusion

4.1

4.1 From your analysis and interpretation of the data, do more people ride the NYC subway when it is raining or when it is not raining?

From the current data set and the analyses performed, it remains inconclusive whether rain has any impact on the number of NYC subway entries. However, based on this data set alone, rain seemed to be an insignificant factor as it related to subway ridership. Thus, further analysis is necessary.

On the other hand, based on the data exploration, it seems quite clear that the number of entries is highly dependent on physical location, particularly station position, with specific units having the most importance.

4.2

4.2 What analyses lead you to this conclusion?

Based on exploration, statistical significance tests, and attempts at modeling the data, physical location dominated as an explanatory variable with respect to the number of entries. The Mann-Whitney $U$ test indicated that rainy and non-rainy days were essentially identical. Thus, there did not appear to be either a statistical or, based on the effect size values, practical difference between rainy days and non-rainy days in terms of their respective number of entries.

Section 5. Reflection

5.1

5.1 Please discuss potential shortcomings of the methods of your analysis

The data set under consideration was limited to a single month in the late spring / early summer of a particular year. As a result, among countless other possible factors for which the available data did not account, precipitation may in fact have an increased impact on the number of entries at other times of the year (e.g., during the winter months). Thus, the data set was limited by its temporal locale.

The linear regression model that was created, while having very high $r$ and $R^{2}$ values did not, based on residual analysis, adequately model the data. There is in fact not a linear relationship between the explanatory and response variables under consideration; thus, a non-linear model would likely be more appropriate for the current data set.

The statistical tests that were employed seemed effective (as long as sample sizes were kept small enough). However, it's unclear how traditional statistical tests relate to massive data sets (esp. since many statitical tests need to be used with relatively small sample sizes; on this point, see 5.2 below).

5.2

5.2 (Optional) Do you have any other insight about the dataset that you would like to share with us?

Assuming all statistical tests and learning models were implemented and interpreted correctly, it became clear that computational power was very important in data science, not due to the ability merely to apply methods to data, but in the ability to repeat numerous tests on random samples of data, which, at least in the case of this analysis, encouraged more confidence in test/model results.



While an inumerable number of online resources were used to attain a better understanding of the statistical matters in this analysis, the primary source for statistical definitions and methods was Michael Sullvan's *Statistics: Informed decisions using data* (4th ed.).