Chapter 10: Exploratory data analysis
Set up
We have already loaded the R packages you need for this tutorial. Once you start working in RStudio, you will run code like the code shown below to load the R packages you need.
Exercises
All exercises are from R for Data Science 2nd Edition.
Ex 1 in Section 10.3.3
Explore the distribution of price in the diamonds dataset. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
In section 10.4, the authors recommend replacing the unusual values with missing values to prevent misleading calculations. Following the reading, we create a new dataset, diamonds2, that replaces these unusual values with NA (read chapter 10 for more details).
Ex 2 in Section 10.4.1
What does na.rm = TRUE do in mean() and sum()? You can use the y variable in diamonds2 dataset to explore this question.
Ex 3 in Section 10.4.1
Recreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.
We provide you with the code from section 10.4 as the baseline. You want to update it to include the faceting with appropriate scaling.
Ex 5 in Section 10.5.1.1
Create a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?
Open code chunk
Here is a code chunk in case you wish to play around with some code