Once one has a good understanding of the data they have to work with, they next need to decide what they aim to answer with this information. Understanding the problem at hand is part of the Business Understanding step in the Data Science Process.
A business question with a data solution can often be posed as a hypothesis. For example “Is there a difference in the customer conversion rate between our old website design and a proposed new layout?” Having a hypothesis to test is a must-have before statistical testing can occur.
Two types of hypotheses are exploratory and confirmatory; as the names might suggest, exploratory analysis seeks to uncover the “why” and dig into the data while confirmatory hypotheses are more applicable when you have a pretty good idea of what is going on with the data and need evidence to support thinking. It is important to decide a priori which of your hypotheses belong to these categories. It has been argued that limiting exploratory hypothesis testing can help to increase certainty in results.
Once the hypothesis has been determined, the next question to answer is “am I comparing the mean or the median of two groups?”. Parametric tests will compare group means, while non-parametric tests compare group medians. A common misconception is that the decision rests solely on whether the data is normally distributed or not, especially when there is a smaller sample size and distribution of the data can matter significantly. Other factors should also be considered.
Parametric tests are widely regarded as handling data that is normally distributed — data with a Gaussian distribution — well. However, parametric tests also:
- Work well with skewed and non-normal distributions.
- Perform well when the spread of each group is different or the groups have different amounts of variability.
- Typically have more statistical power than non-parametric tests.
If sample size is sufficiently large and group mean is the preferred measure of central tendency, parametric tests are the way to go.
If group median is the preferred measure of central tendency for the data, go with non-parametric tests regardless of sample size…