Introduction
Earlier, we showed you what correlation looks like on scatterplots and described how the Pearson correlation coefficient, r, can vary from -1 to 1. In this post, we’ll provide some details on how to manually enter data, get the correlation coefficient, and get the p (significance) value for the correlation coefficient in Python.
What You’ll Need
Python
Anaconda (run Jupyter notebook)
Entering Data Manually and Getting the R Value
In your Jupyter notebook, enter the following code:
height1 = [67, 72, 75, 80, 60, 65, 68, 69, 69, 70, 70, 80, 76, 60, 60]
weight1 = [150, 240, 270, 300, 160, 180, 170, 175, 175, 190, 190, 260, 240, 140, 130]
height = np.array(height1)
weight = np.array(weight1)
from scipy.stats.stats import pearsonr
pearsonr(height, weight)
Then click Ctrl-A to select all of this code within the window; next, press Ctrl-Enter.
Therefore, r = .9156, p < .0001. These two variables are positively and significantly correlated. Remember, you can square your r value to get the coefficient of determination, which happens to be 0.8383. In other words, in your dataset, (0.9156)^2 or approximately 83.83% of the variation in weight is explained by variation in height.
Generating A Scatterplot
You can use the code below to generate a scatterplot of these two variables in relation to each other. This code anticipates one of the next steps after correlation, which, in the case of these data, might be to predict someone’s weight if your know their height: An ordinary least squares (OLS) regression, with the line of best fit and the 95% confidence interval (CI) superimposed. Read on to learn more about OLS regression in Python.
import pandas as pd
data = {
"height": [67, 72, 75, 80, 60, 65, 68, 69, 69, 70, 70, 80, 76, 60, 60],
"weight": [150, 240, 270, 300, 160, 180, 170, 175, 175, 190, 190, 260, 240, 140, 130]
}
df = pd.DataFrame(data)
import seaborn as sns
sns.lmplot(x='height', y='weight', data=df, fit_reg=True, ci=95, n_boot=1000)
BridgeText can help you with all of your statistical analysis needs.