Interpreting Scatterplots

University of Glasgow Q-Step centre

Thees F Spreckelsen


Scatterplots display the relationship between two continuous variables 1 Continuous variables:
- The value of the variable, e.g. age, can be any number (3; 4.88882; -55).
- The differences (“intervals”) between to values of the variable can be of any size (e.g. very small or very large).
. An example is the plot below using data from the Durex 2010 The Face of Global Sex 2010 - report:

All observations are shown for which both variables are not missing. Thus it allows us to see relationships (e.g. correlations), but also other patterns (clusters of observations in one age range), and especially unusual observations2 Unusual observations: often these are call outliers. However this gives a sense that the observations are deviations from the “norm”, rather than being observations that should simply be investigated more..
Scatterplots are very powerful, particular for smaller datasets or those with clear patterns.

Interpretation: Interpreting scatterplots should always start with looking at the two variables presented and the scales on which they are presented.

Sex education example:
The graph displays the mean age at which a person received sex education and the number of sexual partners (at the point of the survey). The age values range from approx. 11 years to just above 15, and the number of average sexual partners from 2.5 to just above 5. Each point in the scatterplot represents a country average. There appears to be a slight positive relationship between average age at sex education and average number of sexual partners.

Check: Any interpretation should check that data are represent “truthfully” (i.e. without distortion), that data used are trustworthy (for that it should ideally be openly accessible3 Open data: To lie with data is easy unless everyone can check for themselves. Ideally all data should be accessible, if you collect your own make it open by depositing it e.g. https://osf.io), and whether the question or argument is actually answered or supported by the graph.

Sex education example:
The range of the axis (e.g. age 11 to age 16) is sensible as the data displayed are averages. If these were observations about individuals (e.g. a persons number of sexual partners) this would not be the case since it is possible to have no partners, and thus the scale should start at zero.
The data for the graph comes from the Durex 2010 The Face of Global Sex 2010 report, which is freely available online. However, the raw individual survey data is not (to the author’s knowledge) openly available and we can thus not check whether the averages are correct.
The title is somewhat misleading since it does not clearly highlight the data displayed is for countries, rather than individuals.

Summary

There are thus six key elements for interpreting and checking scatterplots:

What to interpret:

What to check:

Doing your own scatterplots?

Make sure you have looked at each of the six elements above when making your own plots. If one element is missing in your graph or you can not answer the corresponding questions with the graph (on its own) it is probably not trustworthy or easy to interpret.
A final note on data sources. Graphs are often re-used on Twitter, Instagram or the likes without any of the information that a paper or other interpretation contains. Thus make make the source part of the graph.


Data & code

The graphs above use the following two packages4 H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
:

library(ggplot2)
library(ggthemes)

The plots are based on data from the first table in the Appendix I to the Durex 2010 The Face of Global Sex 2010 - report. Below the raw data:

# Approach taken from Dirk Eddelbuettel: https://stackoverflow.com/a/38601375
durexraw <- "Country, Ever_had_sexual_intercourse_no, Ever_had_sexual_intercourse_yes, Mean_number_of_sexual_partners, Age_at_first_sex_education, Woman_can_become_pregnant_the_first_time_Agree, Woman_can_become_pregnant_the_first_time_Disagree, Woman_can_become_pregnant_the_first_time_Dk
Austria, 27.0, 73.0, 4.4, 11.5, 88.1, 8.7, 3.2
Belgium, 36.8, 63.2, 3.0, 11.7, 91.1, 6.9, 2.0
France, 34.4, 65.6, 3.8, 13.1, 92.9, 4.9, 2.2
Germany, 19.6, 80.4, 4.7, 12.1, 79.5, 17.2, 3.3
Hungary, 25.8, 74.2, 3.3, 12.0, 93.1, 4.6, 2.3
Italy, 30.5, 69.5, 3.9, 15.3, 74.9, 20.9, 4.2
Lithuania, 42.5, 57.5, 2.9, 13.2, 88.3, 8.2, 3.5
Netherlands, 29.6, 70.4, 3.6, 12.1, 91.6, 7.3, 1.1
Poland, 42.9, 57.1, 2.5, 12.9, 93.2, 4.8, 2.0
Romania, 31.4, 68.6, 3.6, 14.3, 84.9, 11.2, 3.9
Slovenia, 34.4, 65.6, 3.0, 11.5, 88.5, 8.9, 2.6
Spain, 26.5, 73.5, 3.9, 14.8, 81.0, 15.9, 3.1
Switzerland, 34.4, 65.6, 4.0, 11.9, 86.7, 11.6, 1.7
Turkey, 36.9, 63.1, 5.3, 14.9, 66.7, 27.1, 6.2
United Kingdom, 33.2, 66.8, 4.0, 11.9, 91.7, 5.7, 2.6"

df <- textConnection(durexraw)
df <- read.csv(df)

The below code generates the first scatter plot:

ggplot(df, aes(x=Age_at_first_sex_education,
               y=Mean_number_of_sexual_partners,
               label = Country)) +
  geom_point(size=3.5) +
  geom_text(aes(label=Country),hjust=-0.15, vjust=0) +
  scale_x_continuous(limits = c(11, 16))+
       xlab("Age sex education")+
       ylab("Mean # sexual partners")+
  labs(title = "Sex education & sexual partners",
       caption = "Source: Durex 2010 ''The Face of Global Sex 2010 - 
       They \n won’t know unless we tell them'', 
       Appendix 1.  ISSN: 1755-3075") +
  theme_tufte()+
  theme(text = element_text(size = 18),
        plot.caption = element_text(size=8),
        panel.background = element_rect(fill = "#666666"))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggthemes_4.2.0 ggplot2_3.2.1  tufte_0.6     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4       knitr_1.28       magrittr_1.5     tidyselect_1.1.0
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.1         rlang_0.4.6     
##  [9] stringr_1.4.0    dplyr_1.0.0      tools_3.6.2      grid_3.6.2      
## [13] gtable_0.3.0     xfun_0.14        withr_2.2.0      htmltools_0.4.0 
## [17] yaml_2.2.1       lazyeval_0.2.2   digest_0.6.25    lifecycle_0.2.0 
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.3      vctrs_0.3.1     
## [25] glue_1.4.1       evaluate_0.14    rmarkdown_2.2    labeling_0.3    
## [29] stringi_1.4.6    compiler_3.6.2   pillar_1.4.2     generics_0.0.2  
## [33] scales_1.0.0     pkgconfig_2.0.3