Interpreting Histograms

University of Glasgow Q-Step centre

Thees F Spreckelsen


A histogram displays the distribution of a continuous variables 1 Continuous variables:
- The value of the variable, e.g. age, can be any number (3; 4.88882; -55).
- The differences (“intervals”) between to values of the variable can be of any size (e.g. very small or very large).
. An example is the plot below using data from the Durex 2010 The Face of Global Sex 2010 - report:

Interpretation: Step 1: Interpreting histograms should always start with looking at the variables presented and the scales on which they are presented in the plot.

Sex education example:
The graph displays the mean age at which a person received sex education. The age values range from approx. 11 years to just above 15. The means represent country averages. The y-axis (left-hand scale in the plot), states that it shows the “Number of countries”. This is not a variable in the dataset, but rather the “count” or “frequency” was calculated for this plot. It ranges from zero to five countries.

Interpretation: Step 2: Only when we know what is displayed should we consider what we conclude from the visualization. A histogram shows the number of observations (frequencies) for a given value of our variable of interest. This allows us to consider what the “most common value” is, what values exist, particularly the highest and lowest, and what “shape” this distribution takes e.g. do all values appear equally often?2 These relate to quantities such as the mode and median, the range and minimum/maximum, as well as skew and kurtosis.

Sex education example:
The most common age to have sex education appears to be 12. For five countries this is the average age of first sex education. Notably though there are a large number of countries where the mean age is higher.
The distribution is not symmetric, rather there appear to be more countries towards the lower average age, the distribution is right-skewed.

The shape of the distribution is the particular focus of histograms.3 Most common values or ranges are better displayed by boxplots, or reported through descriptive statistics, e.g. medians and ranges. Three concepts help with the interpretation of the shape:

The are numerical summaries of skew and kurtosis, but their meaning is best conveyed via histogram.

Check: Any interpretation of a graph should check that data are represented “truthfully” (i.e. without distortion), that the data used is trustworthy (for that it should ideally be openly accessible4 Open data: To lie with data is easy unless everyone can check for themselves. Ideally all data should be accessible, if you collect your own make it open by depositing it e.g. https://osf.io), and whether the question or argument is actually answered or supported by the graph.
Let’s start with the axes:

Sex education example:
The range of the axis (e.g. age 10 to age 16) is sensible as the data displayed are averages. If these were observations about individuals (e.g. a persons number of sexual partners) this would not be the case since it is possible to have no partners, and thus the scale should start at zero.
The data for the graph comes from the Durex 2010 The Face of Global Sex 2010 report, which is freely available online. However, the raw individual survey data is not (to the author’s knowledge) openly available and we can thus not check whether the averages are correct.
The title is somewhat misleading since it does not clearly highlight the data displayed is for a specific set of countries, rather than individuals.

Special check for histograms: Histograms display continuous variables where values can be any possible number, e.g. the average age of sex education in the UK is 11.9 or in Austria 11.9.

The bars in the histogram however bring values like this together under one bar. For example a bar represent the 1 year, would mean that everything between 11 and 12 would be summarized with one bar. The process is called bin-ing. Researchers can decide which range of values goes under a bar. Thus you should always consider whether the bins might mis-represent the distribution. The side-bar shows the above histogram with different bins (bin-width = 1.5 years).
Reading “bin-widths” from a histogram plot is guesswork, and ideally it should have a note saying what the bin-width is plus what the number of observations displayed is.

Sex education example:
The histogram is displayed with a bin-width of 0.3years, and thus gives considerable detail of the overall distribution.

Summary

There are thus six key elements for interpreting and checking histograms:

What to interpret:

What to check:

What to check particularly for histograms:

Doing your own histogram?

Make sure you have looked at each of the seven elements above when making your own plots. If one element is missing in your graph or you can not answer the corresponding questions with the graph (on its own) it is probably not trustworthy or easy to interpret.
A final note on data sources. Graphs are often re-used on Twitter, Instagram or the likes without any of the information that a paper or other interpretation contains. Thus make make the source part of the graph.


Data & code

The graphs above use the following two packages5 H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes
:

library(ggplot2)
library(ggthemes)

The plots are based on data from the first table in the Appendix I to the Durex 2010 The Face of Global Sex 2010 - report. Below the raw data:

# Approach taken from Dirk Eddelbuettel: https://stackoverflow.com/a/38601375
durexraw <- "Country, Ever_had_sexual_intercourse_no, Ever_had_sexual_intercourse_yes, Mean_number_of_sexual_partners, Age_at_first_sex_education, Woman_can_become_pregnant_the_first_time_Agree, Woman_can_become_pregnant_the_first_time_Disagree, Woman_can_become_pregnant_the_first_time_Dk
Austria, 27.0, 73.0, 4.4, 11.5, 88.1, 8.7, 3.2
Belgium, 36.8, 63.2, 3.0, 11.7, 91.1, 6.9, 2.0
France, 34.4, 65.6, 3.8, 13.1, 92.9, 4.9, 2.2
Germany, 19.6, 80.4, 4.7, 12.1, 79.5, 17.2, 3.3
Hungary, 25.8, 74.2, 3.3, 12.0, 93.1, 4.6, 2.3
Italy, 30.5, 69.5, 3.9, 15.3, 74.9, 20.9, 4.2
Lithuania, 42.5, 57.5, 2.9, 13.2, 88.3, 8.2, 3.5
Netherlands, 29.6, 70.4, 3.6, 12.1, 91.6, 7.3, 1.1
Poland, 42.9, 57.1, 2.5, 12.9, 93.2, 4.8, 2.0
Romania, 31.4, 68.6, 3.6, 14.3, 84.9, 11.2, 3.9
Slovenia, 34.4, 65.6, 3.0, 11.5, 88.5, 8.9, 2.6
Spain, 26.5, 73.5, 3.9, 14.8, 81.0, 15.9, 3.1
Switzerland, 34.4, 65.6, 4.0, 11.9, 86.7, 11.6, 1.7
Turkey, 36.9, 63.1, 5.3, 14.9, 66.7, 27.1, 6.2
United Kingdom, 33.2, 66.8, 4.0, 11.9, 91.7, 5.7, 2.6"

df <- textConnection(durexraw)
df <- read.csv(df)

The below code generates the first histogramm:

ggplot(df, aes(x=Age_at_first_sex_education,
               # x=Mean_number_of_sexual_partners,
               label = Country)) +
  geom_histogram(binwidth = 0.3, colour = "black") +
  # geom_text(aes(label=Country),hjust=-0.15, vjust=0) +
  scale_x_continuous(limits = c(10, 16))+
        xlab("Mean age (years)")+
  ylab("Number of countries")+
  labs(title = "When sex education happens \n across the world",
       caption = "Source: Durex 2010 ''The Face of Global Sex 2010 - They \n won’t know unless we tell them'', Appendix 1.  ISSN: 1755-3075 \n Histogram: binwidth = 0.3years") +
  theme_tufte()+
  theme(text = element_text(size = 16),
        plot.caption = element_text(size=8)) 
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggthemes_4.2.0 ggplot2_3.2.1  tufte_0.6     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4       knitr_1.28       magrittr_1.5     tidyselect_1.1.0
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.1         rlang_0.4.6     
##  [9] stringr_1.4.0    dplyr_1.0.0      tools_3.6.2      grid_3.6.2      
## [13] gtable_0.3.0     xfun_0.14        withr_2.2.0      htmltools_0.4.0 
## [17] yaml_2.2.1       lazyeval_0.2.2   digest_0.6.25    lifecycle_0.2.0 
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.3      vctrs_0.3.1     
## [25] glue_1.4.1       evaluate_0.14    rmarkdown_2.2    labeling_0.3    
## [29] stringi_1.4.6    compiler_3.6.2   pillar_1.4.2     generics_0.0.2  
## [33] scales_1.0.0     pkgconfig_2.0.3