Summarising data
data distribution: we want to know what the data looks like
a good summary needs to show location, spread, range, extremes, gaps/holes, symmetry, etc.
Graphical summaries
Frequency distribution (table)
Grade | Frequency |
5 | 2 |
6 | 1 |
7 | 3 |
8 | 2 |
9 | 1 |
10 | 2 |
Bar chart
data:image/s3,"s3://crabby-images/6a933/6a93369f0cbdd56122e1801da996ba9047e39601" alt=""
Pareto bar chart
orders categories based on frequency. only for nominal level of measurement
data:image/s3,"s3://crabby-images/ce82c/ce82ca49887cfa0f43edcec773822db0ed5e1e46" alt=""
Pie chart
size of pieces of pie shows frequency of category.
data:image/s3,"s3://crabby-images/abcf9/abcf9eb63b29c366da8014aeba4dd5994a45e4aa" alt=""
Histogram
size of bar shows frequency of that category.
data:image/s3,"s3://crabby-images/b65d3/b65d3b6422000095ea9d644e91e4d8fa6b95d30f" alt=""
Time series
shows quantity that varies over time.
data:image/s3,"s3://crabby-images/9dc21/9dc21f92db8607ba5f388ef94fdf656466b51769" alt=""
Descriptive summaries
qualitative description:
-
shape:
data:image/s3,"s3://crabby-images/7d36e/7d36e4de3aa32e1e36021735b35accc364241d6a" alt=""
-
location: position on x axis (around 0, around 10, etc.)
-
dispersion: spread out graph == large dispersion
numerical description:
- location: measure of center
- mean: average (sum everything, divide by the total number)
- median: sort, find the middle number
- mode: most often occurring value (highest frequency)
- unimodal: unique mode
- bimodal: two modes
- multimodal: more than two modes
- dispersion:
- measures of variation
- sample standard deviation (how much values deviate from mean)
- same units as data (unlike variance)
- standard deviation is $\sqrt{s^{2}}$
- $s^{2} = \frac{\sum_{i=1} n(x_{i} - \bar{x}^{2})}{n-1}$
- for population: σ², σ
- range
- (minimum - maximum)
- sensitive to extreme values
- relative standing
- percentiles, quartiles (special percentiles: Q1, Q2 (median), Q3)
- IQR: interquartile range = (Q3 - Q1)
- 5-number summary: min, Q1, median (Q2), Q3, max
- boxplot is graph of this
- whiskers are lines from box (by default, not more than 1.5 × IQR
- outliers: points outside of whiskers
data:image/s3,"s3://crabby-images/aba56/aba565bc5b168e15b6faa3b61ef836db4ebb9748" alt=""