exploration.Rmd
The summarize
function in dplyr
, especially
when combined with group_by
and across
,
provides powerful tools for exploring data using summary statistics. The
psyntur
package provides some wrappers to these tools to
allow data exploration, albeit of a limited kind, to be done quickly and
easily. We explore some of these functions in this vignette.
Load the psyntur
functions and data sets with the usual
library
command.
library(psyntur)
describe
We can use the describe
function in
psyntur
. The first argument to describe
should
be the data frame. Subsequent arguments should be named arguments of
summary statistics functions, like mean
,
median
, etc., applied to any variables in the data frame.
For example, using the faithfulfaces
data frame, we can
obtain the arithmetic mean and standard deviation of the
faithful
variable as follows.
describe(data = faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 1 × 2
#> avg stdev
#> <dbl> <dbl>
#> 1 5.14 0.957
We can apply the same or different functions to the same or different variables.
describe(data = faithfulfaces,
avg_faith = mean(faithful),
avg_trust = mean(trustworthy),
sd_trust = sd(trustworthy))
#> # A tibble: 1 × 3
#> avg_faith avg_trust sd_trust
#> <dbl> <dbl> <dbl>
#> 1 5.14 4.32 0.791
We can obtain the summary statistics for the chosen variables for
each group of a third variable using a by
variable.
describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 2 × 3
#> face_sex avg stdev
#> <chr> <dbl> <dbl>
#> 1 female 5.55 0.802
#> 2 male 4.75 0.932
The by
argument may be a vector of variables. In this
case, the chosen variables are grouped by the combination of the
by
variables. For example, in the following we group the
time
variable in vizverb
by both
task
and response
.
describe(vizverb, by = c(task, response),
avg = mean(time),
median = median(time),
iqr = IQR(time),
stdev = sd(time)
)
#> # A tibble: 4 × 6
#> task response avg median iqr stdev
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 verbal verbal 12.8 11.2 2.92 5.17
#> 2 verbal visual 13.7 13.5 4.96 3.98
#> 3 visual verbal 9.01 7.68 4.65 3.37
#> 4 visual visual 18.2 16.0 7.59 6.12
It would be tedious and repetitive to use describe
as
above if wanted to apply the same set of summary statistic functions to
a set of variables. Instead, we can use describe_across
.
For example, to calculate the mean, median, standard deviation to two
variables, trustworthy
and faithful
, in the
faithfulfaces
data set, we can do the following.
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd)
)
#> # A tibble: 1 × 6
#> trustworthy_avg trustworthy_median trustworthy_stdev faithful_avg
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4.32 4.24 0.791 5.14
#> # ℹ 2 more variables: faithful_median <dbl>, faithful_stdev <dbl>
Note that the data frame that is returned is in a wide format. We can
pivot this to a longer format by saying pivot = TRUE
.
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
pivot = TRUE
)
#> # A tibble: 2 × 4
#> variable avg median stdev
#> <chr> <dbl> <dbl> <dbl>
#> 1 trustworthy 4.32 4.24 0.791
#> 2 faithful 5.14 5.24 0.957
We can use the by
variable to calculate the summary
statistics for each subgroup corresponding to each value of the
by
variable, as in the following example.
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.55 5.71 0.802
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.75 4.85 0.932
As in the case of describe
, the by
argument
can be a vector of variables.
_xna
When variable have NA
values, most summary statistics
function will, by default, return NA
. To illustrate this,
we can modify faithfulfaces
to contain NA
’s
for the faithful
variable.
Now, if we try one of the above describe
or
describe_aross
functions with the faithful
variable, we will obtain corresponding NA
values.
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful NA NA NA
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful NA NA NA
Of course, if we set na.rm = TRUE
in any or all of the
summary functions, we will remove the NA
values before the
statistics are calculated. This is relatively easy to do with
describe
, as in the following example.
describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful, na.rm = T), stdev = sd(faithful, na.rm = T))
#> # A tibble: 2 × 3
#> face_sex avg stdev
#> <chr> <dbl> <dbl>
#> 1 female 5.55 0.802
#> 2 male 4.75 0.932
However, for describe
across, we pass in a list of
functions, and so to set na.rm = T
, we can to create
purrr
style anonymous functions calling the summary
statistic function with na.rm = T
, as in the following
example.
library(purrr)
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = ~mean(., na.rm = T),
median = ~median(., na.rm = T),
stdev = ~sd(., na.rm = T)),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.11 5.26 0.606
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.65 4.82 0.845
Anonymous function like this are not very transparent for those new to R, and the resulting function looks quite complex.
In order to avoid using code like ~mean(., na.rm = T)
,
for a number of commonly used summary statistic functions
(sum
, mean
, median
,
var
, sd
, IQR
), we have made
counterparts where na.rm
is set to TRUE
by
default. These functions have the same name as the original with the
suffix _xna
(but IQR
is iqr_xna
,
not IQR_xna
). As such, we can do the following.
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean_xna, median = median_xna, stdev = sd_xna),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.11 5.26 0.606
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.65 4.82 0.845