`exploration.Rmd`

The `summarize`

function in `dplyr`

, especially
when combined with `group_by`

and `across`

,
provides powerful tools for exploring data using summary statistics. The
`psyntur`

package provides some wrappers to these tools to
allow data exploration, albeit of a limited kind, to be done quickly and
easily. We explore some of these functions in this vignette.

Load the `psyntur`

functions and data sets with the usual
`library`

command.

```
library(psyntur)
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
```

`describe`

We can use the `describe`

function in
`psyntur`

. The first argument to `describe`

should
be the data frame. Subsequent arguments should be named arguments of
summary statistics functions, like `mean`

,
`median`

, etc., applied to any variables in the data frame.
For example, using the `faithfulfaces`

data frame, we can
obtain the arithmetic mean and standard deviation of the
`faithful`

variable as follows.

```
describe(data = faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 1 × 2
#> avg stdev
#> <dbl> <dbl>
#> 1 5.14 0.957
```

We can apply the same or different functions to the same or different variables.

```
describe(data = faithfulfaces,
avg_faith = mean(faithful),
avg_trust = mean(trustworthy),
sd_trust = sd(trustworthy))
#> # A tibble: 1 × 3
#> avg_faith avg_trust sd_trust
#> <dbl> <dbl> <dbl>
#> 1 5.14 4.32 0.791
```

We can obtain the summary statistics for the chosen variables for
each group of a third variable using a `by`

variable.

```
describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 2 × 3
#> face_sex avg stdev
#> <chr> <dbl> <dbl>
#> 1 female 5.55 0.802
#> 2 male 4.75 0.932
```

The `by`

argument may be a vector of variables. In this
case, the chosen variables are grouped by the combination of the
`by`

variables. For example, in the following we group the
`time`

variable in `vizverb`

by both
`task`

and `response`

.

```
describe(vizverb, by = c(task, response),
avg = mean(time),
median = median(time),
iqr = IQR(time),
stdev = sd(time)
)
#> # A tibble: 4 × 6
#> task response avg median iqr stdev
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 verbal verbal 12.8 11.2 2.92 5.17
#> 2 verbal visual 13.7 13.5 4.96 3.98
#> 3 visual verbal 9.01 7.68 4.65 3.37
#> 4 visual visual 18.2 16.0 7.59 6.12
```

It would be tedious and repetitive to use `describe`

as
above if wanted to apply the same set of summary statistic functions to
a set of variables. Instead, we can use `describe_across`

.
For example, to calculate the mean, median, standard deviation to two
variables, `trustworthy`

and `faithful`

, in the
`faithfulfaces`

data set, we can do the following.

```
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd)
)
#> # A tibble: 1 × 6
#> trustworthy_avg trustworthy_median trustworthy_stdev faithful_avg
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4.32 4.24 0.791 5.14
#> # ℹ 2 more variables: faithful_median <dbl>, faithful_stdev <dbl>
```

Note that the data frame that is returned is in a wide format. We can
pivot this to a longer format by saying `pivot = TRUE`

.

```
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
pivot = TRUE
)
#> # A tibble: 2 × 4
#> variable avg median stdev
#> <chr> <dbl> <dbl> <dbl>
#> 1 trustworthy 4.32 4.24 0.791
#> 2 faithful 5.14 5.24 0.957
```

We can use the `by`

variable to calculate the summary
statistics for each subgroup corresponding to each value of the
`by`

variable, as in the following example.

```
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.55 5.71 0.802
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.75 4.85 0.932
```

As in the case of `describe`

, the `by`

argument
can be a vector of variables.

`_xna`

When variable have `NA`

values, most summary statistics
function will, by default, return `NA`

. To illustrate this,
we can modify `faithfulfaces`

to contain `NA`

’s
for the `faithful`

variable.

Now, if we try one of the above `describe`

or
`describe_aross`

functions with the `faithful`

variable, we will obtain corresponding `NA`

values.

```
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful NA NA NA
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful NA NA NA
```

Of course, if we set `na.rm = TRUE`

in any or all of the
summary functions, we will remove the `NA`

values before the
statistics are calculated. This is relatively easy to do with
`describe`

, as in the following example.

```
describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful, na.rm = T), stdev = sd(faithful, na.rm = T))
#> # A tibble: 2 × 3
#> face_sex avg stdev
#> <chr> <dbl> <dbl>
#> 1 female 5.55 0.802
#> 2 male 4.75 0.932
```

However, for `describe`

across, we pass in a list of
functions, and so to set `na.rm = T`

, we can to create
`purrr`

style anonymous functions calling the summary
statistic function with `na.rm = T`

, as in the following
example.

```
library(purrr)
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = ~mean(., na.rm = T),
median = ~median(., na.rm = T),
stdev = ~sd(., na.rm = T)),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.11 5.26 0.606
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.65 4.82 0.845
```

Anonymous function like this are not very transparent for those new to R, and the resulting function looks quite complex.

In order to avoid using code like `~mean(., na.rm = T)`

,
for a number of commonly used summary statistic functions
(`sum`

, `mean`

, `median`

,
`var`

, `sd`

, `IQR`

), we have made
counterparts where `na.rm`

is set to `TRUE`

by
default. These functions have the same name as the original with the
suffix `_xna`

(but `IQR`

is `iqr_xna`

,
not `IQR_xna`

). As such, we can do the following.

```
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean_xna, median = median_xna, stdev = sd_xna),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.11 5.26 0.606
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.65 4.82 0.845
```