`exploration.Rmd`

The `summarize`

function in `dplyr`

, especially when combined with `group_by`

and `across`

, provides powerful tools for exploring data using summary statistics. The `psyntur`

package provides some wrappers to these tools to allow data exploration, albeit of a limited kind, to be done quickly and easily. We explore some of these functions in this vignette.

Load the `psyntur`

functions and data sets with the usual `library`

command.

```
library(psyntur)
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
```

`describe`

We can use the `describe`

function in `psyntur`

. The first argument to `describe`

should be the data frame. Subsequent arguments should be named arguments of summary statistics functions, like `mean`

, `median`

, etc., applied to any variables in the data frame. For example, using the `faithfulfaces`

data frame, we can obtain the arithmetic mean and standard deviation of the `faithful`

variable as follows.

```
describe(data = faithfulfaces, avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 1 × 2
#> avg stdev
#> <dbl> <dbl>
#> 1 5.14 0.957
```

We can apply the same or different functions to the same or different variables.

```
describe(data = faithfulfaces,
avg_faith = mean(faithful),
avg_trust = mean(trustworthy),
sd_trust = sd(trustworthy))
#> # A tibble: 1 × 3
#> avg_faith avg_trust sd_trust
#> <dbl> <dbl> <dbl>
#> 1 5.14 4.32 0.791
```

We can obtain the summary statistics for the chosen variables for each group of a third variable using a `by`

variable.

```
describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful), stdev = sd(faithful))
#> # A tibble: 2 × 3
#> face_sex avg stdev
#> <chr> <dbl> <dbl>
#> 1 female 5.55 0.802
#> 2 male 4.75 0.932
```

The `by`

argument may be a vector of variables. In this case, the chosen variables are grouped by the combination of the `by`

variables. For example, in the following we group the `time`

variable in `vizverb`

by both `task`

and `response`

.

```
describe(vizverb, by = c(task, response),
avg = mean(time),
median = median(time),
iqr = IQR(time),
stdev = sd(time)
)
#> # A tibble: 4 × 6
#> task response avg median iqr stdev
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 verbal verbal 12.8 11.2 2.92 5.17
#> 2 verbal visual 13.7 13.5 4.96 3.98
#> 3 visual verbal 9.01 7.68 4.65 3.37
#> 4 visual visual 18.2 16.0 7.59 6.12
```

It would be tedious and repetitive to use `describe`

as above if wanted to apply the same set of summary statistic functions to a set of variables. Instead, we can use `describe_across`

. For example, to calculate the mean, median, standard deviation to two variables, `trustworthy`

and `faithful`

, in the `faithfulfaces`

data set, we can do the following.

```
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd)
)
#> # A tibble: 1 × 6
#> trustworthy_avg trustworthy_med… trustworthy_std… faithful_avg faithful_median
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4.32 4.24 0.791 5.14 5.24
#> # … with 1 more variable: faithful_stdev <dbl>
```

Note that the data frame that is returned is in a wide format. We can pivot this to a longer format by saying `pivot = TRUE`

.

```
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
pivot = TRUE
)
#> # A tibble: 2 × 4
#> variable avg median stdev
#> <chr> <dbl> <dbl> <dbl>
#> 1 trustworthy 4.32 4.24 0.791
#> 2 faithful 5.14 5.24 0.957
```

We can use the `by`

variable to calculate the summary statistics for each subgroup corresponding to each value of the `by`

variable, as in the following example.

```
describe_across(faithfulfaces,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.55 5.71 0.802
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.75 4.85 0.932
```

As in the case of `describe`

, the `by`

argument can be a vector of variables.

`_xna`

When variable have `NA`

values, most summary statistics function will, by default, return `NA`

. To illustrate this, we can modify `faithfulfaces`

to contain `NA`

’s for the `faithful`

variable.

Now, if we try one of the above `describe`

or `describe_aross`

functions with the `faithful`

variable, we will obtain corresponding `NA`

values.

```
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean, median = median, stdev = sd),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful NA NA NA
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful NA NA NA
```

Of course, if we set `na.rm = TRUE`

in any or all of the summary functions, we will remove the `NA`

values before the statistics are calculated. This is relatively easy to do with `describe`

, as in the following example.

```
describe(data = faithfulfaces, by = face_sex,
avg = mean(faithful, na.rm = T), stdev = sd(faithful, na.rm = T))
#> # A tibble: 2 × 3
#> face_sex avg stdev
#> <chr> <dbl> <dbl>
#> 1 female 5.55 0.802
#> 2 male 4.75 0.932
```

However, for `describe`

across, we pass in a list of functions, and so to set `na.rm = T`

, we can to create `purrr`

style anonymous functions calling the summary statistic function with `na.rm = T`

, as in the following example.

```
library(purrr)
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = ~mean(., na.rm = T),
median = ~median(., na.rm = T),
stdev = ~sd(., na.rm = T)),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.11 5.26 0.606
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.65 4.82 0.845
```

Anonymous function like this are not very transparent for those new to R, and the resulting function looks quite complex.

In order to avoid using code like `~mean(., na.rm = T)`

, for a number of commonly used summary statistic functions (`sum`

, `mean`

, `median`

, `var`

, `sd`

, `IQR`

), we have made counterparts where `na.rm`

is set to `TRUE`

by default. These functions have the same name as the original with the suffix `_xna`

(but `IQR`

is `iqr_xna`

, not `IQR_xna`

). As such, we can do the following.

```
describe_across(faithfulfaces_na,
variables = c(trustworthy, faithful),
functions = list(avg = mean_xna, median = median_xna, stdev = sd_xna),
by = face_sex,
pivot = TRUE
)
#> # A tibble: 4 × 5
#> face_sex variable avg median stdev
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 female trustworthy 4.44 4.29 0.742
#> 2 female faithful 5.11 5.26 0.606
#> 3 male trustworthy 4.21 4.18 0.822
#> 4 male faithful 4.65 4.82 0.845
```