1 Why “careful”?

As you read in the title, why “careful”? Using tidyverse, or any frameworks for data manipulation out there, is fine. Most introductory tidyverse tutorials teach you how to write code with it. Very few teach you when the nice-looking code becomes dangerous or unnecessarily painful six months later.

This short guide focuses on the second part — the things that experienced people wish they had known earlier.

Before we start, here are the packages I’ll be using in this post:

Packages that are tidyverse. This includes the following specifics:
- dplyr
- tidyr
- readr
- lubridate
- purrr
Packages that are not tidyverse. This includes the following specifics:
- box
- numberize
- broom
- {e1071}

And they will be used much later on.

2 Tidyverse is not just unquoted column names

You thought the reason why tidyverse framework differ from other frameworks, e.g. Pandas, Polars, or data.table, comes from accepting “columns” argument not being quoted? I mean, yes, and data.table does this as well, although the syntax and APIs are not completely the same. That’s not what I am trying to convey here.

# This fails R CMD check in a package
mtcars |> dplyr::mutate(new = mpg / wt) |> head(5)
#>>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb      new
#>> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 8.015267
#>> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 7.304348
#>> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 9.827586
#>> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 6.656299
#>> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 5.436047

# `R CMD CHECK` friendly
mtcars |> dplyr::mutate(new = .data$mpg / .data$wt) |> head(5)
#>>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb      new
#>> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 8.015267
#>> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 7.304348
#>> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 9.827586
#>> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 6.656299
#>> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 5.436047

Why did I tell you this? Let’s say you’re planning to use this package as one of your package dependencies. Unfortunately, R CMD CHECK will fail and will complain if you are using bare column names, so instead use .data$colname.

Update: Thanks to u/Sufficient_Meet6836 from this thread for pointing it out, and Hadley Wickham respectfully recommends .data[[colname]], rather than .data$colname.

Here’s why:

.data$colname will throws an error once used in a closure:

new_col = function(data, var1, var2) {
    data |> 
        dplyr::mutate(new = .data$var1 / .data$var2) |> 
        head(5)
}

mtcars |> new_col(mpg, wt)

.data[[colname]] with ensym() is less painful and accepts both bare columns and stringed columns:

new_col = function(data, var1, var2) {
    box::use(rlang[ensym, as_string], dplyr[mutate])
    var1 = as_string(ensym(var1))
    var2 = as_string(ensym(var2))

    data |> 
        mutate(new = .data[[var1]] / .data[[var2]]) |> 
        head(5)
}

mtcars |> new_col(mpg, wt)
mtcars |> new_col('mpg', 'wt')

Another method is the use of { name } when used in a different context. The most optimal application is when you’re writing functions that takes column names as arguments.

my_summarise1 = function(data, col) {
    data |> 
        dplyr::summarise(mean_col := mean({{ col }}, na.rm = TRUE))
}

iris |> my_summarise1(Sepal.Length)

  mean_col
1 5.843333

Here’s more complex as it uses !! (pronounced as “bang-bang”) operator:

my_summarise2 = function(data, col) {
    box::use(
        dplyr[summarise],
        glue[glue],
        rlang[ensym, as_string]
    )
    
    capt_col = ensym(col)
    name_col = glue("mean_{as_string(capt_col)}")
    
    data |> 
        summarise(!!name_col := mean({{ col }}, na.rm = TRUE))
}

iris |> my_summarise2(Sepal.Length)

  mean_Sepal.Length
1          5.843333

Click this link for more details.

3 Don’t just straight data imports

Do not just import data (not limited to CSVs), most especially when the data wasn’t collected in one pattern. For example, a CSV data that has “date” column(s), where it doesn’t follow the right format, e.g. %Y-%m-%d (translated as “2005-01-29” for example). This is probably the #1 source of silent bugs in production data pipelines, especially if overlooked.

But first, let me tell you something:

When CSV file contains messy format in general

If the data contains messy format in general, set all the columns into string type first

box::use(readr[read_csv, cols, col_character])

# Dangerous (very common in real life)
read_csv("data.csv")

# Much safer minimum
read_csv(
    "data.csv",
    col_types = cols(
        .default = col_character(), # This force everything as text first 
    )
)

Auto-guessing fails when:

First 1000 rows look clean, row 1001 doesn’t
Excel exported dates as text in weird formats
IDs like “001234” get parsed as numbers (losing leading zeros)
Mix of "N/A", "null", "NULL", "-", "nan" all mean missing

3.1 Synthetic example: Dealing with dates

Here, imagine you have this kind of data in a certain CSV, where it contains filedate/datetime columns that come:

sample_csv = "
date, readings, x, y, z\n
\"Mar.23,2005\",0.02,3.3,1.1,5.7\n
\"Feb.26,2005\",0.15,4.5,5.0,1.9\n
\"Apr.5,2005\",0.2,7.2,2.8,5.2\n
\"May.12,2005\", 0.5,4.9,1.3,6.8
"

If you directly import this CSV data:

readr::read_csv(I(sample_csv))

# A tibble: 4 × 5
  date        readings     x     y     z
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 Mar.23,2005     0.02   3.3   1.1   5.7
2 Feb.26,2005     0.15   4.5   5     1.9
3 Apr.5,2005      0.2    7.2   2.8   5.2
4 May.12,2005     0.5    4.9   1.3   6.8

The problem arises when readr::read_csv() fails to parse the "date" column as an actual date type as you expect, instead it is parsed as a string type.

The readr package provides an API that can explicitly judge the column type.

Here’s how you do it:

box::use(
    readr[read_csv, cols, col_date]
)

read_csv(
    I(sample_csv),
    col_types = cols(
        date = col_date("%b.%d,%Y")
    )
)

# A tibble: 4 × 5
  date       readings     x     y     z
  <date>        <dbl> <dbl> <dbl> <dbl>
1 2005-03-23     0.02   3.3   1.1   5.7
2 2005-02-26     0.15   4.5   5     1.9
3 2005-04-05     0.2    7.2   2.8   5.2
4 2005-05-12     0.5    4.9   1.3   6.8

3.2 Synthetic example: Multiple messy date formats in one column

But, what if the 'date' column doesn’t follow 1 pattern?

box::use(
    readr[read_csv, cols, col_character],
    dplyr[mutate], 
    lubridate[parse_date_time, as_date]
)

sample_csv = "
date, readings, x, y, z\n
\"Mar.23,2005\",0.02,3.3,1.1,5.7\n
\"Feb 26, 2005\",0.15,4.5,5.0,1.9\n
\"04-5-2005\",0.2,7.2,2.8,5.2\n
\"May12,05\", 0.5,4.9,1.3,6.8\n
\"Aug.17,2005\", 0.17,9.5,0.8,4.2
"

read_csv(
    I(sample_csv),
    col_types = cols(
        date = col_character()
    )
) |> 
    mutate(
        date = 
            parse_date_time(
                date,
                orders = c(
                    "b.d,Y",
                    "b d, Y",
                    "m-d-Y",
                    "bd,y"
                )
            ) |> 
            as_date()
    )

# A tibble: 5 × 5
  date       readings     x     y     z
  <date>        <dbl> <dbl> <dbl> <dbl>
1 2005-03-23     0.02   3.3   1.1   5.7
2 2005-02-26     0.15   4.5   5     1.9
3 2005-04-05     0.2    7.2   2.8   5.2
4 2005-05-12     0.5    4.9   1.3   6.8
5 2005-08-17     0.17   9.5   0.8   4.2

Unfortunately, right now, readr::read_csv() doesn’t know how to parse CSV files with different data format, so I have to take some roundabouts to properly parse the "date" column.

4 Use tidyverse aggressively for serious data cleaning

Now you know how easy it get when importing the data with messed up pattern structure. Use tidyverse for data cleaning for f sake, trust me. Seriously, if you can do data cleaning from other tools, e.g. Excel or Python-Pandas (I know some companies you are working with will choose them), tidyverse makes things much easier, conventional, readable, and maintainable (arguable). Some practitioners actually use tidyverse for modeling and resort to base R loops or hacky solutions for cleaning. Don’t be that person.

Let us demonstrate with the Messy HR Data dataset from this repository.

hr_data = readr::read_csv(
    "https://raw.githubusercontent.com/eyowhite/Messy-dataset/main/messy_HR_data.csv"
)

If you glimpse this data:

hr_data |> dplyr::glimpse()

Rows: 1,000
Columns: 10
$ Name                <chr> "grace", "david", "hannah", "eve", "grace", "jack"…
$ Age                 <chr> "25", "nan", "35", "nan", "nan", "nan", "nan", "40…
$ Salary              <chr> "50000", "65000", "SIXTY THOUSAND", "50000", "NAN"…
$ Gender              <chr> "Male", "Female", "Female", "Female", "Female", "O…
$ Department          <chr> "HR", "Finance", "Sales", "IT", "Finance", "Market…
$ Position            <chr> "Manager", "Director", "Director", "Manager", "Man…
$ `Joining Date`      <chr> "April 5, 2018", "2020/02/20", "01/15/2020", "Apri…
$ `Performance Score` <chr> "D", "F", "C", "A", "F", "F", "B", "C", "C", "A", …
$ Email               <chr> "email@example.com", "user@domain.com", "email@exa…
$ `Phone Number`      <chr> "nan", "123-456-7890", "098-765-4321", NA, "098-76…

You see how bad it is, even if you said it’s not “dirty enough” or similar. The reason why tidyverse makes things “conventional” because in R, missing values has its own unique representation value, denoted by NA, not through null, NULL, or NaN / nan.

The data cleaning in R with tidyverse is much easier than you thought

box::use(
    dplyr[mutate, across, everything, if_else, case_when],
    stringr[detect = str_detect, capitalize = str_to_sentence],
    lubridate[parse_date_time, as_date]
)

hr_data |> 
    mutate(
        across(
            everything(), 
            \(col) if_else(detect(col, "^(?i)nan$"), NA, col)
        ), 
        `Joining Date` = parse_date_time(
            `Joining Date`, c(
                "%B %d, %Y",
                "%Y/%m/%d", 
                "%m-%d-%Y",
                "%Y.%m.%d"
            )
        ) |> as_date(), 
        Salary = numberize::numberize(Salary),
        Name = capitalize(Name),
        Age = numberize::numberize(Age)
    )

# A tibble: 1,000 × 10
   Name      Age Salary Gender Department Position  `Joining Date`
   <chr>   <dbl>  <dbl> <chr>  <chr>      <chr>     <date>        
 1 Grace      25  50000 Male   HR         Manager   2018-04-05    
 2 David      NA  65000 Female Finance    Director  2020-02-20    
 3 Hannah     35  60000 Female Sales      Director  2020-01-15    
 4 Eve        NA  50000 Female IT         Manager   2018-04-05    
 5 Grace      NA     NA Female Finance    Manager   2020-01-15    
 6 Jack       NA  65000 Other  Marketing  Director  2019-03-25    
 7 Charlie    NA  50000 Male   Marketing  Clerk     2019-12-01    
 8 Grace      40  50000 Other  HR         Director  2019-03-25    
 9 Hannah     40  60000 Female Marketing  Manager   2020-01-15    
10 Eve        30     NA Other  Finance    Assistant 2020-02-20    
# ℹ 990 more rows
# ℹ 3 more variables: `Performance Score` <chr>, Email <chr>,
#   `Phone Number` <chr>

You see how easy is this? The fact that dplyr::mutate() accepts bare expressions together with its bare column, you can easily use other functions, like numberize::numberize() to translate bare word value, i.e. "thirty" in Age column and "SIXTY THOUSAND" in Salary column.

5 Use across() + for safe type / role-based transformations

You see how I clean the data above, right? I use one of tidyverse’s most underrated features: across() combined with <tidy-select> helpers, commonly used in “data masking” functions, where the operation is applied across multiple columns. This lets you apply transformations to multiple columns based on their type or name pattern, making your code both DRY and maintainable.

5.1 Clean all text columns at once

box::use(
    dplyr[mutate, across, where],
    stringr[str_trim, str_to_upper]
)

survey = tibble::tibble(
    resp_id = c(1, 2, 3),
    name = c("  alice  ", "BOB", "  Charlie  "),
    city = c("New York  ", "  Boston", "Chicago  "),
    rating = c(4.2, 3.8, 4.5)
)

survey |>
    mutate(
        across(where(is.character), str_trim),
        across(where(is.character) & !resp_id, str_to_upper)
    )

# A tibble: 3 × 4
  resp_id name    city     rating
    <dbl> <chr>   <chr>     <dbl>
1       1 ALICE   NEW YORK    4.2
2       2 BOB     BOSTON      3.8
3       3 CHARLIE CHICAGO     4.5

The second across() shows combining conditions: all character columns except the ID field.

5.2 Scale financial columns

box::use(dplyr[mutate, across, starts_with])

financials = tibble::tibble(
    company = c("TechCo", "RetailCorp"),
    revenue_q1 = c(1200000, 850000),
    revenue_q2 = c(1350000, 920000),
    costs_q1 = c(800000, 600000),
    costs_q2 = c(850000, 640000)
)

financials |>
    mutate(
        across(
            starts_with("revenue"),
            \(x) x / 1e6,
            .names = "{.col}_M"
        )
    )

# A tibble: 2 × 7
  company    revenue_q1 revenue_q2 costs_q1 costs_q2 revenue_q1_M revenue_q2_M
  <chr>           <dbl>      <dbl>    <dbl>    <dbl>        <dbl>        <dbl>
1 TechCo        1200000    1350000   800000   850000         1.2          1.35
2 RetailCorp     850000     920000   600000   640000         0.85         0.92

The .names argument controls output column names. When you add revenue_q3 later, it automatically gets converted too.

5.3 Conditional rounding

box::use(dplyr[mutate, across, where, any_of])

measurements = tibble::tibble(
    sample_id = 1:4,
    temperature = c(98.6234, 99.1456, 97.8921, 98.4567),
    pressure = c(120.456, 118.234, 122.789, 119.123),
    ph_level = c(7.4123, 7.3987, 7.4256, 7.4089)
)

measurements |>
    mutate(
        across(
            where(is.numeric) & !any_of("sample_id"),
            round,
            digits = 2
        )
    )

# A tibble: 4 × 4
  sample_id temperature pressure ph_level
      <int>       <dbl>    <dbl>    <dbl>
1         1        98.6     120.     7.41
2         2        99.2     118.     7.4 
3         3        97.9     123.     7.43
4         4        98.5     119.     7.41

This pattern scales: add new numeric columns and they’re automatically rounded. No need to update the rounding code.

5.4 Multiple functions and their “names”

The .fns parameter from dplyr::across() function allows multiple functions, stored in a list, and name them (I strongly recommend naming them). They could be a known function, a lambda / anonymous function (written through function() keyword or its short-hand \()), a purrr-style lambda function, or a mix of them — as long as they are “function” objects, nothing else.

My most used scenario is when you want to calculate the descriptive statistics for all numeric columns:

box::use(
    tidyselect[where, matches], 
    dplyr[summarise, across],
    tidyr[long = pivot_longer, wide = pivot_wider],
    purrr[compose, partial], 
    stats[sd, quantile],
    e1071[skewness, kurtosis]
)

iris |> 
    summarise(
        across(
            where(is.numeric), 
            list(
                n_missing = compose(sum, is.na), # Since compose() returns a function
                mean = ~ mean(., na.rm = TRUE), 
                sd = \(col) sd(col, na.rm = TRUE), 
                min = min, 
                q1 = ~ quantile(., prob = 0.25, na.rm = TRUE), 
                md = \(col) median(col, na.rm = TRUE), 
                q3 = ~ quantile(., prob = 0.75, na.rm = TRUE), 
                max = \(col) max(col, na.rm = TRUE),
                skew = skewness,
                kurt = partial(kurtosis, na.rm = TRUE) # Since partial() returns a function
            ), 
            
            .names = "{.col}..{.fn}"
        )
    ) |> 
    long(
        cols = matches("\\.\\."), 
        names_pattern = "(.+)\\.\\.(.+)",  
        names_to = c("variable", "statistic"),
        values_to = "est"
    ) |> 
    wide(
        names_from = statistic,
        values_from = est
    )

# A tibble: 4 × 11
  variable     n_missing  mean    sd   min    q1    md    q3   max   skew   kurt
  <chr>            <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
1 Sepal.Length         0  5.84 0.828   4.3   5.1  5.8    6.4   7.9  0.309 -0.606
2 Sepal.Width          0  3.06 0.436   2     2.8  3      3.3   4.4  0.313  0.139
3 Petal.Length         0  3.76 1.77    1     1.6  4.35   5.1   6.9 -0.269 -1.42 
4 Petal.Width          0  1.20 0.762   0.1   0.3  1.3    1.8   2.5 -0.101 -1.36

6 Joins flavor in dplyr

Since the stable version (1.1.x) release of dplyr, specification of joins is now more flexible, more explicit, and more “controlled” using dplyr::join_by(). While the new API provides clearer complex “join” tasks, this also makes the operation less error-prone.

6.1 Inequality joins for binning

Assign customer segments based on purchase totals:

box::use(dplyr[left_join, join_by, group_by, slice_max, ungroup])

purchases = tibble::tibble(
    customer_id = 1:8,
    total_spent = c(45, 280, 520, 95, 1200, 180, 650, 35)
)

segments = tibble::tibble(
    segment = c("Bronze", "Silver", "Gold", "Platinum"),
    min_spend = c(0, 100, 500, 1000)
)

purchases |>
    left_join(segments, by = join_by(total_spent >= min_spend)) |>
    group_by(customer_id) |>
    slice_max(min_spend, n = 1, with_ties = FALSE) |>
    ungroup()

# A tibble: 8 × 4
  customer_id total_spent segment  min_spend
        <int>       <dbl> <chr>        <dbl>
1           1          45 Bronze           0
2           2         280 Silver         100
3           3         520 Gold           500
4           4          95 Bronze           0
5           5        1200 Platinum      1000
6           6         180 Silver         100
7           7         650 Gold           500
8           8          35 Bronze           0

The “greater than or equal to (> + =)” operator finds all matching tiers, then we keep the highest one. Much clearer than manual conditional logic.

6.2 Range joins for time-based matching

Match transactions to active promotions:

box::use(
    dplyr[tbl = tibble, inner_join, join_by, between, transmute],
    lubridate[as_date]
)

transactions = tbl(
    tx_id = 1:6,
    tx_date = as_date(c(
        "2024-01-20", "2024-02-15", "2024-03-25",
        "2024-04-10", "2024-05-15", "2024-06-05"
    )),
    amount = c(150, 200, 180, 220, 175, 190)
)

promos = tbl(
    promo = c("New Year", "Spring Sale", "Summer Blast"),
    start = as_date(c("2024-01-01", "2024-03-01", "2024-05-01")),
    end = as_date(c("2024-01-31", "2024-04-30", "2024-06-30")),
    discount = c(0.15, 0.10, 0.20)
)

transactions |>
    inner_join(
        promos,
        by = join_by(between(tx_date, start, end))
    ) |>
    transmute(
        tx_id,
        tx_date,
        promo, 
        amount,
        discount, 
        savings = amount * discount
    )

# A tibble: 5 × 6
  tx_id tx_date    promo        amount discount savings
  <int> <date>     <chr>         <dbl>    <dbl>   <dbl>
1     1 2024-01-20 New Year        150     0.15    22.5
2     3 2024-03-25 Spring Sale     180     0.1     18  
3     4 2024-04-10 Spring Sale     220     0.1     22  
4     5 2024-05-15 Summer Blast    175     0.2     35  
5     6 2024-06-05 Summer Blast    190     0.2     38

SQL Code Equivalent

Here’s the equivalent SQL code:

SELECT
    tx_id,
    tx_date,
    promo,
    amount,
    discount,
    amount * discount AS savings
FROM (
    SELECT t.*, p.*
    FROM transactions AS t
    INNER JOIN promos AS p
        ON (
            t.tx_date >= p.start AND
            t.tx_date <= p.end
        )
) AS new_table

This beats manually checking date ranges with filter(). The join expresses intent directly.

7 Advanced tidyverse Tricks for Grouped Modeling

What’s so great about tidyverse is while easy to use, there are a lot of things you can take advantage of, such as the code below — it’s far more flexible than you thought. In the example below, I take advantage of dplyr::reframe() to generate a new data frame with a variable number of rows directly from grouped operations on the original data.

In this case, we’re exploring the linear relationship between vehicle weight (wt) and fuel efficiency (mpg) in the classic {mtcars} dataset, stratified by the number of cylinders (cyl). The code below computes key summary statistics — including regression coefficients, Pearson correlation, R-squared, and adjusted R-squared—for each cylinder group.

box::use(dplyr[group_by, reframe])

mtcars |> 
    group_by(cyl) |> 
    reframe(
        {         
            box::use(
                stats[linear_reg = lm, coef, pearson_r = cor], 
                purrr[imap_dfc, set_names], 
                tibble[tbl = tibble]
            )
            
            model = linear_reg(mpg ~ wt)
            coefs = coef(model)
            coef_table = imap_dfc(coefs, \(bi, nm) {
                result = tbl(bi)
                set_names(result, nm)
            })
            
            corr = pearson_r(wt, mpg)
            
            test = summary(model)
            tbl(
                coef_table, 
                corr = corr, 
                rsq = test$r.squared,
                adj_rsq = test$adj.r.squared
            )
        }
        
        # , .by = cyl
    )

# A tibble: 3 × 6
    cyl `(Intercept)`    wt   corr   rsq adj_rsq
  <dbl>         <dbl> <dbl>  <dbl> <dbl>   <dbl>
1     4          39.6 -5.65 -0.713 0.509   0.454
2     6          28.4 -2.78 -0.682 0.465   0.357
3     8          23.9 -2.19 -0.650 0.423   0.375

The code cleverly uses dplyr::reframe() with an anonymous block {}to run a complete mini-analysis per cylinder group in mtcars dataset. Inside the block, it loads just the needed functions via box::use(), fits a linear model of mpg ~ wt on the current group, extracts coefficients into a tidy one-row table, computes Pearson correlation, and grabs $R^2$ and adjusted $R^2$ from the model summary. Everything is then combined into a single row per group. This compact, namespace-safe pattern lets you perform fairly complex grouped modeling without cluttering the environment with temporary objects.

Beyond p-Values: Why Effect Size and Power Matter

I chose this scenario to somehow demonstrate the flaw of NHST, at least part of this “careful” data science guide with tidyverse. Here are some insights of the results above why I chose this scenario:

When investigating relationships between variables, it’s tempting to rely solely on p-values from null hypothesis significance testing (NHST). For instance, a simple linear regression on the full mtcars dataset shows a highly significant relationship between wt and mpg:

model = lm(mpg ~ wt, data = mtcars)
broom::tidy(model)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10

broom::glance(model)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.753         0.745  3.05      91.4 1.29e-10     1  -80.0  166.  170.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

After running both linear regression and correlation test at the same time, you can say it rejects the null hypothesis of the test, implying that there are strong linear relationship between wt and mpg. Keep in mind that directly using “Null Hypothesis Significant Testing” (NHST) is a direct flaw for reproducibility. The low p-value (< 0.001) and strong correlation suggest a clear negative linear association—higher weight predicts lower mileage.
However, stratifying by cylinder count reveals a more nuanced picture. The 6-cylinder group with only 7 observations fails to reject the null hypothesis (p = 0.092), despite having a similar effect size to the other groups. This isn’t because the relationship doesn’t exist – it’s because the sample size is too small to detect the effect reliably.

This demonstrates why:
- Sample size matters: The 6-cylinder group lacks statistical power
- Effect sizes differ: The slope varies substantially across groups (-2.19 to -5.65), suggesting the relationship isn’t uniform

That’s why when doing an analysis, it is better to run simulation before conducting the analysis and the sample size should be large enough.

8 Closing Remarks and Resources

Those are all my “simple annotation” on your fussy data work, potentially (hoping) will help in your future jobs.

R for Data Science
Box: Placing module system into R
box: Write Reusable, Composable and Modular R Code
Statistical Rethinking
dplyr 1.1.0 blog post - Deep dive on join_by()
Programming with dplyr - When to use .data$ and { }