A ‘careful’ and small intro guide to data science with ‘tidyverse’

Teaching you the things you can take advantage of, and things weren’t taught by some tutorials you (potentially) know
R
data-science
analytics
tidyverse
Author

Joshua Marie

Published

December 28, 2025

1 Why “careful”?

As you read in the title, why “careful”? Using tidyverse, or any frameworks for data manipulation out there, is fine. Most introductory tidyverse tutorials teach you how to write code with it. Very few teach you when the nice-looking code becomes dangerous or unnecessarily painful six months later.

This short guide focuses on the second part — the things that experienced people wish they had known earlier.

Before we start, here are the packages I’ll be using:

  1. Packages that are tidyverse. This includes the following specifics:

  2. Packages that are not tidyverse. This includes the following specifics:

And they will be used much later on.

2 Tidyverse is not just unquoted column names

You thought the reason why tidyverse framework differ from other frameworks, e.g. Pandas, Polars, or data.table, comes from accepting “columns” argument not being quoted? I mean, yes, and data.table does this as well, but the syntax and some of the APIs are not the same. That’s not what I am trying to convey here.

# This fails R CMD check in a package
mtcars |> dplyr::mutate(new = mpg / wt) |> head(5)
#>>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb      new
#>> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 8.015267
#>> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 7.304348
#>> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 9.827586
#>> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 6.656299
#>> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 5.436047

# Package-safe alternatives
mtcars |> dplyr::mutate(new = .data$mpg / .data$wt) |> head(5)
#>>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb      new
#>> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 8.015267
#>> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 7.304348
#>> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 9.827586
#>> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 6.656299
#>> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 5.436047

Why did I tell you this? Let’s say you’re planning to use this package as one of your package dependencies. Unfortunately, R CMD CHECK will throw an error if you are using bare column names, so instead use .data$colname.

3 Never trust straight data imports

Do not just import data (not limited to CSVs), most especially when the data wasn’t collected in one pattern. For example, a CSV data that has “date” column(s), where it doesn’t follow the right format, e.g. %Y-%m-%d. This is probably the #1 source of silent bugs in production data pipelines, especially if overlooked.

But first, let me tell you something:

NoteWhen CSV file contains messy format in general
  • If the data contains messy format in general, set all the columns into string type first
box::use(readr[read_csv, cols, col_character])

# Dangerous (very common in real life)
read_csv("data.csv")

# Much safer minimum
read_csv(
    "data.csv",
    col_types = cols(
        # force everything as text first
        .default = col_character(),           
    )
)

Auto-guessing fails when:

  1. First 1000 rows look clean, row 1001 doesn’t
  2. Excel exported dates as text in weird formats
  3. IDs like “001234” get parsed as numbers (losing leading zeros)
  4. Mix of "N/A", "null", "NULL", "-", "nan" all mean missing

3.1 Synthetic example: Dealing with dates

Here, imagine you have this kind of data in a certain CSV, where it contains filedate/datetime columns that come:

sample_csv = "
date, readings, x, y, z\n
\"Mar.23,2005\",0.02,3.3,1.1,5.7\n
\"Feb.26,2005\",0.15,4.5,5.0,1.9\n
\"Apr.5,2005\",0.2,7.2,2.8,5.2\n
\"May.12,2005\", 0.5,4.9,1.3,6.8
"

If you directly import this CSV data:

readr::read_csv(I(sample_csv))
# A tibble: 4 × 5
  date        readings     x     y     z
  <chr>          <dbl> <dbl> <dbl> <dbl>
1 Mar.23,2005     0.02   3.3   1.1   5.7
2 Feb.26,2005     0.15   4.5   5     1.9
3 Apr.5,2005      0.2    7.2   2.8   5.2
4 May.12,2005     0.5    4.9   1.3   6.8

The problem arises when readr::read_csv() fails to parse the "date" column as an actual date type as you expect, instead it is parsed as a string type.

The readr package provides an API that can explicitly judge the column type.

Here’s how you do it:

box::use(
    readr[read_csv, cols, col_date]
)

read_csv(
    I(sample_csv),
    col_types = cols(
        date = col_date("%b.%d,%Y")
    )
)
# A tibble: 4 × 5
  date       readings     x     y     z
  <date>        <dbl> <dbl> <dbl> <dbl>
1 2005-03-23     0.02   3.3   1.1   5.7
2 2005-02-26     0.15   4.5   5     1.9
3 2005-04-05     0.2    7.2   2.8   5.2
4 2005-05-12     0.5    4.9   1.3   6.8

3.2 Synthetic example: Multiple messy date formats in one column

But, what if the 'date' column doesn’t follow 1 pattern?

box::use(
    readr[read_csv, cols, col_character],
    dplyr[mutate], 
    lubridate[parse_date_time, as_date]
)

sample_csv = "
date, readings, x, y, z\n
\"Mar.23,2005\",0.02,3.3,1.1,5.7\n
\"Feb 26, 2005\",0.15,4.5,5.0,1.9\n
\"04-5-2005\",0.2,7.2,2.8,5.2\n
\"May12,05\", 0.5,4.9,1.3,6.8\n
\"Aug.17,2005\", 0.17,9.5,0.8,4.2
"

read_csv(
    I(sample_csv),
    col_types = cols(
        date = col_character()
    )
) |> 
    mutate(
        date = 
            parse_date_time(
                date,
                orders = c(
                    "b.d,Y",
                    "b d, Y",
                    "m-d-Y",
                    "bd,y"
                )
            ) |> 
            as_date()
    )
# A tibble: 5 × 5
  date       readings     x     y     z
  <date>        <dbl> <dbl> <dbl> <dbl>
1 2005-03-23     0.02   3.3   1.1   5.7
2 2005-02-26     0.15   4.5   5     1.9
3 2005-04-05     0.2    7.2   2.8   5.2
4 2005-05-12     0.5    4.9   1.3   6.8
5 2005-08-17     0.17   9.5   0.8   4.2

Unfortunately, right now, readr::read_csv() doesn’t know how to parse CSV files with different data format, so I have to take some roundabouts to properly parse the "date" column.

4 Use tidyverse aggressively for serious data cleaning

Now you know how easy it get when importing the data with messed up pattern structure. Use tidyverse for data cleaning for f sake, trust me. Seriously, if you can do data cleaning from other tools, e.g. Excel or Python-Pandas (I know some companies you are working with will choose them), tidyverse makes things much easier, conventional, readable, and maintainable (arguable). Some practitioners actually use tidyverse for modeling and resort to base R loops or hacky solutions for cleaning. Don’t be that person.

Let us demonstrate with the Messy HR Data dataset from this repository.

hr_data = readr::read_csv(
    "https://raw.githubusercontent.com/eyowhite/Messy-dataset/main/messy_HR_data.csv"
)

If you glimpse this data:

hr_data |> dplyr::glimpse()
Rows: 1,000
Columns: 10
$ Name                <chr> "grace", "david", "hannah", "eve", "grace", "jack"…
$ Age                 <chr> "25", "nan", "35", "nan", "nan", "nan", "nan", "40…
$ Salary              <chr> "50000", "65000", "SIXTY THOUSAND", "50000", "NAN"…
$ Gender              <chr> "Male", "Female", "Female", "Female", "Female", "O…
$ Department          <chr> "HR", "Finance", "Sales", "IT", "Finance", "Market…
$ Position            <chr> "Manager", "Director", "Director", "Manager", "Man…
$ `Joining Date`      <chr> "April 5, 2018", "2020/02/20", "01/15/2020", "Apri…
$ `Performance Score` <chr> "D", "F", "C", "A", "F", "F", "B", "C", "C", "A", …
$ Email               <chr> "email@example.com", "user@domain.com", "email@exa…
$ `Phone Number`      <chr> "nan", "123-456-7890", "098-765-4321", NA, "098-76…

You see how bad it is, even if you said it’s not “dirty enough” or similar. The reason why tidyverse makes things “conventional” because in R, missing values has its own unique representation value, denoted by NA, not through null, NULL, or NaN / nan.

The data cleaning in R with tidyverse is much easier than you thought

box::use(
    dplyr[mutate, across, everything, if_else, case_when],
    stringr[detect = str_detect, capitalize = str_to_sentence],
    lubridate[parse_date_time, as_date]
)

hr_data |> 
    mutate(
        across(
            everything(), 
            \(col) if_else(detect(col, "^(?i)nan$"), NA, col)
        ), 
        `Joining Date` = parse_date_time(
            `Joining Date`, c(
                "%B %d, %Y",
                "%Y/%m/%d", 
                "%m-%d-%Y",
                "%Y.%m.%d"
            )
        ) |> as_date(), 
        Salary = numberize::numberize(Salary),
        Name = capitalize(Name),
        Age = numberize::numberize(Age)
    )
# A tibble: 1,000 × 10
   Name      Age Salary Gender Department Position  `Joining Date`
   <chr>   <dbl>  <dbl> <chr>  <chr>      <chr>     <date>        
 1 Grace      25  50000 Male   HR         Manager   2018-04-05    
 2 David      NA  65000 Female Finance    Director  2020-02-20    
 3 Hannah     35  60000 Female Sales      Director  2020-01-15    
 4 Eve        NA  50000 Female IT         Manager   2018-04-05    
 5 Grace      NA     NA Female Finance    Manager   2020-01-15    
 6 Jack       NA  65000 Other  Marketing  Director  2019-03-25    
 7 Charlie    NA  50000 Male   Marketing  Clerk     2019-12-01    
 8 Grace      40  50000 Other  HR         Director  2019-03-25    
 9 Hannah     40  60000 Female Marketing  Manager   2020-01-15    
10 Eve        30     NA Other  Finance    Assistant 2020-02-20    
# ℹ 990 more rows
# ℹ 3 more variables: `Performance Score` <chr>, Email <chr>,
#   `Phone Number` <chr>

You see how easy is this? The fact that dplyr::mutate() accepts bare expressions together with its bare column, you can easily use other functions, like numberize::numberize() to translate bare word value, i.e. "thirty" in Age column and "SIXTY THOUSAND" in Salary column.

5 Use across() + for safe type / role-based transformations

One of tidyverse’s most underrated features is across() combined with <tidy-select> helpers, commonly used in “data masking” functions. This lets you apply transformations to multiple columns based on their type or name pattern, making your code both DRY and maintainable.

5.1 Clean all text columns at once

box::use(
    dplyr[mutate, across, where],
    stringr[str_trim, str_to_upper]
)

survey = tibble::tibble(
    resp_id = c(1, 2, 3),
    name = c("  alice  ", "BOB", "  Charlie  "),
    city = c("New York  ", "  Boston", "Chicago  "),
    rating = c(4.2, 3.8, 4.5)
)

survey |>
    mutate(
        across(where(is.character), str_trim),
        across(where(is.character) & !resp_id, str_to_upper)
    )
# A tibble: 3 × 4
  resp_id name    city     rating
    <dbl> <chr>   <chr>     <dbl>
1       1 ALICE   NEW YORK    4.2
2       2 BOB     BOSTON      3.8
3       3 CHARLIE CHICAGO     4.5

The second across() shows combining conditions: all character columns except the ID field.

5.2 Scale financial columns

box::use(dplyr[mutate, across, starts_with])

financials = tibble::tibble(
    company = c("TechCo", "RetailCorp"),
    revenue_q1 = c(1200000, 850000),
    revenue_q2 = c(1350000, 920000),
    costs_q1 = c(800000, 600000),
    costs_q2 = c(850000, 640000)
)

financials |>
    mutate(
        across(
            starts_with("revenue"),
            \(x) x / 1e6,
            .names = "{.col}_M"
        )
    )
# A tibble: 2 × 7
  company    revenue_q1 revenue_q2 costs_q1 costs_q2 revenue_q1_M revenue_q2_M
  <chr>           <dbl>      <dbl>    <dbl>    <dbl>        <dbl>        <dbl>
1 TechCo        1200000    1350000   800000   850000         1.2          1.35
2 RetailCorp     850000     920000   600000   640000         0.85         0.92

The .names argument controls output column names. When you add revenue_q3 later, it automatically gets converted too.

5.3 Conditional rounding

box::use(dplyr[mutate, across, where, any_of])

measurements = tibble::tibble(
    sample_id = 1:4,
    temperature = c(98.6234, 99.1456, 97.8921, 98.4567),
    pressure = c(120.456, 118.234, 122.789, 119.123),
    ph_level = c(7.4123, 7.3987, 7.4256, 7.4089)
)

measurements |>
    mutate(
        across(
            where(is.numeric) & !any_of("sample_id"),
            round,
            digits = 2
        )
    )
# A tibble: 4 × 4
  sample_id temperature pressure ph_level
      <int>       <dbl>    <dbl>    <dbl>
1         1        98.6     120.     7.41
2         2        99.2     118.     7.4 
3         3        97.9     123.     7.43
4         4        98.5     119.     7.41

This pattern scales: add new numeric columns and they’re automatically rounded. No need to update the rounding code.

6 Joins flavor in dplyr

Since the stable version (1.1.x) release of dplyr, specification of joins is now more flexible, more explicit, and more “controlled” using dplyr::join_by(). While the new API provides clearer complex “join” tasks, this also makes the operation less error-prone.

6.1 Inequality joins for binning

Assign customer segments based on purchase totals:

box::use(dplyr[left_join, join_by, group_by, slice_max, ungroup])

purchases = tibble::tibble(
    customer_id = 1:8,
    total_spent = c(45, 280, 520, 95, 1200, 180, 650, 35)
)

segments = tibble::tibble(
    segment = c("Bronze", "Silver", "Gold", "Platinum"),
    min_spend = c(0, 100, 500, 1000)
)

purchases |>
    left_join(segments, by = join_by(total_spent >= min_spend)) |>
    group_by(customer_id) |>
    slice_max(min_spend, n = 1, with_ties = FALSE) |>
    ungroup()
# A tibble: 8 × 4
  customer_id total_spent segment  min_spend
        <int>       <dbl> <chr>        <dbl>
1           1          45 Bronze           0
2           2         280 Silver         100
3           3         520 Gold           500
4           4          95 Bronze           0
5           5        1200 Platinum      1000
6           6         180 Silver         100
7           7         650 Gold           500
8           8          35 Bronze           0

The “greater than or equal to (> + =)” operator finds all matching tiers, then we keep the highest one. Much clearer than manual conditional logic.

6.2 Range joins for time-based matching

Match transactions to active promotions:

box::use(
    dplyr[tbl = tibble, inner_join, join_by, between, transmute],
    lubridate[as_date]
)

transactions = tbl(
    tx_id = 1:6,
    tx_date = as_date(c(
        "2024-01-20", "2024-02-15", "2024-03-25",
        "2024-04-10", "2024-05-15", "2024-06-05"
    )),
    amount = c(150, 200, 180, 220, 175, 190)
)

promos = tbl(
    promo = c("New Year", "Spring Sale", "Summer Blast"),
    start = as_date(c("2024-01-01", "2024-03-01", "2024-05-01")),
    end = as_date(c("2024-01-31", "2024-04-30", "2024-06-30")),
    discount = c(0.15, 0.10, 0.20)
)

transactions |>
    inner_join(
        promos,
        by = join_by(between(tx_date, start, end))
    ) |>
    transmute(
        tx_id,
        tx_date,
        promo, 
        amount,
        discount, 
        savings = amount * discount
    )
# A tibble: 5 × 6
  tx_id tx_date    promo        amount discount savings
  <int> <date>     <chr>         <dbl>    <dbl>   <dbl>
1     1 2024-01-20 New Year        150     0.15    22.5
2     3 2024-03-25 Spring Sale     180     0.1     18  
3     4 2024-04-10 Spring Sale     220     0.1     22  
4     5 2024-05-15 Summer Blast    175     0.2     35  
5     6 2024-06-05 Summer Blast    190     0.2     38  

Here’s the equivalent SQL code:

SELECT
    tx_id,
    tx_date,
    promo,
    amount,
    discount,
    amount * discount AS savings
FROM (
    SELECT t.*, p.*
    FROM transactions AS t
    INNER JOIN promos AS p
        ON (
            t.tx_date >= p.start AND
            t.tx_date <= p.end
        )
) AS new_table

This beats manually checking date ranges with filter(). The join expresses intent directly.

7 Statistical pitfalls — don’t trust p-values blindly

Let’s say, you have a task that examines the relationship between variables. Linear regression <…>

With {mtcars} dataset, I wanna examine the relationship between wt and mpg using lm()

model = lm(mpg ~ wt, data = mtcars)
broom::tidy(model)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10
broom::glance(model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.753         0.745  3.05      91.4 1.29e-10     1  -80.0  166.  170.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

After running both linear regression and correlation test at the same time, it rejects the null hypothesis of the test, implying that there are strong linear relationship between wt and mpg. Keep in mind that directly using “Null Hypothesis Significant Testing” (NHST) is a direct flaw for reproducibility.

Let us examine the linear relationship by running exact model as above

box::use(dplyr[group_by, reframe])

mtcars |> 
    reframe(
        {         
            box::use(
                stats[linear_reg = lm, coef, pearson_r = cor], 
                purrr[imap_dfc, set_names], 
                tibble[tbl = tibble]
            )
            
            model = linear_reg(mpg ~ wt)
            coefs = coef(model)
            coef_table = imap_dfc(coefs, \(bi, nm) {
                result = tbl(bi)
                set_names(result, nm)
            })
            
            corr = pearson_r(wt, mpg)
            
            test = summary(model)
            tbl(
                coef_table, 
                corr = corr, 
                rsq = test$r.squared,
                adj_rsq = test$adj.r.squared
            )
        },
        
        .by = cyl
    )
  cyl (Intercept)        wt       corr       rsq   adj_rsq
1   6    28.40884 -2.780106 -0.6815498 0.4645102 0.3574122
2   4    39.57120 -5.647025 -0.7131848 0.5086326 0.4540362
3   8    23.86803 -2.192438 -0.6503580 0.4229655 0.3748793

Notice the problem? The 6-cylinder group with only 7 observations fails to reject the null hypothesis (p = 0.092), despite having a similar effect size to the other groups. This isn’t because the relationship doesn’t exist – it’s because the sample size is too small to detect the effect reliably.

This demonstrates why:

  • Sample size matters: The 6-cylinder group lacks statistical power
  • Effect sizes differ: The slope varies substantially across groups (-2.19 to -5.65), suggesting the relationship isn’t uniform

That’s why when doing an analysis, it is better to run simulation before conducting the analysis and the sample size should be large enough.

8 Remarks and Resources

Those are all my “simple annotation” on your fussy data work, potentially (hoping) will help in your future jobs.