# This fails R CMD check in a package
mtcars |> dplyr::mutate(new = mpg / wt) |> head(5)
#>> mpg cyl disp hp drat wt qsec vs am gear carb new
#>> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 8.015267
#>> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7.304348
#>> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 9.827586
#>> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.656299
#>> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5.436047
# Package-safe alternatives
mtcars |> dplyr::mutate(new = .data$mpg / .data$wt) |> head(5)
#>> mpg cyl disp hp drat wt qsec vs am gear carb new
#>> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 8.015267
#>> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7.304348
#>> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 9.827586
#>> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.656299
#>> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5.4360471 Why “careful”?
As you read in the title, why “careful”? Using tidyverse, or any frameworks for data manipulation out there, is fine. Most introductory tidyverse tutorials teach you how to write code with it. Very few teach you when the nice-looking code becomes dangerous or unnecessarily painful six months later.
This short guide focuses on the second part — the things that experienced people wish they had known earlier.
Before we start, here are the packages I’ll be using:
-
Packages that are tidyverse. This includes the following specifics:
-
Packages that are not tidyverse. This includes the following specifics:
And they will be used much later on.
2 Tidyverse is not just unquoted column names
You thought the reason why tidyverse framework differ from other frameworks, e.g. Pandas, Polars, or data.table, comes from accepting “columns” argument not being quoted? I mean, yes, and data.table does this as well, but the syntax and some of the APIs are not the same. That’s not what I am trying to convey here.
Why did I tell you this? Let’s say you’re planning to use this package as one of your package dependencies. Unfortunately, R CMD CHECK will throw an error if you are using bare column names, so instead use .data$colname.
3 Never trust straight data imports
Do not just import data (not limited to CSVs), most especially when the data wasn’t collected in one pattern. For example, a CSV data that has “date” column(s), where it doesn’t follow the right format, e.g. %Y-%m-%d. This is probably the #1 source of silent bugs in production data pipelines, especially if overlooked.
But first, let me tell you something:
- If the data contains messy format in general, set all the columns into string type first
box::use(readr[read_csv, cols, col_character])
# Dangerous (very common in real life)
read_csv("data.csv")
# Much safer minimum
read_csv(
"data.csv",
col_types = cols(
# force everything as text first
.default = col_character(),
)
)Auto-guessing fails when:
- First 1000 rows look clean, row 1001 doesn’t
- Excel exported dates as text in weird formats
- IDs like “001234” get parsed as numbers (losing leading zeros)
- Mix of
"N/A","null","NULL","-","nan"all mean missing
3.1 Synthetic example: Dealing with dates
Here, imagine you have this kind of data in a certain CSV, where it contains filedate/datetime columns that come:
sample_csv = "
date, readings, x, y, z\n
\"Mar.23,2005\",0.02,3.3,1.1,5.7\n
\"Feb.26,2005\",0.15,4.5,5.0,1.9\n
\"Apr.5,2005\",0.2,7.2,2.8,5.2\n
\"May.12,2005\", 0.5,4.9,1.3,6.8
"If you directly import this CSV data:
# A tibble: 4 × 5
date readings x y z
<chr> <dbl> <dbl> <dbl> <dbl>
1 Mar.23,2005 0.02 3.3 1.1 5.7
2 Feb.26,2005 0.15 4.5 5 1.9
3 Apr.5,2005 0.2 7.2 2.8 5.2
4 May.12,2005 0.5 4.9 1.3 6.8
The problem arises when readr::read_csv() fails to parse the "date" column as an actual date type as you expect, instead it is parsed as a string type.
The readr package provides an API that can explicitly judge the column type.
Here’s how you do it:
# A tibble: 4 × 5
date readings x y z
<date> <dbl> <dbl> <dbl> <dbl>
1 2005-03-23 0.02 3.3 1.1 5.7
2 2005-02-26 0.15 4.5 5 1.9
3 2005-04-05 0.2 7.2 2.8 5.2
4 2005-05-12 0.5 4.9 1.3 6.8
3.2 Synthetic example: Multiple messy date formats in one column
But, what if the 'date' column doesn’t follow 1 pattern?
box::use(
readr[read_csv, cols, col_character],
dplyr[mutate],
lubridate[parse_date_time, as_date]
)
sample_csv = "
date, readings, x, y, z\n
\"Mar.23,2005\",0.02,3.3,1.1,5.7\n
\"Feb 26, 2005\",0.15,4.5,5.0,1.9\n
\"04-5-2005\",0.2,7.2,2.8,5.2\n
\"May12,05\", 0.5,4.9,1.3,6.8\n
\"Aug.17,2005\", 0.17,9.5,0.8,4.2
"
read_csv(
I(sample_csv),
col_types = cols(
date = col_character()
)
) |>
mutate(
date =
parse_date_time(
date,
orders = c(
"b.d,Y",
"b d, Y",
"m-d-Y",
"bd,y"
)
) |>
as_date()
)# A tibble: 5 × 5
date readings x y z
<date> <dbl> <dbl> <dbl> <dbl>
1 2005-03-23 0.02 3.3 1.1 5.7
2 2005-02-26 0.15 4.5 5 1.9
3 2005-04-05 0.2 7.2 2.8 5.2
4 2005-05-12 0.5 4.9 1.3 6.8
5 2005-08-17 0.17 9.5 0.8 4.2
Unfortunately, right now, readr::read_csv() doesn’t know how to parse CSV files with different data format, so I have to take some roundabouts to properly parse the "date" column.
4 Use tidyverse aggressively for serious data cleaning
Now you know how easy it get when importing the data with messed up pattern structure. Use tidyverse for data cleaning for f sake, trust me. Seriously, if you can do data cleaning from other tools, e.g. Excel or Python-Pandas (I know some companies you are working with will choose them), tidyverse makes things much easier, conventional, readable, and maintainable (arguable). Some practitioners actually use tidyverse for modeling and resort to base R loops or hacky solutions for cleaning. Don’t be that person.
Let us demonstrate with the Messy HR Data dataset from this repository.
hr_data = readr::read_csv(
"https://raw.githubusercontent.com/eyowhite/Messy-dataset/main/messy_HR_data.csv"
)If you glimpse this data:
hr_data |> dplyr::glimpse()Rows: 1,000
Columns: 10
$ Name <chr> "grace", "david", "hannah", "eve", "grace", "jack"…
$ Age <chr> "25", "nan", "35", "nan", "nan", "nan", "nan", "40…
$ Salary <chr> "50000", "65000", "SIXTY THOUSAND", "50000", "NAN"…
$ Gender <chr> "Male", "Female", "Female", "Female", "Female", "O…
$ Department <chr> "HR", "Finance", "Sales", "IT", "Finance", "Market…
$ Position <chr> "Manager", "Director", "Director", "Manager", "Man…
$ `Joining Date` <chr> "April 5, 2018", "2020/02/20", "01/15/2020", "Apri…
$ `Performance Score` <chr> "D", "F", "C", "A", "F", "F", "B", "C", "C", "A", …
$ Email <chr> "email@example.com", "user@domain.com", "email@exa…
$ `Phone Number` <chr> "nan", "123-456-7890", "098-765-4321", NA, "098-76…
You see how bad it is, even if you said it’s not “dirty enough” or similar. The reason why tidyverse makes things “conventional” because in R, missing values has its own unique representation value, denoted by NA, not through null, NULL, or NaN / nan.
The data cleaning in R with tidyverse is much easier than you thought
box::use(
dplyr[mutate, across, everything, if_else, case_when],
stringr[detect = str_detect, capitalize = str_to_sentence],
lubridate[parse_date_time, as_date]
)
hr_data |>
mutate(
across(
everything(),
\(col) if_else(detect(col, "^(?i)nan$"), NA, col)
),
`Joining Date` = parse_date_time(
`Joining Date`, c(
"%B %d, %Y",
"%Y/%m/%d",
"%m-%d-%Y",
"%Y.%m.%d"
)
) |> as_date(),
Salary = numberize::numberize(Salary),
Name = capitalize(Name),
Age = numberize::numberize(Age)
)# A tibble: 1,000 × 10
Name Age Salary Gender Department Position `Joining Date`
<chr> <dbl> <dbl> <chr> <chr> <chr> <date>
1 Grace 25 50000 Male HR Manager 2018-04-05
2 David NA 65000 Female Finance Director 2020-02-20
3 Hannah 35 60000 Female Sales Director 2020-01-15
4 Eve NA 50000 Female IT Manager 2018-04-05
5 Grace NA NA Female Finance Manager 2020-01-15
6 Jack NA 65000 Other Marketing Director 2019-03-25
7 Charlie NA 50000 Male Marketing Clerk 2019-12-01
8 Grace 40 50000 Other HR Director 2019-03-25
9 Hannah 40 60000 Female Marketing Manager 2020-01-15
10 Eve 30 NA Other Finance Assistant 2020-02-20
# ℹ 990 more rows
# ℹ 3 more variables: `Performance Score` <chr>, Email <chr>,
# `Phone Number` <chr>
You see how easy is this? The fact that dplyr::mutate() accepts bare expressions together with its bare column, you can easily use other functions, like numberize::numberize() to translate bare word value, i.e. "thirty" in Age column and "SIXTY THOUSAND" in Salary column.
5 Use across() + for safe type / role-based transformations
One of tidyverse’s most underrated features is across() combined with <tidy-select> helpers, commonly used in “data masking” functions. This lets you apply transformations to multiple columns based on their type or name pattern, making your code both DRY and maintainable.
5.1 Clean all text columns at once
box::use(
dplyr[mutate, across, where],
stringr[str_trim, str_to_upper]
)
survey = tibble::tibble(
resp_id = c(1, 2, 3),
name = c(" alice ", "BOB", " Charlie "),
city = c("New York ", " Boston", "Chicago "),
rating = c(4.2, 3.8, 4.5)
)
survey |>
mutate(
across(where(is.character), str_trim),
across(where(is.character) & !resp_id, str_to_upper)
)# A tibble: 3 × 4
resp_id name city rating
<dbl> <chr> <chr> <dbl>
1 1 ALICE NEW YORK 4.2
2 2 BOB BOSTON 3.8
3 3 CHARLIE CHICAGO 4.5
The second across() shows combining conditions: all character columns except the ID field.
5.2 Scale financial columns
box::use(dplyr[mutate, across, starts_with])
financials = tibble::tibble(
company = c("TechCo", "RetailCorp"),
revenue_q1 = c(1200000, 850000),
revenue_q2 = c(1350000, 920000),
costs_q1 = c(800000, 600000),
costs_q2 = c(850000, 640000)
)
financials |>
mutate(
across(
starts_with("revenue"),
\(x) x / 1e6,
.names = "{.col}_M"
)
)# A tibble: 2 × 7
company revenue_q1 revenue_q2 costs_q1 costs_q2 revenue_q1_M revenue_q2_M
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TechCo 1200000 1350000 800000 850000 1.2 1.35
2 RetailCorp 850000 920000 600000 640000 0.85 0.92
The .names argument controls output column names. When you add revenue_q3 later, it automatically gets converted too.
5.3 Conditional rounding
box::use(dplyr[mutate, across, where, any_of])
measurements = tibble::tibble(
sample_id = 1:4,
temperature = c(98.6234, 99.1456, 97.8921, 98.4567),
pressure = c(120.456, 118.234, 122.789, 119.123),
ph_level = c(7.4123, 7.3987, 7.4256, 7.4089)
)
measurements |>
mutate(
across(
where(is.numeric) & !any_of("sample_id"),
round,
digits = 2
)
)# A tibble: 4 × 4
sample_id temperature pressure ph_level
<int> <dbl> <dbl> <dbl>
1 1 98.6 120. 7.41
2 2 99.2 118. 7.4
3 3 97.9 123. 7.43
4 4 98.5 119. 7.41
This pattern scales: add new numeric columns and they’re automatically rounded. No need to update the rounding code.
6 Joins flavor in dplyr
Since the stable version (1.1.x) release of dplyr, specification of joins is now more flexible, more explicit, and more “controlled” using dplyr::join_by(). While the new API provides clearer complex “join” tasks, this also makes the operation less error-prone.
6.1 Inequality joins for binning
Assign customer segments based on purchase totals:
box::use(dplyr[left_join, join_by, group_by, slice_max, ungroup])
purchases = tibble::tibble(
customer_id = 1:8,
total_spent = c(45, 280, 520, 95, 1200, 180, 650, 35)
)
segments = tibble::tibble(
segment = c("Bronze", "Silver", "Gold", "Platinum"),
min_spend = c(0, 100, 500, 1000)
)
purchases |>
left_join(segments, by = join_by(total_spent >= min_spend)) |>
group_by(customer_id) |>
slice_max(min_spend, n = 1, with_ties = FALSE) |>
ungroup()# A tibble: 8 × 4
customer_id total_spent segment min_spend
<int> <dbl> <chr> <dbl>
1 1 45 Bronze 0
2 2 280 Silver 100
3 3 520 Gold 500
4 4 95 Bronze 0
5 5 1200 Platinum 1000
6 6 180 Silver 100
7 7 650 Gold 500
8 8 35 Bronze 0
The “greater than or equal to (> + =)” operator finds all matching tiers, then we keep the highest one. Much clearer than manual conditional logic.
6.2 Range joins for time-based matching
Match transactions to active promotions:
box::use(
dplyr[tbl = tibble, inner_join, join_by, between, transmute],
lubridate[as_date]
)
transactions = tbl(
tx_id = 1:6,
tx_date = as_date(c(
"2024-01-20", "2024-02-15", "2024-03-25",
"2024-04-10", "2024-05-15", "2024-06-05"
)),
amount = c(150, 200, 180, 220, 175, 190)
)
promos = tbl(
promo = c("New Year", "Spring Sale", "Summer Blast"),
start = as_date(c("2024-01-01", "2024-03-01", "2024-05-01")),
end = as_date(c("2024-01-31", "2024-04-30", "2024-06-30")),
discount = c(0.15, 0.10, 0.20)
)
transactions |>
inner_join(
promos,
by = join_by(between(tx_date, start, end))
) |>
transmute(
tx_id,
tx_date,
promo,
amount,
discount,
savings = amount * discount
)# A tibble: 5 × 6
tx_id tx_date promo amount discount savings
<int> <date> <chr> <dbl> <dbl> <dbl>
1 1 2024-01-20 New Year 150 0.15 22.5
2 3 2024-03-25 Spring Sale 180 0.1 18
3 4 2024-04-10 Spring Sale 220 0.1 22
4 5 2024-05-15 Summer Blast 175 0.2 35
5 6 2024-06-05 Summer Blast 190 0.2 38
Here’s the equivalent SQL code:
SELECT
tx_id,
tx_date,
promo,
amount,
discount,
amount * discount AS savings
FROM (
SELECT t.*, p.*
FROM transactions AS t
INNER JOIN promos AS p
ON (
t.tx_date >= p.start AND
t.tx_date <= p.end
)
) AS new_tableThis beats manually checking date ranges with filter(). The join expresses intent directly.
7 Statistical pitfalls — don’t trust p-values blindly
Let’s say, you have a task that examines the relationship between variables. Linear regression <…>
With {mtcars} dataset, I wanna examine the relationship between wt and mpg using lm()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 37.3 1.88 19.9 8.24e-19
2 wt -5.34 0.559 -9.56 1.29e-10
broom::glance(model)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
After running both linear regression and correlation test at the same time, it rejects the null hypothesis of the test, implying that there are strong linear relationship between wt and mpg. Keep in mind that directly using “Null Hypothesis Significant Testing” (NHST) is a direct flaw for reproducibility.
Let us examine the linear relationship by running exact model as above
box::use(dplyr[group_by, reframe])
mtcars |>
reframe(
{
box::use(
stats[linear_reg = lm, coef, pearson_r = cor],
purrr[imap_dfc, set_names],
tibble[tbl = tibble]
)
model = linear_reg(mpg ~ wt)
coefs = coef(model)
coef_table = imap_dfc(coefs, \(bi, nm) {
result = tbl(bi)
set_names(result, nm)
})
corr = pearson_r(wt, mpg)
test = summary(model)
tbl(
coef_table,
corr = corr,
rsq = test$r.squared,
adj_rsq = test$adj.r.squared
)
},
.by = cyl
) cyl (Intercept) wt corr rsq adj_rsq
1 6 28.40884 -2.780106 -0.6815498 0.4645102 0.3574122
2 4 39.57120 -5.647025 -0.7131848 0.5086326 0.4540362
3 8 23.86803 -2.192438 -0.6503580 0.4229655 0.3748793
Notice the problem? The 6-cylinder group with only 7 observations fails to reject the null hypothesis (p = 0.092), despite having a similar effect size to the other groups. This isn’t because the relationship doesn’t exist – it’s because the sample size is too small to detect the effect reliably.
This demonstrates why:
- Sample size matters: The 6-cylinder group lacks statistical power
- Effect sizes differ: The slope varies substantially across groups (-2.19 to -5.65), suggesting the relationship isn’t uniform
That’s why when doing an analysis, it is better to run simulation before conducting the analysis and the sample size should be large enough.
8 Remarks and Resources
Those are all my “simple annotation” on your fussy data work, potentially (hoping) will help in your future jobs.
- R for Data Science
- Box: Placing module system into R
- box: Write Reusable, Composable and Modular R Code
- Statistical Rethinking
-
dplyr 1.1.0 blog post - Deep dive on
join_by() -
Programming with dplyr - When to use
.data$and{ }