id | name | color |
---|---|---|
1 | floof | gray |
2 | max | black |
3 | cat | orange |
4 | donut | gray |
5 | merlin | black |
6 | panda | calico |
Day 3 - Introduction to Data Analysis with R
Freie Universität Berlin - Theoretical Ecology
March 14, 2025
Let’s look at some examples
Tidy
id | name | color |
---|---|---|
1 | floof | gray |
2 | max | black |
3 | cat | orange |
4 | donut | gray |
5 | merlin | black |
6 | panda | calico |
Non-tidy
floof | max | cat | donut | merlin | panda |
---|---|---|---|---|---|
gray | black | orange | gray | black | calico |
gray | black | orange | calico |
---|---|---|---|
floof | max | cat | panda |
donut | merlin |
Sometimes raw data is non-tidy because its structure is optimized for data entry or viewing rather than analysis.
The main advantages of tidy data is that the tidyverse
packages are built to work with it.
Let’s go back to the city data set from earlier:
cities <- c("Istanbul", "Moscow", "London", "Saint Petersburg", "Berlin", "Madrid", "Kyiv", "Rome", "Bucharest", "Paris")
population <- c(15.1e6, 12.5e6, 9e6, 5.4e6, 3.8e6, 3.2e6, 3e6, 2.8e6, 2.2e6, 2.1e6)
area_km2 <- c(2576, 2561, 1572, 1439, 891, 604, 839, 1285, 228, 105)
country <- c(
"Turkey", "Russia", "UK", "Russia", "Germany", "Spain",
"Ukraine", "Italy", "Romania", "France"
)
# tidy
cities_tbl <- tibble(
city_name = cities,
population_size = population,
city_area = area_km2,
country = country
)
cities_tbl
#> # A tibble: 10 × 4
#> city_name population_size city_area country
#> <chr> <dbl> <dbl> <chr>
#> 1 Istanbul 15100000 2576 Turkey
#> 2 Moscow 12500000 2561 Russia
#> 3 London 9000000 1572 UK
#> 4 Saint Petersburg 5400000 1439 Russia
#> 5 Berlin 3800000 891 Germany
#> 6 Madrid 3200000 604 Spain
#> 7 Kyiv 3000000 839 Ukraine
#> 8 Rome 2800000 1285 Italy
#> 9 Bucharest 2200000 228 Romania
#> 10 Paris 2100000 105 France
This already looks pretty tidy.
#> # A tibble: 2 × 11
#> type Turkey_Istanbul Russia_Moscow UK_London Russia_Saint Petersb…¹
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 population_size 15100000 12500000 9000000 5400000
#> 2 city_area 2576 2561 1572 1439
#> Germany_Berlin Spain_Madrid Ukraine_Kyiv Italy_Rome Romania_Bucharest
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3800000 3200000 3000000 2800000 2200000
#> 2 891 604 839 1285 228
#> # ℹ abbreviated name: ¹`Russia_Saint Petersburg`
#> # ℹ 1 more variable: France_Paris <dbl>
What’s not tidy here?
tidyr
pacakgeLet’s tidy this data using functions from the tidyr
package!
pivot_longer()
One variable split into multiple columns can be solved with pivot_longer
#> # A tibble: 2 × 11
#> type Turkey_Istanbul Russia_Moscow UK_London Russia_Saint Petersb…¹
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 population_size 15100000 12500000 9000000 5400000
#> 2 city_area 2576 2561 1572 1439
#> Germany_Berlin Spain_Madrid Ukraine_Kyiv Italy_Rome Romania_Bucharest
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3800000 3200000 3000000 2800000 2200000
#> 2 891 604 839 1285 228
#> # ℹ abbreviated name: ¹`Russia_Saint Petersburg`
#> # ℹ 1 more variable: France_Paris <dbl>
pivot_longer()
One variable split into multiple columns can be solved with pivot_longer
#> # A tibble: 20 × 3
#> type location value
#> <chr> <chr> <dbl>
#> 1 population_size Turkey_Istanbul 15100000
#> 2 population_size Russia_Moscow 12500000
#> 3 population_size UK_London 9000000
#> 4 population_size Russia_Saint Petersburg 5400000
#> # ℹ 16 more rows
pivot_longer()
One variable split into multiple columns can be solved with pivot_longer
Another way to select the columns to pivot:
separate_wider_delim()
Multiple variable values that are united into one can be separated using separate_wider_delim
#> # A tibble: 20 × 3
#> type location value
#> <chr> <chr> <dbl>
#> 1 population_size Turkey_Istanbul 15100000
#> 2 population_size Russia_Moscow 12500000
#> # ℹ 18 more rows
#> # A tibble: 20 × 4
#> type country city_name value
#> <chr> <chr> <chr> <dbl>
#> 1 population_size Turkey Istanbul 15100000
#> 2 population_size Russia Moscow 12500000
#> # ℹ 18 more rows
The opposite function exists as well and is called unite
. Check out ?unite
for details.
pivot_wider()
One observation split into multiple rows can solved with pivot_wider
#> # A tibble: 20 × 4
#> type country city_name value
#> <chr> <chr> <chr> <dbl>
#> 1 population_size Turkey Istanbul 15100000
#> 2 population_size Russia Moscow 12500000
#> # ℹ 18 more rows
#> # A tibble: 10 × 4
#> country city_name population_size city_area
#> <chr> <chr> <dbl> <dbl>
#> 1 Turkey Istanbul 15100000 2576
#> 2 Russia Moscow 12500000 2561
#> 3 UK London 9000000 1572
#> 4 Russia Saint Petersburg 5400000 1439
#> 5 Germany Berlin 3800000 891
#> # ℹ 5 more rows
We can also use a pipe to do all these steps in one:
drop_na()
Drop rows with missing values:
This is an easier and more intuitive alternative to filter(!is.na(...))
.
Task (30 min)
Tidy data with tidyr
Find the task description here
Selina Baldauf // Tidy data with tidyr