Solution: Import and Export Data

Read a CSV file

library(tidyverse)

Read the tree data:

trees <- read_csv("data/tree_growth.csv")
trees
# A tibble: 15 × 6
   tree_id species height_m diameter_cm age_years site    
   <chr>   <chr>      <dbl>       <dbl>     <dbl> <chr>   
 1 T001    Oak         12.3        35.2        45 Forest_A
 2 T002    Beech       15.1        28.7        38 Forest_A
 3 T003    Pine        18.5        22.1        30 Forest_B
 4 T004    Oak          9.8        25.4        32 Forest_A
 5 T005    Birch       11.2        18.9        25 Forest_B
 6 T006    Pine        20.1        26.3        35 Forest_B
 7 T007    Beech       13.7        31.5        42 Forest_A
 8 T008    Oak         14.5        38.1        50 Forest_B
 9 T009    Birch        8.9        15.2        20 Forest_A
10 T010    Pine        16.8        24.7        28 Forest_B
11 T011    Oak         11.1        29.8        36 Forest_A
12 T012    Beech       17.3        33.6        48 Forest_B
13 T013    Birch       10.5        17.4        22 Forest_A
14 T014    Pine        19.2        25.8        33 Forest_B
15 T015    Oak         13.9        36.5        47 Forest_A

Explore the data:

summary(trees)
   tree_id            species             height_m      diameter_cm   
 Length:15          Length:15          Min.   : 8.90   Min.   :15.20  
 Class :character   Class :character   1st Qu.:11.15   1st Qu.:23.40  
 Mode  :character   Mode  :character   Median :13.90   Median :26.30  
                                       Mean   :14.19   Mean   :27.28  
                                       3rd Qu.:17.05   3rd Qu.:32.55  
                                       Max.   :20.10   Max.   :38.10  
   age_years        site          
 Min.   :20.0   Length:15         
 1st Qu.:29.0   Class :character  
 Median :35.0   Mode  :character  
 Mean   :35.4                     
 3rd Qu.:43.5                     
 Max.   :50.0                     

There are 4 tree species (Beech, Birch, Oak, Pine). The tallest tree is 20.1 m.

Calculate the mean height:

mean(trees$height_m)
[1] 14.19333

Read an Excel file

library(readxl)
birds <- read_excel("data/bird_observations.xlsx")
birds
# A tibble: 12 × 6
   observation_id species   count habitat date       temperature_c
   <chr>          <chr>     <dbl> <chr>   <chr>              <dbl>
 1 OBS001         Robin         5 Garden  2024-03-15          12.3
 2 OBS002         Blackbird     3 Forest  2024-03-15          11.8
 3 OBS003         Blue Tit      8 Garden  2024-03-16          13.1
 4 OBS004         Great Tit     4 Forest  2024-03-16          12.5
 5 OBS005         Robin         6 Meadow  2024-03-17          14.2
 6 OBS006         Sparrow      12 Garden  2024-03-17          13.8
 7 OBS007         Blackbird     2 Forest  2024-03-18          11  
 8 OBS008         Blue Tit      7 Garden  2024-03-18          12.9
 9 OBS009         Sparrow       9 Meadow  2024-03-19          15.1
10 OBS010         Great Tit     3 Forest  2024-03-19          13.4
11 OBS011         Robin         4 Garden  2024-03-20          14.7
12 OBS012         Blackbird     5 Meadow  2024-03-20          15.3
summary(birds)
 observation_id       species              count          habitat         
 Length:12          Length:12          Min.   : 2.000   Length:12         
 Class :character   Class :character   1st Qu.: 3.750   Class :character  
 Mode  :character   Mode  :character   Median : 5.000   Mode  :character  
                                       Mean   : 5.667                     
                                       3rd Qu.: 7.250                     
                                       Max.   :12.000                     
     date           temperature_c  
 Length:12          Min.   :11.00  
 Class :character   1st Qu.:12.45  
 Mode  :character   Median :13.25  
                    Mean   :13.34  
                    3rd Qu.:14.32  
                    Max.   :15.30  

Write data to a file

write_csv(trees, file = "data/trees_copy.csv")
Note

For this to work, the data/ directory must already exist in your project. Otherwise you will get an error.

Challenge: a slightly messy file

If you try to read the file directly, the metadata lines on top are treated as data:

read_csv("data/water_quality.csv")

Open the file in a text editor — you can see 4 lines of metadata before the actual data starts. Use skip to skip them:

water <- read_csv("data/water_quality.csv", skip = 4)
Rows: 12 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): sample_id, lake
dbl (4): depth_m, temperature_c, dissolved_oxygen_mg_l, ph

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
water
# A tibble: 12 × 6
   sample_id lake        depth_m temperature_c dissolved_oxygen_mg_l    ph
   <chr>     <chr>         <dbl>         <dbl>                 <dbl> <dbl>
 1 WQ001     Wannsee           1           8.2                  10.5   7.8
 2 WQ002     Wannsee           5           6.8                   8.3   7.5
 3 WQ003     Wannsee          10           5.1                   6.1   7.2
 4 WQ004     Mueggelsee        1           9                    11.2   8.1
 5 WQ005     Mueggelsee        5           7.3                   9     7.7
 6 WQ006     Mueggelsee       10           5.5                   5.8   7.3
 7 WQ007     Tegeler See       1           8.8                  10.8   7.9
 8 WQ008     Tegeler See       5           7                     8.7   7.6
 9 WQ009     Tegeler See      10           5.3                   6.5   7.4
10 WQ010     Wannsee           1           9.5                  10.1   7.7
11 WQ011     Mueggelsee        1          10.2                  11.5   8  
12 WQ012     Tegeler See       1           9.1                  10.3   7.8

Optional tasks

Challenge: an even messier file

This file has two problems:

  1. Metadata lines on top (4 lines to skip)
  2. The delimiter is ; instead of ,

So read_csv() won’t work even with skip. Use read_delim() instead:

soil <- read_delim(
  "data/soil_nutrients_messy.csv",
  delim = ";",
  skip = 4
)
Rows: 10 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (2): Plot ID, Soil Type
dbl (4): Nitrogen (mg/kg), Phosphorus (mg/kg), pH, Organic Matter (%)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
soil
# A tibble: 10 × 6
   `Plot ID` `Soil Type` `Nitrogen (mg/kg)` `Phosphorus (mg/kg)`    pH
   <chr>     <chr>                    <dbl>                <dbl> <dbl>
 1 P01       Sandy                     12.3                  4.5   6.2
 2 P02       Clay                      18.7                  7.2   5.8
 3 P03       Sandy                     10.1                  3.8  NA  
 4 P04       Loam                      15.9                  6.1   6.5
 5 P05       Clay                      20.3                 NA     5.5
 6 P06       Sandy                     11.8                  4.1   6  
 7 P07       Loam                      16.2                  5.9   6.3
 8 P08       Clay                      19.5                  7.8   5.7
 9 P09       Sandy                      9.4                  3.2   6.4
10 P10       Loam                      14.7                  5.5   6.1
# ℹ 1 more variable: `Organic Matter (%)` <dbl>

Clean messy column headers

Some column names have spaces and special characters. Use janitor::clean_names() to fix this:

# install.packages("janitor")
library(janitor)
soil <- clean_names(soil)
soil

The column names are now clean without spaces and special characters, all lowercase.