Data Cleaning & EDA Palmer Penguins

Chinda ClintonChinda Clinton
4 min read

Project Overview

This project conducts exploratory data analysis (EDA) on the Palmer Penguins dataset, showcasing essential data analyst skills such as data cleaning, descriptive statistics, and visualization using R. The aim is to identify patterns in penguin species, body mass, sex differences, and relationships between variables like flipper length and body weight. This EDA will reveal biological differences among species and highlight patterns in body mass, distribution, and sex differences. Such analysis can aid ecological studies, wildlife conservation, or educational resources.

Methods / Approach

Data Cleaning

The dataset was examined for missing values using the naniar::vis_miss() function. To maintain simplicity, all rows with missing values in key variables (e.g., body_mass_g, sex, flipper_length_mm) were removed using drop_na(). No imputation or transformation was applied to preserve the integrity of the biological measurements.

Tools used

tidyverse, ggplot2, dplyr, naniar, skimr

Assumptions

Rows with missing data were dropped rather than imputed, based on the assumption that the dataset was sufficiently large and representative to retain its analytical power.

Key Questions

  • Which penguin species are most common?

  • How does body mass vary by species and sex?

  • Is there a relationship between flipper length and body mass?

Load required libraries

library(tidyverse) # For data manipulation and visualization 
library(palmerpenguins) #  For Dataset 
library(skimr) # Summary statistics 
library(naniar) # Visualizing missing data 
library(ggplot2) # Visualization (comes with tidyverse)

Load the Data

data("penguins") # Load the data

# View structure and first few rows 
str(penguins) 
glimpse(penguins) 
head(penguins)

Output:

Step 1: Data Cleaning

# Check for missing values in the data set
vis_miss(penguins)

Output:

# Remove rows with missing values for simplicity 
# (note: in real analysis, consider imputation) 
penguins_clean <- penguins %>% drop_na() 

# Confirm no missing data 
vis_miss(penguins_clean)

Output:

Step 2: Summary Statistics

# Basic overview of the cleaned dataset 
skim(penguins_clean)

Output:

# Grouped summaries: mean body mass by species and sex 
penguins_clean %>% group_by(species, sex) %>% 
    summarize(mean_body_mass = mean(body_mass_g), .groups = 'drop')

Output:

Step 3: Data Visualization

# Distribution of species 
ggplot(penguins_clean, aes(x = species)) + geom_bar(fill = "steelblue") + 
    labs(title = "Number of Penguins by Species", x = "Species", y = "Count") +
    geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)

Output:

# Body mass distribution by species 
ggplot(penguins_clean, aes(x = species, y = body_mass_g, fill = species)) + 
    geom_boxplot() +
    labs(title = "Body Mass Distribution by Species", x = "Species", y = "Body Mass (g)")

Output:

# Body mass by sex 
penguins_clean %>% group_by(sex) %>% 
    summarize(mean_mass = mean(body_mass_g)) %>% 
    ggplot(aes(x = sex, y = mean_mass, fill = sex)) + geom_col() + 
    labs(title = "Mean Body Mass by Sex", x = "Sex", y = "Mean Body Mass (g)")

Output:

# 5. Body mass distribution by species and sex
ggplot(penguins_clean, aes(x = sex, y = body_mass_g, fill = sex)) +
  geom_boxplot() +
  facet_wrap(~ species) +
  labs(title = "Body Mass Distribution by Sex within Each Species",
       x = "Sex",
       y = "Body Mass (g)") +
  theme_minimal()

Output:

Statistical Test: Body Mass by Sex

To statistically test whether the observed difference in body mass between male and female penguins is significant, we perform a two-sample t-test. This test evaluates whether the mean body mass differs between the two groups (sex) in the dataset.

  • If the p-value is less than 0.05, we reject the null hypothesis and conclude that the difference in mean body mass between males and females is statistically significant.

  • The confidence interval tells us the expected range of this difference in grams.

# Perform t-test to compare body mass by sex
sex_mass_test <- t.test(body_mass_g ~ sex, data = penguins_clean)

# Print test results
sex_mass_test

The results support the visual observation: male penguins are significantly heavier than female penguins, across all species.

# Relationship between flipper length and body mass 
ggplot(penguins_clean, aes(x = flipper_length_mm, y = body_mass_g, color = species)) 
    + geom_point(alpha = 0.7, size = 3) + facet_wrap(~ sex) 
    + labs(title = "Flipper Length vs Body Mass by Species and Sex", 
    x = "Flipper Length (mm)", y = "Body Mass (g)")

Output:

# Count species occurrences by island
species_island_count <- penguins_clean %>%
  count(island, species)

# Heatmap visualization
ggplot(species_island_count, aes(x = island, y = species, fill = n)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Heatmap: Penguin Species Count on Each Island",
       x = "Island", y = "Species", fill = "Count") +
  theme_minimal()

Output:

Key Insights

  • Adelie is the most frequent species in this dataset.

  • Male penguins are generally heavier than females across all species.

  • Body mass positively correlates with flipper length.

  • Some islands host only certain species.

Reflections

This project covered core EDA skills. In a next phase, this data could be used to build predictive models (e.g., predict body mass from flipper length and species). Further improvements could include interactive dashboards using Shiny.

0
Subscribe to my newsletter

Read articles from Chinda Clinton directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Chinda Clinton
Chinda Clinton