R Data Analysis Tutorial: World Bank Health and Development Data

Introduction: Why Data Analysis Matters in 2026

In 2026, data-driven decision-making is more critical than ever. From tracking global health trends to understanding economic development, R programming and statistical analysis are essential skills. In this tutorial, we'll explore a cross-section of country-level data from the World Bank, focusing on variables like infant mortality, life expectancy, smoking rates, and GDP. You'll learn to compute summary statistics, visualize distributions, run t-tests and regressions, and interpret results—all while using the tidyverse package. By the end, you'll be ready to tackle Assignment 3 with confidence.

Getting Started: Load the Data and Tidyverse

First, ensure you have the tidyverse package installed. Load it and read in the dataset:

library(tidyverse)
df <- read.csv('assignment3.csv')

This dataset contains variables like infmort (infant mortality per 1,000 births), lifeexp (life expectancy of women), smoke (percentage of adults who smoke), gdp, primeduc, urbpop, pop, cerealcrops, continent, and democ. Let's dive into the analysis.

Question 1: Summarizing Infant Mortality

Five Number Summary and Central Tendency

To describe infant mortality, compute the five-number summary:

summary(df$infmort)

You'll get the minimum, first quartile, median, mean, third quartile, and maximum. The median represents the middle value, while the mean is the average. The interquartile range (IQR) is Q3 - Q1, measuring spread. If the mean is higher than the median, the distribution is right-skewed, indicating some countries have very high infant mortality.

Histogram Shape and Spread

Create a histogram:

ggplot(df, aes(infmort)) + geom_histogram(bins=30)

You'll likely see a right-skewed distribution with a long tail. This is typical for health indicators: most countries have low infant mortality, but a few have extremely high rates. The shape is consistent with the summary statistics—the mean is pulled right by outliers.

Responding to the Central Limit Theorem Misconception

Your friend claims the Central Limit Theorem (CLT) implies a symmetric bell curve. However, the CLT applies to the sampling distribution of the mean, not the population distribution. The raw data can be skewed; only with large sample sizes does the sampling distribution become normal. Here, we're looking at the actual data, not sample means, so skewness is expected.

Question 2: Democracy and Life Expectancy

Difference in Means Test

You hypothesize democracies have higher life expectancy. Use a t-test:

t.test(lifeexp ~ democ, data = df)

Interpret the output: look at the difference in means (democ - non-democ). A positive difference means democracies have longer life expectancy. Check the p-value for statistical significance (p < 0.05). Substantively, if the difference is, say, 5 years, that's a meaningful gap.

Regression Approach

Run a simple linear regression:

m <- lm(lifeexp ~ democ, data = df)
summary(m)

The coefficient for democ estimates the same difference in means. The t-test and regression should match: the coefficient equals the mean difference, and the p-value is identical. This reinforces that regression is a generalization of the t-test.

Question 3: Bivariate Regressions with Life Expectancy

Primary Education

ggplot(df, aes(primeduc, lifeexp)) + geom_point() + geom_smooth(method='lm')

Expect a negative relationship: higher percentage of adults with only primary education (lower overall education) correlates with lower life expectancy. The slope is negative, and the line may show some curvature, suggesting a non-linear pattern.

Urban Population

ggplot(df, aes(urbpop, lifeexp)) + geom_point() + geom_smooth(method='lm')

Urbanization often associates with better healthcare access, so a positive relationship is likely. However, scatter may show heteroscedasticity (spread changes) or outliers.

Smoking

ggplot(df, aes(smoke, lifeexp)) + geom_point() + geom_smooth(method='lm')

Smoking is harmful, so expect a negative slope. The relationship might be weak if smoking's effect is captured by other variables.

Population

ggplot(df, aes(pop, lifeexp)) + geom_point() + geom_smooth(method='lm')

Population size likely has little to no linear relationship with life expectancy. The regression line may be nearly flat, and points may cluster with extreme outliers (e.g., China, India).

For each plot, comment on direction, strength (R-squared from regression), and signs of non-linearity or influential points.

Question 4: Cereal Crops and GDP

Scatter plot with country labels:

ggplot(df, aes(cerealcrops, gdp, label = country)) + 
  geom_point() + geom_smooth(method='lm') + geom_text(vjust=-1, hjust=0.5)

Outliers

Look for points far from the regression line, e.g., the United States or China with high GDP and high cereal production. A point with high cereal crops but low GDP (e.g., some African nations) might be an outlier. Residuals: positive if actual GDP > predicted, negative otherwise. A positive residual means the country has higher GDP than expected given its cereal production.

Question 5: Smoking and Infant Mortality

Correlation

cor(df$smoke, df$infmort)

Expect a positive correlation: higher smoking rates associated with higher infant mortality. The correlation coefficient (r) quantifies strength and direction.

Scatter Plot

ggplot(df, aes(smoke, infmort)) + geom_point()

Visually confirm the positive relationship. Points may show a weak or moderate linear trend.

Regression vs. Correlation

Regression provides the slope (effect size) and allows prediction. It also enables controlling for other variables. Correlation only gives the strength and direction of association.

Regression Results

m <- lm(infmort ~ smoke, data = df)
summary(m)

Interpret the coefficient: for a 1% increase in smoking, infant mortality changes by the coefficient. If the coefficient is 0.2, then a 10% increase in smoking predicts a 2-unit increase in infant mortality per 1,000 births.

Potential Confounders

From lecture: omitted variable bias, reverse causation, measurement error. For example, wealth (GDP) may confound: richer countries have lower infant mortality and lower smoking? Actually, smoking is often higher in richer countries? Check data. Another confounder: education. A factor that does not apply: simultaneity (infant mortality causing smoking) is unlikely.

Colored by Continent

ggplot(df, aes(smoke, infmort, color = continent)) + geom_point(size=1.5)

This reveals clustering by continent. For example, African countries may have high infant mortality and low smoking, while European countries have low infant mortality and moderate smoking. This suggests a Simpson's paradox: the overall positive correlation might reverse within continents.

Controlled Regression

m <- lm(infmort ~ smoke + continent, data = df)
summary(m)

After controlling for continent, the smoking coefficient may change sign or become insignificant. If the plot shows that within each continent the relationship is flat or negative, the controlled regression reflects that. This demonstrates the importance of controlling for group-level confounders.

Conclusion

In this tutorial, you've practiced key R skills: summary statistics, visualization, t-tests, regression, and correlation. These techniques are foundational for data analysis in 2026, whether you're analyzing global health trends or building AI models. Remember to always interpret results in context and check for confounding variables. Good luck with Assignment 3!