Processing Car Sales Data with Linux, Hive, and Impala: DAT 5566 Lab 3 Tutorial

Introduction to Big Data Processing with Car Sales Data

In the world of big data, processing large datasets efficiently is a critical skill. This tutorial walks you through a typical big data lab assignment—processing a car sales dataset (CarSale.csv) using Linux command-line tools, Apache Hive, and Impala. Whether you are a student in DAT 5566 or a data enthusiast, these techniques are essential for handling real-world data. We'll use timely analogies, like analyzing car market trends similar to tracking player stats in esports tournaments or forecasting AI model performance.

Part 1: Linux Command Practices

1. Checking for Duplicate Cars

First, let's inspect the dataset for duplicate entries by looking at the Vehicle Identification Number (VIN). Use the head command to view the first two rows and awk to check duplicates.

head -2 CarSale.csv
awk -F',' 'NR>1 {print $1}' CarSale.csv | sort | uniq -d

If no duplicate VINs appear, each car is unique. This is similar to ensuring unique user IDs in a gaming leaderboard.

2. Finding the 5 Cars with Longest Days on Market

To find cars sitting longest, sort by the days-on-market column (assume column 23).

awk -F',' 'NR>1 {print $23, $2, $3}' CarSale.csv | sort -t',' -k1 -nr | head -5

Add headers using echo or awk BEGIN block. This mirrors identifying the most popular AI apps by user retention.

3. Average Price Using AWK

Compute the average price (assume column 12) across all rows.

awk -F',' 'NR>1 {sum+=$12; count++} END {print "Average Price: " sum/count}' CarSale.csv

4. Average Price and Count per Year

Group by year (column 1) to find average price and number of cars, sorted ascending.

awk -F',' 'NR>1 {year=$1; price=$12; sum[year]+=price; count[year]++} END {print "Year,AvgPrice,Count"; for(y in sum) print y","sum[y]/count[y]","count[y] | "sort -t',' -k1"}' CarSale.csv

This is like aggregating financial data per quarter.

5. Creating a Smaller Dataset

Select columns 1,2,4,12,13,15,18,19,23 for cars from year 2000 onward.

awk -F',' 'NR==1 || $1>=2000 {print $1","$2","$4","$12","$13","$15","$18","$19","$23}' CarSale.csv > CarSale_subset.csv

Part 2: Importing Data into Hive/Impala

6. Creating a Hive Table and Loading Data

Upload the subset file to HDFS and create a Hive table named your_ids_carsale.

hdfs dfs -put CarSale_subset.csv /user/hive/warehouse/
hive -e "CREATE TABLE 111111_222222_carsale (year INT, make STRING, model STRING, price DOUBLE, ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;"
hive -e "LOAD DATA INPATH '/user/hive/warehouse/CarSale_subset.csv' INTO TABLE 111111_222222_carsale;"

Replace columns with actual names. This step is akin to loading player data into a game database.

Part 3: Querying with Hive/Impala

7. Max, Min, Avg Days on Market per Body Type

SELECT body_type, MAX(days_on_market), MIN(days_on_market), AVG(days_on_market) FROM 111111_222222_carsale WHERE body_type IS NOT NULL GROUP BY body_type;

8. Average Price and Count per Color and Condition

SELECT color, condition, AVG(price), COUNT(*) FROM 111111_222222_carsale GROUP BY color, condition HAVING COUNT(*) > 100 ORDER BY COUNT(*) DESC;

The most available combination might be "White, Used".

9. Cars per Maker, Model, and Condition with Tagging

SELECT make, model, CASE WHEN condition='New' THEN 'New' ELSE 'Used' END AS condition_tag, COUNT(*) AS num_cars, AVG(price), MAX(price) FROM 111111_222222_carsale GROUP BY make, model, condition HAVING COUNT(*) > 250;

This query helps identify popular car models, similar to trending items in e-commerce.

Part 4: Clean Up

Remove Table and Files

hive -e "DROP TABLE 111111_222222_carsale;"
hdfs dfs -rm CarSale_subset.csv

Always clean up resources to avoid clutter—like deleting temporary files after a game patch.

Conclusion

This tutorial covered essential big data processing techniques using Linux, Hive, and Impala. By working through duplicate checks, aggregations, and Hive queries, you've built skills applicable to modern data pipelines. Whether you're analyzing car sales or tracking AI model training logs, these commands are your toolkit. Practice with your own datasets to master big data analytics.