Skip to content

polabroda/distributed_data_processing_pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Distributed Data Processing with PySpark

This project presents distributed data processing workflows using PySpark and Spark DataFrames.

The solution focuses on large-scale data transformations, aggregations, cross joins, window functions, distance calculations, and partitioned storage workflows commonly used in distributed analytics and data engineering.


Project Overview

  • Distributed data processing using PySpark
  • Spark DataFrame transformations
  • Aggregation and filtering workflows
  • Cross join operations
  • Window function implementation
  • Distance calculations using geographic coordinates
  • Partitioned CSV generation
  • Large-scale analytical processing

Dataset

The project uses internally generated datasets simulating:

  • US city geographic data
  • Road construction projects
  • Distance-based analytical workflows

The datasets were created directly in the notebook to ensure full reproducibility and eliminate dependency on external sources.


Processing Workflow

The project includes the following distributed processing tasks:

DataFrame Creation

The notebook creates distributed Spark DataFrames containing:

  • city names,
  • state identifiers,
  • geographic coordinates,
  • population data,
  • road construction records.

Data Transformation

The transformation workflow includes:

  • distance conversion from miles to kilometers,
  • timestamp parsing,
  • duration calculation,
  • derived column generation,
  • distributed filtering operations.

Aggregation and Analysis

The analysis demonstrates:

  • grouped aggregations,
  • population analysis,
  • average distance calculations,
  • construction duration analysis,
  • distributed analytical summaries.

Cross Join and Distance Calculation

The project performs:

  • cross joins between road projects and cities,
  • Euclidean distance calculations,
  • nearest city matching workflows,
  • geographic analytical processing.

Window Functions

The notebook uses window functions to:

  • rank nearest cities,
  • partition analytical results,
  • identify closest geographic matches.

Partitioned Storage

The transformed dataset is written as:

  • partitioned CSV files,
  • state-level distributed output,
  • structured Spark partitions for downstream processing.

Example Processing Areas

The notebook demonstrates:

  • distributed transformations,
  • Spark aggregation workflows,
  • partitioned analytical processing,
  • geographic data analysis,
  • nearest-neighbor style matching,
  • large-scale DataFrame operations.

Technologies

  • Python
  • PySpark
  • Apache Spark
  • Spark DataFrames
  • Distributed Data Processing
  • Window Functions
  • Data Transformation
  • CSV Processing
  • Big Data Analytics
  • Google Colab

Goal

The goal of this project is to demonstrate practical distributed data processing skills using PySpark, including transformations, joins, window functions, partitioned storage, and large-scale analytical workflows.


Results

The solution successfully demonstrates:

  • Distributed Spark processing
  • Spark DataFrame transformations
  • Cross join operations
  • Window function workflows
  • Distance-based analysis
  • Partitioned CSV generation
  • Large-scale analytical processing
  • Practical PySpark data engineering techniques

Author

Paulina Broda

About

Distributed data processing project focused on PySpark transformations, joins, window functions, and partitioned data workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors