Distributed Data Processing with PySpark

This project presents distributed data processing workflows using PySpark and Spark DataFrames.

The solution focuses on large-scale data transformations, aggregations, cross joins, window functions, distance calculations, and partitioned storage workflows commonly used in distributed analytics and data engineering.

Project Overview

Distributed data processing using PySpark
Spark DataFrame transformations
Aggregation and filtering workflows
Cross join operations
Window function implementation
Distance calculations using geographic coordinates
Partitioned CSV generation
Large-scale analytical processing

Dataset

The project uses internally generated datasets simulating:

US city geographic data
Road construction projects
Distance-based analytical workflows

The datasets were created directly in the notebook to ensure full reproducibility and eliminate dependency on external sources.

Processing Workflow

The project includes the following distributed processing tasks:

DataFrame Creation

The notebook creates distributed Spark DataFrames containing:

city names,
state identifiers,
geographic coordinates,
population data,
road construction records.

Data Transformation

The transformation workflow includes:

distance conversion from miles to kilometers,
timestamp parsing,
duration calculation,
derived column generation,
distributed filtering operations.

Aggregation and Analysis

The analysis demonstrates:

grouped aggregations,
population analysis,
average distance calculations,
construction duration analysis,
distributed analytical summaries.

Cross Join and Distance Calculation

The project performs:

cross joins between road projects and cities,
Euclidean distance calculations,
nearest city matching workflows,
geographic analytical processing.

Window Functions

The notebook uses window functions to:

rank nearest cities,
partition analytical results,
identify closest geographic matches.

Partitioned Storage

The transformed dataset is written as:

partitioned CSV files,
state-level distributed output,
structured Spark partitions for downstream processing.

Example Processing Areas

The notebook demonstrates:

distributed transformations,
Spark aggregation workflows,
partitioned analytical processing,
geographic data analysis,
nearest-neighbor style matching,
large-scale DataFrame operations.

Technologies

Python
PySpark
Apache Spark
Spark DataFrames
Distributed Data Processing
Window Functions
Data Transformation
CSV Processing
Big Data Analytics
Google Colab

Goal

The goal of this project is to demonstrate practical distributed data processing skills using PySpark, including transformations, joins, window functions, partitioned storage, and large-scale analytical workflows.

Results

The solution successfully demonstrates:

Distributed Spark processing
Spark DataFrame transformations
Cross join operations
Window function workflows
Distance-based analysis
Partitioned CSV generation
Large-scale analytical processing
Practical PySpark data engineering techniques

Author

Paulina Broda

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
GH_distributed_data_processing_pyspark_international.ipynb		GH_distributed_data_processing_pyspark_international.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Data Processing with PySpark

Project Overview

Dataset

Processing Workflow

DataFrame Creation

Data Transformation

Aggregation and Analysis

Cross Join and Distance Calculation

Window Functions

Partitioned Storage

Example Processing Areas

Technologies

Goal

Results

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Processing with PySpark

Project Overview

Dataset

Processing Workflow

DataFrame Creation

Data Transformation

Aggregation and Analysis

Cross Join and Distance Calculation

Window Functions

Partitioned Storage

Example Processing Areas

Technologies

Goal

Results

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages