This project presents distributed data processing workflows using PySpark and Spark DataFrames.
The solution focuses on large-scale data transformations, aggregations, cross joins, window functions, distance calculations, and partitioned storage workflows commonly used in distributed analytics and data engineering.
- Distributed data processing using PySpark
- Spark DataFrame transformations
- Aggregation and filtering workflows
- Cross join operations
- Window function implementation
- Distance calculations using geographic coordinates
- Partitioned CSV generation
- Large-scale analytical processing
The project uses internally generated datasets simulating:
- US city geographic data
- Road construction projects
- Distance-based analytical workflows
The datasets were created directly in the notebook to ensure full reproducibility and eliminate dependency on external sources.
The project includes the following distributed processing tasks:
The notebook creates distributed Spark DataFrames containing:
- city names,
- state identifiers,
- geographic coordinates,
- population data,
- road construction records.
The transformation workflow includes:
- distance conversion from miles to kilometers,
- timestamp parsing,
- duration calculation,
- derived column generation,
- distributed filtering operations.
The analysis demonstrates:
- grouped aggregations,
- population analysis,
- average distance calculations,
- construction duration analysis,
- distributed analytical summaries.
The project performs:
- cross joins between road projects and cities,
- Euclidean distance calculations,
- nearest city matching workflows,
- geographic analytical processing.
The notebook uses window functions to:
- rank nearest cities,
- partition analytical results,
- identify closest geographic matches.
The transformed dataset is written as:
- partitioned CSV files,
- state-level distributed output,
- structured Spark partitions for downstream processing.
The notebook demonstrates:
- distributed transformations,
- Spark aggregation workflows,
- partitioned analytical processing,
- geographic data analysis,
- nearest-neighbor style matching,
- large-scale DataFrame operations.
- Python
- PySpark
- Apache Spark
- Spark DataFrames
- Distributed Data Processing
- Window Functions
- Data Transformation
- CSV Processing
- Big Data Analytics
- Google Colab
The goal of this project is to demonstrate practical distributed data processing skills using PySpark, including transformations, joins, window functions, partitioned storage, and large-scale analytical workflows.
The solution successfully demonstrates:
- Distributed Spark processing
- Spark DataFrame transformations
- Cross join operations
- Window function workflows
- Distance-based analysis
- Partitioned CSV generation
- Large-scale analytical processing
- Practical PySpark data engineering techniques
Paulina Broda