Skip to content

Data Engineering project with end to end implementation.

Notifications You must be signed in to change notification settings

im-aditi/Uber-DataEngg

Repository files navigation

Uber-DataEngg

Data Engineering project with end to end implementation.

Step 1: Data Collection
Dataset used: https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/data/uber_data.csv More info about dataset can be found here: Website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Step 2: Understanding the data and creating a Data Model Data Model: Screenshot 2023-11-01 at 8 25 08 PM

Link: https://app.diagrams.net/#G1cpJ7ZjPlVsL3VrL5dxiFmJbyjGoe2Mn0

Step 3: Create a Data Flow Diagram

Screenshot 2023-12-20 at 3 54 57 PM

Step 4: Create a Postgres RDS on AWS. Screenshot 2023-12-20 at 3 04 57 PM

4.1 Connect to it using python (File: Uber.ipynb)
4.2 Create Dimension and Fact Tables and populate them (File: DML Statements)

Step 5: Perform ETL on the data using AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service offered by Amazon Web Services (AWS). It's designed to help users prepare and transform their data for analytics and data-driven applications. It supports transforming data using Apache Spark ETL jobs, enabling complex data transformations and processing at scale. Also, it allows users to define workflows using AWS Glue Studio or Apache Spark that automate and schedule ETL jobs, ensuring data pipelines run efficiently.

AWS Glue simplifies the process of preparing and transforming data, making it suitable for various data analytics, machine learning, and data warehousing applications within the AWS ecosystem.

image

File: ETL-AWS Glue.md

Step 6: Perform Analysis using Amazon Redshift
Amazon Redshift is a fully managed, cloud-based data warehousing service provided by Amazon Web Services (AWS). It's designed for handling large-scale analytics and data warehousing workloads. Data in Redshift is stored in a columnar format, optimizing query performance by only reading the columns necessary for a query, reducing I/O and improving compression. It utilizes a clustered architecture with nodes running in parallel to process and distribute queries across multiple nodes for faster query execution.

File: Redshift.md

image




Credits: https://www.youtube.com/watch?v=WpQECq5Hx9g

About

Data Engineering project with end to end implementation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published