Uber-DataEngg

Data Engineering project with end to end implementation.

Step 1: Data Collection
Dataset used: https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/data/uber_data.csv More info about dataset can be found here: Website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Step 2: Understanding the data and creating a Data Model Data Model:

Link: https://app.diagrams.net/#G1cpJ7ZjPlVsL3VrL5dxiFmJbyjGoe2Mn0

Step 3: Create a Data Flow Diagram

Step 4: Create a Postgres RDS on AWS.

4.1 Connect to it using python (File: Uber.ipynb)
4.2 Create Dimension and Fact Tables and populate them (File: DML Statements)

Step 5: Perform ETL on the data using AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service offered by Amazon Web Services (AWS). It's designed to help users prepare and transform their data for analytics and data-driven applications. It supports transforming data using Apache Spark ETL jobs, enabling complex data transformations and processing at scale. Also, it allows users to define workflows using AWS Glue Studio or Apache Spark that automate and schedule ETL jobs, ensuring data pipelines run efficiently.

AWS Glue simplifies the process of preparing and transforming data, making it suitable for various data analytics, machine learning, and data warehousing applications within the AWS ecosystem.

File: ETL-AWS Glue.md

Step 6: Perform Analysis using Amazon Redshift
Amazon Redshift is a fully managed, cloud-based data warehousing service provided by Amazon Web Services (AWS). It's designed for handling large-scale analytics and data warehousing workloads. Data in Redshift is stored in a columnar format, optimizing query performance by only reading the columns necessary for a query, reducing I/O and improving compression. It utilizes a clustered architecture with nodes running in parallel to process and distribute queries across multiple nodes for faster query execution.

File: Redshift.md

Credits: https://www.youtube.com/watch?v=WpQECq5Hx9g

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
DML Statements		DML Statements
DataModel.drawio		DataModel.drawio
ETL-AWS Glue.md		ETL-AWS Glue.md
README.md		README.md
Redshift.md		Redshift.md
Uber.ipynb		Uber.ipynb
uber_spark.ipynb		uber_spark.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uber-DataEngg

About

Releases

Packages

Languages

im-aditi/Uber-DataEngg

Folders and files

Latest commit

History

Repository files navigation

Uber-DataEngg

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages