This project involves the design, implementation, and management of a distributed database system, tailored to enhance data performance, scalability, and availability. The project integrates various database management techniques, including vertical and horizontal partitioning, distributed transaction management, and the use of NoSQL and relational database systems. The project specifically focuses on optimizing the management of customer information, orders, products, shipments, and payments in a distributed environment.
The foundational structure of the distributed database is laid out in the schema, which includes multiple interconnected tables such as Customer, Payments, Orders, Products, Shipments, and OrderItems. This schema supports efficient data organization and retrieval. The project utilizes the PostgreSQL database system to implement these tables with detailed attributes, keys, and constraints.
- Customer Table:
- Attributes:
customer_id,customer_name,customer_email,customer_shipping_address,customer_region - Primary Key:
customer_id
- Attributes:
- Payments Table:
- Attributes:
payment_id,payment_date,payment_mode - Primary Key:
payment_id
- Attributes:
- Orders Table:
- Attributes:
order_id,customer_id,order_date,payment_id,quantity,price,order_sequence_id - Composite Primary Key:
order_id,order_sequence_id
- Attributes:
- Products Table:
- Attributes:
product_id,product_name,product_price,product_categories - Primary Key:
product_id
- Attributes:
- Shipments Table:
- Attributes:
shipping_id,order_id,shipment_date,customer_shipping_address,customer_region - Primary Key:
shipping_id
- Attributes:
Data is inserted into these tables using Python scripts that leverage the Faker library to generate realistic and random data. The insert_random_data function ensures that the data is appropriately distributed across partitions, particularly for the Products table, based on the product_price field.
To optimize performance and manage large datasets, both vertical and horizontal partitioning techniques were employed.
-
Vertical Partitioning:
- New tables,
SHIPMENT_DETAILS_TABLEandCUSTOMER_DETAILS_TABLE, were created from the originalShipmentstable to enhance data retrieval performance.
- New tables,
-
Horizontal Partitioning:
- The
Productstable was divided into three separate tables based on product price ranges:Products_10_to_100,Products_101_to_250, andProducts_251_to_500. This approach enables efficient management of product data and improves query performance.
- The
A peer-to-peer replication model was implemented to increase data availability and fault tolerance. A centralized table, Customer_Region, was created to consolidate region-specific data, simplifying the replication structure and ensuring data is distributed and replicated efficiently.
Indexes were created on several tables to improve query performance:
- APJ Region:
- Index:
customer_name_apjoncustomer_region_apj(customer_name)
- Index:
- EMEA Region:
- Index:
customer_name_emeaoncustomer_region_emea(customer_name)
- Index:
- LATAM Region:
- Index:
customer_name_latamoncustomer_region_latam(customer_name)
- Index:
- Generic Customer Table:
- Index:
customer_name_oncustomer(customer_name)
- Index:
These indexes enhance query performance for operations involving the customer_name column, leading to faster data retrieval and improved efficiency.
The project also incorporated a NoSQL database, MongoDB, for handling distributed transactions. A MongoDB cluster was set up using Docker to represent a distributed database environment with three instances acting as primary and secondary replicas.
- Docker Network Creation: A network named
ecommerceMongoClusterwas created. - Replica Set Initialization: The setup involved initializing a replica set to manage distributed transactions.
- Transaction Management: MongoDB's distributed transaction management system was used to ensure ACID properties are preserved across the distributed environment.
The MongoDB setup demonstrated effective handling of distributed transactions, ensuring data consistency, integrity, durability, and isolation across multiple nodes.
MongoDB was selected for managing data related to the e-commerce domain. The project involved the implementation of CRUD operations and aggregation pipelines for data retrieval.
-
Data Model:
- Attributes:
customer_name,customer_email,customer_shipping_address,customer_region - Data generated using Mockaroo was inserted into the MongoDB database.
- Attributes:
-
Aggregation Pipelines:
- Functions like
get_customers_detail_locationandget_customers_detail_namewere used to retrieve specific customer details based on region and name, demonstrating MongoDB’s powerful data retrieval capabilities.
- Functions like
The CRUD operations and aggregation pipelines were successfully executed, showing that MongoDB is effective for managing and querying large datasets in a distributed environment.
This project successfully implemented a distributed database system using both PostgreSQL and MongoDB, integrating various advanced techniques like partitioning, indexing, replication, and distributed transaction management. The combined approach enhances data availability, scalability, performance, and reliability, making it suitable for managing large-scale e-commerce data across different regions and platforms. The techniques employed provide valuable insights into optimizing database management in distributed systems.

