Customer is a Saudi budget airline based in Jeddah. Airlines serves major popular domestic destinations in and around Saudi, it targets domestic travelers, Pilgrims and rising tourists in Saudi with its custom package, pricing, best in class flight experience and destination connectivity options.
Amazon Redshift is a fully managed, MPP, petabyte-scale, column oriented, cloud based data warehouse designed for large data set storage and analysis at scale. Redshift is designed to connect SQL based clients and BI tools, making data available to users at real time. Redshift stands apart from on premises and existing data warehouse solutions in the following areas:
- Performance: Redshift delivers the fast query speeds on large data sets, dealing with data sizes up to a petabyte and more. Redshift processes data up to these sizes is just simply impossible to attain in traditional data warehousing, making it the top choice for applications that run massive amounts of queries on-demand.
- Cost: Redshift has no substantial upfront costs to get setup, which is not in the case for other data warehouse solutions. Being fully managed, Redshift has no recurrent hardware and maintenance costs. on-demand pricing ensures you only pay for what you use.
- Scalability: Redshift allows more flexible and elastic scale. As your requirements change, Redshift can scale up or down instantly to match your capacity and performance needs with a few clicks in the management console.
- Security: Amazon Redshift has most security features of AWS. Credentials and access are granted and managed on the AWS-level through IAM. Inbound access control created through Cluster security group. Data encryption can be enabled upon cluster creation and cannot be switched from encrypted to unencrypted directly. For organizations, with private cloud, access through a Virtual Private Cloud (VPC) environment is available. For data in transit, Redshift uses SSL encryption to communicate with S3 or Amazon Dynamo DB for COPY, UNLOAD, backup, and restore operations.
Pentaho Data Integration provides the Extract, Transform, and Load functionalities that facilitate the process of capturing, cleaning and storing data. The data can be stored in a uniform and consistent format that is easily accessible and relevant to end users. Common uses of PDI includes:
- Data migration among different databases, file systems and applications.
- Loading massive data sets into databases by taking full advantage of cloud, clustered and tremendously parallel processing environments
- Data cleansing, which includes steps ranging from very simple to very complex transformations.
- Data integration which includes the ability to control real-time ETL as a data source for reinforced reporting engines.
- Data warehouse support for slowly changing dimensions of data
Solution for Customer is hosted in AWS environment with Redshift database and PDI deployed on EC2 instance. The source data was residing in SQL Database. Development server’s IP’s on which PDI is residing is whitelisted to grant access to the database. Data from SQL is pushed to Redshift using the pipelines created on user friendly GUI of PDI. The data refresh is scheduled in an incremental fashion once a day at same time (similar to existing setup). Then the relevant tables from Redshift are connected to Power BI, where team has written queries to optimize the datasets before loading in to Power BI. Current solution replaces Increased cost, slow performing system into High Performance and cost effective scalable system.
Project kicks off with Requirement gathering, setting up data warehouse and PDI platform with required configurations. PDI and Data warehouse servers are Whitelisted to access Source systems and SQL server. Implementation phase starts with Data modeling and New schema creation.
Data modeling and New schema creation is based on the source systems and Integrating them with Redshift. Pipelines are set up using PDI which helps in loading the data from source systems to Redshift on daily and real time basis, as per the requirement. Post data refresh, a data gateway is set up to connect Redshift into Power BI and scheduling refresh cycle based on business needs. Power BI datasets are mapped with Redshift tables. The reports and dashboards are reconfigured based on new datasets. Post mapping, Unit testing and Validation has been performed. Post confirmation of UAT and validation from customer side, Solution is thus pushed for production. Training to customer team, project handover to customer and Signoff ends up the project.