This project focuses on generating Month-end and Year-end reports related to Policy, Claims, and Sales for the Western European Market (WEM). The objectives are to reduce operating costs and gain efficiencies through vendor consolidation, enhance productivity and service quality annually, and accelerate the process of scaling up resources to meet business demands. WEM operations span four countries: Spain (including the "Genesis" brand), Ireland, Northern Ireland (Hughes Insurance), and Portugal. Initially, the project targets private motor insurance with Liberty Insurance (Ireland), Genesis (Spain), Liberty Seguros (Portugal), and Regal (Catalonia, Spain).
The Technical Data Architecture for WEM includes multiple source systems such as EIS (PostgreSQL RDS), Outsystems (Microsoft SQL RDS), Rating (Amazon DynamoDB), Legacy systems (AS/400/DB2), Bank sources/Hydra (Adyen, JPMC), and RDM (Source files CSV Files). AWS Pipeline (DMS task, LAMBDA function, ABC function, STEP function) will be used for PostgreSQL, Microsoft SQL, DynamoDB, and JPMC sources, while IICS mappings will extract data from DB2 and RDM files to Redshift Db. Data will flow from sources to Redshift DB (Stage & ODS tables) for transformation and loading into Dim/Fact tables. Final reports will be generated from views built on these tables.
The main purpose of the project is to generate different Month-end and Year-end reports (related to Policy, Claims, Sales etc.). The key objectives of the project are to:
This project is intended to only Western European Market (WEM). WEM comprises of operations in 4 countries:
For the purposes of the GDE (The Global Digital Ecosystem) engagement with WEM, the focus at first is on efforts relating to private motor insurance with Liberty Insurance (Ireland) & the Genesis "brand" (Spain), with Liberty Seguros (Portugal) and Regal "brand" (Catalonia, Spain)
also being in scope.
The Data platform architecture consists of multiple sources systems which are listed below.
EIS β PostgreSQL RDS
Outsystems β Microsoft SQL RDS
Rating β Amazon DynamoDB
Legacy systems β AS/400/ DB2
Bank sources/Hydra β Adyen, JPMC
RDM β Source files CSV Files
Note: for sources (PostgreSQL, Microsoft SQL, DynamoDB & JPMC) we will be using AWS Pipeline (DMS task, LAMBDA function, ABC function, STEP function etc.) and for sources DB2 & RDM files, we will be using IICS mappings to extract the data from source to Redshift Db and then we will be applying different transformation to prepare and load the data in proper
locations.
Once the data from different sources are loaded in the Redshift DB (Stage & ODS tables), then they will be loaded in the Dims/Fact tables and on top of facts and dimensions several views will be created which will be used to generate the Month-end & Year-end reports.
In our ETL process we have used Informatica Intelligent Cloud Services (IICS) as the ETL tool, as IICS has several features which help us to resolve the limitations in other processes. Some of the advantages that our ETL process has over the previously mentioned processes are as follows β
Implemented on Cloud containers: As mentioned earlier we have used IICS as the ETL tool, which is one of the best Cloud ETL tool available in market. As a cloud-native platform, IICS leverages the scalability, flexibility, and cost-effectiveness of cloud infrastructure. This architecture eliminates the needs of on-premises hardware and maintenance, reducing IT overhead and enabling faster time-to-value.
Various data sources: IICS offers seamless connectivity to a wide range of data sources and applications. In our end-to-end process we are able to use different sources as PostgreSQL, Microsoft SQL, Amazon DynamoDB, CSV Files etc.
Scalable: IICS is designed to scale effortlessly, allowing organizations to handle large volumes of data without compromising performance. It can easily adapt to growing data needs, making it suitable for business of all sizes.
Real-time Integration: IICS supports real-time, enabling us to access and analyze data as it becomes available. This real-time visibility into business processes allows for faster decision-making and greater agility in responding to changing market conditions.
User-friendly interface: IICS provides a user-friendly interface that allows developers to quickly design, deploy and manage data integration processes without extensive coding language. Its drag-and-drop interface simplifies the development process, making it accessible to a wider range of users.
The primary objective of this study is to develop an automated ETL (Extract, Transform, Load) pipeline to generate financial reports at both month-end and year-end for the Western European Market (WEM). This project aims to improve operational efficiency, reduce costs, and enhance service quality through vendor consolidation and automation.
The specific goals include:
The dataset used in this study primarily consists of data related to policies, claims, and sales for the Western European Market. The data is sourced from multiple systems, including:
The ETL pipeline uses AWS services such as DMS tasks, Lambda functions, and Step Functions to extract data from these sources. Data is then transformed and loaded into a centralized Redshift database, where it is processed and used to generate the final financial reports.
The Data platform architecture consists of multiple sources systems which are listed below.
From the above-mentioned diverse sources, the data is getting loaded in the landing layer i.e., S3 Bucket Tier 1. After that, the ETL Framework consists of 2 components: Actual data pipeline and Reference data pipeline.
Data loaded in S3 Bucket Tier 1 Landing Layer will contain raw data. Data from S3 Bucket Tier 1 Layer will take 2 different ETL paths to process through different layers before loading to Facts and dimensions.
a. S3 Bucket Tier 1 (Landing Layer) to S3 Bucket Tier 2 (Cleansed Layer): Data from S3 Bucket Tier 1 will be loaded to S3 Bucket Tier 2 using IICS which involves different Data format conversion Transformations.
b. S3 Bucket Tier 2 (Cleansed Layer) to Facts and Dimension: External Table will be configured on top of S3 Bucket Tier 2-layer files. Data from external tables be loaded to Facts and Dimensions with SCD2 Type configuration using IICS.
The new ETL Framework is configured to load the data from S3 Bucket Tier 1 to Facts and dimensions. This involves below flow which will process data using python and SQL scripts-
S3 Tier 1 -> Staging -> ODS -> Batch Master -> Dim/Fact
β’ S3 Tier 1 (Maintains History),
β’ Staging (Truncate and Load) [DB: Redshift],
β’ ODS (Maintain SCD1 Type) [DB: Redshift],
β’ Batch Master (Truncate and Load) [DB: Redshift],
β’ Dim/Fact (Maintain SCD2 Type) [DB: Redshift]
Reference Data Management: Reference data is data used to classify or categorize other data. Typically, they are static or slowly changing over time.
a. The actual RDM model is held on an RDS PostgreSQL database and is based on a Data Vault modeling method.
b. Multiple data sources of reference data can load into RDM, and different delivery/retrieval methods of this data are used, for example DMS replication task and IICS mappings.
Fig below: WEM Technical Data Architecture
Fig below: Reference Data Management architecture diagram
The data coming from various source systems is ingested into the AWS S3 Tier 1 bucket through AWS DMS and AWS Kinesis on real time basis as csv format. This becomes the source folder for data to be merged on scheduled times either hourly or daily (depending upon the source).
A lambda function/ECS server (depending upon the volume of data) is scheduled via CloudWatch cron to run on hourly basis. This function will merge files from folder where data is ingested on S3. File names in above folder contains timestamp, hence merging is done according to name of files. The merged files are then
stored in the same S3 bucket in a different folder. The files are stored in two different formats β csv and parquet. External tables are created on top of parquet files, partitioned on daily basis. This will help us to query raw data.
The csv files are used to load the data into stage through step functions. Once merging is done, a new lambda is invoked. A pointer file is read by this function which contains the parameters like start_time and end_time. This pointer file resides on the same S3 bucket. The lambda function checks if the last triggered step function is still in running state or not, if yes then step function will not trigger else step function will start executing. The step function has multiple steps to load the data in different layers. The layers are listed and described below:
The below Python scripts are created for the data load process β
This python script is for reading the excel and creating table in redshift STG & ODS. This script will do the following job:
This python script is used to create a template config file, which can be modified as per our need. It will contain the following details:
Sample config.json file -
{ "table_details": { "table-name": "billablepolicytern", "redshift_table_name": "stg_billablepolicyterm" }, "column_rename": { }, "sort_by_columns": [ "ID", "TS MS" ], "group_by_columns": [ "ID" ], "timestamp_columns": [ "TS MS" ], "Primary_key": [ "ID" ], "other_details": { "destination_folder": "EIS_Temp/Data/CSV/", "file_location": "AGGREGATE/EIS/CSV/public/"Β } }
This python script will load data from s3 to ODS layer using copy command. Table list inside the script must be edited according to need.
{ Tablename": { "stage_table": "stg_tablename", "ods_table": "ods_tablename", "primary_key": [ List of columns acting as PK], "stage_schema_name": "stg_schemaname", "ods_schema_name": "ods_schemaname" }
{ "redshift_Creds": { "stage_schema_name": "gde_wem_stg" }, "table_name": { "qualification", "posadjustments", "paymentoption", "policysummary", "customer", "contactinfo", "addressline" } }
The above two config files (.json script) stores all the table details that needs to be created. In case any new tables need to be created in later stages, then that table details need to be added in theses config files and when the python script will run itβll create that new table in redshift.
This python script is used to move the report once generated through the IICS mapping execution to the proper predefined location so that the customer can access the report with ease. We have used the AWS S3 CP command to move the file.
Fig above: AWS S3 > Redshift using python script
This Layer contains incremental data load. A function will read a file from AWS S3 and perform partitioning and sorting to pick the latest record from the file and then dump that file in stage folder in S3. The data is then copied to the redshift stage internal tables.
This layer does not preserve the history of the data. Load happens based on parameter βis_updateβ. If this parameter is set to βTrueβ, it indicates we are dealing with data that has updates on the same id or have an operation type.
Batch Master Layer table stores id of master table for every batch load for each entity. Batch master tables act as master table (driving entity) for dimension or fact query. Useful for updating data in dimension tables and updating the current record flag for scd type 2 dimensions.
In this Layer Data loads into the dimension or fact tables based on their SCD types. - A source SQL query for dimension table is present which creates a temp table with joins using batch master table and the ODS tables to fetch latest records.
Several materialized views are created on top of the Dims/fact tables in Redshift database. The main purpose of these materialized views is to be used as the primary sources for the final report that will be generated through the IICS mappings.
For different reports there are different materialized views with different logic. For example β
We have created the IICS mappings for mainly two purposes β
a) Load the data from S3 to Redshift
b) Generate the different reports by ingesting data from different redshift tables
and materialized views
For loading data from S3 to Redshift tables, we have created a simple mapping which has one Source transformation, one Expression Transformation and one Target.
For generating the different financial reports, different complex IICS mapping along with the Mapping task and Taskflows are created. There are different transformations used in the IICS mappings to implement the different business logic as per the customer needs. The below figure shows a sample IICS mapping for generating financial reports.
While the current ETL pipeline implementation has achieved significant milestones, several areas offer potential for further enhancement and expansion.
Incorporating advanced analytics and machine learning algorithms can enhance the predictive capabilities of the financial reports. For instance, machine learning models could be used to forecast financial trends, detect anomalies, and provide deeper insights into data patterns.
Future developments should focus on strengthening data governance frameworks and security measures. Implementing role
based access controls, data encryption, and compliance with data protection regulations such as GDPR will be essential as the system scales.
Developing a more user-friendly interface and integrating advanced reporting tools can enhance the accessibility and usability of the financial reports. Tools like Tableau or Power BI could be integrated to allow for dynamic and interactive reporting capabilities.
Implementing automated documentation tools can ensure that the ETL processes are well-documented and easily maintainable. Additionally, enhancing monitoring and alerting systems will help in proactive issue detection and resolution.
While the current project is tailored for the Western European Market, similar ETL pipelines can be developed for other geographical regions or business domains. This would involve customizing the pipeline to handle different data sources, regulations, and reporting requirements.
By addressing these areas, the ETL pipeline can evolve into a more sophisticated, efficient, and versatile solution, providing even greater value to the organization and its stakeholders.
I would like to express my gratitude for the opportunities and support that have enabled me to explore various aspects of Data Science and Artificial Intelligence, applying them to an industry-specific use case. My sincere thanks to my co-worker, Mr. Amartya Roy, whose domain-specific insights and technology-related
recommendations greatly contributed to shaping this work. I also extend my appreciation to all professional and personal acquaintances who have supported me throughout this journey. Lastly, I am deeply grateful to my parents and family members for their unwavering support and encouragement.
The dissertation titled "Automated ETL Pipeline to Generate Financial Report at MonthEnd & Year-End" aims to enhance the efficiency and accuracy of financial reporting for the Western European Market (WEM) by automating the Extract, Transform, Load (ETL) process. The project successfully implemented an ETL pipeline that integrates multiple source systems and databases, including PostgreSQL, Microsoft SQL, and Amazon DynamoDB, into a cohesive data warehouse using AWS Pipeline and IICS mappings. This architecture supports the generation of detailed financial reports related to policies, claims, and sales, significantly reducing manual effort and operational costs.
The ETL pipeline's design ensures data integrity and consistency through structured transformation processes, including the use of various data transformation techniques and the maintenance of different data layers such as staging, operational data store (ODS), and batch master tables. The project also developed Python scripts and IICS mappings to facilitate the automatic generation of monthly and yearly financial reports, which are then placed in a shared location for easy client access.
The project's success is evident in its ability to consolidate data from diverse sources, streamline the reporting process, and provide accurate and timely financial insights. This automated approach not only improves productivity and service quality but also enables the organization to scale its resources more efficiently to meet business demands. Overall, the dissertation demonstrates a robust and scalable solution for automated financial reporting, setting a precedent for similar initiatives in the future.
There are no models linked
There are no models linked