As we know ETL stands for Extract, Transform and Load. Generally ETL things will comes into picture with respect the data i.e., OLAP data. So many paid and free ware tools available in the market like Informatica, Datastage, Oracle data Integrator and Pentaho etc. In a simple manner what we are testing here is data from source modified based on some conditions and will be inserted into target tables.
Source: The source may be any format like flat files, tables, xml files etc.,These source data is generated from other jobs or other applications, we don't bother how the data is coming or from where the data is coming, here our concentration should be the source data should be inserted into one or two tables and one or more records. If you take in informatica the jobs are called workflows, it might be hourly workflows which will run hourly, or daily workflows which runs once in a day based on the design.
The things we need to check as a part of ETL testing are
ETL Data Flow |
The things we need to check as a part of ETL testing are
- Verify the source data parameters
- We will define the informatica repository database the source file should have particular delimiter and should have these many fields based on that one the workflow should handle the parameters, if it doesn't mach then the alert mail should be received by the concerned team saying bad data.
- Verify the source data is valid
- Some parameters should be date type, some should be strings, and numbers, so the workflow intitially check those things if any changes in the datatypes an alert will be thrown.
- Flat to temp table insertion
- We have to check the data is loaded into temp table from flat file.
- Temp to staging table insertion
- The data will be transformed using some lookup tables, or some filtering or grouping condition and then inserted into staging tables.
- Staging to final tables
- The data will be finally copied to target tables.
- General test cases you need to verify are as below.
- The workflow should not run unless previous hour or day job has ran.
- If we rerun the job there won't be any duplicate records in target tables. That means the job should pull the files again from source.
- If the job fails in between there won't be having any records in target tables.
Nice article very useful information
ReplyDelete