Trap, Review, and Re-Load With a SAS Incremental Load in an ETL Environment

Trap the data for review

The first step is to trap, or simply identify, the data that is suspect and in need of further human review before being processed completely. For this step, you need to create a new identifier and detect all the invalid data that could potentially be processed incorrectly if it were sent through the program “as is”.

How to trap using the IFN function in SAS. It’s possible to identify suspect data using the IF-THEN/ELSE logic. For example, suppose a variable X is a numeric variable that needs to be in the interval [0,100].

/*Call data and trap invalid values to ensure data accuracy*/data data_adminclaims2;set data_adminclaims1;/*Unique ID Created for manual case review*/id_review = _n_;/*Create destination pathway for data*//*Isolate out invalide values of variable X with variable Y*/if x<0 then;y=0;else if x>100 theny=1;elsey=x;run;

The program here isolates invalid numeric values with six simple lines of code that can be expanded to other variables (included character). This is powerful for administrative data that will have multiple data variables requiring review with potentially numeric or character data.

Photo by Riku Lu on Unsplash

Prepare Files For Review and Create Output Data

The second step is to sequester out the “trapped” data from the previous step so that it can be packaged together and exported for a review dataset by a human. This is the selective movement of data from one system to another and created a load pattern that sets you up to identify properly the data that was created and modified since the previous processing load and run.

/*Prepare dataset of pending administrative data for manual external review*//*Set review rules for each file created*/data data_y;set data_adminclaims2;run;data data_z;set data_adminclaims2;if z=0;if z=1;run;data exported_review;set data_y data_z;run;proc sort data=exported_review nodupkey;by id_review;run;

Export Current Dataset for Review

The third step is to create a review file, bump it against a previously reviewed file, and export the final quality control dataset so that it can be reviewed manually. Inevitably these files have been exported because there is some specific problem with their data quality or content. Because the process is recurring without a function such as an incremental load the number of review files can accumulate, become messy, and cause data quality control concerns as well as place time constraints on staff. After exportation, the person reviewing the data must be knowledgeable about how the data should be correctly entered and be able to apply a decision tree to accurately triage where the data goes after review (i/e — keep the data or remove it).

/*Export review dataset for internal manual review by employees*/proc export data=cloud_directory.exported_reviewoutfile="&parent_dir.DataDestinationExportManualReview_&todaysDate..xlsx"dbms=xlsx replace;run;

This step creates an identifier that is unique and specific to the specific person or event. The variable can be numeric or character and parts can be added to ensure a proper number of digits are included to guarantee a unique review file.

/*Import previous internal reviewed validated by employees for comparison to current temporary dataset in need of review*/proc import out = already_revieweddatafile="\CloudDirectoryDataalready_reviewed.xlsx"dbms=xlsx replace;run;/*Create an incremental file that matches previously reviewed files with the current data run stored in the work folder to ensure only new and unique files are exported*/proc sql;create table exported_review3 as select l.*, r.id_review2 as id_newfrom exported_review2 l left join already_reviewed ron l.id_review2 = r.id_review2;quit;data exported_review4;set exported_review3;if id_new=' ' then output;run;/*These file are unique and get exported for complete manual review in new data cycle*/proc export data=exported_review4outfile="\CloudDirectoryDataManual_ReviewExportedReview4_&todaysdate..xlsx"dbms=xlsx replace;run;

The step above is where files from a current program run are compared to previous program runs. This ensures that the exported file has a few files for manual review as possible. Completing this step protects the data integrity of files already processed and reduces overall quality control time allowing for the protection of current data integrity.

Photo by Michael Dziedzic on Unsplash

Re-integration of Separated Dataset into Program for Full Data Processing

The goal of the entire process is to re-integrate the data requiring some form of official review back into the program. Once included back into the program a combined dataset needs to be created that merges the reviewed data with the data being currently processed (and not requiring any review) so that the program can finish running whatever tasks are being completed on the project.

/*This dataset is imported for integration into the current data cycle and is included in data processing*/proc import out=Review_Filesdatafile="\CloudDirectoryDataReviewedFilesReviewedFiles_&todaysdate..xlsx"dbms=xlsx replace;run;data Reviewed_Files2;set Reviewed_Files;if DataApproved = "Yes";run;/*This dataset is created and fully reintegrates the review data with the currently processing data to ensure the program continues with only quality controlled data*/data Quality_Controlled;set Exported_Review2 ReviewedFiles2;run;

What makes the process of incremental loading valuable

First, and foremost the process saves a lot of time from a CPU and human vantage point. Touching a fraction of the data in a large data warehouse environment requires less computational power at an individual workstation. And exporting a small number of review files makes any process requiring a human review more efficient from a time, accuracy, and security standpoint. The second benefit of leveraging this technique is that any time a program is run there is potential for the process to fail or encounter problems. An incremental load adds and modifies less data so that in the event of a program error fewer datafiles will need correction or verification (i/e data validation and change verification are expedited). Finally, it is assumed that incremental loads are being used in times when a program will be run on a periodic schedule with large amounts of data. Implementing this process can minimize data bottlenecks as well as allow for a steady run time and performance over the course of periodic program runs.

It may be apparent that incremental loading is also versatile and can be used for three basic functions — to insert data, to insert and update data, or to insert update and delete data. The process above describes the third process where new data is added, old data is updated, and data that is no longer needed is deleted.

Photo by Christina @ wocintechchat.com on Unsplash

What makes incremental loading challenging

Unless you come from a programming background or are well versed in data science the incremental loading process tends to be quite complex to learn at first. While a full load simply dumps all the previous data into the program an incremental load requires the added logic of identifying new and specific data that requires separate tasking. With large datasets stored in a data warehouse environment, the logic to sort through the data appropriately and process accordingly can become quite complex quickly. This requires both the knowledge of the process and the time to implement the appropriate code within the program. Finally, to quality control, the process requires a good team of programmers who can verify each other’s work within the data environment. One final consideration is that there may not be a single “good” way of identifying the data that is new or changed or requires action. This requires an additional step of creating an appropriate identifier.

With these considerations in mind, it is most appropriate to implement an incremental load when the size of the data is large, querying the data is slow because of size or technical aspects, and there is a specific means to implement and detect a change, and data is either deleted or retained in an external destination.