I am building a system for monitoring data quality. It is building a daily history on all qvds saved in the prod environment. It is getting a lot of the table info from metadata tables. - field counts, unique vakues, row counts.
But I would like some more direct measures as well: - numeric: columns: median , min, max, average, stdev, outliers - for categorical cols: distrbution
The direct measures are a bit on the cpu intensive side, since it is fairly large data. I am thinking to do the calculations when the table is created in order to save resources-to do it twice. What I am doing now is looping over columns and calculating all measures on all columns. It takes it toll on systems. Are there any smarthacks....or any techinal details and manouvers I can do do to make it easier...