Trust me, I get excited about a new data set just like anybody. I just want to tear straight to the good stuff and find those hidden correlations and to use all my fancy tests and methods. But before you get to that point, it’s important to run important preparation on your data – each a critical “Prep Step” that can save you time, rework, and wrong conclusions down the line.
1. Understand missing-ness
Data is only as good as the data that you have. So, missing data is not only un-helpful, it can be a thorn in your side as you start to build multi-variate models. In most cases, statistical tests, methods, and machine learning algorithms require complete cases. A complete case is a row of data for which all values are valid – that means there are no nulls, N/As, or NANs. Even if you have low rates of missing data in any one column, you may have very few complete cases as you build a larger model. So, it’s important to clean up missing-ness before you head down the route of more complex models. Understanding missing-ness may also give you important insights into how you will have to approach your data – for instance you may need to make several smaller models that have more complete samples rather than one big model that brings everything together. Further, it may be beneficial to take a more exploratory statistics approach rather than large scale modeling until you can acquire more complete cases.
- Missing – count the number of missing values in each column
- % Missing – number of missing values / total rowcount
Here you can choose to clean up some of the missing values using imputation. Imputation is a technique for filling in missing values such that it does not bias the data. There’s lots of different ways to impute your data, but on a conceptual level, you should think of imputation as spackle not sheetrock. If there are gaping holes in your data, imputation is not going to be a help – it will just make that column useless. There’s no exact % Missing that is “good” or “bad”, but we usually use 5-10% as candidates for effective imputation.
Quick fixes could include replacing missing values with the column average, minimum or zero. Often, you will need to understand how the data is acquired to determine appropriate ways to replace missing values.
2. Quantify range and distribution
Data comes in all shapes and sizes. Each column will have different univariate statistics that determine the viable tests and methods down the line. In this step, you should calculate the basic statistics for each numeric column, including:
- Min, Max, Range
- Mean, Standard Deviation, Variance – these provide the different “moments” of the distribution. SD and Variance can help explain the spread of the distribution.
- Skew, Kurtosis – these tell the shape of the distribution, should be 0 and 3 for normality
- P10, 50, 90 – percentiles give you more detailed information on the value spread of the distribution
- P90/P10 Ratio – provides the relative “length” of the meat of the distribution. If this is too high, then your data is very spread out
These are good diagnostic numbers to get a “quick look” at your distributions. As you gain familiarity with these statistics, you’ll be able to understand your input data much faster. It takes experience, though, so be patient!
If you have red flags in your univariate statistics, you can triage them with transformations or cleaning outliers.
3. Check normality
Normality is an important assumption in a variety of statistical tests and helps machine learning algorithms perform better. So, it’s good to determine if your data follows normal distributions or if it is skewed, full of zeros, or multi-modal.
A good way to test for normality is the Shapiro-Wilk test. This test assumes normality, so you don’t want to reject it.
You can run it in R with:
4. Get rid of bad outliers
Not all outliers are bad, but some can wreak havoc on your data analysis. Just like handling missing data, you need to take action for your outliers. Sometimes you can fix the outliers knowing how the data was acquired – for instance if valid values should be between 0 to 100 – or sometimes you need to just make them null.
If you have odd univariate statistics, a good place to check is for outliers. They tend to distort the genuine – or informative – part of the distribution. Some univariate statistics are more robust to it than others, like percentiles, versus means and variances. So, if you have reasonable percentiles but wonky means and variances, there’s a good chance outliers are causing the issues.
5. Transform variables
If your univariate statistics are still off, even after cleaning up the outliers, then you may need to look at transformations. Some common transformations to try:
- Log, Sqrt, Exponential – shift the data into log scale, sqrt, or exponential space
- x/y – take one variable and divide it by the other, this is useful with x and y are highly correlated
- f(x,y) – do something creative with your variables that bring out relationships, often this is nice if you have underlying physics or engineering at work
Following these steps provides not only better analytics down the line, but introduces a nice discipline to an otherwise creative pursuit. Finding patterns in large data sets or making a conclusion from observations or predictions from a machine learning algorithm is often as much artistry as engineering. It’s good whenever possible to introduce rigor and steps to make what you do reproducible – including how you analyze things.
Happy Data Hunting!