Skip to content

Data Preparation and Data processing

chriszhangpodo edited this page Oct 31, 2017 · 12 revisions

Data Preparation

  1. Looking for last 10 years data cross the database which will be in client contract level (not client Id level)
  2. data will be written in CSV format with approx. 400,000 rows with 50 variables
  3. Target feature is the "churn_flg"
  4. Data contains categorical type / numeric type /Date type
  5. Data contains NA
  6. Assume Data had no writing errors (except NAs, all other is valuable)
  7. Assume all date is in just date format only

Data feature analytics

Using R studio to pre-analyse the data feature
Looking the summary of the data (need to use df <- df[,-c(a:b,c)] to exclude the features that did not want to input) summary(fulldata1)
if the data has categorical type, you may want to use df$a <- as.factor(df$a) to switch it to factor Or just read.table again

Before do any feature engineering, keep feature in character is a better choice

And then we could get total 77 features summary:

  1. contract_life_year and month has negative value which is not valid and need to be cleaned
  2. client attitude answer has a lot of NA, so we decided to change the feature to one hot encode
  3. call center data has a lot of NA, but we will regard all NA to 0 (as if there is no call from call center, the record will be marked as NA)
  4. In order to have the most accurate feature and optimize the learning, for one contract, we will use AVG / MIN / MAX / TOTAL to calculate the calling data (per one contract from one client, staff may run a lot of calls to the client, so we had to capture as much as possible calling data in the input)
  5. For a lot of the categorical data, if there is a NA, we will mark it as a new factor called "unknown"

-----------few ideas on this stage---------
{With a lot of the data feature work, we did it in SQL server. we will think to use hadoop Map-reduce job to push the hard work if data volume is large enough. Actually, next episode (new project) we will discuss how to use hadoop to process the big data with Map-reduce

In the same time, by using 400,000 observations, we may not quickly train the sample data in the future (even we just use 20% portion of the total), we need to think distributed calculation in our cross validation in the training.

By having 400,000 rows data, it is not realistic to use "leave one out" CV method, let's discuss this later

Say we will need to re-balance the data by having same portion size of the churned and not churned contracts when we finish the training set split.}

Data cleaning and sampling

  1. removing the contract_life_days/months/years with negative value
  2. removing all the CATEGORICAL NAs if you did not want to increase a level (assume this is just a smaller part, in my case, it is about 300:400,000)
  3. By using prop.table(table(fulldata2$CHURN_FLAG_A)) to look up the churned and not churned contract portion
  4. removing the column with IDs and too many levels (usually large than 50?)
  5. do a feature selection with numeric features by using cor() in library(corrplot). With those features that their correlation is very large to each other, we could remove one
  6. using Learning Vector Quantization (LVQ) model to determine the feature importance --- this needs training the data

Initially, I did not want to do the step 5 to 6 as i did not want to remove any feature before i had my first training

We will use 25% of the data for training and testing(validation), then use another 75% of the data for prediction
library(caret)
library(ROSE)
library(dplyr)
Using package caret to create the training and validation portion trainingportion <- createDatapartition(y=data$target,p=0.25,list=FALSE) and further we split up by the trainingset and validationset by half and half

Usingtable(data$target) to identify the partition between churned portion and not churned, we could see there is a little bit unbalanced ratio between each other Y: 22,500, N:18,000, so considering random forest will have a impact of classifiers (random forest using replace and re-sampling method to train the trees, minority of the class will have less chance to train the model, thus the out of bag(oob) error on predicting the minority class will be effected). under-sampling will reduce the observations; thus we decided to use over sampling method by SMOTE.

Using ROSE to re-balance the churned and not churned observation partition. The ROSE package provides a good measure of SMOTE sampling method by using over/under/both sampling cross the data. ovun.sample(data$target~.,data=data, method="over",N= 45000)

Using dplyr package to easily select / filter the data in cleaning process, for example, there are some negative value in contract_life_years column, and we could regard these are database query error(150 of 400,000). So we could use data <- data %>% filter(contract_life_years >= 0)

After we clean the data, now we could run a summary(trainingset) and summary(validation) again to make sure all the features in categorical is factor and numeric is numeric (yes, it could be character). Also we could export our target column name to an array outcome <- c("target") and also the all other predictors to predictors <- names(data)[!names(data) %in% outcome] for later use in our project

Feature selection and pre-analysis

With 77 features in 20,000 rows data is not a large size but still could be a big burden for a 8GB ram 2 cores computer to build a large tree set. 1000 trees with a 5 fold CV model takes about approx. 45mins to 1hs+ to build.
Specially if cross-validation is used by repeated CV method in multiple folds, the total tree need to be built the k-folds * n times repeated * ntree. So we may need to further reduce some features by looking their internal correlation /other method like rfcv (random forest use cross validation on select how many mtry in each tree will be proper until the MSE is stabilized)

library(corrplot) M <- cor(data[,-c(1,3:5)]) --- calculate the correlation for numeric features so had to remove the categorical one.
If still want to have a look all feature correlation cross all categorical, we had to use Chi-Square test and p value.
IF it is a numeric feature with several categorical feature, we could use ANOVA

With large correlation feature pairs, we generally will leave one in, but still depends on the mtry value.

Another try is dumping all features with a small size of portion data, setting 100 trees with out any CV to run a training and then check the varImp on the feature

We use rfcv to get the best mtry is between 16- 30, then we choose 24 will be the mtry or we could do a real CV to decide the mtry.

We will delete some of the high correlation features like some one-hot encoded features

then we got total 43 features

Next page we will discuss the training and the model building.

Also we will discuss the parallel calculation in R by using doSnow and Parallel/foreach.

Also we will discuss deeper in RF itself and explain why it is a such good algorithm not easy to overfit