This is part 2 of a guest post by British Bondora investor ‘ParisinGOC’.
Read part 1 first.
Data Mining the Bondora data.
The initial process.
To help understand the specific data cleansing that the Bondora Data Set needed, I first made use of the RapidMiner metadata view – a summary of all the attributes presented to the software – showing Attribute name, type, statistics (dependant on type, includes the least occurring and most occurring values, the modal value and the average value), Range (min, max, quantity of each value for polynominal and text attributes) and, most critically, “Missings†and “Roleâ€.
“Role†is the name given by RapidMiner to the special attributes that are needed to allow certain operations. In my case, the Decision Tree module needed to know which Attribute was the “Targetâ€, that is the attribute that is the focus of the analysis and to which the Decision Tree has to relate the other attributes in its processing. My “Target†was the “Default†attribute – a “Binominal†(called as such by RapidMiner and meaning an attribute with just 2 values) attribute – 1 if the loan had defaulted, 0 if not.
“Missings†is easy – this is the number of times this attribute has no valid value. For example, my import of the raw Bondora input data has 150 attributes. Only half of these attributes have no missing values. The remainder have between 13 and 19132 rows with missing values from a data set of 20767 rows.
To know whether these “missings†would impact my analysis, I needed to get to know the data in more detail.
I knew that Bondora had started to offer loans in Finland in summer 2013 with Spain following in October of that year and Slovakia in the first half of 2014.
I therefore decided not to bother with any loan issued prior to 2013. Continue reading