Data Selection

Our purpose is to predict the probability whether the incoming customer is a good account or a bad account. To determine this, we need sets of historical data that already resulted in such classification.

Sources of data

For long years financial industry relied on different sources of data; however with the “big data” boom in the recent years, there are much more different resources available now.


Customer declared and verified information usually helps the understanding of the demographics of the customer. Important note here is the preliminary exclusion of sensitive discriminating information such as race, belief (and in some countries gender).

Credit bureau data is usually highly reliable depending on the country. In most developing markets, bureau is still growing and usually only negative (includes only customers’ negative information, such as missed payments, defaults, etc.) A positive bureau refers to the set of data with the customers’ positive information as well, such as utilization, credit lines, number of trades, etc.

In some countries, there are more than one bureau company, which creates the necessity of developing an optimum bureau strategy.

If the score will be built on an existing customer, previous performance of the customer within the financial institution is highly predictive. This is usually even more predictive than bureau data, since the internal customer represents a profile that the institution is already targeting, while bureau data represents an all-out market data.

In the recent years with the vast amount of data availability, social media information usage also became a trend. Your most interacted friends who have high credibility can define your score to be high as well.

Another recent data that became available is customer’s online behavior. This could be the URL content the customer has been visiting which can define the type of transactions customer do or the type of profile the customer can belong to.

Data selection

The selection of data should be representative of the portfolio we are building the score model on. The set of data should be selected keeping the “data sampling” stage in mind. What I usually recommend is to select a period of historical data where the portfolio trends and seasonality factors are considered and the last couple of snapshot periods are excluded. This excluded snapshot should be added as a “validation sample” after the development stage. For more details on data sampling please click here.




One thought on “Data Selection

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s