My Blog

My learnings and etc.

Santander Product Recommendation Kaggle

Santander Product Recommendation

Under their current system, a small number of Santander’s customers receive many recommendations while many others rarely see any resulting in an uneven customer experience. In their second competition, Santander is challenging Kagglers to predict which products their existing customers will use in the next month based on their past behavior and that of similar customers.

With a more effective recommendation system in place, Santander can better meet the individual needs of all customers and ensure their satisfaction no matter where they are in life.

Based on Users history and previous products they subscribed to, predict what products he will be interested in future.

Data

In this competition, you are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records of products a customer has, such as “credit card”, “savings account”, etc. You will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind(xyz)ult1, which are the columns #25 - #48 in the training data. You will predict what a customer will buy in addition to what they already had at 2016-05-28.

The test and train sets are split by time, and public and private leaderboard sets are split randomly.

Id Column Name Description
1 fecha_dato The table is partitioned for this column
2 ncodpers Customer code
3 ind_empleado Employee index: A active, B ex employed, F filial, N not employee, P pasive
4 pais_residencia Customer’s Country residence
5 sexo Customer’s sex
6 age Age
7 fecha_alta The date in which the customer became as the first holder of a contract in the bank
8 ind_nuevo New customer Index. 1 if the customer registered in the last 6 months.
9 antiguedad Customer seniority (in months)
10 indrel 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month)
11 ult_fec_cli_1t Last date as primary customer (if he isn’t at the end of the month)
12 indrel_1mes Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),3 (former primary), 4(former co-owner)
13 tiprel_1mes Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential)
14 indresi Residence index (S (Yes) or N (No) if the residence country is the same than the bank country)
15 indext Foreigner index (S (Yes) or N (No) if the customer’s birth country is different than the bank country)
16 conyuemp Spouse index. 1 if the customer is spouse of an employee
17 canal_entrada channel used by the customer to join
18 indfall Deceased index. N/S
19 tipodom Addres type. 1, primary address
20 cod_prov Province code (customer’s address)
21 nomprov Province name
22 ind_actividad_cliente Activity index (1, active customer; 0, inactive customer)
23 renta Gross income of the household
24 segmento segmentation: 01 - VIP, 02 - Individuals 03 - college graduated
25 ind_ahor_fin_ult1 Saving Account
26 ind_aval_fin_ult1 Guarantees
27 ind_cco_fin_ult1 Current Accounts
28 ind_cder_fin_ult1 Derivada Account
29 ind_cno_fin_ult1 Payroll Account
30 ind_ctju_fin_ult1 Junior Account
31 ind_ctma_fin_ult1 Más particular Account
32 ind_ctop_fin_ult1 particular Account
33 ind_ctpp_fin_ult1 particular Plus Account
34 ind_deco_fin_ult1 Short-term deposits
35 ind_deme_fin_ult1 Medium-term deposits
36 ind_dela_fin_ult1 Long-term deposits
37 ind_ecue_fin_ult1 e-account
38 ind_fond_fin_ult1 Funds
39 ind_hip_fin_ult1 Mortgage
40 ind_plan_fin_ult1 Pensions
41 ind_pres_fin_ult1 Loans
42 ind_reca_fin_ult1 Taxes
43 ind_tjcr_fin_ult1 Credit Card
44 ind_valo_fin_ult1 Securities
45 ind_viv_fin_ult1 Home Account
46 ind_nomina_ult1 Payroll
47 ind_nom_pens_ult1 Pensions
48 ind_recibo_ult1 Direct Debit

My approach

Best Private score: 0.030378(Rank 120/1785) {Notebook 11. and 12.}

Divided data month wise for so that it will be easier for my local computer to handle. So a simple month based grep will create files train_2015_01_28.csv that contain data from Jan 2015 and so on.

And for each month, computed what users added in then next month, say for example for the month June 2015, there are total 631957 train data rows, but the number of products added in that month were 41746 by all the users combined. This data will be in files added_product_2015_06_28.csv. I precomputed all these so that I don’t have to do it again and again for each model. For each month we will be training on the users that added a new product the next month. That makes the data even more maneagable in terms of size. Foreach month there are on an average of 35-40k users who added a new product. We are interested how likely a user is interested in a new product. Thanks to BreakfastPirate for making the train data an order of magnitude lesser by showing in the forums this approach gives us meaningful results. The computation of added_product_* files can be found in 999. Less is more.ipynb notebook.

Feature Engineering.

So just to reiterate, the final training will be done on, for each month, we get all the users who added a product in the next month, which reduces the train size by 10x, and combine this data for all the months.

And data clean up and imputation is done by assuming for categorical variables the median values, and for varibles like rent, mean based on the city of the user. This job was made a lot easier because of Alan (AJ) Pryor, Jr.’s script that cleans up the data and does imputation.

Lag features from the last 4 months. Along with the raw lags of the product subscription history of a user, I computed 6 more features based on the past 4 month history of product subscription that goes as below.

  • product exists atleast once in the past
  • product exists all the months(last 4 months)
  • product doesn’t exist at all
  • product removed in the past(removed before this current month)
  • product added in the past(added before this current month)
  • product removed recently(removed this current month)
  • product added recently(added this current month)

These are the features that gave me the best results. Trained using xgboost. and you can find them in notebook 11. and best hyperparameters through grid search in notebook 12.

Best Hyperparameters through grid search:
num_class: 24,
silent: 0,
eval_metric: ‘mlogloss’,
colsample_bylevel: 0.95,
max_delta_step: 7,
min_child_weight: 1,
subsample: 0.9,
eta: 0.05,
objective: ‘multi:softprob’,
colsample_bytree: 0.9,
seed: 1428,
max_depth: 6

Results:

Notebooks convention is that the number of notebook will correspond to the notebook that start with that file, and the decimal number will also be an extra submission in the same notebook. So, you can find all 18, 18.4, 18.2 in the notebook that starts with 18. in my notebooks. And all the notebooks contain the results, graphs and etc.

And notebooks that start with 999. are helper scripts and etc.

Leaderboard scores of my various approaches:

Just included the submission that scored more than 0.03 in private leaderboard

Notebook Public Score Private Score Private Rank/1785
12. 0.0300179 0.030378 120
11 0.0300507 0.0303626 129
14.2 0.0300296 0.0303479 -
11.1 0.0300342 0.0303458 -
18.4 0.030012 0.0303328 -
18.3 0.0299928 0.0303158 -
14.3 0.0299976 0.0303028 -
18.2 0.0300021 0.0302888 -
9.2 0.0299777 0.0302416 -
17 0.0299372 0.0302239 -
9 0.0299042 0.0301812 -
16 0.029886 0.0301792 -
18.1 0.0298155 0.0301786 -
16.3 0.0297175 0.030066 -
17.1 0.0297196 0.0300583 -

Map 7 score for each iteration of xgboost of my best submission:

Imgur

Feature importance in my best submission:

Imgur

Additional appraoches that I tried.

I looked at product histories of several users over months, to understand when, a product is more likely to be added, to help me with my intuitions, and add more features. Below are product histories of few users.

Imgur
Imgur
Imgur
Imgur
Imgur

Below is a representation of how similar each product is to other products if each product is defined as a set of users who subscribed to that particular product.

Cosine Similarities of products
Imgur

Jacobian Similarities of products
Imgur

There are two important things from the above graphs, that I wanted to capture in terms of features.

  • From the product history vizs, if a particular product is being added and removed consitently and if it doesn’t exist it is more likely to be added.. So features like, is_recently_added, is_recently_removed, exists_in_the_past, no_of_times_product_flanked no_positive_flanks, no_negative_flanks and etc.. In my training set, I only considered the past 4 months product subscription history, but from one of the top solution sharing posts I noticed that people used entire product history to generate these features. That might have increased my score just considering the entire history for each month to generate these features.
  • Another thing I wanted to capture is how likely is a product to be subscribed based on other products that were recently added. Say from the similarity vizs you can see that how closely cno_fin is correlated to nomina or nom_pens and from some product history vizs I observed that if cno_fin was added recently even though nomina never had a history for a given user, he is likely to add it next month. So additional features I generated are based on current months subscription data, from jacobian similarity weights and cosine similarity weights I simply summed the weights of unsubscribed products of the respective subscirbed products. These ended up being valuable features, with hight feature importance scores but I didn’t find them add more to my lb score. Imgur
  • Additional features I tried are lag features related to user attributes, but I didn’t find these added much to my lb scores. Say a user changed from non primary subscriber to a primary subscriber, and he might be intersted in or be eligible for more products..
  • I wanted to caputure the trends and seasonality of product subscriptions, so along with raw month features, as jan is closer to december then what 1, 12 represents so instead use np.cos, np.sin of the month numbers. We can also use a period of 3 months by just using features np.cos(month/4) and np.sin(month/4)

Things learned from post competition solution sharing.

  • Entire product history is much more use ful than limiting myself to just 4 past months
  • Likilyhood of a product getting subscribed is also dependent on the month. I was not able to successfully exploit this.

Still reading various solutions, will update them once I get done with them.. Here are the direct links
* 1st place solution * 2nd place solution * 3rd place solution


Premilinary data analysis

2. ncodpers

1
2
$ cat train_ver2.csv | cut -d , -f 2 | sort -d | uniq -c | wc -l
  956646

3. ind_empleado

1
2
3
4
5
6
7
8
$ cat train_ver2.csv | cut -d , -f 3 | sort -d | uniq -c   
27734
2492 A
3566 B
2523 F
13610977 N
  17 S
   1 "ind_empleado"  

4. pais_residencia

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
$ cat train_ver2.csv | cut -d , -f 4 | sort -d | uniq -c
27734  
 111 AD  
 221 AE  
  17 AL  
  68 AO  
4835 AR  
 476 AT  
 424 AU  
  34 BA  
1526 BE  
 476 BG  
   6 BM  
1514 BO  
2351 BR  
 102 BY  
  17 BZ  
 446 CA  
  17 CD  
  17 CF  
  34 CG  
1995 CH  
  51 CI  
 989 CL  
  85 CM  
 563 CN  
3526 CO  
 147 CR  
 758 CU  
 102 CZ  
4625 DE  
  11 DJ  
 226 DK  
 424 DO  
  86 DZ  
2169 EC  
  45 EE  
  68 EG  
13553710 ES  
  34 ET  
 345 FI  
5161 FR  
  51 GA  
4605 GB  
  17 GE  
  17 GH  
  17 GI  
  17 GM  
  51 GN  
 119 GQ  
 243 GR  
 130 GT  
  34 GW  
  51 HK  
 282 HN  
  68 HR  
  37 HU  
 409 IE  
 413 IL  
 187 IN  
  17 IS  
2947 IT  
  11 JM  
 239 JP  
  72 KE  
  17 KH  
  96 KR  
  17 KW  
  17 KZ  
  17 LB  
  45 LT  
 124 LU  
  17 LV  
  17 LY  
 396 MA  
  96 MD  
  51 MK  
  17 ML  
  17 MM  
  51 MR  
   2 MT  
2573 MX  
  34 MZ  
 214 NG  
  62 NI  
 757 NL  
 136 NO  
  51 NZ  
  22 OM  
  77 PA  
 900 PE  
  34 PH  
  85 PK  
 599 PL  
 101 PR  
1419 PT  
1430 PY  
  52 QA  
2931 RO  
  34 RS  
 769 RU  
  79 SA  
 603 SE  
 117 SG  
  85 SK  
  17 SL  
  68 SN  
 102 SV  
  17 TG  
 102 TH  
  17 TN  
  62 TR  
  34 TW  
 493 UA  
3651 US  
 510 UY  
2331 VE  
  34 VN  
 119 ZA  
  11 ZW  
   1 "pais_residencia"

5. sexo

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 5 | sort -d | uniq -c
27804
6195253 H
7424252 V
   1 "sexo"

6. age

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
$ cat train_ver2.csv | cut -d , -f 6 | sort -d | uniq -c
 733   2
1534   3
2210   4
3004   5
3673   6
3792   7
4744   8
5887   9
7950  10
10481  11
12546  12
12745  13
12667  14
13118  15
11759  16
11953  17
10989  18
21597  19
422867  20
675988  21
736314  22
779884  23
734785  24
472016  25
347778  26
281981  27
240192  28
205709  29
186040  30
167985  31
169537  32
170477  33
174574  34
183577  35
198422  36
212420  37
231963  38
260548  39
287754  40
309051  41
319713  42
324303  43
322955  44
314771  45
299365  46
286505  47
271576  48
250484  49
236383  50
223297  51
211611  52
202527  53
181511  54
165355  55
151340  56
144645  57
134739  58
124177  59
117834  60
108356  61
101186  62
91521  63
87398  64
84750  65
81343  66
77693  67
78361  68
77745  69
70192  70
66825  71
67664  72
64431  73
59086  74
50597  75
48997  76
49218  77
34358  78
35065  79
35773  80
38217  81
33938  82
31860  83
30124  84
27754  85
24956  86
23648  87
21718  88
19175  89
16863  90
15098  91
13492  92
11642  93
10085  94
8511  95
7480  96
5962  97
4622  98
3617  99
27734  NA
3050 100
2666 101
2335 102
2003 103
1350 104
1280 105
 899 106
 594 107
 456 108
 265 109
 261 110
 252 111
 188 112
 117 113
  22 114
  82 115
  63 116
  14 117
   3 126
   8 127
   8 163
   3 164
   1 "age"

8. ind_nuevo

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 8 | sort -d | uniq -c
12808368  0
811207  1
27734 NA
   1 "ind_nuevo"

10. indrel

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 10 | sort -d | uniq -c
13594782  1
24793 99
27734 NA
   1 "indrel"

12. indrel_1mes

1
2
3
4
5
6
7
8
9
10
11
12
$ cat train_ver2.csv | cut -d , -f 12 | sort -d | uniq -c
149781
4357298 1
9133383 1.0
 577 2
 740 2.0
1570 3
2780 3.0
  83 4
 223 4.0
 874 P
   1 "indrel_1mes"

13. tiprel_1mes

1
2
3
4
5
6
7
8
$ cat train_ver2.csv | cut -d , -f 13 | sort -d | uniq -c
149781
6187123 A
7304875 I
   4 N
4656 P
 870 R
   1 "tiprel_1mes"

14. indresi

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 14 | sort -d | uniq -c
27734
65864 N
13553711 S
   1 "indresi"

15. indext

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 15 | sort -d | uniq -c
27734
12974839 N
644736 S
   1 "indext"

16. conyuemp

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 16 | sort -d | uniq -c
13645501
1791 N
  17 S
   1 "conyuemp"

17. canal_entrada

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
$ cat train_ver2.csv | cut -d , -f 17 | sort -d | uniq -c
186126
 210 004
29063 007
27139 013
  11 025
 152 K00
66656 KAA
62381 KAB
7697 KAC
10680 KAD
50764 KAE
30559 KAF
74295 KAG
24875 KAH
37699 KAI
24280 KAJ
 838 KAK
7573 KAL
11285 KAM
1370 KAN
6676 KAO
14928 KAP
18017 KAQ
32686 KAR
86221 KAS
3268209 KAT
 249 KAU
 107 KAV
34275 KAW
67350 KAY
32186 KAZ
1298 KBB
 241 KBD
 147 KBE
3760 KBF
1725 KBG
7197 KBH
 519 KBJ
 575 KBL
 555 KBM
  61 KBN
7380 KBO
  85 KBP
4179 KBQ
1875 KBR
 714 KBS
3127 KBU
 896 KBV
1130 KBW
 102 KBX
 642 KBY
46446 KBZ
1530 KCA
5187 KCB
49308 KCC
3276 KCD
 309 KCE
 420 KCF
5399 KCG
24098 KCH
26546 KCI
 215 KCJ
 957 KCK
4187 KCL
3251 KCM
1117 KCN
 179 KCO
 158 KCP
 198 KCQ
 196 KCR
 236 KCS
 105 KCT
1081 KCU
 229 KCV
  67 KCX
 379 KDA
  17 KDB
1531 KDC
 534 KDD
 828 KDE
 726 KDF
 469 KDG
 191 KDH
  17 KDI
  11 KDL
2468 KDM
 197 KDN
1763 KDO
1001 KDP
1112 KDQ
8050 KDR
1924 KDS
1299 KDT
2588 KDU
 427 KDV
 772 KDW
1728 KDX
1960 KDY
 531 KDZ
 851 KEA
 370 KEB
 231 KEC
3011 KED
 175 KEE
 329 KEF
1895 KEG
1546 KEH
 944 KEI
9247 KEJ
 542 KEK
2483 KEL
  68 KEM
4917 KEN
 850 KEO
 146 KEQ
5904 KES
 265 KEU
 844 KEV
5687 KEW
35146 KEY
2433 KEZ
409669 KFA
 107 KFB
3098360 KFC
44461 KFD
 250 KFE
5529 KFF
6800 KFG
2579 KFH
 881 KFI
6620 KFJ
3913 KFK
3806 KFL
 371 KFM
4520 KFN
9487 KFP
 371 KFR
6694 KFS
8036 KFT
4914 KFU
  67 KFV
  28 KGC
  17 KGN
  28 KGU
8950 KGV
 989 KGW
9474 KGX
4138 KGY
  51 KHA
16418 KHC
116891 KHD
4055270 KHE
20674 KHF
241084 KHK
45128 KHL
183924 KHM
116608 KHN
8992 KHO
 691 KHP
591039 KHQ
   1 KHR
   5 KHS
75607 RED
   1 "canal_entrada"

18. indfall

1
2
3
4
5
$ cat train_ver2.csv | cut -d , -f 18 | sort -d | uniq -c
27734
13584813 N
34762 S
   1 "indfall"

22. ind_actividad_cliente

1
2
3
4
5
6
7
8
9
$ cat train_ver2.csv | cut -d , -f 22 | sort -d | uniq -c
6903158  0
5841260  1
429322  A"
124933  ILLES"
85202  LA"
235700  LAS"
27734 NA
   1 "ind_actividad_cliente"

24. segmento

1
2
3
4
5
6
7
8
9
10
11
12
$ cat train_ver2.csv | cut -d , -f 24 | sort -d | uniq -c | head -100
418613
545352 01 - TOP
7542889 02 - PARTICULARES
4506805 03 - UNIVERSITARIO
  17 100000.44
  17 100001.85
  11 100007.28
  17 100010.67
  17 100013.34
  17 100013.7
  17 100014.21

Product stats

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
0    13645913
1        1396
Name: ind_ahor_fin_ult1, dtype: int64
0    13646993
1         316
Name: ind_aval_fin_ult1, dtype: int64
1    8945588
0    4701721
Name: ind_cco_fin_ult1, dtype: int64
0    13641933
1        5376
Name: ind_cder_fin_ult1, dtype: int64
0    12543689
1     1103620
Name: ind_cno_fin_ult1, dtype: int64
0    13518012
1      129297
Name: ind_ctju_fin_ult1, dtype: int64
0    13514567
1      132742
Name: ind_ctma_fin_ult1, dtype: int64
0    11886693
1     1760616
Name: ind_ctop_fin_ult1, dtype: int64
0    13056301
1      591008
Name: ind_ctpp_fin_ult1, dtype: int64
0    13623034
1       24275
Name: ind_deco_fin_ult1, dtype: int64
0    13624641
1       22668
Name: ind_deme_fin_ult1, dtype: int64
0    13060928
1      586381
Name: ind_dela_fin_ult1, dtype: int64
0    12518082
1     1129227
Name: ind_ecue_fin_ult1, dtype: int64
0    13395025
1      252284
Name: ind_fond_fin_ult1, dtype: int64
0    13566973
1       80336
Name: ind_hip_fin_ult1, dtype: int64
0    13522150
1      125159
Name: ind_plan_fin_ult1, dtype: int64
0    13611452
1       35857
Name: ind_pres_fin_ult1, dtype: int64
0    12930329
1      716980
Name: ind_reca_fin_ult1, dtype: int64
0    13041523
1      605786
Name: ind_tjcr_fin_ult1, dtype: int64
0    13297834
1      349475
Name: ind_valo_fin_ult1, dtype: int64
0    13594798
1       52511
Name: ind_viv_fin_ult1, dtype: int64
0.0    12885285
1.0      745961
Name: ind_nomina_ult1, dtype: int64
0.0    12821161
1.0      810085
Name: ind_nom_pens_ult1, dtype: int64
0    11901597
1     1745712
Name: ind_recibo_ult1, dtype: int64