Mercari price suggestion: kaggle competition problem

Here there are some features which are highly correlated with each other. I will remove one of those features which have highly correlation, because it is not good to have multicollinearity. Shipping is very highly correlated with price.

Finally I have selected 15 features. Here you can see —

6. Featurization of the data —

From previous analysis we know that the unavailability of brand name does effect price, so what I think what if I make new feature with only 0 and 1, where brand_name is given I will make it as 1 and where not I will make it 0(look here I am not removing the original feature brand_name, I am making new feature let’s name it brand_name_exist_encoder). After adding this feature when I calculate the correlation between this feature and price then I got the value of almost 0.21, Which is quite good.
So now we have one more feature brand_name_exist_encoder.
But, I have not use this feature because when I was training my model my system seems to be turn down because it was taking all my ram. Because data is large almost 1.4 million I couldn’t handle it. By the way I have trained some model successfully and it is giving me better results compare to this(which I will show you). The reason behind better result could be more data and that one feature(brand_name_exist_encoder) which have very good correlation with price. So if you have more ram in your laptop you can try this.

Before doing featurization I will split my data in train_set and test_set. I have done this with this code —

y=df['price']
X=df.drop('price',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Now In this section I will do featurizaton with three methods.
1. One Hot Encoding
2. TF-IDF Vectorizing
3. Word2Vec

I could do tf-idf word2vec but limited to myself with these three only.
Basically I have made feature with each methods.
On categorical data I am applying only one_hot_encoding, but for text I am applying these three methods and making three feature sets —

(train_ohe, test_ohe),(train_tfidf,test_tfidf) and (train_w2v,test_w2v).

NOTE — Here I am fitting all encoders on only train and transforming both train and test to prevent data leakage.

One Hot Encoding —

First I am making features with one hot encoding, means I will apply onehotencoder on both text and categorical data. See the code here —

### converting name to bowcount_vect=CountVectorizer(min_df=50)
count_vect.fit(X_train['name'])
train_name_ohe = count_vect.transform(X_train['name'])
test_name_ohe = count_vect.transform(X_test['name'])
print(train_name_ohe.shape)
print(test_name_ohe.shape)
### converting item_description to bow
count_vect=CountVectorizer(min_df=500)
count_vect.fit(X_train['item_description'])
train_item_description_ohe = count_vect.transform(X_train['item_description'])
test_item_description_ohe = count_vect.transform(X_test['item_description'])
print(train_item_description_ohe.shape)
print(test_item_description_ohe.shape)
### on hot encoding of brand name
o_h_e=OneHotEncoder(handle_unknown='ignore')
o_h_e.fit(X_train['brand_name'].values.reshape(-1,1))
train_br_name_ohe=o_h_e.transform(X_train['brand_name'].values.reshape(-1,1))
test_br_name_ohe=o_h_e.transform(X_test['brand_name'].values.reshape(-1,1))
print(train_br_name_ohe.shape)
print(test_br_name_ohe.shape)
### on hot encoding of subcat1
o_h_e=OneHotEncoder(handle_unknown='ignore')
o_h_e.fit(X_train['subcat1'].values.reshape(-1,1))
train_subcat1_ohe=o_h_e.transform(X_train['subcat1'].values.reshape(-1,1))
test_subcat1_ohe=o_h_e.transform(X_test['subcat1'].values.reshape(-1,1))
print(train_subcat1_ohe.shape)
print(test_subcat1_ohe.shape)
### on hot encoding of subcat2
o_h_e=OneHotEncoder(handle_unknown='ignore')
o_h_e.fit(X_train['subcat2'].values.reshape(-1,1))
train_subcat2_ohe=o_h_e.transform(X_train['subcat2'].values.reshape(-1,1))
test_subcat2_ohe=o_h_e.transform(X_test['subcat2'].values.reshape(-1,1))
print(train_subcat2_ohe.shape)
print(test_subcat2_ohe.shape)
### on hot encoding of subcat3
o_h_e=OneHotEncoder(handle_unknown='ignore')
o_h_e.fit(X_train['subcat3'].values.reshape(-1,1))
train_subcat3_ohe=o_h_e.transform(X_train['subcat3'].values.reshape(-1,1))
test_subcat3_ohe=o_h_e.transform(X_test['subcat3'].values.reshape(-1,1))
print(train_subcat3_ohe.shape)
print(test_subcat3_ohe.shape)
### on hot encoding of subcat4
o_h_e=OneHotEncoder(handle_unknown='ignore')
o_h_e.fit(X_train['subcat4'].values.reshape(-1,1))
train_subcat4_ohe=o_h_e.transform(X_train['subcat4'].values.reshape(-1,1))
test_subcat4_ohe=o_h_e.transform(X_test['subcat4'].values.reshape(-1,1))
print(train_subcat4_ohe.shape)
print(test_subcat4_ohe.shape)
### on hot encoding of subcat5
o_h_e=OneHotEncoder(handle_unknown='ignore')
o_h_e.fit(X_train['subcat5'].values.reshape(-1,1))
train_subcat5_ohe=o_h_e.transform(X_train['subcat5'].values.reshape(-1,1))
test_subcat5_ohe=o_h_e.transform(X_test['subcat5'].values.reshape(-1,1))
print(train_subcat5_ohe.shape)
print(test_subcat5_ohe.shape)

In the code above I am doing onhotencoding on both text and categorical column one by one.

I will stack all features including numerical one also using stack and I will save it as train_ohe and test_ohe, so that I don’t have to run it again.
stacking, saving and loading the data code is here —

### concatenating all train vectors with scipy hstackx_tr_o_h_e = sc.sparse.hstack((train_name_ohe,train_item_description_ohe,train_br_name_ohe,train_subcat1_ohe,
train_subcat2_ohe,train_subcat3_ohe,train_subcat4_ohe,train_subcat5_ohe,
x_tr_category_name_len,x_tr_item_condition_id,x_tr_name_each_sen_word,x_tr_item_desc_words_len,x_tr_shipping))
x_tr_o_h_e.shape
### concatenating all test vectors with scipy hstack
x_te_o_h_e = sc.sparse.hstack((test_name_ohe,test_item_description_ohe,test_br_name_ohe,test_subcat1_ohe,
test_subcat2_ohe,test_subcat3_ohe,test_subcat4_ohe,test_subcat5_ohe,
x_te_category_name_len,x_te_item_condition_id,x_te_name_each_sen_word,x_te_item_desc_words_len,x_te_shipping))
x_te_o_h_e.shape
# saving x_tr_o_h_e AND x_te_o_h_e.
sc.sparse.save_npz('train_OHE.npz',x_tr_o_h_e)
sc.sparse.save_npz('test_OHE.npz',x_te_o_h_e)# loading data
train_OHE=sc.sparse.load_npz('train_OHE.npz')
test_OHE=sc.sparse.load_npz('test_OHE.npz')

2. TF-IDF Vectorizing —

Now I have already done one hot encoding on categorical data, so no need to do it again. Here I have to do only tfidf vectorization on text data(name and item_description). Take a look at code —

# using tf-idf### converting name and description in tfidf vector
vect=TfidfVectorizer(min_df=200)
vect.fit(X_train['name'])
train_tfidf_name=vect.transform(X_train['name'])
test_tfidf_name=vect.transform(X_test['name'])
print(train_tfidf_name.shape)
print(test_tfidf_name.shape)
vect=TfidfVectorizer(min_df=500)
vect.fit(X_train['item_description'])
train_tfidf_item_desc=vect.transform(X_train['item_description'])
test_tfidf_item_desc=vect.transform(X_test['item_description'])
print(train_tfidf_item_desc.shape)
print(test_tfidf_item_desc.shape)

Now again same thing, I will stack all features including numerical one also using stack and I will save it as train_tfidf and test_tfidf, so that I don’t have to run it again.
stacking, saving and loading the data code is here —

### concatenating all train vectors with scipy hstackx_tr_tfidf = sc.sparse.hstack((train_tfidf_name,train_tfidf_item_desc,train_br_name_ohe,train_subcat1_ohe,train_subcat2_ohe,train_subcat3_ohe,train_subcat4_ohe,train_subcat5_ohe,x_tr_category_name_len,x_tr_item_condition_id,x_tr_name_each_sen_word,x_tr_item_desc_words_len,x_tr_shipping))
### concatenating all test vectors with scipy hstack
x_te_tfidf = sc.sparse.hstack((test_tfidf_name,test_tfidf_item_desc,test_br_name_ohe,test_subcat1_ohe,test_subcat2_ohe,test_subcat3_ohe,test_subcat4_ohe,test_subcat5_ohe,x_te_category_name_len,x_te_item_condition_id,x_te_name_each_sen_word,x_te_item_desc_words_len,x_te_shipping))
# saving x_tr_tfidf and x_te_tfidf.
sc.sparse.save_npz('train_tfidf.npz',x_tr_tfidf)
sc.sparse.save_npz('test_tfidf.npz',x_te_tfidf)# loading data
train_tfidf=sc.sparse.load_npz('train_tfidf.npz')
test_tfidf=sc.sparse.load_npz('test_tfidf.npz')

3. Word2Vec —

I am applying word2vec only on text data, basically I will convert each word to vector of a sentence and sum it. so it will represent our one text.
seeing below image you will understand —

Now the question is how I will generate d-dim vector, so here I choose to convert it in 100 dimension vector, I have done it with pretrained glove vector, once you download this pretrained glove vector model then you can give one word and it will return 100 dimension vector. You can see in code that how to do it and yes you can download glove vector model from kaggle from here or you can just search on google “glove vector 100 d kaggle download” and you will find it. Let see the code —

### converting name and description to vector with word2vec modelwith open('/content/drive/MyDrive/glove_vectors_100.pkl','rb') as f:
w2v_model=pickle.load(f)
glove_vec=set(w2v_model.keys())
def making_w2v_vectors(item_description,column):
tfidf_model=TfidfVectorizer()
tfidf_model.fit(X_train[column])
dictionary=dict(zip(tfidf_model.get_feature_names(),list(tfidf_model.idf_)))
tfidf_words=set(tfidf_model.get_feature_names())
tfidf_w2v_vec=[]
for desc in item_description:
vector=np.zeros(100)
tfidf_weight=0
for word in desc.split():
if (word in glove_vec) & (word in tfidf_words):
vec=w2v_model[word]
tfidf=dictionary[word]*(desc.count(word)/len(desc.split()))
vector+=(vec*tfidf)
tfidf_weight+=tfidf
if tfidf_weight != 0:
vector/= tfidf_weight
tfidf_w2v_vec.append(vector)
return np.array(tfidf_w2v_vec)
x_tr_item_description_w2v=making_w2v_vectors(X_train['item_description'].values,'item_description')
x_te_item_description_w2v=making_w2v_vectors(X_test['item_description'].values,'item_description')
x_tr_name_w2v=making_w2v_vectors(X_train['name'].values,'name')
x_te_name_w2v=making_w2v_vectors(X_test['name'].values,'name')

I will save it as train_tfidf and test_tfidf, so that I don’t have to run it again.
stacking, saving and loading the data code is here —

x_tr_w2v = sc.sparse.hstack((x_tr_name_w2v,x_tr_item_description_w2v,train_br_name_ohe,train_subcat1_ohe,train_subcat2_ohe,train_subcat3_ohe,train_subcat4_ohe,train_subcat5_ohe,x_tr_category_name_len,x_tr_item_condition_id,x_tr_name_each_sen_word,x_tr_item_desc_words_len,x_tr_shipping))x_te_w2v = sc.sparse.hstack((x_te_name_w2v,x_te_item_description_w2v,test_br_name_ohe,test_subcat1_ohe,test_subcat2_ohe,test_subcat3_ohe,test_subcat4_ohe,test_subcat5_ohe,x_te_category_name_len,x_te_item_condition_id,x_te_name_each_sen_word,x_te_item_desc_words_len,x_te_shipping))
# saving word2vec vectors
sc.sparse.save_npz('train_w2v.npz',x_tr_w2v)
sc.sparse.save_npz('test_w2v.npz',x_te_w2v)# loading data
train_w2v=sc.sparse.load_npz('train_w2v.npz')
test_w2v=sc.sparse.load_npz('test_w2v.npz')

Now our all feature set are ready, So from here we will move on first cut models.

7. First cut models with hyperparameter tuning —

In kaggle competition they suggest to RMSLE, why? because the RMSLE is less impacted by outlier, if let’s say I have an outlier in data, now if we calculate mse then the error will go very high, so if we use mse or anything without log then the error we get, on the basis of that error we can’t say that my model is good or bad, as example let’s say we have three data in x_test and they all three are inlier now if your model is good then it will give very less error, now if I add fourth x_test data which is outlier so now when you evaluate your model then you will get very high error and you will think that ohh my model is so bad, now if you replace error with log of error then what will log do, it will compress the error and show you a better error.
you can see in above example how I thought that my model is not good due to that outlier point but that’s wrong, so this is the thing what log do.

Here I will do minmax scaling on y_train and x_train, i will fit minmax on only y_train and transform y_train, here i will not touch y_test. see code —

scaler=MinMaxScaler()
scaler.fit(np.array(y_train).reshape(-1,1))
y_train_minmax=scaler.transform(np.array(y_train).reshape(-1,1))

Now the question is how i will evaluate my model because now my model predict price which is in scaled format and our y_test is original(not scaled) so for this i will save our y_train min and max, now if i have max and min of y_train then i can convert our predicted output price at the scale of original price.let’s see the code how to do it, suppose for now i have trained a model.

y_pred=model.predict(test_tfidf)# converting out y_pred to original scale.
y_pred1=(y_pred*y_train.max())-y_train.min()# calculating mse
mean_squared_error(y_test,y_pred1)

Now first I will make a random model(like dumb model) so that we can see our first cut models are better than random model or worst than random model.
In random model I will take average of y_train and make it predicted price for whole x_test and then calculate mse, see the code here —

avg_price=y_test_minmax.mean()
predicted_y = [avg_price for i in range(test_tfidf.shape[0])]
y_pred1=(np.array(predicted_y)*y_train.max())-y_train.min()
print("mean_squared_error on Test Data using Random Model",mean_squared_error(y_test, y_pred1))

when you run this get you will get 2046.111599547367.
Now you have one line and you know below this line your model is dumb.

SGDRegression model —
I have fit my three train sets(OHE, TF-IDF, WORD2VEC) one by one on this model with hyperparameter tuning.
The code is here —

## sgd regression model### on one_hot_encoded data
model1=SGDRegressor()
parameters={'penalty':('l2', 'l1', 'elasticnet'),'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_OHE,y_train_minmax.ravel())
cv.best_params_
model1=SGDRegressor(alpha=0.0001,max_iter=2000,penalty='l2')
model1.fit(train_OHE,y_train_minmax.ravel())
y_pred=model1.predict(test_OHE)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on tfidf data
model1=SGDRegressor()
parameters={'penalty':['l2'],'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_tfidf,y_train_minmax.ravel())
cv.best_params_
model=SGDRegressor(alpha=0.0001,max_iter=2000,penalty='l2')
model.fit(train_tfidf,y_train_minmax.ravel())
y_pred=model.predict(test_tfidf)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on word2vec data
model1=SGDRegressor()
parameters={'penalty':['l2'],'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_w2v,y_train_minmax.ravel())
cv.best_params_
model1=SGDRegressor(alpha=0.0001,max_iter=2000)
model1.fit(train_w2v,y_train_minmax.ravel())
y_pred=model1.predict(test_w2v)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)

here I got mse of 1185 on OHE data, 1263 on TFIDF data, 1366 on W2V data. I can see the OHE data is giving better result.

2. LGBMRegressor model —
code is below —

## LGBM regressor### on one_hot_encoder data
model_nb=LGBMRegressor(n_jobs=-1)
parameters={
'learning_rate':[0.0001,0.001,0.1],
'n_estimators':[50,100,150,200],
'num_leaves':[20,40,60],
'max_depth':[2,3,4,5,6,7,8],
'boosting_type':['gbdt']
}
cv=GridSearchCV(model_nb,parameters,n_jobs=-1)
cv.fit(train_OHE,y_train_minmax.ravel())
cv.best_params_
model_nb=LGBMRegressor(alpha=0.0001,max_iter=2000,n_jobs=-1)
model_nb.fit(train_OHE,y_train_minmax.ravel())
y_pred=model_nb.predict(test_OHE)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on tfidf data
model_nb=LGBMRegressor(n_jobs=-1)
parameters={
'learning_rate':[0.0001,0.001,0.1],
'n_estimators':[50,100,150,200],
'num_leaves':[20,40,60],
'max_depth':[2,3,4,5,6,7,8],
'boosting_type':['gbdt']
}
cv=GridSearchCV(model_nb,parameters,n_jobs=-1)
cv.fit(train_tfidf,y_train_minmax.ravel())
cv.best_params_
model=LGBMRegressor(learning_rate=0.1,max_depth=8,n_estimators=200,num_leaves=60,n_jobs=-1)
model.fit(train_tfidf,y_train_minmax.ravel())
y_pred=model.predict(test_tfidf)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on w2v data
train_w2v=train_w2v.tocsr()
train_w2v=train_w2v.astype(dtype=np.float32)
y_train_minmax=y_train_minmax.astype(dtype=np.float32)
model_nb=LGBMRegressor()
parameters={
'learning_rate':[0.0001,0.001,0.1],
'n_estimators':[100,150,200],
'num_leaves':[20,40,60],
'max_depth':[5,6,7,8],
'boosting_type':['gbdt']
}
cv=GridSearchCV(model_nb,parameters)
cv.fit(train_w2v,y_train_minmax.ravel())
cv.best_params_
model=LGBMRegressor(boosting_type='gbdt',learning_rate=0.1,max_depth=8,n_estimators=200,num_leaves=60,n_jobs=-1)
model.fit(train_w2v,y_train_minmax.ravel())
y_pred=model.predict(test_w2v)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)

here I got mse of 1041 on OHE data, 1002 on TFIDF data, 1065 on W2V data. I can see the OHE data is giving better result. here i have got best result on tfidf data.

3. Ridge Regression Model —

## ridge regression model### on one_hot_encoded data
model1=Ridge()
parameters={'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_OHE,y_train_minmax.ravel())
cv.best_params_
model1=Ridge(alpha=1,max_iter=1000)
model1.fit(train_OHE,y_train_minmax.ravel())
y_pred=model1.predict(test_OHE)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on tfidf data
model1=Ridge()
parameters={'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_tfidf,y_train_minmax.ravel())
cv.best_params_
model=Ridge(alpha=1,max_iter=1000)
model.fit(train_tfidf,y_train_minmax.ravel())
y_pred=model.predict(test_tfidf)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on word2vec data
model1=Ridge()
parameters={'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_w2v,y_train_minmax.ravel())
cv.best_params_
model1=Ridge(alpha=1,max_iter=1000)
model1.fit(train_w2v,y_train_minmax.ravel())
y_pred=model1.predict(test_w2v)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)

here I got mse of 1108 on OHE data, 1157 on TFIDF data, 1303 on W2V data. I can see the OHE data is giving better result. here i have got best result on OHE data.

4. Lasso Regreession Model —

## lasso regression model### on one_hot_encoded data
model1=Lasso()
parameters={'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_OHE,y_train_minmax.ravel())
cv.best_params_
model1=Lasso(alpha=0.0001,max_iter=1000)
model1.fit(train_OHE,y_train_minmax.ravel())
y_pred=model1.predict(test_OHE)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on tfidf data
model1=Lasso()
parameters={'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_tfidf,y_train_minmax.ravel())
cv.best_params_
model=Lasso(alpha=0.0001,max_iter=1000)
model.fit(train_tfidf,y_train_minmax.ravel())
y_pred=model.predict(test_tfidf)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)
### on word2vec data
model1=Lasso()
parameters={'alpha':[0.0001,0.001,0.1,1],'max_iter':[1000,2000,3000,4000,5000]}
cv=GridSearchCV(model1,parameters)
cv.fit(train_w2v,y_train_minmax.ravel())
cv.best_params_
model1=Lasso(alpha=0.0001,max_iter=2000)
model1.fit(train_w2v,y_train_minmax.ravel())
y_pred=model1.predict(test_w2v)
y_pred1=(y_pred*y_train.max())-y_train.min()
mean_squared_error(y_test,y_pred1)

here I got mse of 1592 on OHE data, 1672 on TFIDF data, 1644 on W2V data. I can see the OHE data is giving better result. here i have got best result on OHE data. this is the worst model till now.

5. XGBRegressor model —

## XGBRegressor model### on one_hot_encoded data
model1 = XGBRegressor()
# parameters={
#     'n_estimators':[100,150,200],
#     'min_samples_leaf':[20,40,60],
#     'max_depth':[5,6,7,8],
# }
parameters = {'nthread': [4],
'learning_rate': [.01, 0.001, .0001],
'max_depth': [5, 6, 7],
'n_estimators': [300, 400, 500]
}
cv = GridSearchCV(model1, parameters)
cv.fit(train_OHE, y_train_minmax.ravel())
cv.best_params_
model1 = XGBRegressor(n_estimators=500, learning_rate=0.01, max_depth=7)
model1.fit(train_OHE, y_train_minmax.ravel())
y_pred = model1.predict(test_OHE)
y_pred1 = (y_pred * y_train.max()) - y_train.min()
mean_squared_error(y_test, y_pred1)
### on tfidf data
model1 = XGBRegressor()
parameters = {'nthread': [4],
'learning_rate': [.01, 0.001, .0001],
'max_depth': [5, 6, 7],
'n_estimators': [300, 400, 500]
}
cv = GridSearchCV(model1, parameters)
cv.fit(train_tfidf, y_train_minmax.ravel())
cv.best_params_
model = XGBRegressor(n_estimators=500, learning_rate=0.01, max_depth=7)
model.fit(train_tfidf, y_train_minmax.ravel())
y_pred = model.predict(test_tfidf)
y_pred1 = (y_pred * y_train.max()) - y_train.min()
mean_squared_error(y_test, y_pred1)
### on word2vec data
model1 = XGBRegressor()
parameters = {'nthread': [4],
'learning_rate': [.01, 0.001, .0001],
'max_depth': [5, 6, 7],
'n_estimators': [300, 400, 500]
}
cv = GridSearchCV(model1, parameters)
cv.fit(train_w2v, y_train_minmax.ravel())
cv.best_params_
model1 = XGBRegressor(n_estimators=500, learning_rate=0.01, max_depth=7)
model1.fit(train_w2v, y_train_minmax.ravel())
y_pred = model1.predict(test_w2v)
y_pred1 = (y_pred * y_train.max()) - y_train.min()
mean_squared_error(y_test, y_pred1)

here I got mse of 1210 on OHE data, 1236 on TFIDF data, 1278 on W2V data. I can see the OHE data is giving better result. here i have got best result on OHE data.

these five model i have tried as mu first cut models, for more improvement that these results you can do more hyperparameter tuning, probably you should get some better results.

Now see the all models performance in one place and which gives best results till now —

6. Featurization of the data —

7. First cut models with hyperparameter tuning —

Footer