Kaggle tips

How to begin?

Exploratory Data Analysis (EDA)

Load the dataset and inspect the structure (df.info(), df.describe()).
Visualize distributions of features using histograms, box plots, and pair plots.
Check for missing values (df.isnull().sum()).
Identify categorical and numerical features.
Look for correlations between features and the target variable.

Concatenate both the train and test file

some preprocessing steps must be applies uniformly across both datasets.
Some columns may have missing data in test data nut not in train data.
Some categorical data may be present in test data but not in train data.
df2['target'] = 0 (df2 -> test data)
- done to ensure that both the train and test datasets have the same columns before merging.
df = pd.concat([df_1, df_2], axis = 0)
axis=0 => the new values will be stacked vertically.
That is new rows from df2 are being added in df1.

Indexing

df = df.set_index('Id')
The ‘Id’ column is removed from the regular columns and becomes the index of df.
Make sure that the entries are unique.
The index is used to uniquely identify rows, making certain operations (like locating rows) easier.
reset the index using df = df.reset_index()
This will bring Id back as a regular column.

Dealing with NULL values

df.isnull().sum() tells how many null values each column has.
drop the columns where more than 50% of the entries have missing values.
- df = df.drop(columns=[col_name])
Fill the missing values with mean, median, mode, next value, prevoius value
sns.heatmap(df.isnull()) this heatmap will show blank spaces (or different colors) where NaN values are present.
Check both thwe test and train data files, as eah may have differnet set of columns having missing data.
For categorical data use mode.
df['LotFrontage']= df['LotFrontage'].fillna(df['LotFrontage'].mean()) fill the missing values with mean of all the rows (mean is calculated igonring the missing values).
- Mean : when data is normally distributed, without extreme outliers.
- Median : when data has outliers.
- Mode : when data is categorical or ordinal, discrete numerical data.
If there are very few null values remaing then just remove those rows : df.dropna(inplace=True).
df_null = df[df.isnull().sum()[df.isnull().sum()>0].index]
- Extracts all the columns where the null value is more than 0 are place it in a new data frame.
Some numerical comuns can also use mode based on the descrition of that particular column.

Preprocessing Steps

One-Hot Encoding

To deal with categorical data :
Converts all categorical columns (object or category dtype) into one-hot encoded columns.
Each unique category in a categorical column is converted into a separate binary column (0 or 1).
If df_objects has n categorical columns with m unique values across them, it will create m new columns.

df_objects = df[df.select_dtypes(include=['object']).columns]

Select all the columns with categorical data.

df_objects = df_objects.drop(df_objects[df_objects.isna().sum()[df_objects.isna().sum() > 1100].index], axis = 1)

Drop those colmuns where more than half of the rows have null values.

df_objects = df_objects.fillna('null')

If the value is null then replace it with string “null” to perform one-hot encoding.

df_objects_encoded = pd.get_dummies(df_objects)

Perform one-hot encoding.

It adds a column <orginal_column_name>_null because we added null in place of na value above. It is only a placeholder.

Therefore remove it :

for i in df_objects_encoded.columns:
  if 'null' in i:
      df_objects_encoded = df_objects_encoded.drop(i, axis = 1)

merge the encoded columns with the rest of the dataframe
new_df = pd.concat([df, df_objects_encoded], axis = 1)

Spliting the data into test and train sets

Split the data because train and test files were concatenated at begining.

  training_data = new_df[0:len(df_1)]
  testing_data = new_df[len(df_1):]
  testing_data = testing_data.drop(columns='<target column>')

use test_train_split to split the training data.

testing data must not be used to train the model

from sklearn.model_selection import train_test_split
X = training_data.drop(columns='<target column>')
y = training_data['<target column>']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)
Y_train = np.reshape(Y_train,(-1, 1))
Y_test = np.reshape(Y_test,(-1, 1))

Y_train,(-1, 1) is done because sklearn return Y_train and Y_test as 1D array of shape (n_samples,) but the models expect it as a 2D column vector (n_samples, 1).

Training the Model

Import the models from sklearn

from sklearn.metrics import mean_squared_error 
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

mean_squared_error is being used.

Train the models

model = <model_name>()
model.fit(X, y)
y_pred = model.predict(X_test)
mean_squared_error(Y_test, y_pred)

Select the model with the least error.

Export the submission file

pred = <final model>.predict(testing_data)

final = pd.DataFrame()
final['Id'] = testing_data.index
final['<target col.>'] = pred

# Write DataFrame to a CSV file without index
final.to_csv('output.csv', index=False)