Kaggle tips
How to begin?
Exploratory Data Analysis (EDA)
- Load the dataset and inspect the structure (
df.info()
,df.describe()
). - Visualize distributions of features using histograms, box plots, and pair plots.
- Check for missing values (
df.isnull().sum()
). - Identify categorical and numerical features.
- Look for correlations between features and the target variable.
Concatenate both the train and test file
- some preprocessing steps must be applies uniformly across both datasets.
- Some columns may have missing data in test data nut not in train data.
- Some categorical data may be present in test data but not in train data.
df2['target'] = 0
(df2 -> test data)- done to ensure that both the train and test datasets have the same columns before merging.
df = pd.concat([df_1, df_2], axis = 0)
axis=0
=> the new values will be stacked vertically.- That is new rows from df2 are being added in df1.
Indexing
df = df.set_index('Id')
- The ‘Id’ column is removed from the regular columns and becomes the index of df.
- Make sure that the entries are unique.
- The index is used to uniquely identify rows, making certain operations (like locating rows) easier.
- reset the index using
df = df.reset_index()
- This will bring Id back as a regular column.
Dealing with NULL values
df.isnull().sum()
tells how many null values each column has.- drop the columns where more than 50% of the entries have missing values.
df = df.drop(columns=[col_name])
- Fill the missing values with mean, median, mode, next value, prevoius value
sns.heatmap(df.isnull())
this heatmap will show blank spaces (or different colors) where NaN values are present.- Check both thwe test and train data files, as eah may have differnet set of columns having missing data.
- For categorical data use mode.
df['LotFrontage']= df['LotFrontage'].fillna(df['LotFrontage'].mean())
fill the missing values with mean of all the rows (mean is calculated igonring the missing values).- Mean : when data is normally distributed, without extreme outliers.
- Median : when data has outliers.
- Mode : when data is categorical or ordinal, discrete numerical data.
- If there are very few null values remaing then just remove those rows :
df.dropna(inplace=True)
. df_null = df[df.isnull().sum()[df.isnull().sum()>0].index]
- Extracts all the columns where the null value is more than 0 are place it in a new data frame.
- Some numerical comuns can also use mode based on the descrition of that particular column.
Preprocessing Steps
One-Hot Encoding
-
To deal with categorical data :
- Converts all categorical columns (object or category dtype) into one-hot encoded columns.
- Each unique category in a categorical column is converted into a separate binary column (0 or 1).
- If df_objects has n categorical columns with m unique values across them, it will create m new columns.
df_objects = df[df.select_dtypes(include=['object']).columns]
- Select all the columns with categorical data.
df_objects = df_objects.drop(df_objects[df_objects.isna().sum()[df_objects.isna().sum() > 1100].index], axis = 1)
- Drop those colmuns where more than half of the rows have null values.
df_objects = df_objects.fillna('null')
- If the value is null then replace it with string “null” to perform one-hot encoding.
df_objects_encoded = pd.get_dummies(df_objects)
- Perform one-hot encoding.
- It adds a column
<orginal_column_name>_null
because we added null in place of na value above. It is only a placeholder. - Therefore remove it :
for i in df_objects_encoded.columns: if 'null' in i: df_objects_encoded = df_objects_encoded.drop(i, axis = 1)
- merge the encoded columns with the rest of the dataframe
new_df = pd.concat([df, df_objects_encoded], axis = 1)
Spliting the data into test and train sets
- Split the data because train and test files were concatenated at begining.
training_data = new_df[0:len(df_1)] testing_data = new_df[len(df_1):] testing_data = testing_data.drop(columns='<target column>')
- use test_train_split to split the training data.
- testing data must not be used to train the model
from sklearn.model_selection import train_test_split X = training_data.drop(columns='<target column>') y = training_data['<target column>'] X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2) Y_train = np.reshape(Y_train,(-1, 1)) Y_test = np.reshape(Y_test,(-1, 1))
Y_train,(-1, 1)
is done because sklearn return Y_train and Y_test as 1D array of shape(n_samples,)
but the models expect it as a 2D column vector(n_samples, 1)
.
Training the Model
Import the models from sklearn
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
mean_squared_error
is being used.
Train the models
model = <model_name>()
model.fit(X, y)
y_pred = model.predict(X_test)
mean_squared_error(Y_test, y_pred)
Select the model with the least error.
Export the submission file
pred = <final model>.predict(testing_data)
final = pd.DataFrame()
final['Id'] = testing_data.index
final['<target col.>'] = pred
# Write DataFrame to a CSV file without index
final.to_csv('output.csv', index=False)