How to begin?

Exploratory Data Analysis (EDA)

  • Load the dataset and inspect the structure (df.info(), df.describe()).
  • Visualize distributions of features using histograms, box plots, and pair plots.
  • Check for missing values (df.isnull().sum()).
  • Identify categorical and numerical features.
  • Look for correlations between features and the target variable.

Concatenate both the train and test file

  • some preprocessing steps must be applies uniformly across both datasets.
  • Some columns may have missing data in test data nut not in train data.
  • Some categorical data may be present in test data but not in train data.
  • df2['target'] = 0 (df2 -> test data)
    • done to ensure that both the train and test datasets have the same columns before merging.
  • df = pd.concat([df_1, df_2], axis = 0)
  • axis=0 => the new values will be stacked vertically.
  • That is new rows from df2 are being added in df1.

Indexing

  • df = df.set_index('Id')
  • The ‘Id’ column is removed from the regular columns and becomes the index of df.
  • Make sure that the entries are unique.
  • The index is used to uniquely identify rows, making certain operations (like locating rows) easier.
  • reset the index using df = df.reset_index()
  • This will bring Id back as a regular column.

Dealing with NULL values

  • df.isnull().sum() tells how many null values each column has.
  • drop the columns where more than 50% of the entries have missing values.
    • df = df.drop(columns=[col_name])
  • Fill the missing values with mean, median, mode, next value, prevoius value
  • sns.heatmap(df.isnull()) this heatmap will show blank spaces (or different colors) where NaN values are present.
  • Check both thwe test and train data files, as eah may have differnet set of columns having missing data.
  • For categorical data use mode.
  • df['LotFrontage']= df['LotFrontage'].fillna(df['LotFrontage'].mean()) fill the missing values with mean of all the rows (mean is calculated igonring the missing values).
    • Mean : when data is normally distributed, without extreme outliers.
    • Median : when data has outliers.
    • Mode : when data is categorical or ordinal, discrete numerical data.
  • If there are very few null values remaing then just remove those rows : df.dropna(inplace=True).
  • df_null = df[df.isnull().sum()[df.isnull().sum()>0].index]
    • Extracts all the columns where the null value is more than 0 are place it in a new data frame.
  • Some numerical comuns can also use mode based on the descrition of that particular column.

Preprocessing Steps

One-Hot Encoding

  • To deal with categorical data :

  • Converts all categorical columns (object or category dtype) into one-hot encoded columns.
  • Each unique category in a categorical column is converted into a separate binary column (0 or 1).
  • If df_objects has n categorical columns with m unique values across them, it will create m new columns.

  • df_objects = df[df.select_dtypes(include=['object']).columns]
  • Select all the columns with categorical data.
  • df_objects = df_objects.drop(df_objects[df_objects.isna().sum()[df_objects.isna().sum() > 1100].index], axis = 1)
  • Drop those colmuns where more than half of the rows have null values.
  • df_objects = df_objects.fillna('null')
  • If the value is null then replace it with string “null” to perform one-hot encoding.
  • df_objects_encoded = pd.get_dummies(df_objects)
  • Perform one-hot encoding.
  • It adds a column <orginal_column_name>_null because we added null in place of na value above. It is only a placeholder.
  • Therefore remove it :
    for i in df_objects_encoded.columns:
      if 'null' in i:
          df_objects_encoded = df_objects_encoded.drop(i, axis = 1)
    
  • merge the encoded columns with the rest of the dataframe
    new_df = pd.concat([df, df_objects_encoded], axis = 1)

Spliting the data into test and train sets

  • Split the data because train and test files were concatenated at begining.
      training_data = new_df[0:len(df_1)]
      testing_data = new_df[len(df_1):]
      testing_data = testing_data.drop(columns='<target column>')
    
  • use test_train_split to split the training data.
  • testing data must not be used to train the model
    from sklearn.model_selection import train_test_split
    X = training_data.drop(columns='<target column>')
    y = training_data['<target column>']
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)
    Y_train = np.reshape(Y_train,(-1, 1))
    Y_test = np.reshape(Y_test,(-1, 1))
    
  • Y_train,(-1, 1) is done because sklearn return Y_train and Y_test as 1D array of shape (n_samples,) but the models expect it as a 2D column vector (n_samples, 1).

Training the Model

Import the models from sklearn

from sklearn.metrics import mean_squared_error 
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
  • mean_squared_error is being used.

Train the models

model = <model_name>()
model.fit(X, y)
y_pred = model.predict(X_test)
mean_squared_error(Y_test, y_pred)

Select the model with the least error.

Export the submission file

pred = <final model>.predict(testing_data)

final = pd.DataFrame()
final['Id'] = testing_data.index
final['<target col.>'] = pred

# Write DataFrame to a CSV file without index
final.to_csv('output.csv', index=False)