Kaggle Playground series s5e1: Tough Beginnings
Kaggle competitions offer a fantastic playground for machine learning enthusiasts to test their skills, learn from others, and tackle real-world problems. This year, I’ve decided to document my progress in a series of posts, starting with my experience in the Playground Series Season 5, Episode 1: - Forecasting Sticker Sales
The competition aimed to develop a machine learning model to predict sticker sales based on historical data. We were provided with a dataset containing features such as country, store, product type, and date. The evaluation metric was Mean Absolute Percentage Error (MAPE).
Key tasks included:
Data exploration: Understanding the dataset, checking for missing values, and identifying key features.
Feature engineering: Deriving new features to enhance model performance.
Model training: Trying different algorithms to find the best-performing model.
Evaluation: Assessing predictions against the test set using MAPE.
The First Attempt
Data Exploration
After loading the dataset, I explored its structure:
Features: The dataset included columns such as
id
,date
,country
,store
,product
, andnum_sold
(target variable).Missing values: Approximately 8,871 missing values were found in the target column.
I visualized distributions and trends in the data, focusing on sales by country, store, product, and time. This exploration helped uncover patterns, such as seasonal variations and product popularity.
Feature Engineering
To improve predictions, I engineered additional features:
Extracted
year
,month
,day
, andday_of_week
from thedate
column.Mapped
day_of_week
to weekdays for better interpretability.
These new features aimed to capture time-dependent patterns in sticker sales.
Data Cleaning
Missing values in num_sold
were filled using the mean sales for the corresponding product and store. While simple, this approach felt like being a decent choice for the initial run. It would give a good idea of how far off from the proper solution I was.
Model Selection
For my first attempt, I chose:
Linear Regression: As a baseline model.
Random Forest Regressor: To capture complex interactions.
Despite the straightforward pipeline, the results were disappointing. The linear regression model performed poorly, with a MAPE of 429.81%. While the random forest model improved results, achieving a MAPE of 5.12%, my final ranking was around 1500th out of 1800 participants.
Submission
I generated predictions for the test dataset using the trained random forest model and submitted them. Although the results were far from competitive, it was very motivating to feel like giving a proper shot to my first competition.
Lessons Learned
The low ranking highlighted areas for improvement:
Feature engineering: My features captured some patterns but lacked deeper insights, such as interactions or lag-based features.
Model complexity: Random forests were effective but not optimized. Hyperparameter tuning and exploring advanced models might yield better results.
Benchmarking: Comparing my approach with top solutions can reveal gaps and inspire new strategies.
The Path Forward
After my initial attempt, I discovered a top competitor's notebook. Their methodology was far more sophisticated, featuring:
Advanced feature engineering techniques.
Rigorous data preprocessing.
Use of models like LightGBM and XGBoost.
In the next post, I’ll delve into this notebook, documenting what I learned and how I plan to incorporate these strategies into my future attempts.
Kaggle competitions are as much about learning as they are about competing. While my first attempt was a humbling experience, it provided a solid foundation for growth. Stay tuned for the next installment, where I’ll explore the strategies of Kaggle’s best and apply them to improve my rankings!
The notebook can be found here