Introduction
To create a model which will predict salary we need to collect data, preprocess it, find features and value to predict and then choose best model. In this short description, I'll try to show my way of thinking.
Used tools
Motivations
Purpose of this project is data analysis, which can proof hard situation in IT or show most important skills to earn more than average. For me and other people who are interested in programming and plan thier future in IT, it can be very useful tool.
Collecting data
The IT offers comes from justjoin.it. I was getting them from 21-23.05.2024. Currently there are about 5000 offers in database from Poland. Unfortunately, I had to reject internship offers because there were few of them and I did not consider offers without a range.
After collecting data I had to preprocess it. The most important thing was to encode categorical data and encode technologies in offer. It gave me only numbers in dataset and I could use it in model. To encode it, I used MultiLabelBinarizer and LabelEncoder from sklearn library.
Writting a scrapper
I wrote a scrapper and then I could wrap raw data in obejct.
Exploding data
I had to explode data to get all possible values for example city in offer.
Encoding categorical data
I had to encode categorical data to numerical values.
Data cleaning
I had to clean data, remove duplicates and fill missing values.
Data model
Main features are in the picture. Here is my handling of various features in the dataset:
- Technologies/Locations: matching value from offer to their defined synonyms.
- Salary range: taking average value for B2B and UoP.