Roy Lai
- Jun 9, 2019
- 4 min read

An analysis on Data Related Jobs in Sydney

Updated: Jun 17, 2019

As I am delving into the field of data science, why not use data science to explore what the job prospects are in Australia? This was a perfect opportunity for me to practice and implement what I am learning.

The purpose of this project was firstly to determine the most important factors that impacted salaries in ‘data’ related jobs and secondly to determine the skills and keywords associated with these jobs. The data was scraped from seek.com. After cleaning and pre-processing this data, multiple models were tested for each of the 2 questions I had in mind.

Part 1 - Predicting Salary

5 types of information were collected for the purposes of predicting salary:

Job Type – Full-Time / Contract
Job Title – Data Scientist / Data Analyst / Data Engineer / Business Intelligence / Academic (I also split between ‘senior’ and ‘not senior’)
Location – Sydney / Melbourne / Brisbane
Industry – Technology / Finance / Others
Salary – Also our target.

The data used for this analysis was scrapped from seek.com using a Python module, Beautiful Soup. Details of approximately 850 jobs was collated although after removing job postings that were deemed irrelevant, only 654 job postings remained (i.e. 654 rows of data).

The main issue we encountered was that 74% of the jobs did not have a salary specified. The challenge here was to estimate these salaries while maintaining some granularity. The approach taken was to form groups by job type, job title, location and industry and finding the median value which is then used to fill our missing values that fell in the same categories. For example, I would look for the median value of Sydney based Data Scientist Full-Time roles in the Technology industry and use this value that fit the exact same criteria (Sydney/Data Scientist/Sydney/Technology).

With missing values in place, let’s have a look at what the data is telling us! Browsing recruiter and other job related websites (such as Hudson), I found some interesting statistics summarised in the table below:

The most immediate observations are that the order from highest to lowest salary generally remained the same but the numbers from Scraped Data are significantly inflated compared to what I found on the websites (Website Data). Well, there could be several reasons for this:

Firstly, the inflated numbers could be a result of contract roles in the scrapped data which generally have significantly higher salary figures when converted to an annual number. Secondly, there may have been an impact from the way we imputed the missing salary figures.

Now let’s see what we could find regarding location. I was not able to find granular data pertaining to data related roles only, but the average annual wages across all jobs were as follows:

As seen in the above table, there does not seem to be any significant differences between Queensland and Victoria but with NSW slightly ahead. Now looking at the Scraped Data for each of the respective capital cities, we have the following:

Again, we note Scraped Data figures being higher than Website Data likely for the same reason as previously discussed. What is interesting to note however, is that the differences between the cities are much more pronounced with differences between each city being roughly $7,000.

Lastly, let us consider our different industries. As with location, I was not able to locate data role specific information but we do have the following.

What we identify here is that ‘Finance’ and ‘Information Media & Telecommunications’ have higher salaries on average compared to the other industries. This again, is similar to Scrapped Data as shown below:

As for the modelling, despite salary being a continuous variable, I decided to treate this as a classification problem. This meant that I needed to first split salaries into different categories. The split used is as follows:

1) Low Salary - Below $90,000

2) Below Average Salary - Between $90,000 and $120,000

3) Average Salary - Between $120,000 and $150,000

4) Above Average Salary - Between $150,000 and $180,000

5) High Salary - $180 and $210,000

6) Exceptionally High Salary - Above $210,000

Various models were tested with the goal of predicting which category a job would fall into. The models tested included Logistic Regression, Support Vector Machine (SVM), Random Forest and XGBoost.

The model that performed the best was SVM which gave an accuracy score of 82.44%. Of all the data we fed into the model, the features of highest importance is predicting salary group were as follows:

1. Job type;

2. Whether the job was categorised as a data analyst or not; and

3. Whether the job location was Melbourne or not.

Part 2 – Understanding sought after skills and attributes in data related roles

The second part of this project was to determine the skills and keywords associated with each Job Title. For this purpose, I focused my attention on extracting the most commonly used keywords in the job descriptions. More specifically I was interested in words that were more common in one Job Title category compared to other categories. With this in mind, I used Term Frequency-Inverse Document Frequency (TF-IDF) as data to feed into my model.

Below are a list of the top most common single-word and two-word-combinations for each of the job categories:

I again tested several different classifier models to see how well I could predict the job titles based on occurrence of the various key words. This included Logistic Regression (with and without Regularisation), SVM, and Random Forest.

The model that performed the best in this instance was Logistic Regression with Lasso Regularisation. Regularisation improved performance on the Logistic Regression model likely due to the complexity of the model resulting from it being fed a high number of keywords from TF-IDF. Regularisation may have served to simplify the model with Lasso removing a number of these keywords.

None-the-less, having a look at the most important keywords that impacted on model predictions, we have the following list:

· Technical;

· machine learning;

· science, research;

· complex;

· python;

· technology.

In summary, the more important factors in determining salary, based on our model, are whether the job is full time or contract, and what type of role it is, with the model placing emphasis on whether the job was a data analyst role. The keywords our model used to determine the job category include ‘Technical’, ‘Machine Learning’, ‘Science’, ’Complex’, ‘Python’ and ‘Technology’.

For those interested, please refer to GitHub for the code that I used.

#career #data

An analysis on Data Related Jobs in Sydney

Recent Posts