Please mind the Gender Gap
TL;DR: Gender pay gap is a real phenomenon and data from UK Gender Pay Gap Service tells us an important story how unequal men and women are when it comes to advancing in their career.
Gender quality is a highly debated and an important topic, and it deserves to be one. As our society evolves, women’s role in our society also evolves and we are redifining “Equality” again. However not all the time norms in our society are keeping up with the change. In this article we will look at a narrow equality definition: Gender pay gap between women and men. We will look at this in the light of the gender pay gap data UK Government made us available since 2017.
Before moving on, pay gap deserves a stricter definition. Paying a different salary/bonus to two different employees, man or woman is not defined as Pay Gap. This is Unequal Pay
Unequal pay is paying men and women differently for performing the same (or similar) work. Unequal pay has been unlawful since 1970. (Reference)
The gender pay gap is the difference in the average hourly wage of all men and women across a workforce. If women do more of the less well paid jobs within an organisation than men, the gender pay gap is usually bigger.(Reference)
This is important as this article precisely will focus on the gender pay gap definition above. Unequal pay is a big issue and most of the tome when people refer to pay gap they refer to that, but that’s a different story.
In UK Large employers with more than 250 employees are legally required to publish gender pay gap data on their own website and submit to government gender pay gap website since 2017. This data was made publicly available in gender pay gap web site.
Gender Pay Gap Data
Data itself deserves to be explained as we use it in the rest of our post. Let’s look at the columns first:
- Their mean gender pay gap (
DiffMeanHourlyPercent
) - Their median gender pay gap (
DiffMedianHourlyPercent
) - Their mean bonus gender pay gap (
DiffMeanBonusPercent
) - Their median bonus gender pay gap (
DiffMedianBonusPercent
) - The proportion of men in the organisation receiving a bonus payment (
MaleBonusPercent
) - The proportion of women the organisation receiving a bonus payment (
FemaleBonusPercent
) - The proportion of men and women in each quartile pay band (
Fe/MaleLowerQuartile
) (Fe/MaleLowerMiddleQuartile
) (Fe/MaleUpperMiddleQuartile
)(Fe/MaleTopQuartile
)
Using these two different types of average is helpful to give a more balanced overview of an employer’s overall gender pay gap:
- Mean averages are useful because they place the same value on every number they use, giving a good overall indication of the gender pay gap, but very large or small pay rates or bonuses can ‘dominate’ and distort the answer. Fore xample, mean averages can be useful where most employees in an organisation receive a bonus but could less useful in an organisation where the vast majority of bonus pay is received by a small number of board members.
- Median averages are useful to indicate what the ‘typical’ situation is i.e. in the middle of an organisation and are not distorted by very large or small pay rates or bonuses. However, this means that not all gender pay gap issues will be picked up. For example, a median average might show a better indication of the ‘middle of the road’ pay gap in a sports club with a mean average distorted by very highly paid players and board members, but it could also fail to pick up as effectively where the pay gap issues are most pronounced in the lowest paid or highest paid employees. REFERENCE
For the results of the first four calculations:
- A positive percentage figure (which almost all organisations are likely to have) reveals that typically or overall, female employees have lower pay or bonuses than male employees.
- A negative percentage figure (which some organis ations may have) reveals that typically or overall, male employees have lower pay or bonuses than female employees.
- A zero percentage figure (which is highly unlikely, but could exist for a median pay gap where a lot of employees are concentrated in the same pay grade) would reveal no gap between the pay or bonuses of typical male and female employees or completely equal pay or bonuses overall.
For example…
- An employer with a mean hourly rate of pay of £15.25 for all male full-pay relevant employees and £13.42 for all female full-pay relevant employees would have a 12.0% mean gender pay gap (rounded to one decimal place).
The proportion of males and females in each quartile pay band:
- This calculation requires an employer to show the proportions of male and female full-pay relevant employees in four quartile pay bands, which is done by dividing the work force into four equal parts. These quartile pay bands are established when making the calculation, so any other pay banding used in a workplace must not be used
For example…
- An employer has 322 full-pay relevant employees, has arranged them by lowest hourly rate of pay to the highest hourly rate of pay, has divided the list into four quartiles and ensured employees on the same hourly pay rate are distributed evenly by gender where they cross the quartile boundaries
- Of the 81 employees in the lower quartile, 48 are male and 33 are female. This means 59.3% are male and 40.7% are female.
- Of the 80 employees in the lower middle quartile, 28 are male and 52 are female. This means 35% are male and 65% are female.
- Of the 81 employees in the upper middle quartile, 40 are male and 41 are female. This means 49.4% are male and 50.6% are female.
- Of the 80 employees in the upper quartile, 58 are male and 22 are female. This means 72.5% are male and 27.5% are female.
Data Analysis
We’ll be getting gender pay gap data for years 2017, 2018 and 2019 and process it for further analysis.
Let’s define the functions to read the data:
import io
import requests
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)def read_without_tls(url):
"""
Reading data from an HTTPS endpoint with pandas is giving Certificate Error
This helper function bypasses certificate verification and reads data to dataframe
"""
data = requests.get(url=url, verify=False).content
return pd.read_csv(io.StringIO(data.decode('utf8')))def get_data():
"""
Gender Pay Gap data is available at: https://gender-pay-gap.service.gov.uk/viewing/download
This function downloads data for all three years and merges
"""
years = [2017, 2018, 2019]
url_base = 'https://gender-pay-gap.service.gov.uk/viewing/download-data/{year}'
years_df = []
for year in years:
url=url_base.format(year=year)
years_df.append(read_without_tls(url))
return pd.concat(years_df)
Get Data and glimpse first few rows:
data = get_data()
data.head()
data.shape(21670, 25)
There are 21670 companies reported their gender pay gap data reported since 2017.
data.columns
Standard Industrial Classification (SIC) codes
SicCodes
column tells us the nature of the business of the company. As we know that some particular types of businesses attracts more men (such as Construction and Mining) than women or vice versa (Education, healthcare..etc), therefore when we are doing our analysis better we take into account business types.
We can obtain the latest SIC codes and their explanation from datahub.io.
# SIC Codes:
sic = read_without_tls('https://datahub.io/core/uk-sic-2007-condensed/r/uk-sic-2007-condensed.csv')
sic
Below is a function to add SIC Codes to gender pay gap dataframe
def get_sic(df):
"""
Convert multiple Sic Codes to per row and cast to int
"""
df = df[~df.SicCodes.isna()]
df['SicCodes'] = df.apply(lambda row: row.SicCodes.replace('\n','').replace('\r','').split(','), axis=1)
df = df.explode('SicCodes')
df['SicCodes'] = pd.to_numeric(df.SicCodes)
return df.merge(sic, how='left', left_on='SicCodes',
right_on='sic_code'
Adding Country Infromation
England, Wales, Scotland and Northern Ireland are the individual countries in UK. Since all have some level of autonomy inside, there is likely an impact on business policies, therefore classifying company’s location may reveal some insights. We will be using Address
field to extract further information.
data.Address.to_numpy()
We will be using pgeocode
to find state_name
and county_name
from the post code in the Address
column.
data.Address[data.Address.isna()]
We remove them from our dataset as later we need country information in our model.
data = data[~data.Address.isna()]
Adding Country Infromation
In our dataset we do have address information of companies, and for country analysis (England, Scotland, Wales, Northern Ireland) we need to transform this address information to country codes.
pygeocode
is a python library that converts post code to country level information. We’ll first extract post code from Address column and then using pygecode
we’ll create a new state_code
column
data['postcode'] = data.apply(lambda row: row.Address.split('\n')[-1], axis=1)import pgeocodenomi = pgeocode.Nominatim('gb')data['state_code'] = data.apply(lambda row: nomi.query_postal_code(row.postcode)['state_code'], axis=1)
let’s look at these columns briefly
data[["Address","postcode","state_code"]].head()
Bonus and Salary Comparison
In gender pay gap dataset there are four columns of interest for Bonus and Salary differences
DiffMeanHourlyPercent
: Hourly mean salary difference between gendersDiffMedianHourlyPercent
: Hourly median salary difference between gendersDiffMeanBonusPercent
: Mean bonus difference in % between gendersDiffMedianBonusPercent
: Median bonus difference in % between genders
Let’s have alook at first few rows
diff_df = data[['DiffMeanHourlyPercent', 'DiffMedianHourlyPercent', 'DiffMeanBonusPercent', 'DiffMedianBonusPercent']]diff_df.head()
Overall men earn more Salary, and typical (median) bonus for women is more than men, but average bonus tells us men are earning on avreage more than women
diff_df.mean()DiffMeanHourlyPercent 14.245919
DiffMedianHourlyPercent 11.835532
DiffMeanBonusPercent 12.842605
DiffMedianBonusPercent -11.928649
Let’s visualize distibution of Mean and Median salaries
%matplotlib inline
import seaborn as snsimport matplotlib.pyplot as pltf, axes = plt.subplots(2, 2, figsize=(14, 7), sharex=False)sns.distplot( data.DiffMeanHourlyPercent , color="skyblue", ax=axes[0, 0])sns.distplot( data.DiffMedianHourlyPercent , color="olive", ax=axes[0, 1])
Distribution plots above tells us that mean salaries for most of the companies are greater than 0, meaning men earning more than women in those companies.
This is also true for median salaries, men get bigger bonus percent than women. However distribution graphs a bit different, in median, peak is around 0, meaning in most of the companies median salary difference between genders is less, however in rather flatter slope in postive values means that in many companies men subtantially earning a higher median salary than women.
Let’s also plot DiffMeanHorlyPercent
against DiffMedianHourlyPercent
diff_df[~data.DiffMedianBonusPercent.isna()].plot(x='DiffMeanHourlyPercent', y='DiffMedianHourlyPercent', kind='scatter', alpha='0.5')
Center of mass of the ellipse looking distribution is positive for both x and y axis, confirming men has overall higher salary and bonus in these companies. There are few outlier companies however, where women earns more than men, or men mean salary is higher despite median salary difference is about 0.
Let’s check which companies (and industry types) are those few outliers. For this we need SIC codes. We’ll use the get_sic
function we defined earlier
nonkpi_cols = ['Address', 'CompanyNumber', 'SicCodes', 'sic_code',
'DateSubmitted', 'postcode', 'DueDate',
'SubmittedAfterTheDeadline','CompanyLinkToGPGInfo',
'CurrentName','sic_version', 'ResponsiblePerson']f_hour_outliers = get_sic(data[data.DiffMeanHourlyPercent<-100]).drop(nonkpi_cols, axis=1)f_hour_outliers[['EmployerName', 'DiffMeanHourlyPercent', 'DiffMedianHourlyPercent', 'sic_description', 'section_description']]
After cleaning out some outliers below is the close up view of Median and Mean salary difference distribution of companies
data[(data.DiffMeanHourlyPercent>-100) &
(data.DiffMeanHourlyPercent<100) &
(data.DiffMedianHourlyPercent>-100) &
(data.DiffMedianHourlyPercent<100)].plot(x='DiffMeanHourlyPercent', y='DiffMedianHourlyPercent', kind='scatter', alpha='0.1')
if we focus on a narrower band of values:
d = data[(data.DiffMeanHourlyPercent>-20) &
(data.DiffMeanHourlyPercent<65) &
(data.DiffMedianHourlyPercent>-10) &
(data.DiffMedianHourlyPercent<50)]sns.jointplot(x="DiffMeanHourlyPercent", y="DiffMedianHourlyPercent", data=d, kind="kde", color='k');
Generally trend looks linear but as expected the most the center of the mass is positive, meaning men’s mean and median earnings are more than women’s
There is also a group of companies with DiffMean
is substantially bigger than DiffMedian
. The difference between mean and median tells us that typical employees, (men or women) get similar salaries , ie less gender gap for typical employee, but few male employees earn a lot so that they skew the DiffMean
skewed_men_earners = data[(data.DiffMeanHourlyPercent>50) &
(data.DiffMedianHourlyPercent<50)]skewed_men_earners.plot(x='DiffMeanHourlyPercent', y='DiffMedianHourlyPercent', kind='scatter', alpha='0.5')
Let’s print top SIC codes in these skewed earners companies
skewed_men_earners = get_sic(skewed_men_earners).drop(nonkpi_cols, axis=1)skewed_men_earners.groupby('sic_description').count()['section'].sort_values(ascending=False).head(15)
Looks like sport clubs really pay a lot for a few male employees (possibly sportsmen) than their female employees.
Bonus Gap
Let’s have a quick look at bonus difference between men and women
f, axes = plt.subplots(2, 2, figsize=(14, 7), sharex=False)
sns.distplot(data.MaleBonusPercent , color="skyblue", ax=axes[0, 0])
sns.distplot(data.FemaleBonusPercent, color='olive', ax=axes[0, 1])
There is a strong correlation between men and women bonus earnings
data[['MaleBonusPercent', 'FemaleBonusPercent']].corr()
data.plot(x='MaleBonusPercent', y='FemaleBonusPercent', kind='scatter', alpha=0.01)
How easy for women to climb up in their career?
We all have heard that women are mistreated when it comes to gain seniority in their work life. There are number of reasons, including more responsiblity with younger children, maternity leave, mistreatment by the companies ..etc. This is an ongoing debate and solving these issues are very important to met decrease the gender pay gap.
Specifically we’ll look at distribution of women in four different earning buckets, Low, Low-to-Middle, Middle-to-High, High and it tells us a startling story.
Let’s look at the distribution employees by gender for the lowest 25% earners for each company:
sns.distplot(data.MaleLowerQuartile, label='Men')
ax = sns.distplot(data.FemaleLowerQuartile, label='Women')ax.set(xlabel='Percentage of Bottom Quartile Earners', ylabel='ratio')
plt.legend()
Among ~21000 companies, percentage of lowest earners are mostly women according to the graph above. That might be due to few things:
- Women probably start their careeer with less paid jobs. Remember this is distribution of lowest earners in the same company. It doesn’t say if they are paid differently for the same job, it only tells the ratio of lowest earners and in many companies women are the majority
- Women might fail to promote to the higher earning jobs and keep earning less than men. The reasons are not discussed here as we data we have tells nothing about it.
- Women might continue to work in the lower earning jobs in the same company longer.
Each of the bullet points above is a vary valid line of research, however we do not have the data to answer these questions.
Below are the remaining three graphs:
- Distribution of genders in Lower-Middle-Quartile
- Distribution of genders in Upper-Middle-Quartile
- Distribution of genders in Top-Quartile
Trend is clear, as we go from bottom quertile to top quertile, percentage of males increse. In the top 25% earners graph the peak is around 90%. This tells us, in many companies 80% to 100% of top 25% earners are men. This is a pretty shocking result of our analysis.
In many companies in the UK, 80% to 100% of top 25% earners are men.
How about women ratio in different sectors?
Graphs above are the aggregate distribution of male vs female percentages. We all probably have an idea where men are the majority workforce like construction and mining. It’s a good idea to check this and maybe we can get some more insight based on industry:
g = sns.PairGrid(grouped.sort_values("FemaleLowerQuartile", ascending=False),
x_vars=['FemaleLowerQuartile', 'FemaleLowerMiddleQuartile',
'FemaleUpperMiddleQuartile', 'FemaleTopQuartile'],
y_vars=["section_description"],
height=10, aspect=.25)# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
palette="ch:s=1,r=-.1,h=1_r", linewidth=1, edgecolor="w")# Use the same x axis limits on all columns and add better labels
g.set(xlim=(20, 70), xlabel="percentage", ylabel="")# Use semantically meaningful titles for the columns
titles = ["FemaleLowerQuartile", "FemaleLowerMiddleQuartile", 'FemaleUpperMiddleQuartile', 'FemaleTopQuartile']for ax, title in zip(g.axes.flat, titles):# Set a different title for each axes
ax.set(title=title)# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)sns.despine(left=True, bottom=True)
In the graph above, there are some observations in line with our expectations. Women are more concentrated in categories requires less physical power as expected, with more focus on service sector. Sectors require more physical power are less occupied by women such as, mining, transportation and construction.
The real interesting observation is that consistently in all categories women are less likely to climb to higher earning positions, no matter it is education or construction. This is a strong argument that women has disadvantage and there is a gender pay gap in the sense that things are more difficult for women. This means either women cannot promote easily, or even they promote they are paid less than their male coworkers for the same job, or combination of both.
Drop in women ratio in some industries are almost as much as 45%. Below is the drop ratio of women as they climb up in the earnings bucket.
grouped['ChangePercent'] = (( grouped.FemaleTopQuartile - grouped.FemaleLowerQuartile ) / grouped.FemaleLowerQuartile) *100g = sns.PairGrid(grouped.sort_values("ChangePercent", ascending=True),
x_vars=['ChangePercent'],
y_vars=["section_description"],
height=10, aspect=1)# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
palette="ch:s=1,r=-.1,h=1_r", linewidth=1, edgecolor="w")# Use the same x axis limits on all columns and add better labels
g.set(xlim=(-50, 0), xlabel="Percentage", ylabel="")# Use semantically meaningful titles for the columns
titles = ['ChangePercent']for ax, title in zip(g.axes.flat, titles):# Set a different title for each axes
ax.set(title=title)# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)sns.despine(left=True, bottom=True)
Graph above shows women are in construction and mining sector are the most disadvantageous, as their chance to climb to higher earning quartile is less likely.
One interesting observation is that, a soft area such as Financial and Insurance activities also shows that there is a big obstacle for women to be represented in higher earning quartile. This is a bit more of a speculation but thre is a possibility that the pressure in finance sector penalizes women for things like pregnancy as the time lost at work directly impacts revenue, but other sectors such as human health and social work activities are more permissive and flexible.
Model building
Now we have all this data from ~21000 companies, both their salary and bonus distribution as well as other metadata. If there is some stronger correlation between some variables such as EmployerSize
and gender pay gap, then we might have a machone learning model that can predict the pay gap based predictor variables. If the model has a good accuracy we can use this model to predict the gender pay gap for other companies in UK which hasn’t submitted their gender pay gap data yet.
Is there any way we can predict the ratio of women (High, Middle, Low) in top quartile earners based on few predictor variables, such as:
state_code:
England, Wales, Scotland and Northern Ireland.sic_code:
Different industries have different women ratioEmployerSize
: Is there any policies in bigger companies to remove gender pay gap?
We rule out to use columns related to Salary or Bonus as predictors since this won’t be available for a company which hasn’t submitted data yet. But all other information such as company size, sector, address ..etc are available publicly.
Let’s do some exploratory analysis on our predictor variables to see if they have any predictive power.
State Code
UK is union of four countries and each country is a bit different than each other since each of them has devolved governments and these devolved governments has different regulations and laws inside their borders. This may have some impact on female participation rate in work life (childcare programs, maternity rules, regulations in workplaces ..etc). Let’s see if there is any visible difference:
data_sic.groupby('state_code')['FemaleTopQuartile', 'FemaleLowerQuartile'].mean()
England, Scotland and Wales have all rather similar female employee ratio in both salary buckets, Norhern Ireland has considerably low ratio of women participation. state_code
alone doesn’t look like a strong predictor but has some predictive power so we keep it.
SIC Code
As we saw in earlier part of this blog post, different sectors have different female employee participation; there are less women (in %) in Construction than in Education. Therefore SIC code of itself might be a good predictor :
data_sic.groupby('section_description')['FemaleTopQuartile'].mean().sort_values()
Employer Size
One may think bigger companies might have an internal policy regulating the gender pay gap compared to smaller companies. The sumamry statistics below doesn’t necessarily tells us this is the case, therefore predictive power of this variable is rather small
data_sic.groupby('EmployerSize')['FemaleTopQuartile', 'FemaleLowerQuartile'].mean()
There are few companies didn’t provide the EmployerSize
, for our analysis I will exclude them and rows with nan
# Select feature and target columns and drop na's
data_sic_nona = data_sic[['FemaleTopQuartile', 'state_code', 'EmployerSize', 'section']].dropna()# Remove companies those didn't provide EmployerSize value
data_sic_nona = data_sic_nona[data_sic_nona.EmployerSize != "Not Provided"]# Check first few lines
data_sic_nona.head()
Since state_code
EmployerSize
and section
(sector) are all categorical variables, predicting a continuous variableFemaleTopQuartile
using categorical Variables do not make much sense. Therefore I’ll partition FemaleTopQuartile
into three equal sized partitions and make a categorical target variable out of it.
Before doing the transform let’s have a look at the histogram of FemaleTopQuartile
data_sic_nona.FemaleTopQuartile.hist()
As the data is not evenly disributed accross histogram, “High” values over 66% are going to be small compared to “Low” (<33%) and “Middle”(>33% , <66% ) categories
To evenly distribute the High, Middle and Low values, I’ll make use of quantiles to decide cutoff points.
data_sic_nona.FemaleTopQuartile.quantile([0.33, 0.66])0.33 22.5
0.66 50.6
According to the result above;
- “Low” is defined as <22.5% women in top quartile bucket
- “Middle” is defined as >22.5% and <50.6% women in top quartile bucket
- “High” is defined as >50.6% in top quartile bucket
Note that this is something I defined arbitrarily according to the dataset, it is perfectly fine to set High, Middle, Low to something else such as 66% and 33%
Let’s now map FemaleTopQuartile
from continuous values to caegorical values.
def map_femaletopquartile(x):
if x > 60.6:
return "High"
if x < 22.5:
return "Low"
return "Middle"data_sic_nona['label'] = data_sic_nona.apply(lambda row: map_femaletopquartile(row.FemaleTopQuartile), axis=1)data_sic_nona.head()
our target variable y = “label” column and Features (predictors) X is remaining columns except FemaleTopQuartile
. However in or der to work with categorical variables in ML we need to conver them to numbers as ML algorithms only understand numbers. We’ll be using One Hot Encoding to create multiple columns from categorical variables.
y = data_sic_nona.label
X = pd.get_dummies(data_sic_nona.drop('label', axis=1))X.columns
Let’s split our data into Train and Test sets
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
And train our classifier. To make things simple at this point I only use DecisonTreeClassifier
.
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)clf.score(X_test, y_test)0.541459291952296
Cross val score is not specifically very high but that’s somewhat expected as out predictors are only categorical variables and do not have great predictive powers as we explored earlier. This is still a modest improvement over baseline value of 0.33, simply a random guess.
Let’s check for individual Labels how our model did,
predictions = clf.predict(X_test)predictionsarray(['Low', 'Low', 'Low', ..., 'Low', 'Low', 'Middle'], dtype=object)
How much did our predictions and real values agreed for “Low”?
((y_test == 'Low') == (predictions == 'Low')).mean()0.6271950417879613
How much did our predictions and real values agreed for “Middle”?
((y_test == 'Middle') == (pred == 'Middle')).mean()0.6352709174570382
How much did our predictions and real values agreed for “High”?
((y_test == 'High') == (pred == 'High')).mean()0.8204526246595925
Even though this model ahas some predictive power, especially when it predicts the label as “High”, It’s not very much practical to use this model. If we can get more features with more predictive power then we might use this model to predict a completely new company has Low, Middle, High % of women in top earners.
Conclusion
Gender Pay Gap is a real issue and we need to do more to tackle it. UK Govenment’s initiative to understand and quantify gender pay gap is a good initiative. We have found that in general male workers got paid more mean and median salaries (however this does not mean men are earning more than women for the same job, this is the definition of unequal pay). We also saw it is harder for women to climb up in the salary ladder than men as top earners in a company is mostly men and this is consistent for all sectors, there was no single sector female employees didn’t see a drop in % from lowest income bucket to top income bucket.
We also tried to build a ML classifier to predict if the company has Low, Middle or High percent of female employees based on the metadata we know about the company such as sector, size and location.