Ever heard of Catch Me If You Can ? In this movie, the main character impersonates many professions mainly by appropriating their uniforms, specific behaviours and lingos, and everybody fall for it !
For the uniform it is quite obvious, but what can we say about the lingo ? One will agree that each occupation has its « proper words », but let’s think further, a person of a specific profession will also express himself in a certain manner. Think of a politician, he does not only speaks about economy and education, but also uses some figure of speech, and his discourse is fluid and well conducted; rather of a sportsman or a scientist, for exemple.
Based on this statement: give us a quote, and we will tell you which profession said it !
Table of Contents
We present here the two main datasets used for this project. You can find them following the link in the upper menu.
We do we have here ? The full dataset is made of 178 million quotations together with a list of possible speaker ranked by probability, the name of the most probable speaker and its Wikidata Qid, when it has been published, and where it has been published. The later is important because this information has been exctracted out of 162 million English news articles published between 2008 and 2020 included, so one might want to keep records of it.
162 million, 178 million, but who did that ? Well it’s the nice library assistant Quobert, and it does it freely. Here is a nice picture (taken from this publication1 explaining in full detail how the dataset has been collected) that could help you to grasp Quobert’s workflow.
Super nice, except that it correctly attributes only about 85% of the quotations. But could you except better from un unpaid assistant ?
Here is some of its failures:
- The Harry Potter he refers to is an Australian journalist.
- Joe Bidden does not like himself so much.
But this will be good enough for our purpose. Just keep in mind that Quobert is an unpaid assistant.
Furthermore, for this project, we did not use the full Quotebank but only the years 2015-2020 included.
We said we want to predict the occupations but did not said a word about it so far… Quotebank is augmented using metadata about the speakers. It an external dataset we will find for ~9M Wikidata entities some additional information as occupation, gender, religion, etc.
In the first place, we had to deal with the “homonym issue”. This is because a quotation is linked to a list of names, ranked by probability, and further linked to a single name in a “winner takes it all” fashion. Once a name is elected, the name is linked to its corresponding Qid; BUT if a name has a namesake, then several Qids will be added. So how do we know that we are facing Harry Potter the magician or Harry Potter the journalist (it does not really apply in this case because there is no fictional character in Quotebank, but you got the point) ? Furthermore, if Quobert is uncertain, it just assigns “None” as speaker with an empty list of Qid.To cope with those issues we simply filtered out all speakers, whose list of Qid contained more or less than 1 Qid.
By doing so we dealt with both the “homonym issue” and the “empty speaker issue” and get rid of 50% of the remaining data.
Fine, but now suppose that Harry Potter the magician was tired of figthing evil, and after all, You-Know-Who has been defeated in 2011 so… He decided to start a career as a journalist at the renown “The Daily Prophet” newspaper. We now have Harry Potter the magician and Harry Potter the journalist (it is the same person now). So how do we know that his quotes are those of a magician or those of a journalist ? This is why we remove all speaker that have more than one occupation. By doing so we remove 12% of the remaining data.
By applying our filtering criterions on the external dataset, we find that there is 6800 different occupations in it. This is a lot. Here is a look at the distribution of occurencies of occupations:
We selectected an amount we can sort by hand = 280 (even though it is still a piece of work), this correspond to a treshold of occurencies > 1000. The downside is that we loose a lot of occupation just because we can’t sort them by hand. We would need another unpaid assistant to do it for us… and we will come back later to that. Those “unclustered” occupations are gathered in
Other. There was some “nan” occupation that survived the df.dropna() because they were of type string. We store those in
NoOcc. The later class won’t be use on the next step. It can be seen as a garbage class.
To define some clusters we looked at those publications here2 and there about Career Clusters. After that, we classified by hand our occupations with occurencies > 1000 in similar clusters and assign our quotations to unique occupation. (recall: 1 quotation » a unique Qid speaker » a sole occupation). We build two new datasets:
One consists of 4 classes, it will be used for the proof-of-concept of the -later explained- BERT-based classifier:
|Cluster||Label||Meaning||# of quote|
|0||Research||Research and science related careers||2’372’142|
|1||Politics||Government related careers||6’928’256|
|2||Sports||Sport related careers||13’237’461|
|3||Arts||Artists and creator related careers||2’421’718|
The second one consists of the 20 classes:
|Cluster||Label||Meaning||# of quote|
|1||AFNR||Agriculture, Food and Natural Resources careers||9’891|
|3||AAVTC||Arts, Audio/Video Technology and Communications careers||2’649’277|
|4||BMA||Business Management and Administration careers||961’273|
|5||ET||Education and Training careers||14’009’216|
|7||GPA||Government and Public Administration careers||6’491’038|
|8||HS||Health Science careers||46’010|
|9||HumS||Human Services careers||111,’82|
|10||IT||Information Technology careers||14’356|
|11||LPSCS||Law, Public Safety, Corrections, and Security careers||639’833|
|13||MSS||Marketing, Sales, and Service careers||97’653|
|14||STEM||Science, Technology, Engineering, and Mathematics careers||2’152’432|
|15||R||Religion related careers||173’688|
|16||AT||Academic and Teacher related careers||177’524|
|17||J||Journalism related careers||839’391|
|18||MW||Military and War related careers||179’833|
|19||AS||Aircraft and Space careers||24’947|
|20||NoOcc||Not clustered careers||1365|
If you noticed that the cluster #2 does not exist in the last table it is just because we created a cluster with no quotes in it by mistake… if yu didn’t notice it, this is not so important.
We used pretrained BERT-Base model (BertTokenizer and BertModel with 768 hidden layers, introduced in this paper). We add one fully connected layer with 10 output dimensions. BertTokenizer transforms input string into tokens, then BertModel returns 768-dimensional representation of input string. Further, added layer produce 10 values which after applying sigmoid function predict the probabilities of each class.
The weighted binary cross entropy was used as a loss function. The weights were used to decrease the effect of unbalanced data. The weights were computed as normalized reverse frequencies of classes in the train data.
For transformer-based models, it is convenient to use a schedular which changes learning rate during training to make training more smooth: smoothness increases the learning rate from zero to set value during warmup training period and smoothly decrease it to zero for last part of training. We used linear schedular with warmup period equals to first 10% of total training steps.
After several training trials we found that weighted loss does not fully prevent overfitting to the most frequent classes. For training the final model, we decided to create fully balanced train and test datasets.
We present here the different results: the “proof-of-concept” classification, followed by the 20 classes classification, and an extra step ;)
We trained and tested with unbalanced datasets, but make use of weighted loss. Furthermore, to save memory, we decided to crop all quotes up to 300 characters. After having fed the classifier with the quotes classified following the 4-classes table, and trained it for 20 minutes, we get the following results:
The behaviour of the curves has nothing to do with the swiss mountains; it just come from the fact that we used a scheduler to vary the lerning rate of our AdamW optimizer. But its initialization was not well set and it initialized repeatidly instead of only once, so we find the strange behaviour. But it is fixed in the last use of the classifier.
The important point is that it learns well and fast ! Here is the roc_auc for each class:
“Amount of class*” states the number of quote belonging to this class in the test set. As said before, it is unbalanced.
The results seems convincing. Next step: just feed the classifier with the quotes classified following the 20-classes table, run it for 8-10h and that’s it!
We trained and tested with unbalanced datasets, but make use of weighted loss. After having trained for about 9h, we get the following results:
(Same problem with scheduler, wait for it). Ok, we can’t assess if it is good or bad by looking at the evolution of the loss but what about the roc_auc values ?
They are close to 0.5, which means that the classifier just classify randomly. This is super bad !
Can we do better ?
We assumed that the problem came from the fact that we have to many classes. So we decided to move from 20 to only 10 classes. We merge the classes according their similarity. Here is the new classification table:
|Cluster||Label||Meaning||# of quote before filtering||# of quote after filtering|
|0||AAVTCM||Arts, Audio/Video Technology and Communications careers||2’778’314||2’289’052|
|1||BMAxF||Business Management and Administration careers||1’236’888||1’064’036|
|2||GPAxLPSCS||Government, Law, Security careers||7’072’268||6’159’748|
|3||MSSxHumS||Marketing, Sales and Service careers||209’135||180’941|
|4||ATE||Academic and Teacher related careers||177’524||152’000|
|6||STEMxIT||Science, Technology, Mathematics and Health science careers||2’316’481||2’005’419|
|7||R||Religion related careers||173’688||159’630|
|8||J||Journalism related careers||839’391||143’374|
|9||MW||Military and War related careers||188’863||678’870|
Furthermore, after some reflexions, it seemed that the unbalanced testing set was fooling us in some ways for the interpretation of the results. Thus, this step is done with unbalaced training set but balanced testing set. Finally, we made the assumption that quotes with # of characters < 50 were not containing relevant information and we filtered them out. This allowed us to save some space and we augmented the “crop treshold” from 300 to 400.
Well, even though the scheduler thing is still not fixed (waaiiiiiit for iiiit), those loss curves are kind of impossible to interpret in any ways. We can look at extended performance metrics:
Here, “support” states that there is 10’000 quotes/class in the test set.
We can see that the f1 score is low for some classes. We then decided to plot the confusion matrix of classification. And here is what we got:
Even though the diagonal term seems to by the higher column-wise, some columns behaves as attractors. That is the classes 3,6 (GPAxLPSCS, SPORTS) and 1,2,7 (AAVTCM, BMAxF, STEMxIT). Unsurprisingly, this is correlated with the number of quote per class in the training set. The weighted loss does not correct the problem ! We have to try to pass it a balanced training set
Putting the pieces together
As said before, we decided as final step to balance the training set. The bottleneck class is
Journalism with 143’374 quotes. We build a final training set containing 72’000 quotes of each class and a test set containg 72’000 quotes of each class as well.
Good news, we fixed the scheduler as well. Here is our loss curves :
It finally behaves normally. What about the performance metrics ?
Seems good as well. And for the confusion matrix ?
Great ! Problem fixed :)
Room for improvement
We finally got a unpaid assistant working for the occupation classification. We did not have time to use it in the classification process.
Sentence-BERT algorithm for occupation clustering
Sentence-BERT is a recent technique which fine-tunes the pooled BERT sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.
This pretrained algorithm not only made occupation clustering automatically, but also clustered all the 6800 unique occupations !
Sentence-BERT was used to cluster 6800 occupations into 10 defined clusters, by stating for each occupation 10 hypotheses and taking clustering each to the maximum prediction confidence of this algorithm. The plot below shows the distribution of the prediction confidence for each cluster over the filtered additional dataset.
We finally get a classifier performing correctly. Furthermore, it would awesome to make use of the implemented automated occupation clusterer. But this is your job if you want to ;)
Thank for having read us! I hope you enjoyed your ADAventure with us.
The k-dim team.
- 2017-11-11-170105.jpg. (2021, December 16). https://wallpapercave.com/w/wp3396925
- tenor.gif. (2021, December 16). https://media1.tenor.com/images/24eba459fc0a6e19c4d2d60ed678e2f9/tenor.gif?itemid=7219821
- [Quobert.PNG (2021, December 17). https://dlab.epfl.ch/people/west/pub/Vaucher-Spitz-Catasta-West_WSDM-21.pdf<\li>
- tumblr_l982reZrDD1qzcmp3o1_500.gif (2021, December 17). http://1.bp.blogspot.com/-kQcj14MDfdA/Tl7fyQQ_hnI/AAAAAAAAAE8/KI-pUSWaHF8/s1600/tumblr_l982reZrDD1qzcmp3o1_500.gif <\li>
Vaucher, T., Spitz, A., Catasta, M., & West, R. (2021, March). Quotebank: A Corpus of Quotations from a Decade of News. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (pp. 328-336). ↩