Ever heard of Catch Me If You Can ? In this movie, the main character impersonates many professions mainly by appropriating their uniforms, specific behaviours and lingos, and everybody fall for it !

For the uniform it is quite obvious, but what can we say about the lingo ? One will agree that each occupation has its « proper words », but let’s think further, a person of a specific profession will also express himself in a certain manner. Think of a politician, he does not only speaks about economy and education, but also uses some figure of speech, and his discourse is fluid and well conducted; rather of a sportsman or a scientist, for exemple.

Based on this statement: give us a quote, and we will tell you which profession said it !

Table of Contents


We present here the two main datasets used for this project. You can find them following the link in the upper menu.


We do we have here ? The full dataset is made of 178 million quotations together with a list of possible speaker ranked by probability, the name of the most probable speaker and its Wikidata Qid, when it has been published, and where it has been published. The later is important because this information has been exctracted out of 162 million English news articles published between 2008 and 2020 included, so one might want to keep records of it.

162 million, 178 million, but who did that ? Well it’s the nice library assistant Quobert, and it does it freely. Here is a nice picture (taken from this publication1 explaining in full detail how the dataset has been collected) that could help you to grasp Quobert’s workflow.

Super nice, except that it correctly attributes only about 85% of the quotations. But could you except better from un unpaid assistant ?
Here is some of its failures:

But this will be good enough for our purpose. Just keep in mind that Quobert is an unpaid assistant.
Furthermore, for this project, we did not use the full Quotebank but only the years 2015-2020 included.

External sources

We said we want to predict the occupations but did not said a word about it so far… Quotebank is augmented using metadata about the speakers. It an external dataset we will find for ~9M Wikidata entities some additional information as occupation, gender, religion, etc.


In the first place, we had to deal with the “homonym issue”. This is because a quotation is linked to a list of names, ranked by probability, and further linked to a single name in a “winner takes it all” fashion. Once a name is elected, the name is linked to its corresponding Qid; BUT if a name has a namesake, then several Qids will be added. So how do we know that we are facing Harry Potter the magician or Harry Potter the journalist (it does not really apply in this case because there is no fictional character in Quotebank, but you got the point) ? Furthermore, if Quobert is uncertain, it just assigns “None” as speaker with an empty list of Qid.To cope with those issues we simply filtered out all speakers, whose list of Qid contained more or less than 1 Qid.
By doing so we dealt with both the “homonym issue” and the “empty speaker issue” and get rid of 50% of the remaining data.

Fine, but now suppose that Harry Potter the magician was tired of figthing evil, and after all, You-Know-Who has been defeated in 2011 so… He decided to start a career as a journalist at the renown “The Daily Prophet” newspaper. We now have Harry Potter the magician and Harry Potter the journalist (it is the same person now). So how do we know that his quotes are those of a magician or those of a journalist ? This is why we remove all speaker that have more than one occupation. By doing so we remove 12% of the remaining data.

Occupation clustering

By applying our filtering criterions on the external dataset, we find that there is 6800 different occupations in it. This is a lot. Here is a look at the distribution of occurencies of occupations:

We selectected an amount we can sort by hand = 280 (even though it is still a piece of work), this correspond to a treshold of occurencies > 1000. The downside is that we loose a lot of occupation just because we can’t sort them by hand. We would need another unpaid assistant to do it for us… and we will come back later to that. Those “unclustered” occupations are gathered in Other. There was some “nan” occupation that survived the df.dropna() because they were of type string. We store those in NoOcc. The later class won’t be use on the next step. It can be seen as a garbage class.

To define some clusters we looked at those publications here2 and there about Career Clusters. After that, we classified by hand our occupations with occurencies > 1000 in similar clusters and assign our quotations to unique occupation. (recall: 1 quotation » a unique Qid speaker » a sole occupation). We build two new datasets:
One consists of 4 classes, it will be used for the proof-of-concept of the -later explained- BERT-based classifier:

Cluster Label Meaning # of quote
0 Research Research and science related careers 2’372’142
1 Politics Government related careers 6’928’256
2 Sports Sport related careers 13’237’461
3 Arts Artists and creator related careers 2’421’718

The second one consists of the 20 classes:

Cluster Label Meaning # of quote
0 Other NaN careers 2’167’753
1 AFNR Agriculture, Food and Natural Resources careers 9’891
3 AAVTC Arts, Audio/Video Technology and Communications careers 2’649’277
4 BMA Business Management and Administration careers 961’273
5 ET Education and Training careers 14’009’216
6 F Finance careers 275’615
7 GPA Government and Public Administration careers 6’491’038
8 HS Health Science careers 46’010
9 HumS Human Services careers 111,’82
10 IT Information Technology careers 14’356
11 LPSCS Law, Public Safety, Corrections, and Security careers 639’833
12 M Manufacturing careers 129’037
13 MSS Marketing, Sales, and Service careers 97’653
14 STEM Science, Technology, Engineering, and Mathematics careers 2’152’432
15 R Religion related careers 173’688
16 AT Academic and Teacher related careers 177’524
17 J Journalism related careers 839’391
18 MW Military and War related careers 179’833
19 AS Aircraft and Space careers 24’947
20 NoOcc Not clustered careers 1365

If you noticed that the cluster #2 does not exist in the last table it is just because we created a cluster with no quotes in it by mistake… if yu didn’t notice it, this is not so important.


We used pretrained BERT-Base model (BertTokenizer and BertModel with 768 hidden layers, introduced in this paper). We add one fully connected layer with 10 output dimensions. BertTokenizer transforms input string into tokens, then BertModel returns 768-dimensional representation of input string. Further, added layer produce 10 values which after applying sigmoid function predict the probabilities of each class.

The weighted binary cross entropy was used as a loss function. The weights were used to decrease the effect of unbalanced data. The weights were computed as normalized reverse frequencies of classes in the train data.

For transformer-based models, it is convenient to use a schedular which changes learning rate during training to make training more smooth: smoothness increases the learning rate from zero to set value during warmup training period and smoothly decrease it to zero for last part of training. We used linear schedular with warmup period equals to first 10% of total training steps.

After several training trials we found that weighted loss does not fully prevent overfitting to the most frequent classes. For training the final model, we decided to create fully balanced train and test datasets.


We present here the different results: the “proof-of-concept” classification, followed by the 20 classes classification, and an extra step ;)


We trained and tested with unbalanced datasets, but make use of weighted loss. Furthermore, to save memory, we decided to crop all quotes up to 300 characters. After having fed the classifier with the quotes classified following the 4-classes table, and trained it for 20 minutes, we get the following results:

The behaviour of the curves has nothing to do with the swiss mountains; it just come from the fact that we used a scheduler to vary the lerning rate of our AdamW optimizer. But its initialization was not well set and it initialized repeatidly instead of only once, so we find the strange behaviour. But it is fixed in the last use of the classifier.
The important point is that it learns well and fast ! Here is the roc_auc for each class:

“Amount of class*” states the number of quote belonging to this class in the test set. As said before, it is unbalanced.
The results seems convincing. Next step: just feed the classifier with the quotes classified following the 20-classes table, run it for 8-10h and that’s it!

20 classes

We trained and tested with unbalanced datasets, but make use of weighted loss. After having trained for about 9h, we get the following results:

(Same problem with scheduler, wait for it). Ok, we can’t assess if it is good or bad by looking at the evolution of the loss but what about the roc_auc values ?

They are close to 0.5, which means that the classifier just classify randomly. This is super bad !

Can we do better ?

We assumed that the problem came from the fact that we have to many classes. So we decided to move from 20 to only 10 classes. We merge the classes according their similarity. Here is the new classification table:

Cluster Label Meaning # of quote before filtering # of quote after filtering
0 AAVTCM Arts, Audio/Video Technology and Communications careers 2’778’314 2’289’052
1 BMAxF Business Management and Administration careers 1’236’888 1’064’036
2 GPAxLPSCS Government, Law, Security careers 7’072’268 6’159’748
3 MSSxHumS Marketing, Sales and Service careers 209’135 180’941
4 ATE Academic and Teacher related careers 177’524 152’000
5 SPORTS Sport careers 14’009’216 11’593’709
6 STEMxIT Science, Technology, Mathematics and Health science careers 2’316’481 2’005’419
7 R Religion related careers 173’688 159’630
8 J Journalism related careers 839’391 143’374
9 MW Military and War related careers 188’863 678’870

Furthermore, after some reflexions, it seemed that the unbalanced testing set was fooling us in some ways for the interpretation of the results. Thus, this step is done with unbalaced training set but balanced testing set. Finally, we made the assumption that quotes with # of characters < 50 were not containing relevant information and we filtered them out. This allowed us to save some space and we augmented the “crop treshold” from 300 to 400.

Well, even though the scheduler thing is still not fixed (waaiiiiiit for iiiit), those loss curves are kind of impossible to interpret in any ways. We can look at extended performance metrics:

Here, “support” states that there is 10’000 quotes/class in the test set.
We can see that the f1 score is low for some classes. We then decided to plot the confusion matrix of classification. And here is what we got:

Even though the diagonal term seems to by the higher column-wise, some columns behaves as attractors. That is the classes 3,6 (GPAxLPSCS, SPORTS) and 1,2,7 (AAVTCM, BMAxF, STEMxIT). Unsurprisingly, this is correlated with the number of quote per class in the training set. The weighted loss does not correct the problem ! We have to try to pass it a balanced training set

Putting the pieces together

As said before, we decided as final step to balance the training set. The bottleneck class is Journalism with 143’374 quotes. We build a final training set containing 72’000 quotes of each class and a test set containg 72’000 quotes of each class as well.
Good news, we fixed the scheduler as well. Here is our loss curves :

It finally behaves normally. What about the performance metrics ?

Seems good as well. And for the confusion matrix ?

Great ! Problem fixed :)

Room for improvement

We finally got a unpaid assistant working for the occupation classification. We did not have time to use it in the classification process.

Sentence-BERT algorithm for occupation clustering

Sentence-BERT is a recent technique which fine-tunes the pooled BERT sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.

This pretrained algorithm not only made occupation clustering automatically, but also clustered all the 6800 unique occupations !

Sentence-BERT was used to cluster 6800 occupations into 10 defined clusters, by stating for each occupation 10 hypotheses and taking clustering each to the maximum prediction confidence of this algorithm. The plot below shows the distribution of the prediction confidence for each cluster over the filtered additional dataset.


We finally get a classifier performing correctly. Furthermore, it would awesome to make use of the implemented automated occupation clusterer. But this is your job if you want to ;)

Thank for having read us! I hope you enjoyed your ADAventure with us.

The k-dim team.


  • 2017-11-11-170105.jpg. (2021, December 16). https://wallpapercave.com/w/wp3396925
  • tenor.gif. (2021, December 16). https://media1.tenor.com/images/24eba459fc0a6e19c4d2d60ed678e2f9/tenor.gif?itemid=7219821
  • [Quobert.PNG (2021, December 17). https://dlab.epfl.ch/people/west/pub/Vaucher-Spitz-Catasta-West_WSDM-21.pdf<\li>
  • tumblr_l982reZrDD1qzcmp3o1_500.gif (2021, December 17). http://1.bp.blogspot.com/-kQcj14MDfdA/Tl7fyQQ_hnI/AAAAAAAAAE8/KI-pUSWaHF8/s1600/tumblr_l982reZrDD1qzcmp3o1_500.gif <\li> </ul>