AI-Driven Solution for Talent Acquisition: a White Paper

Home » Insights » AI-Driven Solution for Talent Acquisition: a White Paper

Introduction

In the rapidly evolving landscape of Talent Management (TM), the integration of Artificial Intelligence (AI) into Software as a Service (SaaS) platforms is set to revolutionize talent acquisition processes. This paper presents an innovative AI-enhanced SaaS solution designed to tackle the complex challenges of resume classification, job description matching, and the creation of customized resumes and cover letters.

Project Hypothesis

Our AI Team has been tasked with developing a comprehensive solution that leverages the capabilities of Large Language Models (LLMs) to streamline HR processes. The goal is to accurately classify and match resumes to job descriptions, score resumes, generate and update resumes, cover letters, and job descriptions, and provide actionable recommendations.

Problem Framing

The industry-wide issue that needs to be addressed is the lack of a standardized frame of reference for benchmarks and comparisons, scoring, and ranking of what a candidate’s CV offers and what a job description demands. We propose tackling this problem using a “taxonomy anchor” as the benchmark scale to measure the “suitability” of a CV or set of CVs for a given job description through categorization.

Proposed Solution

Our team proposes creating a modularized solution for parsing CVs and Job Descriptions (JDs), matching CVs against a given JD or through the summarization and reasoning of an LLM model fine-tuned on ESCO/ISCO/US SOC standards. The solution also aims to propose a JD for a given CV and score the “fit” of a CV.

The proposed solution is designed to be modular, allowing for the integration of various components as needed. This includes a parser for extracting relevant information from CVs and JDs, a matcher for aligning CVs with appropriate JDs, and a scoring system for evaluating the fit of a CV for a given JD. The solution also includes a generator for creating and updating CVs, cover letters, and JDs based on the extracted information and the results of the matching and scoring processes.

The key innovation of our solution lies in its use of LLMs, which have been fine-tuned on ESCO/ISCO/US SOC standards. These models can understand and reason about the content of CVs and JDs, allowing them to perform tasks such as summarization and matching with high accuracy. Furthermore, by grounding our solution in these occupational standards, we ensure that our analysis is always tied to a common frame of reference, allowing for meaningful comparisons and evaluations.

The goal is to facilitate stepwise and progressive improvement iterations of the architecture, ensuring it excels in various phases of the workflow, including CV data ingestion and parsing, preprocessing, and the application of specialized LLM models tailored for embeddings (retriever) and Q&A functionalities (generator). 

Essentially, our solution focuses on the following: 

Given the large variety of CV content, we prefer to employ a strategy for parsing CV’s so that we can extract content that can be further classified and can allow to associate the subsequent paragraphs in a detected section with the header of that section, which down the road simplifies the processing of the type of the extracted CV content.  

Model Training Pipeline – Framework 

Large Language Models (LLMs) represent a type of artificial intelligence that undergoes training on an extensive amount of web data and various online resources. This training enables them to produce responses to language-based inquiries that closely resemble human expression. Despite the impressive achievements already demonstrated by commercial and versatile LLMs such as ChatGPT, Gemini, and Claude, the prospective evolution of LLMs is trending towards models that are more focused on particular domains, possessing compactness and enhanced efficiency. 

In our given use case, the embedding models and the LLM models can greatly improve their accuracy and text generation relevance if both can be made aware of the ESCO/ISCO/US SOC standards and always reason within the boundaries of the associated Occupational codes for every information they’re asked to summarize or parse. Furthermore, models can be trained on domain-specific corpus such that a very high accuracy is obtained when analyzing a job description and a CV that are specific to the medical, legal or IT domains when the data is routed to the domain fine-tuned model. 

We propose a model training framework that builds upon the utilization of pretrained, in-domain Large Language Models. In-domain LLM’s present limited access to openly available checkpoints and the presence of exceedingly large models – like the 1.7 trillion parameters of GPT-4. However, the landscape of LLM research has evolved with a notable shift towards fully open-source LLM checkpoints, featuring commercially viable sizes ranging from 3 to 13 billion parameters. 

These open source LLMs have undergone pre-training with trillions of tokens and strike a great balance between size and functionality, rendering them exceptionally suitable for tasks demanding specialized or niche language comprehension. 

Within the proposed framework, our primary focus lies in constructing systems by customizing publicly available LLM checkpoints with domain-specific data. To achieve this objective, we emphasize the utilization of innovative Parameter Efficient Fine Tuning (PEFT) techniques like LORA and Quantization, which strike a balance between performance and computational expenditure. 

The proposed framework takes the approach of in-context fine-tuning for pre-trained LLMs using domain-specific datasets. This strategic step capitalizes on the inherent potential of the pre-existing model, thereby refining its contextual comprehension in alignment with the specific requirements of many organizations. 

As language modeling techniques have evolved, we have seen an increasing effectiveness in retrieval augmented generation (RAG), where the generator model is provided with relevant context documents from a context database. Documents are fetched with a retriever model which embeds them into a semantic space where they can be matched to queries at inference time. 

‍Despite the effectiveness of RAG, the generator and retrieval systems have evolved in separate stacks and are served at inference time together with multiple API calls. We not only perceive the necessity of multiple API calls as an issue but also advocate for a unified training approach for these systems, leading to their integration into a singular API that seamlessly incorporates both the generative and retrieval functionalities. 

Are you a seasoned AI engineer looking to join our core AI Team and work on state-of-the-art PoC and custom projects?

Review of Existing Approaches

While there exist numerous attempts at SaaS or locally deployed solutions that tackle the HR AI domain issues (skill extraction, general CV parsing, etc), there is no commercially available solution that uses occupational codes as a frame of reference to parse and analyze CVs and Job Descriptions and ensure the analysis is always grounded in a common denominator for both.

Existing approaches to the problem of resume classification and job description matching typically rely on rule-based or machine learning algorithms. These methods, while effective in certain contexts, have significant limitations. Rule-based algorithms, for instance, are often brittle and unable to handle the complexity and variability of real-world data. Machine learning algorithms, on the other hand, require large amounts of labeled data and can struggle with tasks that require a deep understanding of the content of CVs and JDs.

Our review of the existing literature and commercial solutions reveals a gap in the market for a solution that leverages the power of LLMs and occupational standards to address the challenges of resume classification and job description matching. Our proposed solution aims to fill this gap by offering a versatile and effective tool for talent acquisition.

The following sections explain the main existing approaches and contextualize our approach.

Supervised Approaches 

Supervised learning has conventionally been the dominant paradigm for occupation extraction and standardization. Typically, the extraction task is framed as a multiclass classification problem. Studies such as [3, 2, 4, 6] have evaluated the merits of SVMs and convolutional neural networks using English job titles labeled with ISCO codes. In a different vein, [20] employed an ensemble of five machine learning algorithms for classifying Greek job postings into ISCO categories. Other researchers like [1] have applied Naive Bayes and Bayesian Multinomial techniques for the prediction of KldB2021 codes using German survey data. Furthermore, investigations such as [18] and [17] have assessed the efficacy of a variety of supervised algorithms for classifying into American SOC codes. However, supervised methods often grapple with limitations tied to dataset size, language specificity, and taxonomy constraints.  

Our approach, by contrast, offers a versatile framework, adept at handling multiple languages and taxonomies through a cohesive pipeline.  

Unsupervised Approaches 

Initial endeavors in the realm of occupation and skill extraction leaned on techniques like rule-based labeling [14], keyword-driven searches [15], and topic modeling [10]. An intriguing observation from [22] highlighted that CASCOT, a rule-based approach, could sometimes surpass machine learning counterparts in specific evaluations. However, as the CV and Job Descriptions get more elaborate semantically, the performance of such systems still falls short when juxtaposed with the capacity of removing semantic noise via summarization of a fine-tuned LLM. 

Transformer-based LLMs  

To the best of our knowledge there is no commercial offering harnessing the capabilities of state-of-the-art GPT-style decoder-only models for occupation coding. While there exist some parallels in the HR AI domain, such as skill extraction, the focus has primarily been on BERT-like (encoder-only) architectures [7, 16]. For instance, [21] limited their scope to fine-tuning classification layers for skill categorization. Meanwhile, works like [5, 23] leveraged LLM embeddings as alternatives to traditional word2vec embeddings, albeit with the LLM playing a somewhat peripheral role in their Named Entity Recognition (NER) frameworks. Predominantly, prior endeavors have concentrated on tapping into LLMs for Natural Language Understanding (NLU) while our proposed approach covers new grounds by exploring the generative prowess (NLG) of LLMs specifically for OES. 

Ethical Considerations and Implications of the AI-Driven Solution for Talent Acquisition

The proposed AI-driven solution for talent acquisition, as described in this whitepaper, presents several ethical considerations and implications, particularly from a Bias and Fairness perspective:

Bias and Fairness: The solution leverages Large Language Models (LLMs) that have been fine-tuned on ESCO/ISCO/US SOC standards. These standards provide a common frame of reference for analyzing CVs and Job Descriptions (JDs), which helps ensure that all candidates are evaluated based on the same criteria. This approach can help mitigate biases that might arise from subjective interpretations of CVs and JDs. Furthermore, the use of domain-specific models for different fields like medical, legal, or IT ensures that the evaluation is tailored to the specific requirements and terminologies of each domain, which can further enhance fairness.
Transparency and Explainability: The modular design of the solution, with separate components for parsing, matching, scoring, and generating CVs and JDs, can potentially make it easier to explain how the system works and how it arrives at its decisions. This transparency is crucial for candidates to understand and trust the system.
Privacy and Data Security: The solution involves processing sensitive personal data contained in CVs. It’s therefore essential to have robust data security measures in place and to use the data only for the intended purpose of matching candidates to job descriptions. Candidates should be informed about how their data will be used and stored, and their consent should be obtained.

Our solution for talent acquisition has been designed with careful consideration of ethical issues related to bias, fairness, transparency, and data privacy. However, given the importance and criticality of this topic, we continuously monitor and evaluate the system to ensure that these ethical standards are maintained in practice.

Conclusion

Our solution stands out by offering a versatile framework that adeptly handles multiple languages and taxonomies, effectively addressing the limitations of supervised methods tied to dataset size, language specificity, and taxonomy constraints. Our model excels by leveraging the generative prowess of Large Language Models for occupation coding, offering an innovative edge over existing supervised and unsupervised methods.

Future Work

Looking forward, our team is committed to continuous refinement of the LLM models to enhance domain-specific accuracy and task-oriented capabilities. We believe that our approach not only surpasses the capabilities of rule-based and machine learning algorithms but also marks a significant advancement by employing state-of-the-art LLMs in a way that previous efforts have not, particularly in the nuanced domain of occupation standardization and skill extraction.

List of References 

[1] Bethmann, A., Schierholz, M., Wenzig, K., Zielonka, M.: Automatic coding of occupations using machine learning algorithms for occupation coding in several Ggerman panel surveys. In: Beyond traditional survey taking. Adapting to a changing world. Proceedings of Statistics Canada Symposium (2014) 

[2] Boselli, R., Cesarini, M., Marrara, S., Mercorio, F., Mezzanzanica, M., Pasi, G., Viviani, M.: Wolmis: A labor market intelligence system for classifying web job vacancies. Journal of intelligent information systems 51, 477–502 (2018) 

[3] Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Using machine learning for labour market intelligence. In: Altun, Y., Das, K., Mielik¨ainen, T., Malerba, D., Stefanowski, J., Read, J., ˇZitnik, M., Ceci, M., Dˇzeroski, S. (eds.) Machine Learning and Knowledge Discovery in Databases. pp. 330–342. Springer International Publishing, Cham (2017) 

 [4] Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Classifying on-line job advertisements through machine learning. Future Generation Computer Systems 86, 319–328 (2018) 

[5] Chernova, M.: Occupational skills extraction with FinBERT. Master’s thesis (2020)   

[6] Colombo, E., Mercorio, F., Mezzanzanica, M.: Ai meets labor market: Exploring the link between automation and skills. Information Economics and Policy 47, 27–37 (2019) 

[7] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 

[10] Gurcan, F., Cagiltay, N.E.: Big data software engineering: Analysis of knowledge domains and skill sets using lda-based topic modeling. IEEE access 7, 82541–82552 (2019) 

[14] Jones, R., Elias, P.: Cascot: Computer-assisted structured coding tool. Institute for Employment Research, Coventry, University of Warwick (2023). <https://warwick.ac.uk/fac/soc/ier/research-tools> 

[15] Kouretsis, A., Bampouris, A., Morfiris, P., Papageorgiou, K.: Labourr: classify multilingual labour market free-text to standardized hierarchical occupations (2020). <https://cran.r-project.org/web/packages/labourR/index.html>

 [16] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach.  arXiv preprint arXiv:1907.11692 (2019) 

[17] Mukherjee, S., Widmark, D., DiMascio, V., Oates, T.: Determining standard occupational classification codes from job descriptions in immigration petitions. In: 2021 International Conference on Data Mining Workshops (ICDMW). pp. 647–652. IEEE (2021) 

[18] Russ, D.E., Ho, K.Y., Colt, J.S., Armenti, K.R., Baris, D., Chow, W.H.,Davis, F., Johnson, A., Purdue, M.P., Karagas, M.R., et al.: Computer based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occupational and environmental medicine 73(6), 417–424 (2016) 

[20] Varelas, G., Lagios, D., Ntouroukis, S., Zervas, P., Parsons, K., Tzimas, G.: Employing natural language processing techniques for online job vacancies classification. In: IFIP International Conference on Artificial Intelligence Applications and Innovations. pp. 333–344. Springer (2022) 

[21] Vermeer, N., Provatorova, V., Graus, D., Rajapakse, T., Mesbah, S.: Using robbert and extreme multi-label classification to extract implicit and explicit skills from dutchDutch job descriptions (2020) 

[22] Wan, W., Ge, C.B., Friesen, M.C., Locke, S.J., Russ, D.E., Burstyn, I., Baker, C.J., Adisesh, A., Lan, Q., Rothman, N., et al.: Automated coding of job descriptions from a general population study: overview of existing tools, their application and comparison. Annals of Work Exposures and Health 67(5), 663–672 (2023) 

[23] Zhang, M., Jensen, K.N., Plank, B.: Kompetencer: Fine-grained skill classification in danishDanish job postings via distant supervision and transfer learning. arXiv preprint arXiv:2205.01381 (2022) 

automotive

Looking for a technology partner?

Let’s talk.

Celebrating Visionary Leadership: Carmen Kolcsár Named Among Business Magazin’s “Top 100 Most Powerful Women in Business”

Automation Solutions Aimed at Reducing Energy Consumption and Waste

Unified XOps Strategy for Enterprise Agility

Voice Recognition and Security: Balancing Convenience and Privacy