
You may have heard that many Japanese companies are struggling with a labor shortage due to the declining birthrate and aging population, especially in recruiting core personnel. In fact, it is said that about 70% of companies are facing this issue. On the other hand, the resources of the human resources department are limited, and the reality is that they cannot devote enough time and effort to the recruitment process, especially interviews.
Our team started by discussing, "Can we solve this problem with an AI agent?" "Can we provide a natural interview experience that is not just a chatbot, but a real conversation with a human being?" This idea of an "AI interviewer" was born. Our goal was to create a solution that could automate the recruitment interview process, reducing the burden on human resources personnel while providing candidates with a high-quality interview experience.

The "AI interviewer" we developed has three main functions.
1.A personalized interviewer that sounds just like the real thing: By simply uploading a face photo and voice data of the company's representative, a realistic avatar that looks just like that person will appear as the interviewer. The voice is also reproduced using cloning technology, so we aimed for a natural sounding voice that sounds as if the person is really speaking. It is also possible to customize the interviewer's speaking style and atmosphere to match the company culture.
2.Customize the interview content: Each company has different questions to ask and points to evaluate. With this system, you can set detailed information such as question sets, evaluation criteria, and even the interviewer's background and job title. This allows the AI to ask appropriate questions that meet the company's needs and dig deeper into the candidate's answers.
3. Automatic objective evaluation: After the interview is over, the AI analyzes the candidate's answers and speaking style, and automatically generates an evaluation report based on the evaluation criteria set in advance. It supports decisions such as "Should this candidate proceed to the next selection step?", achieving a more objective and fair evaluation.

In the process of turning the idea into reality, we faced several major technical challenges. In particular, we went through a lot of trial and error in creating a natural, real-time conversation experience.

・issue: The tempo of a conversation is important in an interview. If it takes too long for the AI to respond after the candidate finishes speaking, it becomes unnatural all at once. We needed to smoothly link speech recognition (STT), response generation by LLM, and speech synthesis (TTS) to achieve real-time performance close to that of human-to-human conversations.
・Ingenuity: First, we tried Nova-2 STT API for speech recognition, which has a reputation for its high accuracy in recognizing Japanese. The key was to speed up response generation and voice synthesis, so we adopted an approach that combines OpenAI's Real-time API (gpt-4o-realtime-preview), which specializes in real-time processing, with WebRTC/WebSocket. This allows us to generate response voice almost simultaneously while inputting voice, aiming to minimize conversation delay. During the Hackathon, fighting this delay was probably the most nerve-wracking part (laughs).
**・Challenge:**Rather than simply reading text aloud, we needed to reproduce the nuances of the voice quality and speaking style of real interviewers and create a natural voice with emotion.
・Ingenuity:ElevenLabs' TTS API was attractive because it allows high-quality voice cloning even from a small number of voice samples. However, the cloned voice sometimes had an unnatural intonation. Therefore, we fine-tuned the API parameters (such as stability and clarity settings) and repeated trial and error to achieve a more human-like, natural tone. It was particularly difficult to express certain emotions (for example, a voice tone that sounds a little surprised or a gentle tone that reassures).