1) HR Round: Pre-Screening.
2) 1-on-1 with Engineer Round: Open ended talk about current state of affairs in AI (Screening).
3) PyTorch Coding live Round: Implement Multi Headed Self Attention, from scratch (batched etc) + Causal Mask.
4) Personal project & Quiz Round: Present a Personal Project (or Personal Research) + Quiz on LLMs fundamentals and Scaling.
5) Pair Programming Round: Co-operate with one of their engineers to solve a bug.
6) Cultural Fit Round.
TL;DR: Avoid if you value your time. The process dragged on for over a month with 5–6 rounds, no feedback between rounds (despite repeated requests), and ended with a silent rejection (no-reply server). This level of communication is unacceptable for such a lengthy process and such a small start-up who wants to be the AI leader in Europe (!).
Note to international applicants: Mistral is doing a lot of consultancy work, ie repurposing and retraining smaller LLMs (1-3B Params) for various downstream tasks e.g., for clients in automotive, finance etc. Be very cautious as there might be some hidden requirements for French fluency down the line to communicate with local clients, which might make it difficult to progress, career-wise.
Interview tips:
For the live coding round make sure you can implement efficiently from scratch (PyTorch only) all fundamental transformer modules (e.g., MHA, GQA, MQA, Self/Cross attention, LayerNorm, RmsNorm, FFNs, Positional Embeddings (rotary, learnt, static), Masking strategies, Mixture of Experts (MoE) etc with possible twists).
For the pair programming they asked to debug an issue with pre-norm in a transformer block with residuals (fairly straightforward).
For the Quiz round, focus on 'why'. You should be able to talk and reason about everything mentioned above and all of their variations, in depth. And, be able to provide geometric and algebraic explanations + intuitions (although the later might not be appreciated that much). Additionally, you need to know practicalities about training/inferencing LLMs in large scale, such as KV-caching, flash attention, pre-training, fine-tuning, alignment, RHLF etc. For the scaling part, read the blog post from Huggingface (The Ultra Scale Playbook), they will ask you 3-4 questions from there, about FSDP, Zero1/2/3 as well as tensor, pipeline and data parallelism, computation-communication overlap etc, make sure you understand these concepts very well.
Salary range in EUR: 75k-100k, based in Paris.