Research Engineer

VishalYadav

Building the infrastructure that tells you when AI is wrong - and working to understand why.

Published Research · Explainable AI · AI Safety · London

About

I design and evaluate AI systems with a focus on measurable safety, reliability, and bias reduction.

Current focus

Making AI measurably safer through better evaluation, observability, and bias mitigation.

I build evaluation frameworks and observability infrastructure for deployed AI systems. At Arva AI, I redesigned benchmarking pipelines and built end-to-end agentic tracing - replacing unstructured logs with structured visibility across the full agent lifecycle.

My research background is in clinical AI - specifically demographic bias in pediatric mental health records. That work, developed during two years as a Research Assistant at Queen Mary University of London, was published in Nature (2026).

Open to research roles in AI safety, evaluation, and interpretability.

Research & Publications

Selected work on clinical AI bias, evaluation methodology, and research communication.

Featured[1]NaturePeer reviewed

A data-centric approach to detecting and mitigating demographic bias in pediatric mental health text

Nature · March 2026

Achieved a 27% reduction in diagnostic bias and a 4% accuracy gain for female patients through neutralised language and data-centric mitigation. The work improves diagnostic equity in pediatric mental health AI.

[2]

Privacy-Preserving Behaviour of Chatbot Users: Steering Through Trust Dynamics

arXiv · November 2024

Preprint

[3]

Traceability Solution for SMEs

IJSREM · June 2021

Peer reviewed

[4]

Deep Neural Network Compiler

IRISS 2020 · March 2020

Conference

Selected Talks & Presentations

2024Poster - AI4H Conference, Italy - Bias mitigation for pediatric EHR notes

2024Alan Turing Data Science Conference - EHR bias research

2022Intelligent Sensing Winter School, QMUL - Explainable AI in Computer Vision

2020IRISS, IIT Gandhinagar - Deep Neural Network Compiler

2023ICRA 2023 - Volunteer

Experience

Roles focused on AI evaluation infrastructure, safety research, and applied model quality.

Mar 16 2026 / Mar 21 2026

Technical AI Safety Course

BlueDot Impact · Remote

Completed an intensive technical program focused on practical AI safety concepts and risk-aware system design.
Worked through hands-on exercises covering evaluation, failure analysis, and mitigation-oriented thinking for modern AI systems.
Collaborated in discussion-based sessions to apply safety principles to real-world deployment scenarios.

Dec 2025 / Mid Mar 2026

Research Engineer

Arva AI · London

Redesigned benchmarking framework - decomposed a flawed combined metric into factor-specific accuracy scores for verdict prediction and discounting.
Built full agentic observability with Langfuse - structured tracing of all agent calls across offline and online environments.
Developed complete evaluation infrastructure: custom evaluators, golden dataset, scoring logic, and construction guidelines.

Mar 2025 / Nov 2025

LLM Trainer & Evaluator

Mercor Intelligence · Remote

Designed complex scenarios to stress-test conversational AI - identifying edge cases and failure modes systematically.
Built evaluation rubrics quantifying performance across accuracy, contextual alignment, and user experience.

Nov 2023 / Feb 2025

Research Assistant

Queen Mary University of London · London

Developed bias detection and mitigation algorithms for clinical AI; published in Nature (2026).
Improved anxiety detection by 10% using time series-based medical NER.
Co-developed a clinical AI platform with NHS DialogPlus on Azure GPU infrastructure.

Jul 2021 / Jul 2022

Product Engineer, AI Technology

AI Technology & Systems · California (Remote)

Built a Deep Neural Network Compiler using Eigen, ONNX, and Caffe for edge devices.
Supervised 45 interns building TinyML applications.

Nov 2019 / Mar 2021

Research Intern

Indian Institute of Technology · Indore

Developed AR applications and a traceability app for Android - published.
Contributed to an Intelligent AGV for smart manufacturing (Industry 4.0).

Projects

Practical builds spanning evaluation tooling, multimodal AI, and production-oriented research systems.

FeaturedIn progress

DeepGuard

End-to-end pipeline for detecting deepfake and AI-generated speaking videos on YouTube. Tests motion entropy as a computationally efficient signal alongside traditional classifiers - with a Claude-powered reporting layer and Streamlit human review dashboard.

Deepfake DetectionComputer VisionClaude APIOpenCVStreamlit

FeaturedHackathon

Re_Search

AI-powered platform connecting researchers with career-defining opportunities. Built at the Perplexity Hackathon - uses grounded search and LLM reasoning to match researchers with roles via a reasoning graph over papers, skills, and opportunities.

Perplexity APINext.jsGoogle ScholarNeurIPS 2024

Multimodal Hate Speech Detection

Compared early, late, and cross-attention fusion across BERT, ViT, and VisualBERT for multimodal classification.

NLPMultimodalTransformers

May - Sep 2023

Article Person Verification Agent

Autonomous adverse media screening using LangGraph and Gemini - with MLflow tracing and multilingual support across 6+ scripts.

LangGraphMLflowGemini

2025

Brain Tumor Segmentation

3D FMRI segmentation using NVIDIA deep learning libraries and SAM.

Medical AIComputer VisionPyTorch

Sep - Dec 2023

Traceability App for SMEs

End-to-end supply chain traceability; launched on Google Play Store.

AndroidIndustry 4.0

2021

DNN Compiler for Edge Devices

Compiled high-level DNN specs to optimised machine code for constrained hardware using C++ and Eigen.

C++Edge AIONNX

2020

Skills

Core stack for building, evaluating, and shipping reliable AI systems.

AI & Research

Evaluation Frameworks, Bias Mitigation, Benchmarking, Explainability (LIME, GradCAM), RLHF, Agentic AI, RAG, Generative AI, Reinforcement Learning, NLP, Computer Vision, Deep Learning

Infra & MLOps

Langfuse, MLFlow, Vertex AI, AWS, Azure, Terraform, CUDA, GPU Optimisation, Docker

Languages

Python, C/C++, SQL, TypeScript, Bash, R

Frameworks

PyTorch, TensorFlow, HuggingFace, Scikit-Learn, SpaCy, OpenCV, NumPy, Pandas, Playwright

Databases

PostgreSQL, MongoDB, Firebase, Redis

Contact

Let's talk.

I'm open to research collaborations, full-time roles in AI safety and evaluation, and conversations about the field. Best reached by email.

visdav8@gmail.com

LinkedIn GitHub Google Scholar Personal site