This job has been expired

April 9, 2026

C2C

mike@staffxpertllc.com

We are seeking an experienced GenAI Engineer to design and implement on-premise LLM and vector database solutions. The ideal candidate will have hands-on expertise in deploying open-source large language models and building scalable Retrieval-Augmented Generation (RAG) pipelines in secure, enterprise environments.

Key Responsibilities

Deploy and optimize open-source LLMs such as Llama 3 and Mistral / Mixtral in on-prem or private cloud environments
Develop, test, and integrate LLM applications using Python (prompt engineering, inference workflows)
Implement and optimize CPU-based inference, including model quantization and performance tuning
Design and build scalable RAG pipelines using vector databases
Manage embeddings, indexing, and metadata filtering strategies

Required Skills & Experience

Strong Python programming skills for AI/ML application development
Hands-on experience with vector databases such as Qdrant, Chroma, Milvus, or pgvector
Proven experience building and deploying Retrieval-Augmented Generation (RAG) solutions
Solid understanding of embeddings generation and semantic search techniques
Experience working in secure, air-gapped, or enterprise-controlled environments
Knowledge of access control, audit logging, and data governance

Nice to Have

Experience with LangChain or LlamaIndex
Familiarity with Docker and Kubernetes for containerized deployments
Exposure to high-performance languages like Rust, Go, or C++
Experience with inference frameworks such as vLLM, llama.cpp, or Hugging Face Transformers
Background in regulated or enterprise environments

Key Deliverables

Define reference architecture for on-prem LLM + vector DB solutions
Build a working prototype (LLM + vector DB + RAG pipeline)
Provide clear documentation and knowledge transfer to internal teamsa