Project Description :
I am looking to hire an experienced NLP / ML engineer to train high-quality machine translation models for Indic languages. The goal is to develop single language-pair models, such as :
- English Telugu
- English Hindi
(and additional language pairs, if needed)
You may choose the most suitable model architecture based on your expertise (e.g., mBART, mT5, NLLB fine-tuning, Transformer variants, etc.), as long as the final models deliver strong translation quality.
Dataset :
You can use the AI4Bharat datasets including :SamanantarBPCCOther open Indic parallel corporaScope of Work :
The freelancer will be responsible for :
Data HandlingCleaning, filtering, and preprocessing datasets
Sentence alignment (if needed)Tokenization and vocabulary preparation (SentencePiece / BPE / etc.)Model TrainingSelecting an appropriate model architecture
Training single language-pair translation modelsImplementing best practices for training efficiency (FP16, gradient accumulation, etc.)Hyperparameter tuningCheckpoint management and monitoringEvaluationCompute BLEU, SacreBLEU, and other relevant metrics
Provide side-by-side qualitative translation samplesBenchmarking against baseline modelsDeliveryFinal trained model weights
Inference scripts (Python) for quick testingInstructions for running and continuing trainingDocumentation of preprocessing and training pipelineOptional : Dockerfile or virtual environment setupRequirements :
Strong experience in NLP, Transformers, and neural MT modelsPrior work with Indic languages (big plus)Experience with training libraries such as PyTorch, Hugging Face Transformers, Fairseq, OpenNMT, or similarAbility to handle large-scale training and dataset preprocessingFamiliarity with SentencePiece, tokenization strategies, and MT evaluation metricsAbility to deliver clean, well-documented codeAdditional Notes :
Compute resources can be discussed (I can provide compute, or you can use yours).More language pairs may be added later as separate follow-up projects.Quality of translation is the highest priority.