GEFS Language Detector using Transformers at Hugging Face

3 min readApr 26, 2024

In a world with thousands languages, being able to predict which language a piece of text is written in is nothing short of magic. Imagine a tool so sharp that it almost never misses, identifying languages with a high precision that nearly touches perfection. That’s exactly what the GEFS-language-detector model does!

Why is language detection so important? Well, it is important for everything from translating websites to offering customer support in multiple languages. It’s crucial for businesses aiming to connect with a global audience and for software that manages content across different languages.

This model has scored an F1 measure so close to 100% that is awesome. The high performance of this model is because of its advanced training on the “papluca Language Identification” dataset, with the robust “xlm-roberta-base” model as its foundation. This combination has turned the GEFS-language-detector into a top model that businesses and tech developers can use for the best performance.

This model can be tested using the Hugging Face Space https://huggingface.co/spaces/ImranzamanML/GEFS-Language-detector

Predicted output

Model will return the language detection in the language codes like:

  - de as German
  - en as English
  - fr as French
  - es as Spanish

Supported languages

Currently this model support 4 languages but in future more languages will be added.

Following languages supported by the model:

German (de)
English (en)
French (fr)
Spanish (es)

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed because it will install all the dependencies including torch.

pip install -U sentence-transformers

Use a pipeline as a high-level helper

from transformers import pipeline
text=["Mir gefällt die Art und Weise, Sprachen zu erkennen",
      "I like the way to detect languages",
      "Me gusta la forma de detectar idiomas",
      "J'aime la façon de détecter les langues"]
pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector")
lang_detect=pipe(text, top_k=1)
print("The detected language is", lang_detect)

Load model directly using transformer

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector")
model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector")

Model Training Results

Epoch	  Training Loss	    Validation Loss
1	      0.002600	        0.000148  
2	      0.001000	        0.000015
3	      0.000000	        0.000011
4	      0.001800	        0.000009
5	      0.002700	        0.000016
6	      0.001600	        0.000012
7	      0.001300	        0.000009
8	      0.001200	        0.000008
9	      0.000900	        0.000007
10	      0.000900	        0.000007

Testing Results

Language     Precision     Recall	F1 	 Accuracy
    de	       0.9997	   0.9998	0.9998   0.9999
    en	       1.0000	   1.0000	1.0000	 1.0000
    fr	       0.9995	   0.9996	0.9996	 0.9996
    es	       0.9994	   0.9996	0.9995	 0.9996

Model can be accessed at Hugging Face

ImranzamanML/GEFS-language-detector · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

About Author

Muhammad Imran Zaman
Machine Learning Engineer
Professional Links:

Kaggle: Profile
LinkedIn: Profile
Google Scholar: Profile
YouTube: Channel
GitHub: Profile
Hugging Face: Profile