GEFS Language Detector using Transformers at Hugging Face
In a world with thousands languages, being able to predict which language a piece of text is written in is nothing short of magic. Imagine a tool so sharp that it almost never misses, identifying languages with a high precision that nearly touches perfection. That’s exactly what the GEFS-language-detector model does!
Why is language detection so important? Well, it is important for everything from translating websites to offering customer support in multiple languages. It’s crucial for businesses aiming to connect with a global audience and for software that manages content across different languages.
This model has scored an F1 measure so close to 100% that is awesome. The high performance of this model is because of its advanced training on the “papluca Language Identification” dataset, with the robust “xlm-roberta-base” model as its foundation. This combination has turned the GEFS-language-detector into a top model that businesses and tech developers can use for the best performance.
Predicted output
Model will return the language detection in the language codes like:
- de as German
- en as English
- fr as French
- es as Spanish
Supported languages
Currently this model support 4 languages but in future more languages will be added.
Following languages supported by the model:
- German (de)
- English (en)
- French (fr)
- Spanish (es)
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed because it will install all the dependencies including torch.
pip install -U sentence-transformers
Use a pipeline as a high-level helper
from transformers import pipeline
text=["Mir gefällt die Art und Weise, Sprachen zu erkennen",
"I like the way to detect languages",
"Me gusta la forma de detectar idiomas",
"J'aime la façon de détecter les langues"]
pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector")
lang_detect=pipe(text, top_k=1)
print("The detected language is", lang_detect)
Load model directly using transformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector")
model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector")
Model Training Results
Epoch Training Loss Validation Loss
1 0.002600 0.000148
2 0.001000 0.000015
3 0.000000 0.000011
4 0.001800 0.000009
5 0.002700 0.000016
6 0.001600 0.000012
7 0.001300 0.000009
8 0.001200 0.000008
9 0.000900 0.000007
10 0.000900 0.000007
Testing Results
Language Precision Recall F1 Accuracy
de 0.9997 0.9998 0.9998 0.9999
en 1.0000 1.0000 1.0000 1.0000
fr 0.9995 0.9996 0.9996 0.9996
es 0.9994 0.9996 0.9995 0.9996