Hi!
with iOS 12, Apple released a new framework for language recognition and other interesting stuff. Is called NLLanguageRecognizer.
Use the Natural Language framework to perform tasks like language and script identification, tokenization, lemmatization, parts-of-speech tagging, and named entity recognition. You can also use this framework with Create ML to train and deploy custom natural language models.
This framework provides a high-level API for lots of language detection features using text.
Let’s see some example:
こんにちは、私はアルベルトです、私はイタリアに住んでいて、私は不明な言語で書いています。
- Which language is this?
- How many words contain this phrase?
- There are names inside? Places? Company names?
Who knows. Me not.
Answers:
- JA, or better JAPANESE
- 23 words in this phrase
- The english translations is “Hi, I’m Alberto, I live in Italy and I write in an unknown language.”, so yes, there is a name and a place inside.
Let’s do it to the new iOS/macOS common framework, NLNaturalLanguage to see how it works.
Examine this phrase:
let string = "Ciao, sono Alberto, vivo in Italia e scrivo in an unknown language. Mi piace la Coca-Cola."
it’s mixed, ITALIAN / ENGLISH. We can use this as a good example.
DETECTING LANGUAGE(s)
import NaturalLanguage let string = "Ciao, sono Alberto, vivo a Bergamo e scrivo in an unknown language. Mi piace la CocaCola." // create a new recognizer let languageRecognizer = NLLanguageRecognizer() // that should read your string languageRecognizer.processString(string) // get eventually any language hypoteses let hypoteses = languageRecognizer.languageHypotheses(withMaximum: 2) //2 // get the dominant language of the phrase let language = languageRecognizer.dominantLanguage!.rawValue print("First language is : \(language)") print("Other languages are: \(hypoteses)")
output in console:
First language is : it Other languages are: [__C.NLLanguage(_rawValue: it): 0.9752411842346191, __C.NLLanguage(_rawValue: en): 0.009950380772352219]
We receive the languages and the percentage of the confidence. Good. Italian is about 0.97% so we can trust the algorithm.
TOKENIZE A TEXT:
Let’s count the words (or the paragraph, or the sentences, or the document…) using NLTokenizer:
// create a new tokenizer // choose your unit (word, paragraph, sentences, document) let tokenizer = NLTokenizer(unit: .word) // set your language (or use the discovered one...) tokenizer.setLanguage( .italian ) //NLLanguage(language) ) // link your string tokenizer.string = string // get tokens let tokens = tokenizer.tokens(for: string.startIndex..<string.endIndex) print( "Words: \(tokens.count)" ) // Words: 12 .
EXTRACT pieces of information:
Another cool feature is related to TAG, to extract tagged informations like, people names, city, places and organization names, using NLTagger.
Let’s see how:
// create a tagger let tagger = NLTagger(tagSchemes: [.nameType]) // set the text tagger.string = string // select the options let options: NLTagger.Options = [ .omitPunctuation, .omitWhitespace, .omitOther, .joinNames ] // and the tag to extract let tags: [NLTag] = [ .personalName, .placeName, .organizationName // and much more... ] // create all the tags let tags = tagger.tags( in: string.startIndex..<string.endIndex, unit: .word, scheme: .nameType, options: options) { tag, tokenRange in if let tag = tag, tags.contains(tag) { print("\(tag.rawValue) -> \(string[tokenRange])") } return true }
The result is nice… with mixed languages happens something strange, but it’s ok.
PersonalName -> Alberto PlaceName -> Bergamo OrganizationName -> an unknown language OrganizationName -> CocaCola
EXTRA
You are able to know the language of the phrase so, you can easily speech the text in the real and in the correct language using AVSpeechSynthesizer!
let speechUtterance = AVSpeechUtterance(string: string) //speechUtterance.rate = 0.7 speechUtterance.volume = 1.0 // set your discovered language speechUtterance.voice = AVSpeechSynthesisVoice(language: language) speechSynthesizer.speakUtterance(speechUtterance)
Instead of using these old techniques that make me laugh now… 😉
And that’s all for now.
Go deeper into this framework because is very interesting.