Let me ask you this, Edem: do you think an independently trained LLM in Africa (or other places) could then be used to improve existing LLMs like ChatGPT, Bard, etc where "beautiful" is too often a synonym for "white"? In other words, could a much better, more comprehensive system result from these efforts?
Regards your LLM question, I think it could. Technically speaking, an LLM trained using data centric to different regions could be brought together using Transfer learning and form an ensemble model.
But resource wise, I think a smarter plan would involve getting the model trained on a well distributed dataset descriptive of the population distribution.
Nice. I am 100% rooting for you to be successful in getting ideas like this out into the world! I see that as a huge necessary component of success in any venture like this - lots of people have to understand the idea and get excited by it.
Very kind of you, Edem! Thanks for the mention! I hope this dramatic situation will be solved soon. Back in 2016 I was denied a grant because I was from Georgia. I know what you're talking about and I understand your POV.
You're welcome Nat, it was an incredible piece. I'm really shocked about your Georgian exclusion, I was of the opinion that the US was a really inclusive place.
Definitely agree with the premise, and have been thinking about what it means when it comes to languages. What happens to Africa's incredible linguistic diversity if AI tools can only be used in English, or French, or a handful of national languages which have sufficient written usage to train or fine-tune a model? Developing ways for systems like those built around LLMs to learn minority languages should be a major focus of research imo
Yeah you're right Matt. Most language translation corpus's contain data on popular languages and leave a large bunch of these languages under-represented and ironically the regions with under represented languages require the most help and are most susceptible to the effects of language barriers like (inefficient knowledge distillation).
I was thinking of how this could be solved perhaps by creating an open source platform where under-represented languages could be modelled and trained on less data and available via API?
Interesting idea! I've also seen some studies on fine-tuning models in order to teach them new languages, with the basic linguistic structures learnt during initial training remaining the same. Would be happy to discuss!
Thanks for the mention Edem! Definitely agree Africa needs its own data - and in fact I think if you solve the data problem, Africa will catch up in no time. Data diversity is still an issue in some use cases in US/UK too - so sorting this out early would let Africa leap frog in certain aspects. A bit like how Africa leapfrogged into mobile and mobile payments.
I wholeheartedly agree with your analysis, Edem. The dearth of data can be primarily attributed to the absence of local ownership of online content. The majority of discussions among Africans on various platforms, be it in comments, posts, or threads, are stored by foreign entities. I firmly believe that this limitation in training data originates from this external storage. This realization motivated me to establish Vouchaah, a forum dedicated to Nigeria, aiming to address and rectify this gap.
Let me ask you this, Edem: do you think an independently trained LLM in Africa (or other places) could then be used to improve existing LLMs like ChatGPT, Bard, etc where "beautiful" is too often a synonym for "white"? In other words, could a much better, more comprehensive system result from these efforts?
Also, LOL @ "Goatesque"
Haha, I'm glad you like Goatesque man.
Regards your LLM question, I think it could. Technically speaking, an LLM trained using data centric to different regions could be brought together using Transfer learning and form an ensemble model.
But resource wise, I think a smarter plan would involve getting the model trained on a well distributed dataset descriptive of the population distribution.
Nice. I am 100% rooting for you to be successful in getting ideas like this out into the world! I see that as a huge necessary component of success in any venture like this - lots of people have to understand the idea and get excited by it.
Haha, you're an incredible human being man.
Thanks, man! I feel the same about you, and frankly about the little community I've met here. These are some remarkable folks we're surrounded by.
True words man.
Very kind of you, Edem! Thanks for the mention! I hope this dramatic situation will be solved soon. Back in 2016 I was denied a grant because I was from Georgia. I know what you're talking about and I understand your POV.
You're welcome Nat, it was an incredible piece. I'm really shocked about your Georgian exclusion, I was of the opinion that the US was a really inclusive place.
Definitely agree with the premise, and have been thinking about what it means when it comes to languages. What happens to Africa's incredible linguistic diversity if AI tools can only be used in English, or French, or a handful of national languages which have sufficient written usage to train or fine-tune a model? Developing ways for systems like those built around LLMs to learn minority languages should be a major focus of research imo
Yeah you're right Matt. Most language translation corpus's contain data on popular languages and leave a large bunch of these languages under-represented and ironically the regions with under represented languages require the most help and are most susceptible to the effects of language barriers like (inefficient knowledge distillation).
I was thinking of how this could be solved perhaps by creating an open source platform where under-represented languages could be modelled and trained on less data and available via API?
Perhaps we could talk about this?
Interesting idea! I've also seen some studies on fine-tuning models in order to teach them new languages, with the basic linguistic structures learnt during initial training remaining the same. Would be happy to discuss!
Awesome! I'll send an email.
Or rather, please send me an email at; ekmedm@gmail.com
Thanks for the mention Edem! Definitely agree Africa needs its own data - and in fact I think if you solve the data problem, Africa will catch up in no time. Data diversity is still an issue in some use cases in US/UK too - so sorting this out early would let Africa leap frog in certain aspects. A bit like how Africa leapfrogged into mobile and mobile payments.
Exactly man! I think if we can provide an African solution it could potentially allow Africa to lead the new AI renaissance.
I wholeheartedly agree with your analysis, Edem. The dearth of data can be primarily attributed to the absence of local ownership of online content. The majority of discussions among Africans on various platforms, be it in comments, posts, or threads, are stored by foreign entities. I firmly believe that this limitation in training data originates from this external storage. This realization motivated me to establish Vouchaah, a forum dedicated to Nigeria, aiming to address and rectify this gap.