Internet giants are betting fortunes the size of Latvia’s or Slovenia’s GDP on Large Language AI Models for returns much bigger: no sovereign can afford to let corporate machinations dominate the evolution of humanity’s social code.

Open-Source AI, why it is the best way forward for Europe

by Thomas Hoberg

Key insights

The explosion of humanity was fueled by its ability to create ‘social code’ whose evolutionary pace and expansion operates vastly faster than genetic or animal behavioral codes: shifting even parts its evolution to computers may be as transformative as the invention of speech
The degree and range diversity of our social code diversity is much bigger than any individual can imagine. The quest for efficiency through scale can exert unwanted and dangerous homogenization pressure
Opening source, sharing knowledge is one of the most potent measures to release tension
The vast majority of humanity’s digitally accessible global knowledge base is coded in English, local sovereignty requires both making local language content accessible to others (push) and making global content accessible in the local language (pull)
G42s Open-Source AI initiative shows how to progress both directions with Arabic
Corporate AI giants grow out of control and need immediate countermeasures to retain sovereignty

Key recommendations

Investigate G42 JAIS project, because could be a base for a near global collaboration or blueprint for a similar EU initiative
Collaborating with partners that are furthest away from a Western ethos, enables AIs to incorporate social code from the largest range of populations and then provide service to them, because it helps to eliminate the invasiveness of foreign social code
Giant AI models in foreign corporate hands are an ‘invasive species’ with potentially catastrophic consequences for local societies: they need to be countered via regulation prescribing ground-truth partitioning, open source, and interdiction of unauthorized data grabs

Evolving at the Speed of Code

Ever since Darwin had us not just come out of the trees, but with living cousins still left there, identifying what justifies our innate sense of human superiority became ever more involved.

Using or fabricating tools isn’t even limited to our cousins, far more distant relation like raven manage that. Even our sense of self, awareness of us vs others, including empathy, which unfortunately always seems to require more effort than enmity, isn’t at all unique to our kind. It is found with ever more species who also develop culture, which emerges as soon as this mix of awareness, emotions and knowledge come together and create motivation to pass it on to offspring as social code via some type of communication.

Its most recognizable manifestation is play, an instinct that has cubs running through simulations of a future where they need to take care of themselves and their charges and where they constantly hone, evolve and seek to improve skills and the social code they inherit from parents and peers. That play involves demonstrative teachers who double as immensely dedicated critics, giving very immediate, sometimes gentle, sometimes forceful feedback via anything from posture to pain: you don’t need to take their word to get the message and work that into your knowledge base!

And while those abilities build on a giant genetic base, it’s the vastly different pace of social code evolution, which accelerated the emergence of higher life forms and finally allows us place a defining hurdle between us and our cousins: speech as a much more efficient means of communication than live demos.

While communication has been immeasurably important for humanity, we may not be able to claim exclusivity nor the lead. The jury is still out if whales or octopus can’t in fact achieve much higher levels of information transfer: there are indications they might in fact have been “videoconferencing” long before we started “texting”^[1].

True break-through discoveries, with the potential of completely transforming the future of an individual or its group vs the rest of the planet have been made… again and again.

And then lost, as they never made it out of the individual’s head nor beyond immediate kin, because they could not be communicated or preserved beyond immediate reach.

It took eons and millennia to progress language and abstractions to the point where tales allowed reliving an entire experience. The ability to abstract was urgently needed to manage ever greater bodies of knowledge, and it grew into another set of cousins, religion and science, which are at ever greater odds and entangled in politics.

These huge tomes of social code soon exceeded average individual capacity and led to sages, shamans and minstrels specializing in their upkeep and dissemination. The ability to preserve them with fidelity outside a mortal brain, first in writing, then in printing, and more recently in a form where they continue activity and evolution on a machine, has each brought orders of magnitude to how effectively social code can disperse and potentially reign supreme: the speed at which social code spreads is both a measurement and a cause for its success—which is always temporary, but tends to evolve most rapidly when scale provides room.

Code & Context

As it turns out, fidelity of recording over space and time can add as many issues as the precision lost in retelling, where critical truths—but also the more egregious nonsense, could fade by accidentally or conveniently leaving them out of the telling: we fail to recognize just how many assumptions from our every-day life we take for granted or perennially true. Those are then neither recorded nor put into context, creating deadening strife instead of valuable synergies when scriptures are interpreted far beyond the natural expiry date of the parts that got written down.

To illustrate, I once learned of a people with who time and space were merged into an ever-present spatiality that I found indelibly intriguing. For them the past was straight ahead for the evident reason that it was known and thus visible, while slowly blurring with distance. The future is unknowable and therefore obviously in the back where it cannot be seen. Both have a fixed point on the compass, which I don’t remember and they can’t imagine not to. It could have been the future going south or the past chasing the sunset, most important is that it’s shared and identical amongst them all. It obviously doesn’t mean that changing your direction steers the flow of time, but going [in the direction of] Forward or Backward isn’t an individual, but a much stronger collective perspective.

With time and direction fused so deeply into everyone’s mind, they always know their exact orientation just as we’d notice if time were to turn backward. And they do not get lost, even in the dark or a cave. As they move with full awareness through this space-time continuum on their feet, they know exactly how they are heading and which way to return.

Perfect orientation may seem a bonus, but imagine trying to converse with a member of that tribe, when it’s not just about replacing words of one language with those of another, but where the two sides perceive the world in a completely different manner. They might come to regard us as Flatlanders [1] who can only see a circle where there is really a sphere, lacking awareness of a full dimension of the time-space that surrounds us. They might in fact have far less trouble conversing with whales and their sonar minds than we have.

Yet traveling the high seas, going into space or just falling asleep in a night train may well be torture to them. Being plucked from solid ground and moved unconsciously across the time-space continuum would be like being robbed of your mind. Waking up without a perfectly valid spatial reference, would be as frightful as it is unfortunately for many people with Alzheimer’s who struggle to remember their recent social context. We might fear such people unfit for the modern world, but their youngest generation might just judge us as utterly incapable of reaping the huge potential and value of a spatial Internet^[2].

It is very hard to realize just how much of our social context, knowledge, prejudice, bias etc. we constantly carry with us and unwittingly assume and embed as we try to record or write the most objective and scientific truths into code or content we create.

When the earliest electronic computers were used by the Allies to replace their human variants to more efficiently compute projectile trajectories to end WWII, not even the Axis disputed the numbers, only the goal. But today’s IT isn’t about the less contentious physics or accounting. With the rise of social networks, the vast majority of all computing capacity being dedicated to digital media content and as ever larger parts of that becoming artificially created, very little of it has a shred of objectivity left. And where von Neumann might have eliminated the theoretical distinction between code and data decades ago, machine learning models obliterate the distinction between content and code, also merging science and religion, if they were ever separate, as enlightened minds liked to believe. And as digitalization engulfs all but the most personal transactions in life, these models become governing bodies of social code, exert a force that becomes immensely political through sheer mass adoption.

Like all code, from genetic over social, cultural, scientific and software, these machine learning models are immensely expensive to create, but much cheaper to replicate [2]. So, once they offer a significant enough evolutionary advantage, they spread with the acceleration offered by the base technology, machine learning models beating sex, social code evolution and now even human programmers in generational speed. And looking back at how printing, machines, and programmable computers have transformed societies, it’s very clear that transferring significant parts of social code to machines not just for storage and dissemination, but for execution and evolution will transform all the still very different societies on the planet.

One might argue that slavery was not abolished because after millennia owners suddenly developed a conscience, instead the industrial revolution enabled machines as a cheaper substitute. Manual work and even slavery isn’t gone in all forms and places, neither will machines nor machine learning will replace humans everywhere. But judging by the huge transformational power already exerted by global content industry, where Western ethos exerts much more influence than population distribution would indicate, it’s clear that control and ownership over the social code embedded in these models from the content they digest is absolutely crucial for sovereigns of any scale and persuasion. And those sovereigns who aim for and can afford a longer-term perspective took note and turned to action.

China has a very long tradition of regarding scale as a power base and adapted with urgency when machines and industrialization allowed the far less populous West to unhinge what it believes the natural balance during the last century. Closely managing knowledge and its dissemination was deeply ingrained in its feudal history, and it seems perhaps even more important today, where scale is regarded as the most intrinsic factor to control. At the low end, anyone is allowed to have and even voice their own opinion. But as soon as you share it with five people or more, any but the officially sanctioned opinions are being filtered out. From the top China aims to employ its giant domestic market potential to the fullest, ensuring scale via a relentlessly homogeneous social code dictated by the current sovereign over a population base that still retains traits of a rather diverse linguistic, cultural and ethnic past.

The Indian subcontinent may host an even greater diversity and continues to attempt standardizing its vast diversity of social codes into only two major dissenting variants with more vengeance than is peacefully absorbed. But both Asian giants, who represent the majority of humanity, as well as others in the East much nearer to Europe, seem to ignore that this diversity isn’t just a historical burden but often evolved with a high degree of sophistication as a means of conflict avoidance when heterogeneous populations grew in a mixed environment.

Dense populations that survived for centuries often are not marked by the thorough homogeneity one might expect from a “pressure cooker”, but via a highly evolved mesh of subcultures whose peaceful cohabitation is made possible by distinct food taboos, job differentiation, castes, religions and feudal classes with very distinct bodies of social code that coexist highly differentiated for that reason. The only safe thing to say about them is that they are highly political and thus cannot be universally applied^[3].

The European Union was founded after centuries of debilitating warfare as a means to enable peaceful conflict resolution in an area where feudal sovereignty was often completely decoupled geographically, linguistically and tribally from the subjects through generations of marital diplomacy.

While that intermarriage at the highest level of society imbued Europe with a rich legacy of shared culture which trickled from nobility via the bourgeoise to the greater population, from the Middle Ages right through the enlightenment and industrialization, it did not close the tribal rifts below which expanded into nations. When industrialization created a productivity explosion that manifested in an often brutal expansion of these national social mores into colonies, that also brought the new nations into conflict over social code variations their highly interbred sovereigns [3] would hardly deign to squabble over—but it had their subjects march to war and death in millions.

While its former colonies inherited and still echo much of that cultural legacy over vast regions of the planet, Europe itself attempted a reboot after two terrible wars where religion, language, ethnicity, and most other matters that had been the source of past contentions were firmly put under lock: yes, they could all be regarded as rich cultural assets to be shared for the pleasure and benefit of all; no, none of them would be permitted to claim superiority or exclusivity at the cost of all others.

Mumbai’s Dabbawala [4] food supply system may be hard to beat in terms of the diversity per surface area, but the highly modular institutionalized social code of Europe, which aims at sharing all elements where that seems to provide a benefit, while retaining locality where it reduces potential conflict, is still rather unique in this world.

US and Chinese internet giants have used their giant domestic markets to create awe inspiring head starts in AI, but once they try grow beyond and with scope of digitalization still expanding, the friction between cloud code and ground rules becomes too large, the lack of modularity and ability to support a diversity of social codes becomes an impediment.

What remains, is the giant cost of developing new code—now including social code, content and AI models—and the need for scale to spread the effort. And what needs to change is the hegemonic mindset of its creators: wherever humanity couldn’t just dominate, it had to set down and negotiate for areas of collaboration vs exclusiveness of control.

And when it comes to planning ahead and for the greatest scale of collaboration achievable, it may be better to test such collaboration with partners where the social code gaps are much bigger than with those who we historically shared sovereigns or empires with. It needs to support a degree of collaboration with potential, even active enemies.

Open-Source Altruism vs Corporate Profit

Open Source at its heart is a decision to give up exclusive control over know-how or technology and share it instead with a wider group, often as part of a trade. Just like trade as an alternative to battle, it has existed for eons, it has built civilizations, but it has never stopped being weaponized, either.

At the time of this writing language models are being published as open source by companies that have invested enormous resources into their creation, not because sudden altruism compels them, but because they want to staunch a near exclusive flow of all attention and revenue to OpenAI-Microsoft as the technology leader, which would allow that company to out scale all potential competition for a technology lead that would elevate it into an AI mainframe.

Very much like open-source Linux diverted financial and intellectual resources from the Unix mid-range companies, the aim is to avoid a stranglehold of exclusivity on a base technology where the current underdogs might have fallen behind. Open sourcing their 2^nd rate assets reduces the leader’s ability to monetize its lead and allows competition to continue at higher level race, where hopefully those losses are recouped.

Underselling the competition to eliminate it is probably older than Rockefeller’s Standard Oil^[4], while the speed and disembodied nature of global software and content make it hard to combat when sovereigns are at odds. But the more exponential than linear cost of AI leadership make it difficult to sustain, which why the current top combatants fight all the harder for exclusivity, just as the open-source competition is all bent on keeping them from attaining a stranglehold.

Analysts (such as [5]), using what public information is available, find it unlikely that OpenAI and Microsoft have yet turned a profit for their AI ventures and are racing either for a GAI^[5] singularity or a practical monopoly to recoup their investments. While OpenAI’s original goal as a non-profit corporation was proclaimed altruistic, Altman and his fellowship of employees and investors believe they deserve more of its valuation and are more inclined to raise the stakes than let competitors or regulating skeptics catch up, especially when Google may have an almost ‘singularity’ advantage in the form of their proprietary TPU series chips and infrastructure, that may scale both higher and far more efficiently than Nvidia as the otherwise undisputed champion. With Nvidia’s top AI GPUs already apportioned mostly to Microsoft and very few other big players for the next years, that market has ceased to exist for lack of product. It’s easy to see why Microsoft and Altman would want to exploit the position they dearly paid for already, and which may not last for long.

GAI capabilities in corporate hands, that find it difficult to self-regulate with competitors at their heels, deserve the slowdown the OpenAI skeptics attempted, just because it is likely to be as revolutionary as the steam machine. But that needs outside support from sovereigns. While corporate exclusive control over an emerging technology can lead to that technology never quite realizing its potential on its own, because consumers grow wary of feeding a giant that tends to become more beastly as its appetite grows, it becomes more difficult, when ubiquitous platforms like Microsoft’s Windows and Office are infected with it before regulators react appropriately.

But then AIs, like the machines that started the industrial revolution, neither purchase products nor pay taxes for lack of income, while the resources they turn into emission may not be appropriately taxed. Without an economy to sustain them, GAI may see more short-term success as a vehicle of political power than be economically or socially sustainable.

Leveling a playing field where competition has ebbed away or is no longer generating taxable value is a strategy also often employed by governments, which might regulate or nationalize infrastructures like transportation, water or electricity grids to actually boost a more competitive economy that builds on top. And in the current AI race, one contender might inspire Europe.

Another Union’s Quest for Sovereign AI

A century ago, the Arabian Peninsula was a vast area with few scattered resources to sustain life: apart from a few choice spots at coasts, oases, or in mountain valleys to sustain agriculture, mankind subsisted in a highly competitive mostly nomadic way of life that made for a very traditional, tribal and feudal society, which has religion and politics fully intertwined.

Then huge oil deposits were struck and it started a flood of capital and investments, enabling a large growth of inhabitants and infrastructure in an environment that still doesn’t support a self-sufficient high-density population. At 1/100 European density in 1950 it has since grown tenfold, but mostly from labor immigrants which have as little political sway as the slaves that were an integral part of that traditional society, even if outlawed since.

The United Emirates is a federation of absolute monarchies covering only 2.5% of the peninsula along the oil rich North facing coastal strip on the Persian Gulf and its social code has seen the minimum possible evolution from its traditional tribal base, as its emirs see very little value in much of the social code, Western Democracies might want to offer or impose.

But very much like China’s political elite they understand the importance of controlling the social code AIs incorporate to maintain sovereignty and so they have invested very heavily into large language models, which have been trained with a strong focus on Arabic language inputs, while English is used to augment the general knowledge base [6].

And contrary to China^[6] or OpenAI, UAE’s based Group 42 Holding Ltd., largely owned by a son of the UAE founder and brother to the monarch of Abu Dhabi, decided to make them globally available as open source under the permissive Apache 2.0 license. That is especially remarkable, as these 13 and 30 billion parameter JAIS models have been trained on perhaps one of the most modern and powerful AI platforms available today, a Cerebras wafer-scale based supercomputer estimated to cost almost a billion US$ when completed at the end of 2023 [7]. Fully owned by G42 it runs in Silicon Valley, in a data center just a short walk from Nvidia headquarters.

To recap, the UAE with a population of 10 million, of which 1 million are citizens, and 7 men are emirs holding political power, is so determined to ensure its sovereignty and control over the social code embedded in the large language models used domestically, that it spent $1000 per citizen on LLM hardware alone.

The EU with a population of 450 million (mostly citizens) does not spend 450 billion Euro on AI, but currently focuses on regulating AI to maintain sovereignty; which is extremely important, but beyond the natural boundaries of your sovereign territories, bargaining power requires deeper investments.

The UAE is not renowned for its egalitarian altruism^[7], and given the political context the UAE’s JAIS 13B and 30B models may in fact be designed more as weapon of social code hegemony among the global population of 135 million Arab speakers and perhaps even 2 billion Muslims, whose core codex is read in Arabic—even by its closest enemies, who used to rule them from across the gulf but prefer to speak Farsi. They all have recognized the dire need for LLMs that actually support their ethos, instead of Western models showing at best ignorance, and often quite a negative bias towards Islam, the Arabic world, or a class society. But Iran, like Russia and North Korea, belongs to a group of nations the West would like to slow even more than China on their path to AI supremacy.

The United States have used their lead and stake in IT for decades of foreign political control, and apart from smuggling and piracy, open source has been one of the few escapes available to contenders. Many started their own Linux distributions, forked digital essentials, and took control over their domestic Internets. And China is only the most capable, when it comes to replacing every bit of IT that the US aims to use as a control lever, many others are just as motivated and search for allies with ever fewer constraints.

With the technology foundations of IT, value system divides seem of little issue. RAM and flash chips, even CPUs and most ASICS remain free of political bias. Operating systems definitely have a somewhat leftist ethos when it comes to scheduling, but it’s easily configured more feudal for real-time. Likewise, much of the stack of digital essentials like browsers and document management applications can easily be made to conform to local ethos. Using them as open source across the biggest political trenches should be minimal effort while a global user base helps to spread the development effort [8].

Of course, shared code is a gift that can be weaponized like the very first Trojan horse, the main benefit is that this is more easily discovered than in the closed binary distributions of yore. Sometimes redesigning the very von Neumann base of computing and programming languages may seem attractive to eliminate the fundamental vulnerability that accompanied the invention of software [9]. Especially Linux as the global operating system is straining under the inadequacy of a 1960’s design in a world where millions of heterogeneous cores may need to be coordinated to run applications spread across network fabrics that start computing within the data plane and cover the planet. But, as we can observe in nature, evolution practically never leaves enough of a gap for really fundamental reboots, as each iteration needs to survive, first.

That is perhaps the reason why the UAE decided to step in early and with such significant effort to provide an evolutionary branch of LLMs substantial enough to avoid that OpenAI supremacy which would drown out their ethos. But instead of limiting itself to some last-minute and usage constrained leaking like Meta, they seem to have faced that challenge with a very wide, long-term and open-source perspective.

Necessity played a significant part. LLMs need data to train and whereas the openly available text content base for English encompasses around 2 trillion tokens^[8], no other language comes even close. For Arabic as few as 55 billion usable input tokens could be found which limits the optimal model size in accordance to Chinchilla scaling [10] to around 7 billion weights. And those would not cover large scientific bodies with sufficient substance, or many areas of practical human knowledge, because they had never been published in anything but English. Arabic itself first needed to be trained by the model, using an approach very similar to how children learn: exposure or immersion with context.

It was next empowered to understand topoi missing from the Arabic language bodies using English documents translated into that language to enable bilingualism, which could then be used to expand the knowledge base directly from English texts, as if they had been natively written. Most importantly that knowledge can now be rendered in both languages equally, widening the scope of knowledge that is directly accessible via the Emirati’s mother tongue^[9].

While this effort has required a significant investment not just into infrastructure but base research, significant parts of that base can be reused for other languages which also have significant grammatical challenges and gaps to overcome between themselves and English. And with the expanded infrastructure available, training times have fallen as low as 21 days for the 13b model^[10].

We don’t know if Group 42 will offer LLMs-of-a-given-language-as-a-service, but the EU should quickly consider asking that question.

G42 has proven that this approach is generic enough to be repeatable wherever a significant enough language body exists to train the multilingualism, and its organizational, functional and financial viability. It propels the topic of localized large language models for the sake of sovereignty from abstract deliberations to the rather concrete decision at which level the European Union chooses to collaborate with the UAE and like-minded others, or replicate the approach, hopefully again within the largest possible federation of collaborators and contributors.

JAIS comprises not just the base models, but also chat variants, which have been carefully fine tuned to match the ethos of their authors, who concede that protecting against factually or culturally wrong answers remains a challenge and will require further work. But while the cultural training each derivative of these base models will be diverse, the basic approach, the code and a collection of curated input bodies, and even the infrastructure that performs training and inference can be shared openly, once policies, procedures, controls etc. have been developed, tested and contractually agreed.

Whether it is large language models or others, that also include picture or video capabilities, the ability to respect and incorporate and respect very localized social codes or knowledge bases is essential to increase the value of AI solutions. As Emiratis would be quick to agree, a very loyal artificial slave sticking strictly to the rules of its owner can be far more valuable than something more akin to an employee with a college degree, but far too many opinions and an outlandish ethos: reliability, trustworthiness and control may be more desirable than a GAI that lacks these qualities.

The need to add not just linguistic translations, but to include “grammars of ethos” into the models will grow, and those who we might consider our greatest political contenders, will be most motivated to generalize the issue and thus make it more cost efficient to resolve if we cooperate with them.

And when it comes to translating even between “speech” and “sonar” cultures like that spatial tribe or interacting with sonic-visual species like squid, dolphins and whales, perhaps only AIs will be able to shorten those fundamental gaps in how we look at the very same universe as “flatlanders”, to the point where peaceful coexistence and informational trade is made possible.

Contrary to China’s leaders^[11], with who the United Emirates share quite a lot of opinions when it comes to freedom of speech, democracy, or equal rights and responsibilities for everyone, G42 has chosen to go with a government driven Open-Source approach, which may catch on as both collaborate more than Cerebras and the US government appreciate [11]. The EU should go much further and seek a very active collaboration and knowledge transfer, that goes as wide and deep into beyond its borders as possible, because the potential benefits grow, as always, with scale.

Knowing where to draw the lines, where to insert safeguards against the inevitable temptation of weaponizing code requires matching research efforts now, negotiations and regulations later, which should be prepared to match the underlying pace of the scientific and technological advances.

Open source is ancient and was since times immemorial a path from war towards trade; which then often led even to marriages, exchange of not just social but also genetic code.

Open source LLMs present a unique opportunity and a clear mandate to act now.

Countering the Giants

Microsoft, Nvidia, Google, and Meta are only the biggest known players, who are betting their fortunes and their future, and those of their investors, on AI becoming a profitable product in the near future.

Microsoft alone has planned $50 billion for their 2024 data-center build out, which includes than 400.000 Nvidia GPUs already allocated and depreciating. Google is also investing in TPU based build outs into AI infrastructure costing dozens of billions, while none of these companies break even with public LLM offerings yet.

It has them irrevocably committed as these investments are fully bespoke, almost as much as Bitcoin mining rigs, and cannot be repurposed to run SAP or any other general-purpose workload.

This means these companies are not ready to sit down and calmly debate the ethical use of AI, the risks of GAI, or fundamentally changing how their LLMs might include local preferences and social code variants, because even a month of delay would redshift the profit line, a year of delay create havoc.

When Microsoft was faced during Thanksgiving 2023 by the altruistic faction within OpenAI endangering their running freight train by firing Sam Altman, it immediately offered to take on the entire team of lead engineers for any salary and working terms they might wish for, because the derailing, or just a delay of their AI strategy by AI skeptics would have cost vastly more.

None of the players can afford to pause, rethink or renegotiate, because fab volumes have been contracted with huge penalties for plan deviations, millions of chips are either already produced or in the midst of a production run that is edging ever closer to a full year from committing to push the start button to the finished product filling inventory, after processing and assembly steps that involve specialists across the globe who need to plan years ahead to adjust capacity. In the quest to push out the competition by contracting all available capacity, they have made themselves vulnerable to anything they did not foresee. Vulnerability tends to be compensated by defensiveness and escalates to aggression.

The sheer size of their commitment makes them as dangerous as stepping in front of a two-mile freight train with a hand held high as if it were a cart, because your grocery bag broke on the tracks. It may well be that they are in fact on the wrong track, a track that violates a nations law or even an EU directive, but penalties would need to match the stakes and actually get enforced within the living memory of acting executives.

Stopping the Big Data Steal

Microsoft is pushing AI agents into operating systems and applications that run on Personal Computers: PCs which belong to owners, who as such have a sovereign right for exclusive control. But taking a page from Apple’s scrapbook, which developed their products out of a DRM-enabled music player, as well as Google and others, who have invented the data economy, they now have these AIs scrape PCs for any usable data to convert into sellable insights and training material for models they run in their data centers, without asking for permission from the owners of the PCs and the material stored there.

It is the concierge going about your home, reporting back to the recruiting agency on your comings & goings, habits and hobbies, including a carbon copy of any document you might receive, touch or send, in fact even on where you hesitated or rewrote a sentence. As I type this, Word freezes every ten minutes or so, trying to execute a task called ai.exe, describing itself as “Artificial Intelligence (AI) Host for the Microsoft® Windows® Operating System and Platform x64”, a pest I surely never invited into my home and therefore tried to disable, again and again. And no, I’m not “logged in”, nothing cloudy about my machine which operates entirely without a Microsoft account, or that company being permitted anything but software maintenance and occasionally unclogging blocked pipes.

Microsoft argues that this nosiness is required “to improve their services and products”. It may also argue that my personal data never leaves my personal computer, or that in case of Office 365, that it actually never left their data centers. They are convinced that the insights that their agent gathers or their AI then develops are then obviously and exclusively theirs, a court case they urgently need to lose; while neither the watertight regulation, which puts device and data sovereignty solely with the owner, nor that court case have been started yet by the EU.

An AI chat-bot hallucinating may be annoying but seem harmless. But Microsoft is handing Co-pilot executive power and the steering wheel to our personal computers. Giant investments into making cars sufficiently intelligent for self-driving may have turned their principal imaginator towards hallucinations, but so far have failed to a deliver a safe product [12]. Tesla’s and Microsoft’s need to sell demonstratable AI value is as big as the billions in balance and dwarfs all concerns. Copilot’s potential for failure and abuse cuts into the core of human value creation in the Internet age and will inspire a completely fresh wave of criminal creativity, where little more than prompting [13] is required to bypass every known security measure invented since the code-data bulwark was breached on June 30th, 1945 [14].

Forty-two years after its first operating system, Microsoft has not managed to make printing documents safe [15] [16]. Currently self-driving Co-pilot only insists on resetting user preferences [17] but forcing it on every desktop [18] without permission or consent…

…clearly needs action.

The sole responsibility of the European Commission is to the citizens of Europe. And their main task is to take actions that benefit them.

What Microsoft, and the other Internet giants racing towards socio-economic dominance via the better, bigger, smarter and perhaps even super-human AI services, are doing here, does not benefit European citizens in any obvious way that is more significant than the risks involved.

Where they might technically argue that processing all your data via distributed AIs does not represent a violation of the GDPR or other regulation, they are feeding a knowledge base from which any insights you personally developed, can no longer be deleted.

This data extraction must quite simply be stopped, because in a society where content and social code is becoming more valuable than jewels, such thievery needs to be met like the concierge absconding with your baubles. And it gives these giants an advantage that simply doesn’t balance back in EU citizen benefits.

No data or insight from EU personal computing devices may flow into AI giant data centers, unless the owner or his loyal agent hands it over for the purpose of a transaction. No application or operating system may gather data and insights on PCs or their owners by default, nor send it back as “telemetry” or “customer feedback” etc. without the owner explicitly consenting, while no service may be denied if he does not. If payment or license data is required, a neutral clearing house needs to be put in place to safeguard the commercial exchange and contract compliance. Vendors may not demand privileged operational control over a PC to safeguard software purchases, because it’s not necessary and done in bad faith.

AI is happening and only time will tell how much good or bad it will do. There is no simple way to undo it from human memory, or stop if from eventually reaching GAI, if that becomes cheap and attractive enough to do. Yet it’s hard to imagine it doing so well, without consumers able to buy products or services it helped to create to finance its evolution.

But sovereigns at every level, be they just the owner of a PC or the rulers of nations need, and as sovereigns have every right to control, its spread and usage. And they need to exercise that right in the interest of their citizens or just their families and the ability to do so safeguarded by their government, at least in a democracy.

Such control can’t be too complex and it is easier to implement them, if direct financial feedback can be used.

Ensure Local Sovereignty

Sovereign control needs to be decisive, with every default pointing towards the owner of the devices and the data not the hardware or software vendor. A list of topics that should be very openly and noisily discussed to help steer planned AI giant commitments could include:

vendor locked personal computing devices need to be in fact unlocked by default, the delegation of management to a vendor, a conscious but reversible decision that doesn’t imply complete data loss when changed
Any cloud synchronization or telemetry data option again needs to be off by default, only enabled when the owner agrees with informed, non-discriminative and reversible consent. These cloud synchronization offers may not be vendor exclusive, 3rd party providers enabled via open APIs
No application that can run with local data storage may prescribe a cloud link to function
Where such a shared data provider is necessary (e.g. collaborative tools), the storage needs to be geographically local (minimum country level) and possibly 3rd party enabled
No generalized analysis of cloud data by code or models that is not specifically agreed by the owner: you can continue to pay with your data, but just entering a web site or installing an application may not be interpreted as eternal and all-encompassing consent
No generalized analysis of local data on personal computing devices or appliances by apps or models, no such modules or packages may come activated by default or be inserted via some update without explicit informed, revocable, and non-discriminative consent
Applications should never look at data they have not generated or were instructed to use by the owner

Philosophically it means that PCs and appliances may not be turned into agents of an AI or cloud giant: consumers must specifically want to purchase them as such and be clearly informed as to what that means now and whenever that changes. And they must be able to use these devices without the vendor connection, where that makes sense (e.g. coffee machine).

Rather concretely it means Microsoft may not just install Co-Pilot on Windows PCs or convert Office installations into a cloud variant unless the owner gives its informed and reversible consent.

The creeping backdoorism which has been designed to feed the training pipelines needs to be cut before it’s proclaimed the universal state of the art. And after that, even more so.

Reigning in the Networks

WhatsApp, Facebook, Twitter etc. have made themselves “indispensable”, mostly by just being convenient and using the network effect. It has led to schools, clubs, employers, service providers, governments etc. using them to the point where they become effectively critical infrastructure or mandatory to participate and function.

It means the EU citizens no longer have a realistic choice not to feed these foreign giants with all the data their apps might siphon from their devices.

That choice not only needs to be reestablished, not feeding them needs to be the preferred default.

And that starts by all branches of EU governments going off social media not operated by themselves or by a 3rd party under their full control. Even if social media were rocket science, after ten years of use and study, far lesser nations have built commodity rockets.

And it then needs to follow through in the private sector, where no employer may rely on social media or AI platforms, that are under foreign sovereign control. E.g. if an employer uses WhatsApp to direct employees, any employee can file an anonymous complaint that is followed up and sanctioned so the employer will have to switch to something like a self-hosted or safe EU regulation audited chat platform, that sticks to chatting.

If citizens want to use TikTok, Facebook or Instagram for their private life, that’s their free choice. But when it concerns anything not strictly private, local sovereignty must take precedence.

Facebook could become such a trusted 3^rd party operator, if they split operationally along the sovereignty lines with audited functionality and data flows. If they were better and cheaper than a 2^nd rate EU implementation, that should benefit EU citizens. Facebook has asked to be regulated, perhaps for that reason, but without such strict rules, corporate profit optimizations can only grow the size of the unregulated freight train [19].

And when it comes to Microsoft, the need to break its stranglehold on the workplace desktop is bigger than ever. All branches of EU government need to switch to open-source alternatives for Microsoft products such as the desktop OS and the office suite and every employee needs to have the right to refuse Office 365, Co-Pilot or anything that uses OpenAI services without discrimination. As Russia, China, North-Korea and others will attest, sovereignty depends on the ability to control the platforms that run code.

And when these use AI models that develop a mind of their own, the control needs to match their rise in capabilities.

Conclusion

As with any tool or technology, AI’s benefit to EU citizens depends on its usage.

As with most Information Technology advances, leading edge AI requires scale to distribute the cost of its evolution.

Control vs. scale are not diametrical opposites, when you carefully sort areas of synergies and areas where differentiation and partitioning are required, scale can actually bridge value divides enabling lower cost. But the current players are both driven and tempted too much by short term goals and practical monopolies.

It’s much easier to recognize the disruptive potential of how the current AI giants drive LLMs towards GAI than to imagine than how EU citizens will benefit. They seem only concerned about scaling out for affordability and market dominance, while they disregard the need to differentiate and partition.

Yet not only must such platforms support the full diversity of existing social code, they might even need to encourage, but surely enable, a rapid further growth of diversity for conflict avoidance where people come virtually closer and more involved with each other, without wanting to give up their individual or group ethos. Interoperability is key, exerting pressure towards a standardized ethos can have explosive consequences and is a sovereign privilege.

If AI is to become truly ubiquitous, a constantly available and intensively used commodity to empower humanity, one or even a few giant global corporations are the worst place to hold control, which must match the topology of the society on the ground, from federations of nations to the level of an individuum.

Where those sovereigns are not ready to take the lead themselves, they must ensure their control in other ways. The right to internalize or nationalize such control is the unalienable right of any sovereign, but corporates may need reminding and courts clear directions to follow citizen over corporate interests.

Scale is the primary means of lowering the cost of technology evolution, but scale exerts gravitational pull when it becomes huge. Before corporate scale overwhelms sovereign power, those need to step in and slice, relentlessly if corporations do not anticipate the need to pre-partition along existing social code diversity.

Without that balance of power AI will fail to benefit those who ultimately enable and support it by paying for services and products which make use of it. And without customers, corporations die.

The EU needs to partner with the widest range of other sovereigns to create a framework of controls, to avoid corporate run-away freight trains accumulating through cost optimizations, which then become too dangerous to operate and steer. Using and enforcing open source and open processes at speeds and power that match the technology underneath is crucial and the EUs immediate obligation with the broadest possible range of partners.

AUTHOR

Thomas Hoberg is the technical director at Worldline Labs, Frankfurt, Germany

REFERENCES

[1]: "'Flatland' on Wikipedia," Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Flatland. [Accessed 20 12 2023].
[2]: T. Hoberg, "Extreme reuse: the only future any code can afford. HiPEAC Vision 2021," 18 January 2021. [Online]. Available: https://doi.org/10.5281/zenodo.4719690. [Accessed 20 December 2023].
[3]: "'Descendants of Queen Victoria' on Wikipedia," Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Descendants_of_Queen_Victoria . [Accessed 20 December 2023].
[4]: "'Dabbawala' on Wikipedia," Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Dabbawala. [Accessed 20 December 2023].
[5]: D. Patel, "SemiAnalysis," [Online]. Available: https://www.semianalysis.com/. [Accessed 20 December 2023].
[6]: Sengupta, Neha et al, "Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models," 29 September 2023. [Online]. Available: https://arxiv.org/abs/2308.16149. [Accessed 20 December 2023].
[7]: S. Ward-Foxton, "Cerebras Sells $100 Million AI Supercomputer, Plans Eight More," EE Times, 20 July 2023. [Online]. Available: https://www.eetimes.com/cerebras-sells-100-million-ai-supercomputer-plans-8-more/. [Accessed 20 December 2023].
[8]: T. Hoberg, "Europe's need for digital essentials, individual sovereignty and consumer protection. HiPEAC Vision 2023," 16 January 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7461962. [Accessed 20 December 2023].
[9]: T. Hoberg, "Reversing John von Neumann and Steve Jobs, but not software. HiPEAC Vision 2021," 18 January 2021. [Online]. Available: https://doi.org/10.5281/zenodo.4719396. [Accessed 20 December 2023].
[10]: Hoffmann, Jordan et al, "Training Compute-Optimal Large Language Models," 29 March 2022. [Online]. Available: https://arxiv.org/abs/2203.15556 . [Accessed 20 December 2023].
[11]: T. Mann, "After bashing Nvidia for ‘arming’ China, Cerebras's backer G42 alarms US govt with suspected Beijing ties," The Register, 28 November 2023. [Online]. Available: https://www.theregister.com/2023/11/28/cerebras_g42_china_refile/. [Accessed 20 December 2023].
[12]: E. Helmore, "Judge finds ‘reasonable evidence’ Tesla knew self-driving tech was defective," The Guardian, 22 November 2023. [Online]. Available: https://www.theguardian.com/technology/2023/nov/22/tesla-autopilot-defective-lawsuit-musk. [Accessed 23 December 2023].
[13]: M. Pesce , "Go ahead, let the unknowable security risks of Windows Copilot onto your PC fleet," The Register, 11 October 2023. [Online]. Available: https://www.theregister.com/2023/10/11/microsoft_expects_it_pros_to/. [Accessed 23 December 2023].
[14]: J. von Neumann, "First Draft of a Report on the EDVAC," 30 June 1945. [Online]. Available: https://web.mit.edu/STS.035/www/PDFs/edvac.pdf. [Accessed 23 December 2023].
[15]: R. Speed, "Messed up metadata could be to blame for Microsoft's Windows printer woes," The Register, 8 December 2023. [Online]. Available: https://www.theregister.com/2023/12/08/messed_up_metadata_to_blame/. [Accessed 22 December 2023].
[16]: R. Speed, "Microsoft to kill off third-party printer drivers in Windows," The Register, 11 December 2023. [Online]. Available: https://www.theregister.com/2023/09/11/go_native_or_go_home/. [Accessed 8 December 2023].
[17]: R. Speed, "AMD graphics card users report gremlins with Windows 11," The Register, 3 October 2023. [Online]. Available: https://www.theregister.com/2023/10/03/amd_graphics_users_report_problems/. [Accessed 23 December 2023].
[18]: "There Will Be A Personal Computer On Every Desk In Every Home - Bill Gates (1990)," YouTube, [Online]. Available: https://youtu.be/X-M3FbIlqQA. [Accessed 23 December 2023].
[19]: "'2023 Ohio train derailment' on Wikipedia," Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/2023_Ohio_train_derailment . [Accessed 20 December 2023].
[20]: T. Hoberg, "Gaming, content and the metaverse. HiPEAC Vision 2023," 16 January 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7461953. [Accessed 20 December 2023].
[21]: "'Standard Oil' on Wikipedia," Wikipedia, [Online]. Available: https://en.wikipedia.org/wiki/Standard_Oil. [Accessed 20 December 2023].

Whales and dolphins may broadcast abstracts of sonar imagery from an individual viewpoint towards a group to enable coordinated action. Octopus can essentially project video patterns on their skin enabling an optical communications channel. The achievable symbol rates in both cases may be far in excess of anything possible verbally. ↩︎
See ‘The Spatial Web: Interconnecting people, places, things and AI for a smarter world’ and [20] ↩︎
high degrees of equalization pressure have led to revolutions, which then just resettle into different strata ↩︎
The main inspiration for US anti-trust law [21] ↩︎
Also acronymized AGI, it stands for the threshold where an AI might exceed a qualified human in problem solving capability ↩︎
There are also various open source LLM project from China vying for public attention, e.g. 01.ai, which is also supposed to represent $1B in investments. Currently no documentation is available and first hands-on tests show questionable quality compared to similar sized JAIS or Llama-2 LLMs ↩︎
Its moral codex encourages giving alms to beggars, not abolishing their career path or that of emirs e.g. via universal basic income ↩︎
A minimal carrier of meaning, of which there could be several in composite words or with agglutinating grammars ↩︎
Arab spoken dialects differ vastly, even within the UAE, but a standard dialect called fuṣḥah (فصحى), is used for nearly all publishing and broadcasting ↩︎
For the 30b model a quote for the training time could not be found. But it seems safe to assume that training started after the 13b publication in August 2023, while the 30b model was ready for download three months later in early November ↩︎
While the Chinese government has sponsored local forks of digital essentials such as OpenKylin, the Chinese government is reluctant to push open source at the leading edge. There are quite a few Chinese LLM projects, increasingly also open source, but so far either company driven or academic. ↩︎

Open-Source AI, why it is the best way forward for Europe ​

Key insights ​

Key recommendations ​

Evolving at the Speed of Code ​

Code & Context ​

Open-Source Altruism vs Corporate Profit ​

Another Union’s Quest for Sovereign AI ​

Countering the Giants ​

Stopping the Big Data Steal ​

Ensure Local Sovereignty ​

Reigning in the Networks ​

Conclusion ​