UC Berkeley's Script Encoding Initiative wins $1.3M in grants for advancing digital inclusion

November 7, 2024

By Aubrey Spowart

Anushah Hossain

The Script Encoding Initiative (SEI), led by researchers Anushah Hossain and Deborah Anderson in the Linguistics Department, recently received grants from the Internet Society Foundation ($300,000) and the Mellon Foundation ($1M) for research on digital inclusion of diverse writing systems. SEI aims to increase the number of writing systems available on digital devices and study the processes by which inclusion occurs.

The Internet Society Foundation’s grants seek to expand internet access, build internet resiliency, support sustainability efforts and promote digital inclusion by funding initiatives that contribute to the development of global technical infrastructure. The Mellon Foundation awards grants in support of the arts and humanities, especially to scholars whose work advances equity in their fields.

Hossain, SEI’s Research Director, recently spoke to Berkeley Social Sciences about her research and grants. This interview has been edited for clarity.

Tell us about your background and how you ended up at UC Berkeley?
Anushah Hossain: UC Berkeley has been home to the Script Encoding Initiative (SEI) for over 20 years now, housed in the Linguistics Department. SEI was started by my colleague, Berkeley Linguistics researcher Debbie Anderson, to help uncommon writing systems become available on digital devices. To borrow a common phrase in the software world, SEI was born out of a “scratch your own itch” impulse.

Debbie wanted to use historic scripts like Old Italic for an online newsletter she was working on. After realizing how many steps were involved and the unlikelihood of someone else doing it, she got to work on finding experts with whom she could work to get the script digitally included. The project that grew out of that initial effort has been spectacularly successful: out of 168 scripts currently available for use on computers, SEI worked on over two-thirds of them!

I first met Debbie while working on my dissertation at Berkeley. I was working on histories of internet infrastructures, and happened to come across stories of the difficulties digitizing Indic scripts, used across South and Southeast Asia, that I couldn’t peel my attention away from. I started learning more and ended up basing my dissertation on the debates surrounding the digitization of the Bangla script. Debbie and I stayed in touch after I interviewed her for my project. Over the last few years, I’ve gotten more involved in the practical work of SEI alongside my historical research on language and text standards, and I returned to Berkeley this fall to take on the role of research director of SEI.

Tell us about your research at Berkeley Linguistics.
Anushah Hossain: My work centers around the Unicode Standard, a widely used technical standard for writing systems. Unicode has committees of experts that review proposals to add new characters to the Standard. Once a script or character is in the Unicode Standard, it can be implemented in our digital devices in fonts, keyboards and applications, so we can send messages with it. A lot of people know Unicode because of its work on emojis. There’s been a lot of coverage of the tired face emoji that the Emoji Subcommittee elected to include in the latest version of the Unicode Standard.

But less known is the work Unicode does to add new writing systems for digital use. Just this past release, seven brand new scripts were added, plus important additions to many existing ones like Egyptian hieroglyphs. One part of my work is to research unencoded scripts and work with linguists, archeologists, digital humanists, type designers and engineers to prepare those scripts for inclusion in the Unicode Standard. The other part is to take a critical eye to Unicode. As you might imagine, there’s a great deal of discretion involved in designing these technical systems. Beyond the practical projects I do as part of SEI, I’m most interested in documenting how the decisions that go into Unicode track broader language politics in society.

What will your $300K grant from the Internet Society Foundation be used for?
Anushah Hossain: This grant focuses on raising public awareness of Unicode and SEI’s work. While we all use text technologies daily, few of us understand why certain characters might be missing or glitchy. SEI has worked with over 100 scripts in the last 20 years that have gotten into Unicode, and tens more that are still in process. How do all of the interconnecting text standards and software tools actually work? What are the lessons learned from SEI’s trove of experiences? What successes should be better known?

We plan to build an archive of stories about these efforts, documenting both successes and setbacks, and situating SEI’s work within broader historical and political contexts. The goal is to create a public-facing record that highlights key achievements in language technology and critically examines the shift toward an increasingly digital world. This archive will be made available on the SEI website, contributing to a broader conversation about linguistic inclusion in digital spaces.

What will your $1M grant from the Mellon Foundation be used for?
Anushah Hossain: This grant will really keep SEI’s fundamental work going and help us explore some new avenues for collaboration. Our essential work is identifying scripts that are not yet in Unicode but are viable for encoding, meaning they’re fairly stable and many people want to use them online, and we work with experts to author proposals to submit to the Unicode Consortium. We have a rough roadmap of 20 more scripts to tackle these next four years, spanning across continents and including candidate scripts like Maya hieroglyphs in the Americas, Lampung in Southeast Asia and Mwangwego in West Africa.

In addition, I’m keen on expanding the roster of people working in the language technology space. SEI is launching a fellowship program open to international applicants, supporting an exploratory research project in the field or in archives to advance our knowledge of a script. I’m also excited to collaborate with Berkeley students on projects analyzing Unicode archives and uncovering untold histories related to script encoding.

What is your reaction to receiving both grants?
Anushah Hossain: For a long time, it seemed to me that I was working on niche topics in isolation, often focusing on the encoding of a single letter. But for better or worse, the recent public releases of large language models (LLMs) has brought renewed attention to foundational technologies like Unicode. We’re reminded that the digital divide still exists, and it can easily grow if a script isn’t included in Unicode—it certainly won’t appear in tools like ChatGPT. There’s a new sense of urgency, bringing together experts from both STEM and the humanities, and I’m excited to have the resources and support from funders to address critical questions about language digitization.

UC Berkeley's Script Encoding Initiative wins $1.3M in grants for advancing digital inclusion

Topics