Building bridges: the strengthening black and indigenous media on Wikipedia

InternetLab produces research on the use of black, indigenous and peripheral/territorial media as references in the Lusophone Wikipedia, identifying underrepresentations, challenges, and dialogues for fostering more diverse online knowledge.

News Culture & knowledge 12.15.2023 by Stephanie Lima, Fernanda K. Martins, Alessandra Gomes and Catharina Vilela

In collaboration with the Wikimedia Foundation, InternetLab has launched a fellowship program from 2021 to 2023. This initiative aims to contribute to discussions on the generation and dissemination of knowledge by black and indigenous people, both online and offline. In 2022, we organized a Seminar that assembled a group of social actors from black, indigenous, and free knowledge movements actively engaged in academia, the third sector, or civil society. As a result, we released the mapping entitled “Inequalities & Knowledge –  Transformations, challenges and Strategies after 10 years of Quotas Act”. In this publication, we could better comprehend the relationship between the Internet and the quest for knowledge equity by organizing insights gathered from these participants. Thus, we look at the past and the future regarding the desire to achieve a more equal generation and dissemination of knowledge.

A fundamental strategy that we identified, in dialogue with the mentioned social actors and actresses, was the “urgent need to strengthen black and indigenous media“. This strategy is based on the understanding that these media serve as vital platforms for spreading, accessing, and recognizing knowledge from these populations. In other words, these media are recognized as crucial spaces for disseminating the knowledge produced by these communities, particularly given the barriers in traditional knowledge spaces, whether academic or journalistic. Furthermore, within the historical and political context centered on expanding the discourse on disinformation, platform regulation, and the sustainability of journalism, the strategic relationship between dissemination and empowerment becomes even more evident.

A fundamental approach to addressing inequalities in access to and generation of knowledge by historically vulnerable populations involves strengthening and promoting media that convey this knowledge.  It is worth noting that, upon closer examination of the field of these media, the category “peripheral” has come to light. In the Brazilian context, many communication channels are predominantly formed by black or indigenous individuals who self-identify as “peripheral” or “territorial.” To delve deeper into this, we decided to analyze how content produced by black, indigenous, and peripheral/territorial media in Brazil is used on the Lusophone Wikipedia. This platform is the largest free online encyclopedia and one of the most visited websites globally. When we mention these media definitions, we are referring to communication outlets focusing on topics related to and/or predominantly composed of black and indigenous people.
To this end, we formulated two primary questions:

  1. How has the representativeness of the knowledge and culture of historically marginalized groups, such as black and indigenous people, been depicted on Wikipedia?
  • 2. What voices contribute to this information, and in what context is it presented?

Our efforts encompassed conducting comprehensive surveys that facilitated the analysis of Wikipedia’s most frequently used subcategories and terms. Additionally, we identified the websites and media sources most commonly referenced in constructing existing entries. Based on the data collected, we present several considerations. First and foremost, we emphasize the necessity to broaden discussions and reflections on using these media as sources within free knowledge platforms. 

Data Analysis

The analysis of collected data included calculating the total number of citations—i.e., appearances in Wikipedia articles—for each medium or subcategory, followed by the classification of this data. Below are some detailed results:

i. The 100 most cited reference links

The reference links in citations lead to specific online locations, such as newspaper articles, academic papers, government information sites, and more. The count was based solely on the root of the URL address to determine the frequency of references to a particular digital media. Subsequently, all links from references were aggregated based on the shared URL root. Following this aggregation, the totals and descending rankings were calculated.

ii. The Top 100 Subcategories Most Used by Articles

In addition to a references section, Wikipedia articles feature a categories section, identifying the categories and subcategories that best describe the article’s content. To determine the most frequently used categories and subcategories, all identified items were consolidated into a single list for total calculations and ranking

iii. Additional totals

The outcomes from items 1 and 2 prompted further investigations involving counts and searches for more specific groups or terms. In addition to utilizing Python code, the Pandas library was employed to handle these specifications.

iv. Specific Analyses

Due to the challenges in identifying the presence of black, indigenous and peripheral/territorial media as sources in the general survey, we conducted a targeted investigation by surveying select media outlets active in Brazil today. This survey involved an active search on social networks and independent journalism associations such as Ajor. Consequently, we compiled a list of 21 media outlets identified as black, indigenous, and peripheral/territorial, or marginalized that met the platform’s reliability criteria. The comprehensive list of selected media can be found in Graph 4.”

Results

i. In the Top 100, over half of the cited sources are foreign or affiliated with large national media conglomerates

Initially, we examined the top 100 links referenced on Wikipedia, identifying the most frequently cited sources of knowledge on the platform. Of these 100 sources, 76 were foreign, 24 were national, and none were associated with black, indigenous, and peripheral/territorial media. This disparity between citations from foreign and domestic sources reinforces observations made by researchers regarding the concentration of knowledge and articles on the Anglophone Wikipedia compared to other languages[1]. Given this disparity, there is a broad set of entries on the Lusophone Wikipedia that are translations of articles originally from the Anglophone Wikipedia, which could be the reason for the observed difference in citations[2].

Subsequently, we focused on the top 50 national media sources, excluding any external influence. Brazil’s major media conglomerates dominated these rankings, with Rede Globo occupying the second position in the top 100, leading the top 50. It was followed by UOL, domain edu.br (linked with higher education institutions), Terra, Abril, and Estadão.

Graph 1. Author: InternetLab.

ii. In the national top, government sites and universities in the southeast also secure prominent positions

Notably, alongside the edu.br domain, representing numerous higher education institutions, two other Brazilian universities have independently secured positions in the top 50: FGV at position 18 and PUC Rio at position 25. This outcome prompts intriguing questions about the geographic distribution of knowledge production. It becomes evident that, although the edu.br domain encompasses multiple institutions, the knowledge referenced by two private universities located in Brazil southeast stands out in the national references.

Graph 2. Author: InternetLab.

During our research, we extended our examination to the remaining entries in the list of the 50 most cited sources on Wikipedia. This exploration unveiled various sources, encompassing state-owned entities such as the Brazilian Institute of Geography and Statistics, Federal Executive Power, Federal Legislative Power — Federal Senate and Legislative House —, Legislative Assembly of São Paulo, and Superior Electoral Court. While Blogspot, WordPress, and Scielo do not fall under national domains, they were retained on the list due to the diverse geographic content they feature. The first two were grouped together as both functions as blog content platforms, securing the seventh position. We also identified well-known media sources, including R7, Gazeta do Povo, Isto É, and SBT. Additionally, alongside Scielo, sources about culture and entertainment, such as sports sites with a specific focus on soccer and those affiliated with samba schools, were also identified.

iii. There is a fivefold disparity in the number of articles about men

In a subsequent phase, we compiled the 100 most frequently used subcategories on Wikipedia. An important discovery emerged regarding the discrepancy in the number of articles related to men compared to women, with nearly five times as many articles about men. What may be related to research that already shows a higher concentration of male editors compared to female editors.

Based on this data, we categorized the top 100 into general themes, aiming to comprehend the primary topics covered by the articles. The most prevalent category, encompassing 49 out of the 100 most used subcategories, was the ‘people’ group, including men, women, individuals born in specific years and decades, and notable figures in historical events. These subcategories were followed by those related to film and television, biology, and astronomy. The lower positions included articles about territories and geography, covering content about countries, states, and cities, pages dedicated to artists and bands, and a category related to games. In this context, it is crucial to emphasize the absence of mentions of black, indigenous, and peripheral/territorial media in the top 100 national and international references, signaling a significant deficit in representation.

Graph 3. Author: InternetLab.

iv. No indigenous media were cited

Finally, we focused on the media central to this study since they were not referenced in general surveys. Despite the ‘Indigenous peoples of Brazil’ subcategory containing 789 related articles, none of the indigenous media outlets we identified were cited on Wikipedia. Notably, only the journalism agency “Amazônia Real”, while not explicitly positioning itself as an indigenous media outlet, was cited in 116 articles. Amazônia Real is dedicated to providing visibility to the populations and issues of the Amazon, revealing that a significant portion of articles directly related to indigenous peoples do not cite sources led by this population.

Graph 4. Author: InternetLab.

According to our analysis, the most referenced black and peripheral/territorial media outlet is the Geledés portal, with 377 mentions in our surveys. It is noteworthy that, despite Geledés being a civil society organization, the significance of its portal as a communication platform warranted its inclusion in our surveys within the ‘media’ category. The inclusion of Geledés in the ranking supports our interpretation. Among all the articles utilizing the portal as a source, 201 mentioned individuals, the vast majority of whom were black. Additionally, 42 articles addressed cultural topics, notably carnival, while 19 delved into music, encompassing bands and artists.

While our initial approach aimed to align with the classifications already established in the 100 main subcategories, encompassing Black and marginalized media, it prompted the adoption of new categorization strategies. This shift was necessitated by identifying 12 articles in which Geledés was mentioned and was associated with terms used to reference historically minoritized groups, such as ‘sapatão'(in English, dyke) and ‘black’. Additionally, another 12 articles addressed issues of violence, encompassing topics such as racism, torture, and massacres. Lastly, the theme of Nazism and related institutions emerged in 8 of the articles citing this media.

This data prompts us to question the manner and contexts in which black media are referenced. In contrast to our observations in the case of Indigenous media, it is apparent that black media enjoys a degree of visibility, albeit relatively small compared to the vastness of the Wikipedia universe; for instance, Globo was referenced in 516,284 articles. Nevertheless, we question the reasons why black media has predominantly emerged in contexts related to black individuals, marginalized groups, or situations of violence. Is this phenomenon a result of how black media is chosen to reference restricted topics or do black media outlets focus their journalistic activities in these specific areas? In other words, are we facing a situation where black media is only mentioned under particular circumstances on the Lusophone Wikipedia, or is this association a result of intentional editorial orientation by these media outlets themselves?

v. Black, indigenous, and peripheral/territorial media are seldom cited, and their presence is concentrated in the “people” subcategory.

Finally, we chose to conduct an analysis of the subcategories related to articles citing the selected media. The idea was to map, using Wikipedia’s own categorization system, in which contexts black, indigenous, and peripheral/territorial media were being mobilized as references. In doing so, we aimed to partially address the questions raised in the previous section.

In total, the articles mentioning any of the considered media presented 1,542 unique subcategories, totaling 5,873 occurrences. The analysis of the graph below reveals that most identified subcategories are associated with articles mentioning black media. This data reinforces the conclusion obtained in the previous section, as per Graph IV, where black media is generally more cited than indigenous and peripheral/territorial media. Therefore, it is plausible to expect that they also encompass more subcategories, as evidenced by the results.

Graph 5. Author: InternetLab.

To provide a more comprehensive visualization of the frequency of the 1,542 subcategories, an aggregation based on general themes was conducted. This process led to the formation of 15 groups, as illustrated in the subsequent graph:

Graph 6. Author: InternetLab.

The graph analysis reveals that almost 70% of the subcategories are associated with articles about “people,” totaling 4,050 occurrences. This quantity, when considering the overall context of the Portuguese-language Wikipedia, is relatively small, as only the “men” subcategory, shown in Graph III, has almost 40,000 occurrences, highlighting a significant disparity between the number of citations of the considered media and the magnitude of the “people” subcategory.

The “people” group also predominates in the visualization by media type, as illustrated in the next graph:

Graph 7. Author: InternetLab.

According to the collected data, articles about “people” and “concepts and history” are among the most frequent subcategories in both media types, although with significantly distinct quantities. It is relevant to note that Indigenous media is not represented in the above graph because no references to them were identified.

In this context, it is also noteworthy to highlight the predominance of peripheral/territorial media in specific subcategories such as geography, botany, and religion, while Black media shows higher recurrence in areas such as art and awards. Although these data do not provide definitive answers, they may suggest clues about using specific media, highlighting the limited space that certain groups still occupy as a source of knowledge.

Reflections to Consider

This analysis brings forth a fundamental realization: even within the online sphere, narratives appear predominantly influenced by white and hegemonic media. This underscores that the space occupied by independent media, particularly those of black, indigenous, and marginalized origin, remains constrained and often confined to simplistic and stereotyped perspectives, disproportionately fixated on issues related to race and violence.

Given the outcomes disclosed by our research, an invitation and simultaneous question arise for the Wiki community and the expansive realm of free knowledge: How can we actively collaborate to mitigate disparities in the production of and access to knowledge, especially for historically marginalized populations?
The response to this question requires deep reflection and a renewed commitment to diversity and inclusion. Below, we outline recommendations and aspects that we consider crucial in the quest to strengthen the voices of black, indigenous, and peripheral/territorial individuals on Wikipedia:

  1. First and foremost, it is essential to acknowledge the vital role of black, indigenous, and peripheral/territorial media as custodians and disseminators of knowledge within these populations. The call to action leads us to recognize that we can and should work together to amplify their voices, ensuring they are heard not only regarding topics related to their racial identity or the violence they face but also across a diverse range of subjects, from culture to science.
  1. It is crucial to rethink how references are constructed on Wikipedia and other open knowledge platforms. We can explore innovative ways to highlight sources from black, indigenous, and peripheral media, thus valuing their contributions in various contexts. This effort will not only enrich the diversity of knowledge available online but also promote a richer understanding of the cultures and experiences of these communities.

Data Collection

For this research, we utilized one of the copies (dumps) made available by Wikipedia, specifically the ptwiki-20230520-pages-articles-multistream from July 2023. This dump encompasses the titles of all pages that existed on the Lusophone Wikipedia as of July 1, 2023. The archive contained a total of 2,562,293 pages, with 1,847,109 being Wikipedia articles. After obtaining the titles, the subsequent step involved automated access to each page to extract the data used in this research, including digital references (links) and category identifications of each article. To facilitate this process, a Python code was implemented to interact with the Wikipedia API. All collected data was stored in a *.csv dataset. The subsequent stage encompassed data processing, which included correcting and standardizing URL addresses with typos. A Python code utilizing Regex was implemented for these tasks.
Hence, after collecting and processing the data, data analysis techniques were employed to identify specific areas for investigation and generate the graphs presented here.

Notes:

[1]  JEMIELNIAK, Dariusz. Common Knowledge? An ethnography of Wikipedia. Stanford: Stanford University Press, 2014

MAYER-SCHÖNBERGER, Viktor. Geographies of the world’s knowledge: An approach. In: FLICK, Corinne Michaela. Who Owns the World’s Knowledge? Munich: Convoco, 2012. p. 112-124.

[2]  TERRES, Pedro Toniazzo; PIANTÁ, Lucas Tubino. Wikipédia: públicos globais, histórias digitais. Esboços: histórias em contextos globais, v. 27, n. 45, p. 264-285, 2020.

compartilhe